"Read from Database, Process Documents From Data, kMeans Clustering"
nate
New Altair Community Member
Greetings - My question concerns what I imagine is something very simple that as a newbie, I am merely overlooking. However, after reading the manual and similar posts (like this one: http://rapid-i.com/rapidforum/index.php/topic,5518.0.html ), I am still at a loss.
I am reading data from a DB with the following columns:
Many thanks in advance for helping me through what I imagine is a total noob oversight.
I am reading data from a DB with the following columns:
- entity_id
- raw_text
- Add Set Role operator for the attribute entity_id to id, after Process Documents From Data operator
- Add Set Role operator for the attribute entity_id to id, before Process Documents From Data operator
Many thanks in advance for helping me through what I imagine is a total noob oversight.
0
Answers
-
Hm, where does RapidMiner create an id? If you add the id before Process Documents, it survives until the end of the process, even if you add a clustering algorithm in the end. Please have a look at the attached process. If you keep having problems, please post your process setup, as described in the link in my signature.
Best regards,
Marius<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
<process expanded="true" height="583" width="889">
<operator activated="true" class="generate_nominal_data" compatibility="5.3.000" expanded="true" height="60" name="Generate Nominal Data" width="90" x="112" y="30"/>
<operator activated="true" class="generate_id" compatibility="5.3.000" expanded="true" height="76" name="Generate ID" width="90" x="246" y="30"/>
<operator activated="true" class="nominal_to_text" compatibility="5.3.000" expanded="true" height="76" name="Nominal to Text" width="90" x="380" y="30"/>
<operator activated="true" class="text:process_document_from_data" compatibility="5.2.005" expanded="true" height="76" name="Process Documents from Data" width="90" x="514" y="30">
<list key="specify_weights"/>
<process expanded="true" height="583" width="889">
<operator activated="true" class="text:tokenize" compatibility="5.2.005" expanded="true" height="60" name="Tokenize" width="90" x="194" y="40">
<parameter key="mode" value="specify characters"/>
<parameter key="characters" value=".: "/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="k_means" compatibility="5.3.000" expanded="true" height="76" name="Clustering" width="90" x="648" y="30"/>
<connect from_op="Generate Nominal Data" from_port="output" to_op="Generate ID" to_port="example set input"/>
<connect from_op="Generate ID" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
<connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>0 -
Hi Marius - Thanks for responding so quickly! In reference to your first, question I believe I am referring to the "id" column that is generated in the Results View. It appears to correspond to row number from the exampleSet. Also, each of the documents within the clusters are identified with this same id value.
My data looks like this in the database I am querying. In real life, that values within the raw_text column are significantly longer. Also, I rename my database's id column to entity_id and only return ids over 1000, as well as limit it to 100 rows.
The database call works fine. Once data is in, here is my process:id raw_text 1003 This entity is about snowboarding and other fun winter sports 2097 This entity is about orange juice and pancakes 2318 This entity is about elephants
The goal is to have the k-means cluster use raw_text to cluster, then identify each document by its entity_id, not the row number (which seems to be named "id" and is always between 1 and 100). For example, if I am looking at the "Folder View", cluster_0 might expand to 2220, 3862, and 1034 (entity_id). Not 12, 44, 86 (row number, called id)
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<process expanded="true" height="476" width="949">
<operator activated="true" class="read_database" compatibility="5.2.008" expanded="true" height="60" name="Read Database" width="90" x="45" y="30">
<parameter key="connection" value="Ramsey Test"/>
<parameter key="query" value="SELECT raw_text, id AS entity_id FROM ram_entity WHERE raw_text IS NOT NULL AND id > 1000 LIMIT 100"/>
<enumeration key="parameters"/>
</operator>
<operator activated="true" class="set_role" compatibility="5.2.008" expanded="true" height="76" name="Set Role" width="90" x="246" y="165">
<parameter key="name" value="entity_id"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="447" y="30">
<parameter key="add_meta_information" value="false"/>
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="999"/>
<list key="specify_weights"/>
<process expanded="true" height="745" width="1027">
<operator activated="true" class="web:extract_html_text_content" compatibility="5.2.003" expanded="true" height="60" name="Extract Content" width="90" x="45" y="30"/>
<operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="165" y="28"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="313" y="30"/>
<operator activated="true" class="text:filter_by_length" compatibility="5.2.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="470" y="29"/>
<operator activated="true" class="text:stem_snowball" compatibility="5.2.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="607" y="30"/>
<connect from_port="document" to_op="Extract Content" to_port="document"/>
<connect from_op="Extract Content" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
<connect from_op="Stem (Snowball)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="k_means" compatibility="5.2.008" expanded="true" height="76" name="Clustering" width="90" x="581" y="30">
<parameter key="k" value="3"/>
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
</operator>
<connect from_op="Read Database" from_port="output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Set Role" from_port="original" to_port="result 3"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
<connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>0