"Survey analysys from Excel source"
Hi,
I hope you guys can help or direct me to a right direction. I just find RapidMiner and spent the last two days going through tutorials.
I have over 2000 surveys loaded into an excel file and I need to make a statistic for frequently mentioned phrases and brake it down by ID’s. The file contains few additional information, but the data and the ID columns are the most important. I was doing this breakdown in excel, but it’s not the ideal tool for this analysis. Can anybody help creating a solution?
Thanks a lot
I hope you guys can help or direct me to a right direction. I just find RapidMiner and spent the last two days going through tutorials.
I have over 2000 surveys loaded into an excel file and I need to make a statistic for frequently mentioned phrases and brake it down by ID’s. The file contains few additional information, but the data and the ID columns are the most important. I was doing this breakdown in excel, but it’s not the ideal tool for this analysis. Can anybody help creating a solution?
Thanks a lot
Find more posts tagged with
Sort by:
1 - 14 of
141
What exactly do you mean by breaking down by Id? Do you have several rows with the same Id in the excel file, and you want to combine the statistics for rows with similar Ids?
It would probably be helpful to post your current process setup and if possible some sample data. Post your process by clicking the XML tab in RapidMiner above the process view, and copy the XML code into your next post. Please use the "#" button above the input field of this forum.
Best,
Marius
It would probably be helpful to post your current process setup and if possible some sample data. Post your process by clicking the XML tab in RapidMiner above the process view, and copy the XML code into your next post. Please use the "#" button above the input field of this forum.
Best,
Marius
Hi Marius,
Thanks for the prompt replies.
The excel file has a unique ID column and there are multiple rows belong with the same ID. Here is the header of the Excel file looks like:
ID (text); Comment (text); Description (text); Code (text); Reference# (integer); Code Description (text); Resolution Description (text)
My target is to analyze the frequent words, phrases in the comment column and display the result in a chart. Including and excluding word and phase list would be great, but not essential.
I was able to create a working model with manual pre-sorting the source in excel and transferring it to a txt file, but I’m completely newbie. I couldn’t find the basics of working with excel files. I’m also having difficulties to export the wordlist into an Excel or a txt file and creating a chart from the results.
Here is the xml code just in case:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.017">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Process">
<process expanded="true" height="531" width="637">
<operator activated="true" class="text:read_document" compatibility="5.1.004" expanded="true" height="60" name="Read Document" width="90" x="45" y="30">
<parameter key="file" value="E:\Comment Analysis\Dev\Description light.txt"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.1.004" expanded="true" height="94" name="Process Documents" width="90" x="176" y="31">
<process expanded="true" height="531" width="655">
<operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="75"/>
<operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" height="60" name="Tokenize" width="90" x="45" y="255"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="45" y="390"/>
<operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="179" y="165">
<parameter key="min_chars" value="3"/>
<parameter key="max_chars" value="999"/>
</operator>
<operator activated="true" class="text:generate_n_grams_terms" compatibility="5.1.004" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="179" y="30">
<parameter key="max_length" value="3"/>
</operator>
<connect from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
<connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
I already can see that RapidMiner is a great tool for this purpose; I just have to find the way to get the things done.
Thank you in advance,
Peter
Thanks for the prompt replies.
The excel file has a unique ID column and there are multiple rows belong with the same ID. Here is the header of the Excel file looks like:
ID (text); Comment (text); Description (text); Code (text); Reference# (integer); Code Description (text); Resolution Description (text)
My target is to analyze the frequent words, phrases in the comment column and display the result in a chart. Including and excluding word and phase list would be great, but not essential.
I was able to create a working model with manual pre-sorting the source in excel and transferring it to a txt file, but I’m completely newbie. I couldn’t find the basics of working with excel files. I’m also having difficulties to export the wordlist into an Excel or a txt file and creating a chart from the results.
Here is the xml code just in case:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.017">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Process">
<process expanded="true" height="531" width="637">
<operator activated="true" class="text:read_document" compatibility="5.1.004" expanded="true" height="60" name="Read Document" width="90" x="45" y="30">
<parameter key="file" value="E:\Comment Analysis\Dev\Description light.txt"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.1.004" expanded="true" height="94" name="Process Documents" width="90" x="176" y="31">
<process expanded="true" height="531" width="655">
<operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="75"/>
<operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" height="60" name="Tokenize" width="90" x="45" y="255"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="45" y="390"/>
<operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="179" y="165">
<parameter key="min_chars" value="3"/>
<parameter key="max_chars" value="999"/>
</operator>
<operator activated="true" class="text:generate_n_grams_terms" compatibility="5.1.004" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="179" y="30">
<parameter key="max_length" value="3"/>
</operator>
<connect from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
<connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
I already can see that RapidMiner is a great tool for this purpose; I just have to find the way to get the things done.
Thank you in advance,
Peter
Heya,
try something like the process below. The main work is done by the aggregation operator. The Rename by Replace just removes the function names from the attribute names (set a breakpoint after the aggregation to see what I mean). I tested this process on a simple dataset like this:
try something like the process below. The main work is done by the aggregation operator. The Rename by Replace just removes the function names from the attribute names (set a breakpoint after the aggregation to see what I mean). I tested this process on a simple dataset like this:
id;text
1;cat dog fish
1;fish dog
2;cat cheese mouse
2;dog
2;cat dog
3;dog
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.017">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Process">
<process expanded="true" height="531" width="915">
<operator activated="false" class="text:process_documents" compatibility="5.2.000" expanded="true" height="76" name="Process Documents" width="90" x="447" y="210">
<process expanded="true" height="633" width="705">
<operator activated="false" class="text:transform_cases" compatibility="5.2.000" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="30"/>
<operator activated="false" class="text:tokenize" compatibility="5.2.000" expanded="true" height="60" name="Tokenize" width="90" x="180" y="30"/>
<operator activated="false" class="text:filter_stopwords_english" compatibility="5.2.000" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="315" y="30"/>
<operator activated="false" class="text:filter_by_length" compatibility="5.2.000" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="450" y="30">
<parameter key="min_chars" value="3"/>
<parameter key="max_chars" value="999"/>
</operator>
<operator activated="false" class="text:generate_n_grams_terms" compatibility="5.2.000" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="585" y="30">
<parameter key="max_length" value="3"/>
</operator>
<connect from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
<connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="read_csv" compatibility="5.1.017" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
<parameter key="csv_file" value="C:\Users\marius\Documents\development\test.csv"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="id.true.integer.attribute"/>
<parameter key="1" value="text.true.polynominal.attribute"/>
</list>
</operator>
<operator activated="true" class="set_role" compatibility="5.1.017" expanded="true" height="76" name="Set Role" width="90" x="179" y="30">
<parameter key="name" value="id"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="5.1.017" expanded="true" height="76" name="Nominal to Text" width="90" x="313" y="30"/>
<operator activated="true" class="text:process_document_from_data" compatibility="5.2.000" expanded="true" height="76" name="Process Documents from Data" width="90" x="447" y="30">
<parameter key="vector_creation" value="Term Occurrences"/>
<list key="specify_weights"/>
<process expanded="true" height="633" width="705">
<operator activated="true" class="text:transform_cases" compatibility="5.2.000" expanded="true" height="60" name="Transform Cases (2)" width="90" x="45" y="30"/>
<operator activated="true" class="text:tokenize" compatibility="5.2.000" expanded="true" height="60" name="Tokenize (2)" width="90" x="180" y="30"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.000" expanded="true" height="60" name="Filter Stopwords (2)" width="90" x="315" y="30"/>
<operator activated="true" class="text:filter_by_length" compatibility="5.2.000" expanded="true" height="60" name="Filter Tokens (2)" width="90" x="450" y="30">
<parameter key="min_chars" value="3"/>
<parameter key="max_chars" value="999"/>
</operator>
<operator activated="true" class="text:generate_n_grams_terms" compatibility="5.2.000" expanded="true" height="60" name="Generate n-Grams (2)" width="90" x="585" y="30">
<parameter key="max_length" value="3"/>
</operator>
<connect from_port="document" to_op="Transform Cases (2)" to_port="document"/>
<connect from_op="Transform Cases (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
<connect from_op="Filter Stopwords (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
<connect from_op="Filter Tokens (2)" from_port="document" to_op="Generate n-Grams (2)" to_port="document"/>
<connect from_op="Generate n-Grams (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="aggregate" compatibility="5.1.017" expanded="true" height="76" name="Aggregate" width="90" x="581" y="30">
<parameter key="use_default_aggregation" value="true"/>
<parameter key="default_aggregation_function" value="sum"/>
<list key="aggregation_attributes"/>
<parameter key="group_by_attributes" value="|id"/>
</operator>
<operator activated="true" class="rename_by_replacing" compatibility="5.1.017" expanded="true" height="76" name="Rename by Replacing" width="90" x="715" y="30">
<parameter key="replace_what" value="sum\((.*)\)"/>
<parameter key="replace_by" value="$1"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_op="Rename by Replacing" to_port="example set input"/>
<connect from_op="Rename by Replacing" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
You can use the Write Excel operator to save Example Sets to Excel files. Probably you want to connect the exa output of process documents to that operator. You can also convert the wor output to an Example Set with the Wordlist to Data operator and then store that dataset. Just do what fits your needs best 
Best, Marius

Best, Marius
That depends totally on what you want to show
hennessy wrote: PS: What do you think, what's the best solution to dispay the results in a chart?

Best,
Marius
The manual is available here: http://rapid-i.com/content/view/26/84/lang,en/
if I understand your problem correctly, you should have a look at the Text Processing Extension.
Best,
Marius