[SOLVED] Prepare data to implement k-means
MarcosRL
New Altair Community Member
Hello friends of the community. a query
I'm working with text mining - clustering
I performed the pre-processing text files, create the TF-IDF Vertor, filter the STOP-WORDS and the next step I need to apply K-means.
How should the input format for the algorithm?
Where I can get information on this?
I'm working with text mining - clustering
I performed the pre-processing text files, create the TF-IDF Vertor, filter the STOP-WORDS and the next step I need to apply K-means.
How should the input format for the algorithm?
Where I can get information on this?
Tagged:
0
Answers
-
Hi,
once you have the TFIDF-Vector you can directly apply the k-Means operator. If you have problems doing so, please post your process as far as you have it so far. How to do that is explained in my signature.
Did I already give you the link to our video tutorials? They provide a lot of introductory material and will also help you to understand the base concepts of RapidMiner: http://rapid-i.com/content/view/189/212/lang,en/
Best regards,
Marius0 -
Hi Marius, thanks for your response. Try doing what you told me, applying K-means to TF-IDF vector but I get the following error.
The error tells me:
"... Wrong data of type "Document" was delivered at port "example set".
Expected data of type "data table".
The data delivered at the specified por was of wrong type.Please make sure your ports are connected correctly " ...
How I can attach images of the error? (insert image in menu does not work)
thanks
0 -
Hi, no need to attach an image, just paste the xml of your process.
Please click where it says "click here" in my signature and read the section "How to attach a process".
Best regards,
Marius0 -
Hi Marius, I send the xml of my process.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<process expanded="true" height="161" width="279">
<operator activated="true" class="text:process_document_from_file" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Files" width="90" x="179" y="75">
<list key="text_directories">
<parameter key="doc1" value="C:\Users\marcos\Desktop\Datos de prueba para clustering\Caso de prueba 2\En español\doc1"/>
<parameter key="doc2" value="C:\Users\marcos\Desktop\Datos de prueba para clustering\Caso de prueba 2\En español\doc2"/>
<parameter key="doc3" value="C:\Users\marcos\Desktop\Datos de prueba para clustering\Caso de prueba 2\En español\doc3"/>
</list>
<process expanded="true" height="415" width="758">
<operator activated="true" class="text:transform_cases" compatibility="5.2.004" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="30"/>
<operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="45" y="120"/>
<operator activated="true" class="text:filter_stopwords_dictionary" compatibility="5.2.004" expanded="true" height="76" name="Filter stopwords_pronombres_preposiciones" width="90" x="45" y="210">
<parameter key="file" value="C:\Users\marcos\Desktop\stopwords\stopwords_pronombres_preposiciones.txt"/>
</operator>
<operator activated="true" class="text:filter_stopwords_dictionary" compatibility="5.2.004" expanded="true" height="76" name="Filter stopwords_caratula" width="90" x="45" y="300">
<parameter key="file" value="C:\Users\marcos\Desktop\stopwords\stopwords_caratula.txt"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="179" y="30"/>
<operator activated="true" class="text:filter_by_length" compatibility="5.2.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="179" y="120">
<parameter key="min_chars" value="3"/>
</operator>
<operator activated="true" class="k_means" compatibility="5.2.008" expanded="true" height="76" name="Clustering" width="90" x="336" y="179"/>
<connect from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter stopwords_pronombres_preposiciones" to_port="document"/>
<connect from_op="Filter stopwords_pronombres_preposiciones" from_port="document" to_op="Filter stopwords_caratula" to_port="document"/>
<connect from_op="Filter stopwords_caratula" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
0 -
Your process is almost perfect, you just have to move the clustering operator out of Process Documents. That is because you want to work on the final TF/IDF data, which is only available after the completion of Process Documents. Please have a look at the attached process for a working version of your process.
Best regards,
Marius<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
<process expanded="true" height="516" width="433">
<operator activated="true" class="text:process_document_from_file" compatibility="5.2.005" expanded="true" height="76" name="Process Documents from Files" width="90" x="112" y="30">
<list key="text_directories">
<parameter key="doc1" value="C:\Users\marcos\Desktop\Datos de prueba para clustering\Caso de prueba 2\En español\doc1"/>
<parameter key="doc2" value="C:\Users\marcos\Desktop\Datos de prueba para clustering\Caso de prueba 2\En español\doc2"/>
<parameter key="doc3" value="C:\Users\marcos\Desktop\Datos de prueba para clustering\Caso de prueba 2\En español\doc3"/>
</list>
<process expanded="true" height="516" width="705">
<operator activated="true" class="text:transform_cases" compatibility="5.2.005" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="30"/>
<operator activated="true" class="text:tokenize" compatibility="5.2.005" expanded="true" height="60" name="Tokenize" width="90" x="180" y="30"/>
<operator activated="true" class="text:filter_stopwords_dictionary" compatibility="5.2.005" expanded="true" height="76" name="Filter stopwords_pronombres_preposiciones" width="90" x="315" y="30">
<parameter key="file" value="C:\Users\marcos\Desktop\stopwords\stopwords_pronombres_preposiciones.txt"/>
</operator>
<operator activated="true" class="text:filter_stopwords_dictionary" compatibility="5.2.005" expanded="true" height="76" name="Filter stopwords_caratula" width="90" x="450" y="30">
<parameter key="file" value="C:\Users\marcos\Desktop\stopwords\stopwords_caratula.txt"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.005" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="581" y="30"/>
<operator activated="true" class="text:filter_by_length" compatibility="5.2.005" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="313" y="210">
<parameter key="min_chars" value="3"/>
</operator>
<connect from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter stopwords_pronombres_preposiciones" to_port="document"/>
<connect from_op="Filter stopwords_pronombres_preposiciones" from_port="document" to_op="Filter stopwords_caratula" to_port="document"/>
<connect from_op="Filter stopwords_caratula" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="162"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="k_means" compatibility="5.3.000" expanded="true" height="76" name="Clustering" width="90" x="313" y="30"/>
<connect from_op="Process Documents from Files" from_port="example set" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
<connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>0