K-means clustering over 8000 text file

eman_alahmadi
eman_alahmadi New Altair Community Member
edited November 5 in Community Q&A

hi, I'm new to use this platform. I want to use k-means to cluster 8000 text file that contains tags of 8000 image, if it possible to use rapidminer  or not? and if it's possible what is the suitable K and max runs should be chosen?

 

Regard

Answers

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    Yes you can do that with RapidMiner but just to be sure, the texts don't contain actual images? like jpgs or pngs? If you want to do image mining you have to install the Image Mining extension. 

     

    W.R.T. to the # of optimal clusters. I usually use X-means to figure that out automatically. 

     

    Here's a sample process that will get you started. You will need to install the Text Mining extension to do this.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
    <operator activated="true" class="social_media:search_twitter" compatibility="7.3.000" expanded="true" height="68" name="Search Twitter" width="90" x="45" y="34">
    <parameter key="connection" value="NewConnection"/>
    <parameter key="query" value="rapidminer"/>
    <parameter key="language" value="en"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.4.000" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Text"/>
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="7.4.000" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Text"/>
    </operator>
    <operator activated="true" class="text:process_document_from_data" compatibility="7.4.001" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="34">
    <parameter key="prune_method" value="percentual"/>
    <parameter key="prune_below_percent" value="5.0"/>
    <parameter key="prune_above_percent" value="50.0"/>
    <parameter key="prune_below_absolute" value="100"/>
    <parameter key="prune_above_absolute" value="500"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:replace_tokens" compatibility="7.4.001" expanded="true" height="68" name="Replace Tokens (2)" width="90" x="45" y="34">
    <list key="replace_dictionary">
    <parameter key="http.*" value="link"/>
    </list>
    </operator>
    <operator activated="true" class="text:tokenize" compatibility="7.4.001" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">
    <parameter key="characters" value=" .!;:[,"/>
    </operator>
    <operator activated="true" class="text:transform_cases" compatibility="7.4.001" expanded="true" height="68" name="Transform Cases" width="90" x="313" y="34"/>
    <operator activated="true" class="text:filter_by_length" compatibility="7.4.001" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="447" y="34"/>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.4.001" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="581" y="34">
    <parameter key="string" value="link"/>
    <parameter key="invert condition" value="true"/>
    </operator>
    <operator activated="true" class="text:filter_stopwords_english" compatibility="7.4.001" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="715" y="34"/>
    <connect from_port="document" to_op="Replace Tokens (2)" to_port="document"/>
    <connect from_op="Replace Tokens (2)" from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
    <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
    <connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
    <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="x_means" compatibility="7.4.000" expanded="true" height="82" name="X-Means" width="90" x="581" y="34">
    <parameter key="numerical_measure" value="CosineSimilarity"/>
    <parameter key="divergence" value="SquaredEuclideanDistance"/>
    </operator>
    <connect from_op="Search Twitter" from_port="output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_op="X-Means" to_port="example set"/>
    <connect from_op="X-Means" from_port="cluster model" to_port="result 1"/>
    <connect from_op="X-Means" from_port="clustered set" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

     

  • eman_alahmadi
    eman_alahmadi New Altair Community Member

    hello, thank you for replay.

    I'm starting by installing the text proccing operator from extension nad update, then drag the "Process Documents from Files"  operator and I “Edit List” beside the "text directories” label in order to choose the files that I use to run the clustering algorithm on it. Then I open the “Process Documents from Files” operator (by double click ) to Insert the “Extract Content” operator into the Main Process. after that drag "Tokenize" operator into the "Process Documents from Files" process after the “Extract Content” operator. Then  I get out of the “Process Documents from Files” process. And use the standard k-Means algorithm by dragging into the Main Process frame after the “Process Documents from Files” operator. I was set the K= 89 and max runs=8000. finally when I press “Play” button and this take until now 5:16 hours and not finish yet?  I don't know if it's OK ? and why the run does not finish yet?

     

     

    for "W.R.T. to the # of optimal clusters. I usually use X-means to figure that out automatically" can you explain to me how can I figure it?

     

     

    Best Regard.

     

     

  • eman_alahmadi
    eman_alahmadi New Altair Community Member

    This the screenshot could you help me please??

    rapid.png

     

     

     

     

     

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    It's quite possibly that it could take 10 hours, hard to fathom without knowing how wide your dataset got from the Text Processing. I would consider doing Pruning and getting your data set all text processed before you do the Clustering, this way you can speed up the process.  Why do you need 89 clusters anyway?

  • eman_alahmadi
    eman_alahmadi New Altair Community Member

    You mean it's good to do text processing first then clustering?

     

    I chose k=89 this the closest to the square root of 8000. Would you mind to tell me how I can choose it automatically? And in case the laptop restarted the run start again or complete from the previous one.

     

    Many thanx.

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    Just use the x-means operator and set the k limits. the default has a min of 2 and a max of 60. 

     

    What I would do is put a Store operator right after the EXA port of the Process Documents from Files operator. This way you can save the processed text and inspect it. You could also try a Sample operator to take a random sample of maybe 500 rows to see how long it would take to process then.  

     

    In cases like this we usually suggest you use a RapidMiner Server on a dedicated box with lots of memory and cores. Of course that pre-supposes that you have a license that will unlock the cores and memory on the Server. 

  • eman_alahmadi
    eman_alahmadi New Altair Community Member

    rapid.pngstill the output not appear yet ?? is that possible ??

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    Based on it being 2% done, you'll have to wait about 98 days for it to finish.

     

    You must have a very very wide data set. Did you try the sampling as I proposed. You might have to do some heavy pruning of your text files too. 

  • eman_alahmadi
    eman_alahmadi New Altair Community Member

    Really no, I'm a beginner in that. Would you mind to explain to me the steps to use the samples?

     

    Thanx in advanced.

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    Ok a few things you should try to make this more manageable. For testing purposes, put a Sample operator right after the EXA port from the Process Documents file. The default value 100 rows. Use that for the time being. 

     

    Next, make sure you toggle on Pruning on the Process Documents from file. I typically use the Percentual one with the default values of 3% and 30%. This should take a lot of junk out the text documents. I would even go further and use a Filter Tokens inside the Process Documents operator. 

     

    Start small and work up from there. 

  • eman_alahmadi
    eman_alahmadi New Altair Community Member

    for first step should be like this 

    rapid.png



    @Thomas_Ott wrote:

     

     

    Next, make sure you toggle on Pruning on the Process Documents from file. I typically use the Percentual one with the default values of 3% and 30%. This should take a lot of junk out the text documents. I would even go further and use a Filter Tokens inside the Process Documents operator. 

     

    Start small and work up from there. 


    The line marked by red color --- what you mean about it?how can I toggle  on pruning ???
    thank you for your help, regard. 
  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    Pruning is toggled on in the Process Documents from Files operator. There a parameter called "Prune Method," enable that and select Precentural.

     

    You should confirm how wide your data set gets after your Text Process. This is likely the problem. 

  • eman_alahmadi
    eman_alahmadi New Altair Community Member

    hello, please can I use any operator at first to remove any word other than English?. Because of a lot of tags in my text files in deferent language. So in this way could help to reduce the size of files.

     

    Regards.

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    Are the files you load into the Process Documents from Files operator a mix of English and non English? If som just seperate out the non English ones and run again.  Unless there is some metadata that can be extracted that will give you the "lang = en" contained in your texts, there is no easy way I know of of doing it. 

     

    Some possible workarounds are maybe using the NameSor extension or even the Rosette extension, there might be some auto-language support in them.

  • eman_alahmadi
    eman_alahmadi New Altair Community Member

    yes, my files have a mixed language. But I used a script now to remove non-English words. Now I concern about another thing after I used the k-means rapidminor  and get the output, Can I use the source code of the output to transform it to specific format in a text file like this:

    # 0
    @ 192 100886.txt
    @ 814 1034.txt
    @ 988 1042.txt
    @ 1854 107663.txt
    @ 1961 1081.txt
    @ 2011 1084.txt
    @ 2082 1086.txt
    @ 2188 1090.txt# 0
    @ 192 100886.txt
    @ 814 1034.txt
    @ 988 1042.txt
    .........

    and so on  where the # refer to the num oof cluster and @ refer to the text file.

     

     

    Regard. 

  • eman_alahmadi
    eman_alahmadi New Altair Community Member

    hello, after running the script of removing all non-English words and remove all numbers and punctuations I have a folder with size much smaller, half of the previous may be. The folder contains 8000 text files. Which operator enough to run k-means clustering over these files. I think I have to use Process Documents from Files” operator ( inside this operator drag "Tokenize" operator and “Transform Cases” operator) and k-Means operator. 

     

    Wait for the response, all regards. 

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    You can use the Extract Cluster Prototypes operater to conver the results and save them as an exampleset.

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    I don't understand your question? It sounds like you have a process that will text process your data and then cluster it afterwards.

  • eman_alahmadi
    eman_alahmadi New Altair Community Member
    I mean can I used the source code of the out put and save the result in text files as I want.


    Regards.
  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    You can use Write CSV (you can config to write a txt) or Write File. 

  • eman_alahmadi
    eman_alahmadi New Altair Community Member

    when I double click on an object that is displayed in the Folder View of a cluster model: "No visualization available for an object with id 1,020,793!". It is the same result for any item of the folder view. How can I solve this, please ???

     

    Regard 

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    That just means the data it's trying to visualize doesn't lend itself to visualization. Do you need this for a reason?