Text Processing Help! (Beginner at Rapidminer)

antioquia_jonas
antioquia_jonas New Altair Community Member
edited November 5 in Community Q&A

Im new to Rapidminer and I wanted to generate N-grams from my excel file that contains comments and replies from forum posts. My process design currently contains the following operators: Data, Process Documents (w/ Tokenize, Filter Stopwords English, Generate n-grams, Filter Tokens by Length), and Write Excel. I am not sure why my results are showing me all the possible combinations of words within the data instead of just showing me the combinations that occur twice or more. Maybe im missing an important detail. Really need urgent help! TIA!
(Images below depicting my current problem)
proper.PNGwhat i want it to look likeimproper.PNGwhat it actually looks like

Answers

  • ahootanha
    ahootanha New Altair Community Member
    Hello
    I want to extract five words with the highest tf-idf in the output tf-idf matrix.
    How should i do ???
    Thanks
  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi @antioquia_jonas,

     

    You can find here a process, which extract the token and the number of occurences of this token in an Excel file.

    I don't know how to create the attribute "string" (where the token is repeated n times).

    This process is to adapt to your own data : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="social_media:search_twitter" compatibility="8.1.000" expanded="true" height="68" name="Search Twitter" width="90" x="45" y="34">
    <parameter key="connection" value="dkk"/>
    <parameter key="query" value="tesla"/>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="8.1.000" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Text"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="8.1.000" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Text"/>
    </operator>
    <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="514" y="34">
    <parameter key="add_meta_information" value="false"/>
    <parameter key="prune_method" value="absolute"/>
    <parameter key="prune_below_absolute" value="1"/>
    <parameter key="prune_above_absolute" value="5"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="380" y="34"/>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="text:wordlist_to_data" compatibility="8.1.000" expanded="true" height="82" name="WordList to Data" width="90" x="648" y="34"/>
    <operator activated="true" class="sort" compatibility="8.1.000" expanded="true" height="82" name="Sort" width="90" x="782" y="136">
    <parameter key="attribute_name" value="total"/>
    <parameter key="sorting_direction" value="decreasing"/>
    </operator>
    <operator activated="true" class="rename" compatibility="8.1.000" expanded="true" height="82" name="Rename" width="90" x="916" y="136">
    <parameter key="old_name" value="word"/>
    <parameter key="new_name" value="token"/>
    <list key="rename_additional_attributes">
    <parameter key="in documents" value="Amount (in documents)"/>
    <parameter key="total" value="Amount"/>
    </list>
    </operator>
    <operator activated="true" class="write_excel" compatibility="8.1.000" expanded="true" height="82" name="Write Excel" width="90" x="1050" y="136">
    <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Count_TF_IDF\Count_TF_IDF.xlsx"/>
    </operator>
    <connect from_op="Search Twitter" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
    <connect from_op="Process Documents from Data" from_port="word list" to_op="WordList to Data" to_port="word list"/>
    <connect from_op="WordList to Data" from_port="example set" to_op="Sort" to_port="example set input"/>
    <connect from_op="Sort" from_port="example set output" to_op="Rename" to_port="example set input"/>
    <connect from_op="Rename" from_port="example set output" to_op="Write Excel" to_port="input"/>
    <connect from_op="Write Excel" from_port="through" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

    I hope it helps,

     

    Regards,

     

    Lionel