"Text processing operators on example set"

laurahajnalka
laurahajnalka New Altair Community Member
edited November 5 in Community Q&A

Hello Everyone,

 

I have several csv files, that looks the same: they have 2 attributes; a word list (extracted from a document), and their occurrences. First, I have to filter them. For that, I made a Stopword Dictionary. Then, I have to make one huge matrix out of them, where there are the remaining words in the header, and every document represents a line. 

The "Process Documents from Files" operator works almost perfectly, BUT the occurrences lost. This operator wants to count its own occurrence, so it is going to be 1 or 0, if the given word is presented in a document or nor not. How can I use the previously counted numbers?

I also tried it with "Read CSV", "Nominal to text" and "Process Documents from Data" operators, but in this way, I can't even filter the words.

I'll also need the name of the files in the final matrix at the beginning of the lines. I already found out how to use an existing macro, but I do not know how to make one. I would like to make a file_name macro, but I don't know how to do that. 
I am a newbie, so if you know the answer for one of the questions, please detail it as much as possible, because what is obvious to you, it may not be for me.

 

Thank you in advance!

Laura

Answers

  • sgenzer
    sgenzer
    Altair Employee

    hello @laurahajnalka welcome to the community. Could you please post your XML and your data set so we can better understand what you're trying to do? You can find instructions on how to do this here.

     

    Happy RapidMining!

     

    Scott

     

  • laurahajnalka
    laurahajnalka New Altair Community Member

    Dear @sgenzer,

     

    I haven't done it before, because I haven't got so much to show, but now I attached a sample of my dataset (there are 5000-6000 rows in one csv), and a sample of the matrix I got.

    And here is the xml:

    <?xml version="1.0" encoding="UTF-8"?>
    -<process version="9.0.003">
    -<context>
    <input/>
    <output/>
    <macros/>
    </context>
    -<operator name="Process" expanded="true" compatibility="9.0.003" class="process" activated="true">
    <parameter value="init" key="logverbosity"/>
    <parameter value="2001" key="random_seed"/>
    <parameter value="never" key="send_mail"/>
    <parameter value="" key="notification_email"/>
    <parameter value="30" key="process_duration_for_mail"/>
    <parameter value="SYSTEM" key="encoding"/>
    -<process expanded="true">
    -<operator name="Process Documents from Files" expanded="true" compatibility="8.1.000" class="text:process_document_from_file" activated="true" y="34" x="112" width="90" height="82">
    -<list key="text_directories">
    <parameter value="C:\Users\...\teszt" key="test"/>
    </list>
    <parameter value="*" key="file_pattern"/>
    <parameter value="true" key="extract_text_only"/>
    <parameter value="true" key="use_file_extension_as_type"/>
    <parameter value="txt" key="content_type"/>
    <parameter value="UTF-8" key="encoding"/>
    <parameter value="true" key="create_word_vector"/>
    <parameter value="Binary Term Occurrences" key="vector_creation"/>
    <parameter value="false" key="add_meta_information"/>
    <parameter value="false" key="keep_text"/>
    <parameter value="none" key="prune_method"/>
    <parameter value="3.0" key="prune_below_percent"/>
    <parameter value="30.0" key="prune_above_percent"/>
    <parameter value="0.05" key="prune_below_rank"/>
    <parameter value="0.95" key="prune_above_rank"/>
    <parameter value="double_sparse_array" key="datamanagement"/>
    <parameter value="auto" key="data_management"/>
    -<process expanded="true">
    -<operator name="Tokenize" expanded="true" compatibility="8.1.000" class="text:tokenize" activated="true" y="34" x="45" width="90" height="68">
    <parameter value="non letters" key="mode"/>
    <parameter value=".:" key="characters"/>
    <parameter value="English" key="language"/>
    <parameter value="3" key="max_token_length"/>
    </operator>
    <connect to_port="document" to_op="Tokenize" from_port="document"/>
    <connect to_port="document 1" from_port="document" from_op="Tokenize"/>
    <portSpacing spacing="0" port="source_document"/>
    <portSpacing spacing="0" port="sink_document 1"/>
    <portSpacing spacing="0" port="sink_document 2"/>
    </process>
    </operator>
    <connect to_port="word list" to_op="Process Documents from Files" from_port="input 1"/>
    <connect to_port="result 1" from_port="example set" from_op="Process Documents from Files"/>
    <portSpacing spacing="0" port="source_input 1"/>
    <portSpacing spacing="0" port="source_input 2"/>
    <portSpacing spacing="0" port="sink_result 1"/>
    <portSpacing spacing="0" port="sink_result 2"/>
    </process>
    </operator>
    </process

    dataset.PNGresult.PNG 

  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi @laurahajnalka,

     

    Your XML process is broken : It can't be loaded in RapidMiner...

    Anyway, if you already have the extracted words and their occurence, from my point view, a solution is to use Loop Files operator (instead of the Process Documents from XXXX operators) associated with the building block "Append with Union".

    Here the results (with 3 fictive files)  : 

    Word_occurence.png

    NB : the "?" caracter means [occurence = 0]. 

     

    The process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="concurrency:loop_files" compatibility="9.0.003" expanded="true" height="82" name="Loop Files" width="90" x="179" y="85">
    <parameter key="directory" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Word_Occurence"/>
    <parameter key="filter_type" value="regex"/>
    <parameter key="filter_by_regex" value=".*"/>
    <parameter key="enable_macros" value="true"/>
    <process expanded="true">
    <operator activated="true" class="read_excel" compatibility="9.0.003" expanded="true" height="68" name="Read Excel" width="90" x="179" y="34">
    <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Word_Occurence\Word_Occurences.xlsx"/>
    <list key="annotations"/>
    <list key="data_set_meta_data_information"/>
    </operator>
    <operator activated="true" class="transpose" compatibility="9.0.003" expanded="true" height="82" name="Transpose" width="90" x="313" y="34"/>
    <operator activated="true" class="rename_by_example_values" compatibility="9.0.003" expanded="true" height="82" name="Rename by Example Values" width="90" x="447" y="34"/>
    <operator activated="true" class="generate_id" compatibility="9.0.003" expanded="true" height="82" name="Generate ID" width="90" x="581" y="34"/>
    <operator activated="true" class="provide_macro_as_log_value" compatibility="9.0.003" expanded="true" height="68" name="Provide Macro as Log Value" width="90" x="246" y="187">
    <parameter key="macro_name" value="file_name"/>
    </operator>
    <operator activated="true" class="log" compatibility="9.0.003" expanded="true" height="68" name="Log" width="90" x="380" y="187">
    <list key="log">
    <parameter key="file_name" value="operator.Provide Macro as Log Value.value.macro_value"/>
    </list>
    </operator>
    <operator activated="true" class="log_to_data" compatibility="9.0.003" expanded="true" height="82" name="Log to Data" width="90" x="514" y="187"/>
    <operator activated="true" class="generate_id" compatibility="9.0.003" expanded="true" height="82" name="Generate ID (2)" width="90" x="648" y="187"/>
    <operator activated="true" class="concurrency:join" compatibility="9.0.003" expanded="true" height="82" name="Join" width="90" x="782" y="85">
    <parameter key="use_id_attribute_as_key" value="true"/>
    <list key="key_attributes"/>
    </operator>
    <connect from_port="file object" to_op="Read Excel" to_port="file"/>
    <connect from_op="Read Excel" from_port="output" to_op="Transpose" to_port="example set input"/>
    <connect from_op="Transpose" from_port="example set output" to_op="Rename by Example Values" to_port="example set input"/>
    <connect from_op="Rename by Example Values" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
    <connect from_op="Generate ID" from_port="example set output" to_op="Join" to_port="left"/>
    <connect from_op="Log to Data" from_port="exampleSet" to_op="Generate ID (2)" to_port="example set input"/>
    <connect from_op="Generate ID (2)" from_port="example set output" to_op="Join" to_port="right"/>
    <connect from_op="Join" from_port="join" to_port="output 1"/>
    <portSpacing port="source_file object" spacing="0"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="subprocess" compatibility="9.0.003" expanded="true" height="82" name="Union Append" origin="GENERATED_COMMUNITY" width="90" x="313" y="85">
    <process expanded="true">
    <operator activated="true" class="loop_collection" compatibility="9.0.003" expanded="true" height="82" name="Output (4)" origin="GENERATED_COMMUNITY" width="90" x="45" y="34">
    <parameter key="set_iteration_macro" value="true"/>
    <process expanded="true">
    <operator activated="false" breakpoints="after" class="select" compatibility="9.0.003" expanded="true" height="68" name="Select (5)" origin="GENERATED_COMMUNITY" width="90" x="112" y="34">
    <parameter key="index" value="%{iteration}"/>
    </operator>
    <operator activated="true" class="branch" compatibility="9.0.003" expanded="true" height="82" name="Branch (2)" origin="GENERATED_COMMUNITY" width="90" x="313" y="34">
    <parameter key="condition_type" value="expression"/>
    <parameter key="expression" value="%{iteration}==1"/>
    <process expanded="true">
    <connect from_port="condition" to_port="input 1"/>
    <portSpacing port="source_condition" spacing="0"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_input 1" spacing="0"/>
    <portSpacing port="sink_input 2" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="recall" compatibility="9.0.003" expanded="true" height="68" name="Recall (5)" origin="GENERATED_COMMUNITY" width="90" x="45" y="187">
    <parameter key="name" value="LoopData"/>
    </operator>
    <operator activated="true" class="union" compatibility="9.0.003" expanded="true" height="82" name="Union (2)" origin="GENERATED_COMMUNITY" width="90" x="179" y="34"/>
    <connect from_port="condition" to_op="Union (2)" to_port="example set 1"/>
    <connect from_op="Recall (5)" from_port="result" to_op="Union (2)" to_port="example set 2"/>
    <connect from_op="Union (2)" from_port="union" to_port="input 1"/>
    <portSpacing port="source_condition" spacing="0"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_input 1" spacing="0"/>
    <portSpacing port="sink_input 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="remember" compatibility="9.0.003" expanded="true" height="68" name="Remember (5)" origin="GENERATED_COMMUNITY" width="90" x="581" y="34">
    <parameter key="name" value="LoopData"/>
    </operator>
    <connect from_port="single" to_op="Branch (2)" to_port="condition"/>
    <connect from_op="Branch (2)" from_port="input 1" to_op="Remember (5)" to_port="store"/>
    <connect from_op="Remember (5)" from_port="stored" to_port="output 1"/>
    <portSpacing port="source_single" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="select" compatibility="9.0.003" expanded="true" height="68" name="Select (6)" origin="GENERATED_COMMUNITY" width="90" x="179" y="34">
    <parameter key="index" value="%{iteration}"/>
    </operator>
    <connect from_port="in 1" to_op="Output (4)" to_port="collection"/>
    <connect from_op="Output (4)" from_port="output 1" to_op="Select (6)" to_port="collection"/>
    <connect from_op="Select (6)" from_port="selected" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="order_attributes" compatibility="9.0.003" expanded="true" height="82" name="Reorder Attributes" width="90" x="447" y="85">
    <parameter key="attribute_ordering" value="file_name"/>
    </operator>
    <connect from_op="Loop Files" from_port="output 1" to_op="Union Append" to_port="in 1"/>
    <connect from_op="Union Append" from_port="out 1" to_op="Reorder Attributes" to_port="example set input"/>
    <connect from_op="Reorder Attributes" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    I hope it helps,

     

    Regards,

     

    Lionel