Unable to get desired result using Word2Vec

shashwat01
shashwat01 Altair Community Member
edited November 5 in Community Q&A

 I am trying to implement the Word2Vec extension operator in RapidMiner for the Cyber Solutions Spam Detection project. Despite reading numerous forums and RapidMiner-related materials, I am still struggling to achieve my desired results.

After significant effort, I was able to run the process without errors. However, the output results are empty, and I am unsure why this is happening. Additionally, the output columns include 49 dimensions, but I am not sure what those dimensions refer to. I assume it is related to the Word2Vec parameters, but would appreciate clarification on this and how it relates the ham-spam analysis.

I have attached my RapidMiner (RM) file for your reference. Your guidance in understanding the correct setup and the role of each operator in the Word2Vec process would be greatly appreciated.

Files attached.
Tagged:

Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi there,

    the collection you are trying to apply the model on is empty. Try this one:


    <?xml version="1.0" encoding="UTF-8"?><process version="10.4.000">
    
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="10.4.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
    <operator activated="false" class="retrieve" compatibility="10.4.000" expanded="true" height="68" name="Retrieve SPAM-text-message (2)" width="90" x="179" y="442">
    <parameter key="repository_entry" value="../../../data/Data Exploration - Unstructured/Cyber Solutions_SPAM-text-message"/>
    </operator>
    <operator activated="false" class="word2vec:Get_Vocabulary" compatibility="1.0.000" expanded="true" height="82" name="Extract Vocabulary" width="90" x="782" y="442">
    <parameter key="Get Full Vocabulary" value="false"/>
    <parameter key="Take Random Words" value="true"/>
    <parameter key="Number of Words to Pull" value="100"/>
    </operator>
    <operator activated="false" class="text:process_document_from_data" compatibility="10.0.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="380" y="493">
    <parameter key="create_word_vector" value="true"/>
    <parameter key="vector_creation" value="TF-IDF"/>
    <parameter key="add_meta_information" value="true"/>
    <parameter key="keep_text" value="true"/>
    <parameter key="prune_method" value="percentual"/>
    <parameter key="prune_below_percent" value="1.0"/>
    <parameter key="prune_above_percent" value="99.0"/>
    <parameter key="prune_below_rank" value="0.05"/>
    <parameter key="prune_above_rank" value="0.95"/>
    <parameter key="datamanagement" value="double_sparse_array"/>
    <parameter key="data_management" value="auto"/>
    <parameter key="select_attributes_and_weights" value="false"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="10.0.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34">
    <parameter key="mode" value="non letters"/>
    <parameter key="characters" value=".:"/>
    <parameter key="language" value="English"/>
    <parameter key="max_token_length" value="3"/>
    </operator>
    <operator activated="true" class="text:stem_porter" compatibility="10.0.000" expanded="true" height="68" name="Stem (Porter)" width="90" x="179" y="34"/>
    <operator activated="true" class="text:transform_cases" compatibility="10.0.000" expanded="true" height="68" name="Transform Cases" width="90" x="313" y="34">
    <parameter key="transform_to" value="lower case"/>
    </operator>
    <operator activated="true" class="text:filter_stopwords_english" compatibility="10.0.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="447" y="34"/>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Stem (Porter)" to_port="document"/>
    <connect from_op="Stem (Porter)" from_port="document" to_op="Transform Cases" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
    <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="false" class="split_data" compatibility="10.0.000" expanded="true" height="68" name="Split Data" width="90" x="782" y="289">
    <enumeration key="partitions"/>
    <parameter key="sampling_type" value="automatic"/>
    <parameter key="use_local_random_seed" value="false"/>
    <parameter key="local_random_seed" value="1992"/>
    </operator>
    <operator activated="false" class="multiply" compatibility="10.4.000" expanded="true" height="68" name="Multiply" width="90" x="648" y="340"/>
    <operator activated="true" class="read_excel" compatibility="10.4.000" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">
    <parameter key="excel_file" value="C:\Users\mliebig\Downloads\SPAM-text-message (1).xlsx"/>
    <parameter key="sheet_selection" value="sheet number"/>
    <parameter key="sheet_number" value="1"/>
    <parameter key="imported_cell_range" value="A1"/>
    <parameter key="encoding" value="SYSTEM"/>
    <parameter key="use_header_row" value="true"/>
    <parameter key="header_row" value="1"/>
    <parameter key="first_row_as_names" value="true"/>
    <list key="annotations"/>
    <parameter key="date_format" value=""/>
    <parameter key="time_zone" value="SYSTEM"/>
    <parameter key="locale" value="English (United States)"/>
    <parameter key="read_all_values_as_polynominal" value="false"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="Category.true.polynominal.attribute"/>
    <parameter key="1" value="Message.true.polynominal.attribute"/>
    </list>
    <parameter key="read_not_matching_values_as_missings" value="false"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="10.4.000" expanded="true" height="82" name="Select Attributes" width="90" x="45" y="136">
    <parameter key="attribute_filter_type" value="all"/>
    <parameter key="attribute" value=""/>
    <parameter key="attributes" value="|Category|Message"/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="attribute_value"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="false"/>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="10.4.000" expanded="true" height="82" name="Nominal to Text" width="90" x="45" y="238">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attribute" value=""/>
    <parameter key="attributes" value="|Message|Category"/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="nominal"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="file_path"/>
    <parameter key="block_type" value="single_value"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="single_value"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="false"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="10.4.000" expanded="true" height="82" name="Set Role" width="90" x="45" y="340">
    <parameter key="attribute_name" value="Category"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="text:data_to_documents" compatibility="10.0.000" expanded="true" height="68" name="Data to Documents" width="90" x="246" y="136">
    <parameter key="select_attributes_and_weights" value="false"/>
    <list key="specify_weights"/>
    </operator>
    <operator activated="true" class="loop_collection" compatibility="10.4.000" expanded="true" height="103" name="Loop Collection" width="90" x="380" y="136">
    <parameter key="set_iteration_macro" value="true"/>
    <parameter key="macro_name" value="iteration"/>
    <parameter key="macro_start_value" value="1"/>
    <parameter key="unfold" value="false"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="10.0.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="45" y="34">
    <parameter key="mode" value="non letters"/>
    <parameter key="characters" value=".:"/>
    <parameter key="language" value="English"/>
    <parameter key="max_token_length" value="3"/>
    </operator>
    <operator activated="true" class="text:stem_porter" compatibility="10.0.000" expanded="true" height="68" name="Stem (Porter) (2)" width="90" x="246" y="34"/>
    <operator activated="true" class="text:transform_cases" compatibility="10.0.000" expanded="true" height="68" name="Transform Cases (2)" width="90" x="380" y="34">
    <parameter key="transform_to" value="lower case"/>
    </operator>
    <operator activated="true" class="text:filter_stopwords_english" compatibility="10.0.000" expanded="true" height="68" name="Filter Stopwords (English) (2)" width="90" x="648" y="34"/>
    <connect from_port="single" to_op="Tokenize (2)" to_port="document"/>
    <connect from_op="Tokenize (2)" from_port="document" to_op="Stem (Porter) (2)" to_port="document"/>
    <connect from_op="Stem (Porter) (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
    <connect from_op="Transform Cases (2)" from_port="document" to_op="Filter Stopwords (English) (2)" to_port="document"/>
    <connect from_op="Filter Stopwords (English) (2)" from_port="document" to_port="output 1"/>
    <portSpacing port="source_single" spacing="0"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    <portSpacing port="sink_output 3" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="multiply" compatibility="10.4.000" expanded="true" height="103" name="Multiply (2)" width="90" x="514" y="187"/>
    <operator activated="true" class="word2vec:Word2Vec_Learner" compatibility="1.0.000" expanded="true" height="68" name="Word2Vec " width="90" x="648" y="85">
    <parameter key="Minimal Vocab Frequency" value="5"/>
    <parameter key="Layer Size" value="50"/>
    <parameter key="Window Size" value="5"/>
    <parameter key="Use Negative Samples" value="5"/>
    <parameter key="Iterations" value="5"/>
    <parameter key="Down Sampling Rate" value="1.0E-4"/>
    </operator>
    <operator activated="true" class="word2vec:Apply_Word2Vec" compatibility="1.0.000" expanded="true" height="103" name="Apply Word2Vec (Documents) " width="90" x="916" y="136"/>
    <connect from_op="Read Excel" from_port="output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="original" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
    <connect from_op="Data to Documents" from_port="documents" to_op="Loop Collection" to_port="collection"/>
    <connect from_op="Loop Collection" from_port="output 1" to_op="Multiply (2)" to_port="input"/>
    <connect from_op="Multiply (2)" from_port="output 1" to_op="Word2Vec " to_port="doc"/>
    <connect from_op="Multiply (2)" from_port="output 2" to_op="Apply Word2Vec (Documents) " to_port="doc"/>
    <connect from_op="Word2Vec " from_port="mod" to_op="Apply Word2Vec (Documents) " to_port="mod"/>
    <connect from_op="Apply Word2Vec (Documents) " from_port="exa" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>


    Cheers,
    Martin

  • shashwat01
    shashwat01 Altair Community Member
    How to get the rmp file for this or how should i import the solution that you provided above?
  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi,
    check out this thread, where Ingo explains how to get XMLs in: https://community.rapidminer.com/discussion/50470/import-xml-code-to-process


    RMP is by the way just xml. You can also just rename it to rmp and load it.

    Best,
    Martin