"Using the Cross Distances Operator with Text Attributes"

evansh
evansh New Altair Community Member
edited November 2024 in Community Q&A
I'm trying to calculate the Cosine Similarity for a request set of documents against a larger reference set; however the Cross Distances operator will only return either a 0 or a 1. If I try and use process documents to create a word vector to feed it, it simply doesn't work. I assume the latter issue is because the word vectors are obviously made up of different attributes between the request and reference sets. I've been stuck on this for a while, so I'd really appreciate some help.


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.5.002">
  <operator activated="true" class="retrieve" compatibility="6.5.002" expanded="true" height="60" name="Retrieve Scraped DB Files" width="90" x="45" y="390">
    <parameter key="repository_entry" value="//Local Repository/processes/SSR/Novelty Mining/Scraped DB Files"/>
  </operator>
</process>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.5.002">
  <operator activated="true" class="nominal_to_text" compatibility="6.5.002" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="390">
    <parameter key="attribute_filter_type" value="all"/>
    <parameter key="attribute" value=""/>
    <parameter key="attributes" value=""/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="nominal"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="file_path"/>
    <parameter key="block_type" value="single_value"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="single_value"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="true"/>
  </operator>
</process>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.5.002">
  <operator activated="false" class="text:data_to_documents" compatibility="6.5.000" expanded="true" height="60" name="Data to Documents (3)" width="90" x="313" y="615">
    <parameter key="select_attributes_and_weights" value="true"/>
    <list key="specify_weights">
      <parameter key="Description" value="1.0"/>
      <parameter key="Indication" value="1.0"/>
      <parameter key="Mechanism Of Action" value="1.0"/>
      <parameter key="Name" value="1.0"/>
      <parameter key="Pharmacodynamics" value="1.0"/>
    </list>
  </operator>
</process>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.5.002">
  <operator activated="false" class="text:process_documents" compatibility="6.5.000" expanded="true" height="94" name="Process Documents (3)" width="90" x="447" y="615">
    <parameter key="create_word_vector" value="true"/>
    <parameter key="vector_creation" value="TF-IDF"/>
    <parameter key="add_meta_information" value="true"/>
    <parameter key="keep_text" value="false"/>
    <parameter key="prune_method" value="none"/>
    <parameter key="prune_below_percent" value="3.0"/>
    <parameter key="prune_above_percent" value="30.0"/>
    <parameter key="prune_below_rank" value="0.05"/>
    <parameter key="prune_above_rank" value="0.95"/>
    <parameter key="datamanagement" value="double_sparse_array"/>
    <process expanded="true">
      <operator activated="true" class="text:tokenize" compatibility="6.5.000" expanded="true" height="60" name="Tokenize (3)" width="90" x="179" y="75">
        <parameter key="mode" value="non letters"/>
        <parameter key="characters" value=".:"/>
        <parameter key="language" value="English"/>
        <parameter key="max_token_length" value="3"/>
      </operator>
      <operator activated="true" class="text:filter_stopwords_english" compatibility="6.5.000" expanded="true" height="60" name="Filter Stopwords (3)" width="90" x="179" y="165"/>
      <operator activated="true" class="text:transform_cases" compatibility="6.5.000" expanded="true" height="60" name="Transform Cases (3)" width="90" x="179" y="255">
        <parameter key="transform_to" value="lower case"/>
      </operator>
      <operator activated="true" class="text:stem_snowball" compatibility="6.5.000" expanded="true" height="60" name="Stem (3)" width="90" x="313" y="75">
        <parameter key="language" value="English"/>
      </operator>
      <operator activated="false" class="text:generate_n_grams_terms" compatibility="6.5.000" expanded="true" height="60" name="Generate n-Grams (3)" width="90" x="313" y="165">
        <parameter key="max_length" value="2"/>
      </operator>
      <connect from_port="document" to_op="Tokenize (3)" to_port="document"/>
      <connect from_op="Tokenize (3)" from_port="document" to_op="Filter Stopwords (3)" to_port="document"/>
      <connect from_op="Filter Stopwords (3)" from_port="document" to_op="Transform Cases (3)" to_port="document"/>
      <connect from_op="Transform Cases (3)" from_port="document" to_op="Stem (3)" to_port="document"/>
      <connect from_op="Stem (3)" from_port="document" to_port="document 1"/>
      <portSpacing port="source_document" spacing="0"/>
      <portSpacing port="sink_document 1" spacing="0"/>
      <portSpacing port="sink_document 2" spacing="0"/>
    </process>
  </operator>
</process>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.5.002">
  <operator activated="false" class="text:data_to_documents" compatibility="6.5.000" expanded="true" height="60" name="Data to Documents (2)" width="90" x="313" y="480">
    <parameter key="select_attributes_and_weights" value="true"/>
    <list key="specify_weights">
      <parameter key="Content" value="1.0"/>
      <parameter key="Title" value="1.0"/>
    </list>
  </operator>
</process>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.5.002">
  <operator activated="false" class="text:process_documents" compatibility="6.5.000" expanded="true" height="94" name="Process Documents (2)" width="90" x="447" y="435">
    <parameter key="create_word_vector" value="true"/>
    <parameter key="vector_creation" value="TF-IDF"/>
    <parameter key="add_meta_information" value="true"/>
    <parameter key="keep_text" value="false"/>
    <parameter key="prune_method" value="none"/>
    <parameter key="prune_below_percent" value="3.0"/>
    <parameter key="prune_above_percent" value="30.0"/>
    <parameter key="prune_below_rank" value="0.05"/>
    <parameter key="prune_above_rank" value="0.95"/>
    <parameter key="datamanagement" value="double_sparse_array"/>
    <process expanded="true">
      <operator activated="true" class="text:tokenize" compatibility="6.5.000" expanded="true" height="60" name="Tokenize (2)" width="90" x="179" y="75">
        <parameter key="mode" value="non letters"/>
        <parameter key="characters" value=".:"/>
        <parameter key="language" value="English"/>
        <parameter key="max_token_length" value="3"/>
      </operator>
      <operator activated="true" class="text:filter_stopwords_english" compatibility="6.5.000" expanded="true" height="60" name="Filter Stopwords (2)" width="90" x="179" y="165"/>
      <operator activated="true" class="text:transform_cases" compatibility="6.5.000" expanded="true" height="60" name="Transform Cases (2)" width="90" x="179" y="255">
        <parameter key="transform_to" value="lower case"/>
      </operator>
      <operator activated="true" class="text:stem_snowball" compatibility="6.5.000" expanded="true" height="60" name="Stem (2)" width="90" x="313" y="75">
        <parameter key="language" value="English"/>
      </operator>
      <operator activated="false" class="text:generate_n_grams_terms" compatibility="6.5.000" expanded="true" height="60" name="Generate n-Grams (2)" width="90" x="313" y="165">
        <parameter key="max_length" value="2"/>
      </operator>
      <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
      <connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
      <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
      <connect from_op="Transform Cases (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
      <connect from_op="Stem (2)" from_port="document" to_port="document 1"/>
      <portSpacing port="source_document" spacing="0"/>
      <portSpacing port="sink_document 1" spacing="0"/>
      <portSpacing port="sink_document 2" spacing="0"/>
    </process>
  </operator>
</process>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.5.002">
  <operator activated="false" class="select_attributes" compatibility="6.5.002" expanded="true" height="76" name="Select Attributes" width="90" x="581" y="435">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attribute" value=""/>
    <parameter key="attributes" value="Author|Link|Categories|Published|ID"/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="attribute_value"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
  </operator>
</process>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.5.002">
  <operator activated="false" class="select_attributes" compatibility="6.5.002" expanded="true" height="76" name="Select Attributes (2)" width="90" x="581" y="615">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attribute" value=""/>
    <parameter key="attributes" value="metadata_file|Category"/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="attribute_value"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
  </operator>
</process>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.5.002">
  <operator activated="true" class="retrieve" compatibility="6.5.002" expanded="true" height="60" name="Retrieve Pubmed RSS Data" width="90" x="45" y="255">
    <parameter key="repository_entry" value="../Pubmed RSS Data"/>
  </operator>
</process>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.5.002">
  <operator activated="true" class="select_attributes" compatibility="6.5.002" expanded="true" height="76" name="Select Attributes (3)" width="90" x="179" y="255">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attribute" value=""/>
    <parameter key="attributes" value="Source Text"/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="attribute_value"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="true"/>
  </operator>
</process>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.5.002">
  <operator activated="true" class="generate_attributes" compatibility="6.5.002" expanded="true" height="76" name="Generate Attributes" width="90" x="313" y="345">
    <list key="function_descriptions">
      <parameter key="Source Text" value="concat(Description,Indication,[Mechanism Of Action],Pharmacodynamics)"/>
    </list>
    <parameter key="keep_all" value="false"/>
  </operator>
</process>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.5.002">
  <operator activated="true" class="select_attributes" compatibility="6.5.002" expanded="true" height="76" name="Select Attributes (4)" width="90" x="447" y="345">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Source Text"/>
    <parameter key="attributes" value=""/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="attribute_value"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="true"/>
  </operator>
</process>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.5.002">
  <operator activated="true" class="filter_examples" compatibility="6.5.002" expanded="true" height="94" name="Filter Examples" width="90" x="447" y="210">
    <parameter key="parameter_expression" value=""/>
    <parameter key="condition_class" value="missing_labels"/>
    <parameter key="invert_filter" value="false"/>
    <list key="filters_list">
      <parameter key="filters_entry_key" value="Source Text.equals.?"/>
    </list>
    <parameter key="filters_logic_and" value="true"/>
    <parameter key="filters_check_metadata" value="true"/>
  </operator>
</process>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.5.002">
  <operator activated="true" class="cross_distances" compatibility="6.5.002" expanded="true" height="94" name="Cross Distances" width="90" x="380" y="30">
    <parameter key="measure_types" value="NumericalMeasures"/>
    <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
    <parameter key="nominal_measure" value="NominalDistance"/>
    <parameter key="numerical_measure" value="CosineSimilarity"/>
    <parameter key="divergence" value="GeneralizedIDivergence"/>
    <parameter key="kernel_type" value="radial"/>
    <parameter key="kernel_gamma" value="1.0"/>
    <parameter key="kernel_sigma1" value="1.0"/>
    <parameter key="kernel_sigma2" value="0.0"/>
    <parameter key="kernel_sigma3" value="2.0"/>
    <parameter key="kernel_degree" value="3.0"/>
    <parameter key="kernel_shift" value="1.0"/>
    <parameter key="kernel_a" value="1.0"/>
    <parameter key="kernel_b" value="0.0"/>
    <parameter key="only_top_k" value="false"/>
    <parameter key="k" value="10"/>
    <parameter key="search_for" value="nearest"/>
    <parameter key="compute_similarities" value="true"/>
  </operator>
</process>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.5.002">
  <operator activated="true" class="aggregate" compatibility="6.5.002" expanded="true" height="76" name="Aggregate" width="90" x="514" y="30">
    <parameter key="use_default_aggregation" value="true"/>
    <parameter key="attribute_filter_type" value="all"/>
    <parameter key="attribute" value=""/>
    <parameter key="attributes" value=""/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="attribute_value"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="false"/>
    <parameter key="default_aggregation_function" value="average"/>
    <list key="aggregation_attributes">
      <parameter key="distance" value="maximum"/>
    </list>
    <parameter key="group_by_attributes" value="request"/>
    <parameter key="count_all_combinations" value="false"/>
    <parameter key="only_distinct" value="false"/>
    <parameter key="ignore_missings" value="true"/>
  </operator>
</process>

Answers

  • dunkin2025
    dunkin2025 New Altair Community Member
    ??? can rapidminer do cosine similarity? I have been experimenting  on a 2-instances testing  set with R.  it was alright but I need to do it for a couple of hundred thousands- multiple instances (tweets) against my predefined queries(built dictionary).  Converting them into vector types and calculating the angles.
  • MartinLiebig
    MartinLiebig
    Altair Employee
    of course. It is part of Cross distance.

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.