🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

How to replace Values after TF-IDF

RucaUser: "Ruca"
New Altair Community Member
Updated by Jocelyn
Hi all,

Sorry if this topic was already answered by I was not able to find it.
Here's the situation.
I've perfomed a TF-IDF score to a set of documents and my idea is to get rid of words were their tf-idf score is very low (ex. below 0.001).
How can I "clean" my example set in order to set values that are below 0.001 to 0.
It's not simple to set a word tf-idf score into 0, because a word can have a higher tf-idf in particular document and low values in order documents.
I hope I could made myself clear.
Any help would be very much apreciated.
Thank's a lot

Cheers,

Ruca

Find more posts tagged with

Sort by:
1 - 4 of 41
    Hi, you can use Generate Attributes to conditionally replace the values of a certain attribute:
    if(myAttribute < 0.001, 0, myAttribute)
    To do this iteratively for all attributes (i.e. words in your case) you can use the aforementioned expression in Generate Attributes together with the Loop Attributes operator.

    Regards,
    Marius
    RucaUser: "Ruca"
    New Altair Community Member
    OP
    Hi Marius,

    Already implemented your approach. I'm usign the "Loop Attributes" operator, with the "Generate Attributes" in its subprocess.
    It's seems that the loop_attribute generated by the "Loop Attributes" is not recognized inside the "Generate Attributes" operator.
    Shouldn't the loop_attribute be the iterator variable?

    Thank you,

    Ruca
    RucaUser: "Ruca"
    New Altair Community Member
    OP
    It seems that I'm not able to get it working...
    Do I have to perform a "set macro" somewhere within the process?
    Here's my example.
    Any hint is more than welcome!

    Thanks a lot,

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
        <parameter key="logverbosity" value="all"/>
        <parameter key="logfile" value="H:\SEKS\KnowledgeSources\log.txt"/>
        <parameter key="resultfile" value="H:\SEKS\KnowledgeSources\result.csv"/>
        <parameter key="parallelize_main_process" value="true"/>
        <process expanded="true" height="791" width="815">
          <operator activated="false" class="text:process_document_from_file" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
            <list key="text_directories">
              <parameter key="Waste Management" value="H:\SEKS\KnowledgeSources\ICONDA\Waste Management\TXT"/>
              <parameter key="Climate Control" value="H:\SEKS\KnowledgeSources\ICONDA\Climate Control\TXT"/>
              <parameter key="Utility &amp; Transportation" value="H:\SEKS\KnowledgeSources\ICONDA\Utility and Transportation\TXT"/>
              <parameter key="Electric Power &amp; Lighting" value="H:\SEKS\KnowledgeSources\ICONDA\Electric Power and Lighting\TXT"/>
              <parameter key="Information &amp; Communication" value="H:\SEKS\KnowledgeSources\ICONDA\Information and Communication\TXT"/>
            </list>
            <parameter key="content_type" value="pdf"/>
            <parameter key="prune_method" value="percentual"/>
            <parameter key="prunde_below_percent" value="2.0"/>
            <parameter key="prune_above_percent" value="100.0"/>
            <parameter key="prune_below_rank" value="0.9"/>
            <parameter key="prune_above_rank" value="0.5"/>
            <parameter key="parallelize_vector_creation" value="true"/>
            <process expanded="true" height="663" width="1094">
              <operator activated="false" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="112" y="75"/>
              <operator activated="false" class="text:transform_cases" compatibility="5.2.004" expanded="true" height="60" name="Transform Cases" width="90" x="112" y="210"/>
              <operator activated="false" class="text:filter_stopwords_english" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="122" y="351"/>
              <operator activated="false" class="text:stem_snowball" compatibility="5.2.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="112" y="480"/>
              <operator activated="false" class="text:filter_by_length" compatibility="5.2.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="380" y="480">
                <parameter key="max_chars" value="50"/>
              </operator>
              <operator activated="false" class="text:generate_n_grams_terms" compatibility="5.2.004" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="581" y="345">
                <parameter key="max_length" value="3"/>
              </operator>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
              <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
              <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
              <connect from_op="Stem (Snowball)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
              <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
              <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="false" class="data_to_similarity" compatibility="5.2.008" expanded="true" height="76" name="Data to Similarity (2)" width="90" x="246" y="435">
            <parameter key="measure_types" value="NumericalMeasures"/>
          </operator>
          <operator activated="false" class="data_to_similarity" compatibility="5.2.008" expanded="true" height="76" name="Data to Similarity" width="90" x="246" y="75">
            <parameter key="measure_types" value="NumericalMeasures"/>
            <parameter key="numerical_measure" value="CosineSimilarity"/>
          </operator>
          <operator activated="false" class="store" compatibility="5.2.008" expanded="true" height="60" name="Store" width="90" x="441" y="59">
            <parameter key="repository_entry" value="ICONDAwl"/>
          </operator>
          <operator activated="false" class="item_distribution_performance" compatibility="5.2.008" expanded="true" height="76" name="Performance (4)" width="90" x="581" y="615"/>
          <operator activated="false" class="cluster_count_performance" compatibility="5.2.008" expanded="true" height="76" name="Performance (3)" width="90" x="581" y="525"/>
          <operator activated="false" class="map_clustering_on_labels" compatibility="5.2.008" expanded="true" height="76" name="Map Clustering on Labels" width="90" x="581" y="705"/>
          <operator activated="false" class="cluster_density_performance" compatibility="5.2.008" expanded="true" height="112" name="Performance" width="90" x="581" y="390"/>
          <operator activated="true" class="retrieve" compatibility="5.2.008" expanded="true" height="60" name="Retrieve" width="90" x="45" y="255">
            <parameter key="repository_entry" value="ICONDA/2nd Iteration/wordlist2"/>
          </operator>
          <operator activated="true" class="loop_attributes" compatibility="5.2.008" expanded="true" height="60" name="Loop Attributes" width="90" x="246" y="255">
            <process expanded="true" height="645" width="1076">
              <operator activated="true" class="generate_attributes" compatibility="5.2.008" expanded="true" height="76" name="Generate Attributes" width="90" x="313" y="30">
                <list key="function_descriptions">
                  <parameter key="loop_attribute" value="if(loop_attribute &lt; 0.001, 0, loop_attribute)"/>
                </list>
              </operator>
              <connect from_port="example set" to_op="Generate Attributes" to_port="example set input"/>
              <connect from_op="Generate Attributes" from_port="example set output" to_port="example set"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_example set" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="k_means" compatibility="5.2.008" expanded="true" height="76" name="Clustering" width="90" x="447" y="255">
            <parameter key="k" value="5"/>
            <parameter key="determine_good_start_values" value="true"/>
          </operator>
          <operator activated="true" class="cluster_distance_performance" compatibility="5.2.008" expanded="true" height="94" name="Performance (2)" width="90" x="648" y="255"/>
          <connect from_op="Retrieve" from_port="output" to_op="Loop Attributes" to_port="example set"/>
          <connect from_op="Loop Attributes" from_port="example set" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="cluster model" to_op="Performance (2)" to_port="cluster model"/>
          <connect from_op="Clustering" from_port="clustered set" to_op="Performance (2)" to_port="example set"/>
          <connect from_op="Performance (2)" from_port="performance" to_port="result 1"/>
          <connect from_op="Performance (2)" from_port="example set" to_port="result 2"/>
          <connect from_op="Performance (2)" from_port="cluster model" to_port="result 3"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>
        </process>
      </operator>
    </process>
    Hi,

    to access the value of a macro you have to enclose it in curly braces preceded by a percentage sign, like this: %{myMacro}

    Happy Clustering!
    ~Marius