"Text Processing: Select Attribute and Weights in Process Documents From Data"
Frank
New Altair Community Member
Hello,
I am working on a text classification problem where my input consists of news articles.
There are two text attributes; the title of a news article and the fulltext (ArticleText).
As part of my research i am investigating the effect of assinging different weights to the title and fulltext attributes. This is done in the "Process document from data" process with the property "Select Attribute and Weights". (I use TF-IDF weighitng)
I tested five cases of different weights: (4.0 Title, 1.0 ArticleText), (2.0 Title, 1.0 ArticleText), (1.0 Title, 1.0 ArticleText), (1.0 Title, 2.0 ArticleText), (1.0 Title, 4.0 ArticleText). Each weighting configuration in a different process.
I was hoping that the TF-IDF score of words originating from the title would be multiplied with the corresponding weight and the same for the fulltext. However no matter how i set the weights, the outputed document term matrices are all equal. Am i doing something wrong, or is there another way to achieve my goal?
I have attached a simplified proces, where the Title attribute has weight 4.0 and the articleText 1.0.
cheers,
Frank
I am working on a text classification problem where my input consists of news articles.
There are two text attributes; the title of a news article and the fulltext (ArticleText).
As part of my research i am investigating the effect of assinging different weights to the title and fulltext attributes. This is done in the "Process document from data" process with the property "Select Attribute and Weights". (I use TF-IDF weighitng)
I tested five cases of different weights: (4.0 Title, 1.0 ArticleText), (2.0 Title, 1.0 ArticleText), (1.0 Title, 1.0 ArticleText), (1.0 Title, 2.0 ArticleText), (1.0 Title, 4.0 ArticleText). Each weighting configuration in a different process.
I was hoping that the TF-IDF score of words originating from the title would be multiplied with the corresponding weight and the same for the fulltext. However no matter how i set the weights, the outputed document term matrices are all equal. Am i doing something wrong, or is there another way to achieve my goal?
I have attached a simplified proces, where the Title attribute has weight 4.0 and the articleText 1.0.
cheers,
Frank
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
<process expanded="true" height="523" width="845">
<operator activated="true" class="set_macros" compatibility="5.3.000" expanded="true" height="60" name="Set Macros" width="90" x="45" y="30">
<list key="macros">
<parameter key="TermMatrixPath" value="//RapidMinerRepository/ForReal/Data/TermMatrix/GEW500Text1GramNoBlackList%{TW}TW%{AW}AW"/>
<parameter key="StratifiedDataPath" value="//RapidMinerRepository/ForReal/Data/Dataset/GEW500TextWithoutSource"/>
<parameter key="WordNetPath" value="C:\Program Files (x86)\WordNet\2.1\dict"/>
<parameter key="TW" value="4.0"/>
<parameter key="AW" value="1.0"/>
</list>
</operator>
<operator activated="true" class="retrieve" compatibility="5.3.000" expanded="true" height="60" name="Retrieve" width="90" x="45" y="165">
<parameter key="repository_entry" value="%{StratifiedDataPath}"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="5.3.000" expanded="true" height="76" name="ProcessTrainDocuments (2)" width="90" x="313" y="165">
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="4"/>
<parameter key="prune_above_absolute" value="999999999"/>
<parameter key="prune_above_rank" value="0.05"/>
<parameter key="select_attributes_and_weights" value="true"/>
<list key="specify_weights">
<parameter key="ArticleText" value="1.0"/>
<parameter key="Title" value="4.0"/>
</list>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" name="Tokenize (4)"/>
<operator activated="true" class="text:filter_by_length" compatibility="5.3.000" expanded="true" name="Filter Tokens (4)">
<parameter key="min_chars" value="3"/>
<parameter key="max_chars" value="999"/>
</operator>
<operator activated="true" class="wordnet:open_wordnet_dictionary" compatibility="5.2.000" expanded="true" name="Open WordNet Dictionary (4)">
<parameter key="directory" value="%{WordNetPath}"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.000" expanded="true" name="Filter Stopwords (4)"/>
<operator activated="true" class="wordnet:stem_wordnet" compatibility="5.2.000" expanded="true" name="Stem (4)">
<parameter key="keep_unmatched_stems" value="true"/>
<parameter key="keep_unmatched_tokens" value="true"/>
</operator>
<operator activated="true" class="subprocess" compatibility="5.3.000" expanded="true" name="Filter Clash Attributes">
<process expanded="true">
<operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.000" expanded="true" name="Filter Weight Attribute">
<parameter key="condition" value="equals"/>
<parameter key="string" value="weight"/>
<parameter key="invert condition" value="true"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.000" expanded="true" name="Filter Batch attribute">
<parameter key="condition" value="equals"/>
<parameter key="string" value="Fold"/>
<parameter key="invert condition" value="true"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.000" expanded="true" name="Filter Source Attribute">
<parameter key="condition" value="equals"/>
<parameter key="string" value="Source"/>
<parameter key="invert condition" value="true"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.000" expanded="true" name="Filter PRID attribute">
<parameter key="condition" value="equals"/>
<parameter key="string" value="PRID"/>
<parameter key="invert condition" value="true"/>
</operator>
<connect from_port="in 1" to_op="Filter Weight Attribute" to_port="document"/>
<connect from_op="Filter Weight Attribute" from_port="document" to_op="Filter Batch attribute" to_port="document"/>
<connect from_op="Filter Batch attribute" from_port="document" to_op="Filter Source Attribute" to_port="document"/>
<connect from_op="Filter Source Attribute" from_port="document" to_op="Filter PRID attribute" to_port="document"/>
<connect from_op="Filter PRID attribute" from_port="document" to_port="out 1"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="source_in 2" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:filter_by_length" compatibility="5.3.000" expanded="true" name="Filter Tokens (5)">
<parameter key="min_chars" value="3"/>
<parameter key="max_chars" value="999"/>
</operator>
<connect from_port="document" to_op="Tokenize (4)" to_port="document"/>
<connect from_op="Tokenize (4)" from_port="document" to_op="Filter Tokens (4)" to_port="document"/>
<connect from_op="Filter Tokens (4)" from_port="document" to_op="Filter Stopwords (4)" to_port="document"/>
<connect from_op="Open WordNet Dictionary (4)" from_port="dictionary" to_op="Stem (4)" to_port="dictionary"/>
<connect from_op="Filter Stopwords (4)" from_port="document" to_op="Stem (4)" to_port="document"/>
<connect from_op="Stem (4)" from_port="document" to_op="Filter Clash Attributes" to_port="in 1"/>
<connect from_op="Filter Clash Attributes" from_port="out 1" to_op="Filter Tokens (5)" to_port="document"/>
<connect from_op="Filter Tokens (5)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="ProcessTrainDocuments (2)" to_port="example set"/>
<connect from_op="ProcessTrainDocuments (2)" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Tagged:
0
Answers
-
For me specifying weights works flawlessly. However, when you use TF/IDF the weights do not result in an operation as simple as multiplying the final values with the weight. What happens is that the words are counting according to their weight, i.e. if you specify a weight of 4.0 for an attribute, each word/token that appears in that attribute is counted four times. You'll see that if you switch from TF/IDF to term_occurences for the vector_creation parameter of Process Documents.
If your results are still always the same, maybe an operator that you use inside Process Documents dismisses those weights. To test that, please start with a very simple process, and iteratively add subsequent operators. That way you'll find out which operator breaks the weighting. If you can find out anything useful we would be very grateful if you posted your findings here.
Best regards,
Marius0 -
Hey Marius,
Thanks for your quick repsonse.
I followed your advice and tried eliminating operators one by one inside the Process Documents operator and quickly found out the problem.
When I remove the Stem(WordNet) operator the resulting term matrices are different and i suspect the weighting to work.
Even when I use Stem(Snowball) the resulitng term matrices are different, so i probably use that as workaround.
Btw, i have installed WordNet version 2.1.
regards,
Frank0 -
Ok, I'll forward it to the developers, then!
Best regards,
Marius0