How can I filter no missing values in a special attribute
ArnoG
New Altair Community Member
I m using the "Process documents from data" to select sentences containing a certain word. The operator generates an example set with a special attribute named text. Now I want to select only the records containing text, but what I trie it doesn't seem to work. I tried filter examples/no_missing values, but somehow I can't filter the recors out containing text. Anabody suggestions?
Regards Arno
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.0.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="6.0.003" expanded="true" height="60" name="Read Excel" width="90" x="45" y="75">
<parameter key="excel_file" value="C:\Improve Your Business\Qing\Rapidminer\Hampshire hotel\Prediction model.xlsx"/>
<parameter key="sheet_number" value="2"/>
<parameter key="imported_cell_range" value="A1:F8"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="Date.false.date_time.attribute"/>
<parameter key="1" value="Rate.false.numeric.attribute"/>
<parameter key="2" value="Guest category.false.binominal.attribute"/>
<parameter key="3" value="Positivereview.true.text.attribute"/>
<parameter key="4" value="Negativereview.true.text.attribute"/>
<parameter key="5" value="Sentiment.true.attribute_value.label"/>
</list>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="246" y="75">
<parameter key="vector_creation" value="Term Occurrences"/>
<parameter key="keep_text" value="true"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="112" y="30">
<parameter key="mode" value="specify characters"/>
<parameter key="characters" value=".:?!"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Content)" width="90" x="246" y="30">
<parameter key="string" value="room"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
<connect from_op="Filter Tokens (by Content)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="filter_examples" compatibility="6.0.003" expanded="true" height="94" name="Filter Examples" width="90" x="447" y="75">
<parameter key="condition_class" value="no_missing_attributes"/>
<list key="filters_list"/>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Regards Arno
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.0.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="6.0.003" expanded="true" height="60" name="Read Excel" width="90" x="45" y="75">
<parameter key="excel_file" value="C:\Improve Your Business\Qing\Rapidminer\Hampshire hotel\Prediction model.xlsx"/>
<parameter key="sheet_number" value="2"/>
<parameter key="imported_cell_range" value="A1:F8"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="Date.false.date_time.attribute"/>
<parameter key="1" value="Rate.false.numeric.attribute"/>
<parameter key="2" value="Guest category.false.binominal.attribute"/>
<parameter key="3" value="Positivereview.true.text.attribute"/>
<parameter key="4" value="Negativereview.true.text.attribute"/>
<parameter key="5" value="Sentiment.true.attribute_value.label"/>
</list>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="246" y="75">
<parameter key="vector_creation" value="Term Occurrences"/>
<parameter key="keep_text" value="true"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="112" y="30">
<parameter key="mode" value="specify characters"/>
<parameter key="characters" value=".:?!"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Content)" width="90" x="246" y="30">
<parameter key="string" value="room"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
<connect from_op="Filter Tokens (by Content)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="filter_examples" compatibility="6.0.003" expanded="true" height="94" name="Filter Examples" width="90" x="447" y="75">
<parameter key="condition_class" value="no_missing_attributes"/>
<list key="filters_list"/>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Tagged:
0
Answers
-
Hello Arno
I don't have your data so I can't be certain but I think this is what is happening.
Your process is reading in a spreadsheet and the resulting example set has three attributes, Two of these are of type text and the third is a label. The process documents from data operator will process the text attributes together for each example in the example set. The tokenizing is splitting by characters .:?! which means the document is split into sentence like tokens and you are keeping only those which contain the word 'room'. The resulting document vector will therefore contain attributes corresponding to sentences containing the word room. The setting 'term occurrences' for the process documents operator counts the number of times the token appears in each example within the example set. A value of zero means the example has no match.
Could it be that you want to remove those examples which have the value 0 for all possible attributes?
regards
Andrew
0 -
Hi Andrew,
That is exactly what I'm trying to do. My process is reading a spreadsheet with 2 text columns and 1 label collumn for the sentiment. The process results in a example set containing 7 examples, 2 special attributes and 3 regular attributes.
3 out of the 7 examples contains text, 4 have no text. The examlples with text have at leat 1 regular attribute with a 1. The 4 examples without text have all 0.
I like to remove all the examples with a 0 for all attributes. So ypu're exectly right. Is tgat possible?
Regards.
Arno0 -
Hello Arno
There are many ways. One to try would be "Remove Useless Attributes".
regards
Andrew0 -
Hi Andrew,
Thanks for your response. I was not familair with this operator. I tried the operator but the operator removes attributes instead od examples. Am I is using it the ridht way?
Regards,
Arno0 -
Hello Arno
My mistake - silly me - not thinking straight.
You could add up all values of the attributes to create a new attribute and then filter out all those where the new attribute is not zero. The operator to use would be "Generate Aggregation"; set the parameters to be "value type" and "numeric" and ensure the aggregation function is "sum". Using this operator means you don't need to know the names of the attributes you are summing.
regards
Andrew0 -
Hi Andrew,
Thanks! That worked. I created a new attribute and added up all the values. Then used the filter examples operator, set it to custom_filter, is not 0. Now I have the examples containing text.
Thanks.
Regards,
Arno0