token in groupe
startx25
New Altair Community Member
Hi all,
I have read this wonderful tutorial for Finding text needles in document haystacks :
https://docs.google.com/file/d/0BzlG_h9m5M7tVXUyeVl4cmhJZGc/edit?usp=sharing
It'work fine, but now i want to add another texte file in step 1 : the text needles file (with label value : ex:Groupe2)
(2 textfile in intput in step 1)
And in the end result proces, i want to identify from witch text needles file provide my wordlist (Groupe1 or Group2) in my textfile in step3
thank you for any help
I have read this wonderful tutorial for Finding text needles in document haystacks :
https://docs.google.com/file/d/0BzlG_h9m5M7tVXUyeVl4cmhJZGc/edit?usp=sharing
It'work fine, but now i want to add another texte file in step 1 : the text needles file (with label value : ex:Groupe2)
(2 textfile in intput in step 1)
And in the end result proces, i want to identify from witch text needles file provide my wordlist (Groupe1 or Group2) in my textfile in step3
thank you for any help
Tagged:
0
Answers
-
Hello
Is this what you need?<?xml version="1.0" encoding="UTF-8" standalone="no"?>
regards
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document (2)" width="90" x="45" y="75">
<parameter key="text" value="binominal parameter binominal attributes Binominal operator "/>
</operator>
<operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document (3)" width="90" x="45" y="165">
<parameter key="text" value="at this had the of the "/>
</operator>
<operator activated="true" class="collect" compatibility="5.3.008" expanded="true" height="94" name="Collect" width="90" x="179" y="120"/>
<operator activated="true" class="loop_collection" compatibility="5.3.008" expanded="true" height="76" name="Loop Collection" width="90" x="313" y="120">
<parameter key="set_iteration_macro" value="true"/>
<process expanded="true">
<operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="94" name="Process Documents" width="90" x="112" y="75">
<parameter key="vector_creation" value="Binary Term Occurrences"/>
<parameter key="add_meta_information" value="false"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" name="Tokenize (3)">
<parameter key="mode" value="regular expression"/>
<parameter key="expression" value="[^a-zA-Z ]"/>
</operator>
<operator activated="true" class="text:replace_tokens" compatibility="5.3.000" expanded="true" name="Replace Tokens (3)">
<list key="replace_dictionary">
<parameter key=" " value="_"/>
</list>
</operator>
<connect from_port="document" to_op="Tokenize (3)" to_port="document"/>
<connect from_op="Tokenize (3)" from_port="document" to_op="Replace Tokens (3)" to_port="document"/>
<connect from_op="Replace Tokens (3)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document (4)" width="90" x="112" y="300">
<parameter key="text" value="This Example Process mostly focuses on the transform binominal parameter. All remaining parameters are mostly for selecting the attributes. The Select Attributes operator also has many similar parameters for selection of attributes. You can study the Example Process of the Select Attributes operator if you want an understanding of these parameters. The Retrieve operator is used to load the Golf data set. A breakpoint is inserted at this point so that you can have look at the data set before application of the Nominal to Binominal operator. You can see that the 'Outlook' attribute has three possible values i.e. 'sunny', 'rain' and 'overcast'. The 'Wind' attribute has two possible values i.e. 'true' and 'false'. All parameters of the Nominal to Binominal operator are used with default values. Run the process. First you will see the Golf data set. Press the run button again and you will see the final results. You can see that the 'Outlook' attribute is replaced by three binominal attributes, one for each possible value of the original 'Outlook' attribute. These attributes are ' Outlook = sunny', ' Outlook = rain', and ' Outlook = overcast'. Only the value of one of these attributes is true for a specific example, the value of the other attributes is false. Examples whose 'Outlook ' attribute had the value 'sunny' in the original ExampleSet, will have the attribute ' Outlook =sunny' value set to 'true'in the new ExampleSet, the value of the 'Outlook =overcast' and 'Outlook =rain' attributes will be 'false'. The numeric attributes of the input ExampleSet remain unchanged. The 'Wind' attribute was not replaced by two binominal attributes, one for each possible value of the 'Wind' attribute because this attribute is already binominal. Still if you want to break it into two separate binominal attributes, this can be done by setting the transform binominal parameter to true. "/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="94" name="Process Documents (2)" width="90" x="313" y="255">
<parameter key="vector_creation" value="Term Occurrences"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" name="Tokenize (4)">
<parameter key="expression" value="\\r\\n"/>
</operator>
<operator activated="true" class="text:generate_n_grams_terms" compatibility="5.3.000" expanded="true" name="Generate n-Grams (2)">
<parameter key="max_length" value="5"/>
</operator>
<connect from_port="document" to_op="Tokenize (4)" to_port="document"/>
<connect from_op="Tokenize (4)" from_port="document" to_op="Generate n-Grams (2)" to_port="document"/>
<connect from_op="Generate n-Grams (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="generate_attributes" compatibility="5.3.008" expanded="true" height="76" name="Generate Attributes" width="90" x="313" y="75">
<list key="function_descriptions">
<parameter key="group" value=""Group_%{iteration}""/>
</list>
</operator>
<connect from_port="single" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="word list" to_op="Process Documents (2)" to_port="word list"/>
<connect from_op="Create Document (4)" from_port="output" to_op="Process Documents (2)" to_port="documents 1"/>
<connect from_op="Process Documents (2)" from_port="example set" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_port="output 1"/>
<portSpacing port="source_single" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<connect from_op="Create Document (2)" from_port="output" to_op="Collect" to_port="input 1"/>
<connect from_op="Create Document (3)" from_port="output" to_op="Collect" to_port="input 2"/>
<connect from_op="Collect" from_port="collection" to_op="Loop Collection" to_port="collection"/>
<connect from_op="Loop Collection" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Andrew0 -
first, thank you for your help
this is not realy what i need. Meybe my epxlanation are not exact.
here is a summary of what i need :
input 1 : textfile with some keywords (groupe1) (one word per line)
input 2 : textfile with some keywords (groupe2) (one word per line)
input 3: a flat text file
what i need is to count how much keyword from goupe1 and from goupe2 are présent in my flat text file (input 3)
I think i need to add an aggregate operators but it 'can't count correct value of groupe
here iw an example from
Thank you for any help
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">
<parameter key="text" value="dog cat bird "/>
<parameter key="add label" value="true"/>
<parameter key="label_type" value="text"/>
<parameter key="label_value" value="groupe1"/>
</operator>
<operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document (2)" width="90" x="45" y="255">
<parameter key="text" value="This Example Process mostly focuses on the transform binominal parameter. All remaining parameters are mostly for selecting the attributes. The Select Attributes operator also has many similar parameters for selection of attributes. You can study the Example Process of the Select Attributes operator if you want an understanding of these parameters. The Retrieve operator is used to load the Golf data set. A breakpoint is inserted at this point so that you can have look at the data set before application of the Nominal to Binominal operator. You can see that the 'Outlook' attribute has three possible values i.e. 'sunny', 'rain' and 'overcast'. The 'Wind' attribute has two possible values i.e. 'true' and 'false'. All parameters of the Nominal to Binominal operator are used with default values. Run the process. First you will see the Golf data set. Press the run button again and you will see the final results. and dog cat bird and dog cat bird and dog cat bird house car car house You can see that the 'Outlook' attribute is replaced by three binominal attributes, one for each possible value of the original 'Outlook' attribute. These attributes are ' Outlook = sunny', ' Outlook = rain', and ' Outlook = overcast'. Only the value of one of these attributes is true for a specific example, the value of the other attributes is false. Examples whose 'Outlook ' attribute had the value 'sunny' in the original ExampleSet, will have the attribute ' Outlook =sunny' value set to 'true'in the new ExampleSet, the value of the 'Outlook =overcast' and 'Outlook =rain' attributes will be 'false'. The numeric attributes of the input ExampleSet remain unchanged. The 'Wind' attribute was not replaced by two binominal attributes, one for each possible value of the 'Wind' attribute because this attribute is already binominal. Still if you want to break it into two separate binominal attributes, this can be done by setting the transform binominal parameter to true. "/>
</operator>
<operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document (3)" width="90" x="45" y="120">
<parameter key="text" value="car train boat truck"/>
<parameter key="add label" value="true"/>
<parameter key="label_type" value="text"/>
<parameter key="label_value" value="groupe2"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="112" name="Process Documents" width="90" x="246" y="75">
<parameter key="vector_creation" value="Binary Term Occurrences"/>
<parameter key="add_meta_information" value="false"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize (3)" width="90" x="45" y="30">
<parameter key="mode" value="regular expression"/>
<parameter key="expression" value="[^a-zA-Z ]"/>
</operator>
<connect from_port="document" to_op="Tokenize (3)" to_port="document"/>
<connect from_op="Tokenize (3)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="94" name="Process Documents (2)" width="90" x="313" y="255">
<parameter key="vector_creation" value="Term Occurrences"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize (4)" width="90" x="45" y="30">
<parameter key="expression" value="\\r\\n"/>
</operator>
<connect from_port="document" to_op="Tokenize (4)" to_port="document"/>
<connect from_op="Tokenize (4)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Create Document (2)" from_port="output" to_op="Process Documents (2)" to_port="documents 1"/>
<connect from_op="Create Document (3)" from_port="output" to_op="Process Documents" to_port="documents 2"/>
<connect from_op="Process Documents" from_port="word list" to_op="Process Documents (2)" to_port="word list"/>
<connect from_op="Process Documents (2)" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
0 -
Hello
I think this might work for you
I'll waive my usual fee of beer or money<?xml version="1.0" encoding="UTF-8" standalone="no"?>
regards
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document (2)" width="90" x="45" y="75">
<parameter key="text" value="binominal parameter binominal attributes Binominal operator "/>
</operator>
<operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document (3)" width="90" x="45" y="165">
<parameter key="text" value="at this had the of the "/>
</operator>
<operator activated="true" class="collect" compatibility="5.3.008" expanded="true" height="94" name="Collect" width="90" x="246" y="75"/>
<operator activated="true" class="loop_collection" compatibility="5.3.008" expanded="true" height="76" name="Loop Collection" width="90" x="380" y="75">
<parameter key="set_iteration_macro" value="true"/>
<process expanded="true">
<operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="94" name="Process Documents" width="90" x="112" y="30">
<parameter key="vector_creation" value="Binary Term Occurrences"/>
<parameter key="add_meta_information" value="false"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" name="Tokenize (3)">
<parameter key="mode" value="regular expression"/>
<parameter key="expression" value="[^a-zA-Z ]"/>
</operator>
<operator activated="true" class="text:replace_tokens" compatibility="5.3.000" expanded="true" name="Replace Tokens (3)">
<list key="replace_dictionary">
<parameter key=" " value="_"/>
</list>
</operator>
<connect from_port="document" to_op="Tokenize (3)" to_port="document"/>
<connect from_op="Tokenize (3)" from_port="document" to_op="Replace Tokens (3)" to_port="document"/>
<connect from_op="Replace Tokens (3)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document (4)" width="90" x="112" y="300">
<parameter key="text" value="This Example Process mostly focuses on the transform binominal parameter. All remaining parameters are mostly for selecting the attributes. The Select Attributes operator also has many similar parameters for selection of attributes. You can study the Example Process of the Select Attributes operator if you want an understanding of these parameters. The Retrieve operator is used to load the Golf data set. A breakpoint is inserted at this point so that you can have look at the data set before application of the Nominal to Binominal operator. You can see that the 'Outlook' attribute has three possible values i.e. 'sunny', 'rain' and 'overcast'. The 'Wind' attribute has two possible values i.e. 'true' and 'false'. All parameters of the Nominal to Binominal operator are used with default values. Run the process. First you will see the Golf data set. Press the run button again and you will see the final results. You can see that the 'Outlook' attribute is replaced by three binominal attributes, one for each possible value of the original 'Outlook' attribute. These attributes are ' Outlook = sunny', ' Outlook = rain', and ' Outlook = overcast'. Only the value of one of these attributes is true for a specific example, the value of the other attributes is false. Examples whose 'Outlook ' attribute had the value 'sunny' in the original ExampleSet, will have the attribute ' Outlook =sunny' value set to 'true'in the new ExampleSet, the value of the 'Outlook =overcast' and 'Outlook =rain' attributes will be 'false'. The numeric attributes of the input ExampleSet remain unchanged. The 'Wind' attribute was not replaced by two binominal attributes, one for each possible value of the 'Wind' attribute because this attribute is already binominal. Still if you want to break it into two separate binominal attributes, this can be done by setting the transform binominal parameter to true. "/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="94" name="Process Documents (2)" width="90" x="246" y="120">
<parameter key="vector_creation" value="Term Occurrences"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" name="Tokenize (4)">
<parameter key="expression" value="\\r\\n"/>
</operator>
<operator activated="true" class="text:generate_n_grams_terms" compatibility="5.3.000" expanded="true" name="Generate n-Grams (2)">
<parameter key="max_length" value="5"/>
</operator>
<connect from_port="document" to_op="Tokenize (4)" to_port="document"/>
<connect from_op="Tokenize (4)" from_port="document" to_op="Generate n-Grams (2)" to_port="document"/>
<connect from_op="Generate n-Grams (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="generate_attributes" compatibility="5.3.008" expanded="true" height="76" name="Generate Attributes" width="90" x="380" y="120">
<list key="function_descriptions">
<parameter key="group" value=""Group_%{iteration}""/>
</list>
</operator>
<operator activated="true" class="rename_by_generic_names" compatibility="5.3.008" expanded="true" height="76" name="Rename by Generic Names" width="90" x="380" y="210">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="|group"/>
<parameter key="invert_selection" value="true"/>
</operator>
<operator activated="true" class="generate_aggregation" compatibility="5.3.008" expanded="true" height="76" name="Generate Aggregation" width="90" x="380" y="300">
<parameter key="attribute_name" value="sum"/>
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="|group"/>
<parameter key="invert_selection" value="true"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="5.3.008" expanded="true" height="76" name="Select Attributes" width="90" x="514" y="120">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="group|sum|"/>
</operator>
<connect from_port="single" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="word list" to_op="Process Documents (2)" to_port="word list"/>
<connect from_op="Create Document (4)" from_port="output" to_op="Process Documents (2)" to_port="documents 1"/>
<connect from_op="Process Documents (2)" from_port="example set" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Rename by Generic Names" to_port="example set input"/>
<connect from_op="Rename by Generic Names" from_port="example set output" to_op="Generate Aggregation" to_port="example set input"/>
<connect from_op="Generate Aggregation" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_port="output 1"/>
<portSpacing port="source_single" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="append" compatibility="5.3.008" expanded="true" height="76" name="Append" width="90" x="514" y="75"/>
<connect from_op="Create Document (2)" from_port="output" to_op="Collect" to_port="input 1"/>
<connect from_op="Create Document (3)" from_port="output" to_op="Collect" to_port="input 2"/>
<connect from_op="Collect" from_port="collection" to_op="Loop Collection" to_port="collection"/>
<connect from_op="Loop Collection" from_port="output 1" to_op="Append" to_port="example set 1"/>
<connect from_op="Append" from_port="merged set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Andrew0 -
Hi Andrew
Great !
Thank you for this, I really appreciate,
;D
0