"How to get Meta Data with the Data to Documents and Process Documents operators"
dramhampton
New Altair Community Member
I'm performing text analytics and am struggling with Meta Data.
In the toy process below, there should be meta data available to the Data to Documents operator, but if you want to specify weights and click on Edit List, the source attribute doesn't populate, so you have to type it manually. Seems an unnecessary chore if there are several attributes to be listed, is there a 'proper' way to do this?
Also, the Process Documents operator loses all the meta data (quite understandably because it is creating a bunch of new attributes from the text it is fed) - what is best practice for restoring the meta data so that subsequent operators can be set up easily?
Many thanks!
David
In the toy process below, there should be meta data available to the Data to Documents operator, but if you want to specify weights and click on Edit List, the source attribute doesn't populate, so you have to type it manually. Seems an unnecessary chore if there are several attributes to be listed, is there a 'proper' way to do this?
Also, the Process Documents operator loses all the meta data (quite understandably because it is creating a bunch of new attributes from the text it is fed) - what is best practice for restoring the meta data so that subsequent operators can be set up easily?
Many thanks!
David
<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Root" origin="GENERATED_SAMPLE">
<parameter key="logverbosity" value="warning"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="9.2.000" expanded="true" height="68" name="Retrieve Golf" width="90" x="112" y="34">
<parameter key="repository_entry" value="../../data/Golf"/>
</operator>
<operator activated="true" class="text:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents" width="90" x="313" y="34">
<parameter key="select_attributes_and_weights" value="true"/>
<list key="specify_weights"/>
</operator>
<connect from_op="Retrieve Golf" from_port="output" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
0
Best Answer
-
hi @dramhampton maybe this will help?
<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" breakpoints="after" class="retrieve" compatibility="9.2.000" expanded="true" height="68" name="Retrieve Golf" width="90" x="45" y="34"> <parameter key="repository_entry" value="//Samples/data/Golf"/> <description align="center" color="transparent" colored="false" width="126">Outlook and Wind are NOMINAL attributes here - Text Processing operators will ignore them</description> </operator> <operator activated="true" breakpoints="after" class="nominal_to_text" compatibility="9.2.000" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34"> <parameter key="attribute_filter_type" value="all"/> <parameter key="attribute" value=""/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="nominal"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="file_path"/> <parameter key="block_type" value="single_value"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="single_value"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <description align="center" color="transparent" colored="false" width="126">now they're TEXT attributes</description> </operator> <operator activated="false" class="select_attributes" compatibility="9.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="34"> <parameter key="attribute_filter_type" value="subset"/> <parameter key="attribute" value=""/> <parameter key="attributes" value="Outlook|Wind"/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <description align="center" color="transparent" colored="false" width="126">not necessary</description> </operator> <operator activated="true" class="text:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents" width="90" x="447" y="34"> <parameter key="select_attributes_and_weights" value="false"/> <list key="specify_weights"/> <description align="center" color="transparent" colored="false" width="126">everything deleted and box is now unchecked</description> </operator> <operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="581" y="34"> <parameter key="create_word_vector" value="true"/> <parameter key="vector_creation" value="TF-IDF"/> <parameter key="add_meta_information" value="true"/> <parameter key="keep_text" value="false"/> <parameter key="prune_method" value="none"/> <parameter key="prune_below_percent" value="3.0"/> <parameter key="prune_above_percent" value="30.0"/> <parameter key="prune_below_rank" value="0.05"/> <parameter key="prune_above_rank" value="0.95"/> <parameter key="datamanagement" value="double_sparse_array"/> <parameter key="data_management" value="auto"/> <process expanded="true"> <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="246" y="34"> <parameter key="mode" value="non letters"/> <parameter key="characters" value=".:"/> <parameter key="language" value="English"/> <parameter key="max_token_length" value="3"/> </operator> <connect from_port="document" to_op="Tokenize" to_port="document"/> <connect from_op="Tokenize" from_port="document" to_port="document 1"/> <portSpacing port="source_document" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> <description align="center" color="transparent" colored="false" width="126">voil&#224;</description> </operator> <operator activated="false" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="514" y="238"> <parameter key="create_word_vector" value="true"/> <parameter key="vector_creation" value="TF-IDF"/> <parameter key="add_meta_information" value="true"/> <parameter key="keep_text" value="false"/> <parameter key="prune_method" value="none"/> <parameter key="prune_below_percent" value="3.0"/> <parameter key="prune_above_percent" value="30.0"/> <parameter key="prune_below_rank" value="0.05"/> <parameter key="prune_above_rank" value="0.95"/> <parameter key="datamanagement" value="double_sparse_array"/> <parameter key="data_management" value="auto"/> <parameter key="select_attributes_and_weights" value="false"/> <list key="specify_weights"/> <process expanded="true"> <portSpacing port="source_document" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> </process> </operator> <connect from_op="Retrieve Golf" from_port="output" to_op="Nominal to Text" to_port="example set input"/> <connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/> <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/> <connect from_op="Process Documents" from_port="example set" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <description align="center" color="yellow" colored="false" height="223" resized="true" width="297" x="407" y="178">FWIW I would just use this operator instead of the two above</description> </process> </operator> </process>
Scott5
Answers
-
I have had this metadata loss problem before and the best way I have found to handle it is to create an id (if you don't already have one) for each document, then multiply the data, use Process Documents, and then merge back in the metadata from the earlier dataset.
3 -
Good suggestion on the way to handle the inevitable loss of meta data caused by Process Documents, many thanks Brian.
Any ideas about why Data to Documents cannot use the Meta Data provided to it by the previous operator?0 -
hi @dramhampton - so both good questions.
1. Re specifying attributes and weights in Data to Documents, the way that works is that any text attribute is used by default. Hence I rarely go in here as I just make them text ahead of time. Yes the attribute list does not propagate into this list - likely a known bug in the TP extension. I will investigate.
2. (I think Brian answered this faster than I could! I was going to say the same thing... )
Scott1 -
Thank you both. This is very encouraging, I can feel I am getting closer to the point where I realise that I am doing it all wrong.
But here's the thing: I have created a toy process below that does 'Data to Documents' and 'Process Documents' on a simple dataset that just has two text columns. The version pasted below has these two columns manually selected in the Data to Documents operator, and it works just fine - creates a document term matrix. But if you now uncheck the 'select attributes and weights' box in Data to Documents (which should have the same result if I am reading Scott right), the Process Documents operator fails to produce a document term matrix. So the only way I can get the process to work is to manually specify all the text attributes that I want to use - which to my original point is very clunky because the Data to Documents operator appears to ignore the meta data it has been presented...<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000"><context><input/><output/><macros/></context><operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process"><parameter key="logverbosity" value="init"/><parameter key="random_seed" value="2001"/><parameter key="send_mail" value="never"/><parameter key="notification_email" value=""/><parameter key="process_duration_for_mail" value="30"/><parameter key="encoding" value="SYSTEM"/><process expanded="true"><operator activated="true" class="retrieve" compatibility="9.2.000" expanded="true" height="68" name="Retrieve Golf" width="90" x="112" y="34"><parameter key="repository_entry" value="//Samples/data/Golf"/></operator><operator activated="true" class="select_attributes" compatibility="9.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34"><parameter key="attribute_filter_type" value="subset"/><parameter key="attribute" value=""/><parameter key="attributes" value="Outlook|Wind"/><parameter key="use_except_expression" value="false"/><parameter key="value_type" value="attribute_value"/><parameter key="use_value_type_exception" value="false"/><parameter key="except_value_type" value="time"/><parameter key="block_type" value="attribute_block"/><parameter key="use_block_type_exception" value="false"/><parameter key="except_block_type" value="value_matrix_row_start"/><parameter key="invert_selection" value="false"/><parameter key="include_special_attributes" value="false"/></operator><operator activated="true" class="text:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents" width="90" x="447" y="34"><parameter key="select_attributes_and_weights" value="true"/><list key="specify_weights"><parameter key="Outlook" value="1.0"/><parameter key="Wind" value="1.0"/></list></operator><operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="581" y="34"><parameter key="create_word_vector" value="true"/><parameter key="vector_creation" value="TF-IDF"/><parameter key="add_meta_information" value="true"/><parameter key="keep_text" value="false"/><parameter key="prune_method" value="none"/><parameter key="prune_below_percent" value="3.0"/><parameter key="prune_above_percent" value="30.0"/><parameter key="prune_below_rank" value="0.05"/><parameter key="prune_above_rank" value="0.95"/><parameter key="datamanagement" value="double_sparse_array"/><parameter key="data_management" value="auto"/><process expanded="true"><operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="246" y="34"><parameter key="mode" value="non letters"/><parameter key="characters" value=".:"/><parameter key="language" value="English"/><parameter key="max_token_length" value="3"/></operator><connect from_port="document" to_op="Tokenize" to_port="document"/><connect from_op="Tokenize" from_port="document" to_port="document 1"/><portSpacing port="source_document" spacing="0"/><portSpacing port="sink_document 1" spacing="0"/><portSpacing port="sink_document 2" spacing="0"/></process></operator><connect from_op="Retrieve Golf" from_port="output" to_op="Select Attributes" to_port="example set input"/><connect from_op="Select Attributes" from_port="example set output" to_op="Data to Documents" to_port="example set"/><connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/><connect from_op="Process Documents" from_port="example set" to_port="result 1"/><portSpacing port="source_input 1" spacing="0"/><portSpacing port="sink_result 1" spacing="0"/><portSpacing port="sink_result 2" spacing="0"/></process></operator></process>0 -
hi @dramhampton maybe this will help?
<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" breakpoints="after" class="retrieve" compatibility="9.2.000" expanded="true" height="68" name="Retrieve Golf" width="90" x="45" y="34"> <parameter key="repository_entry" value="//Samples/data/Golf"/> <description align="center" color="transparent" colored="false" width="126">Outlook and Wind are NOMINAL attributes here - Text Processing operators will ignore them</description> </operator> <operator activated="true" breakpoints="after" class="nominal_to_text" compatibility="9.2.000" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34"> <parameter key="attribute_filter_type" value="all"/> <parameter key="attribute" value=""/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="nominal"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="file_path"/> <parameter key="block_type" value="single_value"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="single_value"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <description align="center" color="transparent" colored="false" width="126">now they're TEXT attributes</description> </operator> <operator activated="false" class="select_attributes" compatibility="9.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="34"> <parameter key="attribute_filter_type" value="subset"/> <parameter key="attribute" value=""/> <parameter key="attributes" value="Outlook|Wind"/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <description align="center" color="transparent" colored="false" width="126">not necessary</description> </operator> <operator activated="true" class="text:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents" width="90" x="447" y="34"> <parameter key="select_attributes_and_weights" value="false"/> <list key="specify_weights"/> <description align="center" color="transparent" colored="false" width="126">everything deleted and box is now unchecked</description> </operator> <operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="581" y="34"> <parameter key="create_word_vector" value="true"/> <parameter key="vector_creation" value="TF-IDF"/> <parameter key="add_meta_information" value="true"/> <parameter key="keep_text" value="false"/> <parameter key="prune_method" value="none"/> <parameter key="prune_below_percent" value="3.0"/> <parameter key="prune_above_percent" value="30.0"/> <parameter key="prune_below_rank" value="0.05"/> <parameter key="prune_above_rank" value="0.95"/> <parameter key="datamanagement" value="double_sparse_array"/> <parameter key="data_management" value="auto"/> <process expanded="true"> <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="246" y="34"> <parameter key="mode" value="non letters"/> <parameter key="characters" value=".:"/> <parameter key="language" value="English"/> <parameter key="max_token_length" value="3"/> </operator> <connect from_port="document" to_op="Tokenize" to_port="document"/> <connect from_op="Tokenize" from_port="document" to_port="document 1"/> <portSpacing port="source_document" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> <description align="center" color="transparent" colored="false" width="126">voil&#224;</description> </operator> <operator activated="false" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="514" y="238"> <parameter key="create_word_vector" value="true"/> <parameter key="vector_creation" value="TF-IDF"/> <parameter key="add_meta_information" value="true"/> <parameter key="keep_text" value="false"/> <parameter key="prune_method" value="none"/> <parameter key="prune_below_percent" value="3.0"/> <parameter key="prune_above_percent" value="30.0"/> <parameter key="prune_below_rank" value="0.05"/> <parameter key="prune_above_rank" value="0.95"/> <parameter key="datamanagement" value="double_sparse_array"/> <parameter key="data_management" value="auto"/> <parameter key="select_attributes_and_weights" value="false"/> <list key="specify_weights"/> <process expanded="true"> <portSpacing port="source_document" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> </process> </operator> <connect from_op="Retrieve Golf" from_port="output" to_op="Nominal to Text" to_port="example set input"/> <connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/> <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/> <connect from_op="Process Documents" from_port="example set" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <description align="center" color="yellow" colored="false" height="223" resized="true" width="297" x="407" y="178">FWIW I would just use this operator instead of the two above</description> </process> </operator> </process>
Scott5 -
Bingo. Thank you so much for taking the time to put this together Scott!
For the benefit of those who will read this later, the key points are:
- Before processing documents, make sure your text data is of type 'text' rather than 'polynominal'. Text means its a string of any number of words, such as reviews of products on a website, whereas polynominal means it is nominal data that has a finite number of different values (even if it's a large number of them) - such as names of items for sale on a website. It would be reasonable to make a Pareto Chart from Polynominal data to see which values occur most often, but makes no sense to do so with text. It's a subtle difference and I will admit to having assumed that RapidMiner treated them the same.
So Scott introduced a Nominal to Text operator, which forced the polynominal attributes to text. That was why the data to documents operator was not seeing the meta data, it is looking for text attributes, not polynominal ones.
And finally, to put the lid on it, he pointed out that it is actually easier to use just a single operator - Process Documents from Data - instead of Data to Documents and Process Documents.
All great stuff, many thanks Scott!
David
1