"How to get Meta Data with the Data to Documents and Process Documents operators"

User: "dramhampton"
New Altair Community Member
Updated by Jocelyn
I'm performing text analytics and am struggling with Meta Data.

In the toy process below, there should be meta data available to the Data to Documents operator, but if you want to specify weights and click on Edit List, the source attribute doesn't populate, so you have to type it manually.  Seems an unnecessary chore if there are several attributes to be listed, is there a 'proper' way to do this?

Also, the Process Documents operator loses all the meta data (quite understandably because it is creating a bunch of new attributes from the text it is fed) - what is best practice for restoring the meta data so that subsequent operators can be set up easily?

Many thanks!

David


<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Root" origin="GENERATED_SAMPLE">
    <parameter key="logverbosity" value="warning"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.2.000" expanded="true" height="68" name="Retrieve Golf" width="90" x="112" y="34">
        <parameter key="repository_entry" value="../../data/Golf"/>
      </operator>
      <operator activated="true" class="text:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents" width="90" x="313" y="34">
        <parameter key="select_attributes_and_weights" value="true"/>
        <list key="specify_weights"/>
      </operator>
      <connect from_op="Retrieve Golf" from_port="output" to_op="Data to Documents" to_port="example set"/>
      <connect from_op="Data to Documents" from_port="documents" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Sort by:
1 - 1 of 11
    User: "sgenzer"
    Altair Employee
    Accepted Answer
    hi @dramhampton maybe this will help?

    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" breakpoints="after" class="retrieve" compatibility="9.2.000" expanded="true" height="68" name="Retrieve Golf" width="90" x="45" y="34">
            <parameter key="repository_entry" value="//Samples/data/Golf"/>
            <description align="center" color="transparent" colored="false" width="126">Outlook and Wind are NOMINAL attributes here - Text Processing operators will ignore them</description>
          </operator>
          <operator activated="true" breakpoints="after" class="nominal_to_text" compatibility="9.2.000" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34">
            <parameter key="attribute_filter_type" value="all"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <description align="center" color="transparent" colored="false" width="126">now they're TEXT attributes</description>
          </operator>
          <operator activated="false" class="select_attributes" compatibility="9.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="34">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value="Outlook|Wind"/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <description align="center" color="transparent" colored="false" width="126">not necessary</description>
          </operator>
          <operator activated="true" class="text:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents" width="90" x="447" y="34">
            <parameter key="select_attributes_and_weights" value="false"/>
            <list key="specify_weights"/>
            <description align="center" color="transparent" colored="false" width="126">everything deleted and box is now unchecked</description>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="581" y="34">
            <parameter key="create_word_vector" value="true"/>
            <parameter key="vector_creation" value="TF-IDF"/>
            <parameter key="add_meta_information" value="true"/>
            <parameter key="keep_text" value="false"/>
            <parameter key="prune_method" value="none"/>
            <parameter key="prune_below_percent" value="3.0"/>
            <parameter key="prune_above_percent" value="30.0"/>
            <parameter key="prune_below_rank" value="0.05"/>
            <parameter key="prune_above_rank" value="0.95"/>
            <parameter key="datamanagement" value="double_sparse_array"/>
            <parameter key="data_management" value="auto"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="246" y="34">
                <parameter key="mode" value="non letters"/>
                <parameter key="characters" value=".:"/>
                <parameter key="language" value="English"/>
                <parameter key="max_token_length" value="3"/>
              </operator>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
            <description align="center" color="transparent" colored="false" width="126">voil&amp;#224;</description>
          </operator>
          <operator activated="false" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="514" y="238">
            <parameter key="create_word_vector" value="true"/>
            <parameter key="vector_creation" value="TF-IDF"/>
            <parameter key="add_meta_information" value="true"/>
            <parameter key="keep_text" value="false"/>
            <parameter key="prune_method" value="none"/>
            <parameter key="prune_below_percent" value="3.0"/>
            <parameter key="prune_above_percent" value="30.0"/>
            <parameter key="prune_below_rank" value="0.05"/>
            <parameter key="prune_above_rank" value="0.95"/>
            <parameter key="datamanagement" value="double_sparse_array"/>
            <parameter key="data_management" value="auto"/>
            <parameter key="select_attributes_and_weights" value="false"/>
            <list key="specify_weights"/>
            <process expanded="true">
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Retrieve Golf" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
          <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <description align="center" color="yellow" colored="false" height="223" resized="true" width="297" x="407" y="178">FWIW I would just use this operator instead of the two above</description>
        </process>
      </operator>
    </process>

    Scott