"Error - attribute text was already present in the example set"

B_
B_ New Altair Community Member
edited November 5 in Community Q&A
I've used this basic train/apply model successfully for other text classification jobs, but now it is producing an error.  Removing stemming and token filtering by size in the apply section of Process Documents still produces an error.  Version is 5.0.11 and text module is 5.0.7.

Any ideas about what to change?  The text is coming from the same table/field.  Training text is a subset of the full set of documents in the table.


Exception: com.rapidminer.operator.UserError
Message: The attribute text was already present in the example set.
Stack trace:

  com.rapidminer.operator.text.io.AbstractDocumentInputOperator.createWordAttributes(AbstractDocumentInputOperator.java:336)
  com.rapidminer.operator.text.io.AbstractDocumentInputOperator.doWork(AbstractDocumentInputOperator.java:243)
  com.rapidminer.operator.Operator.execute(Operator.java:771)
  com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51)
  com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:709)
  com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:368)
  com.rapidminer.operator.Operator.execute(Operator.java:771)
  com.rapidminer.Process.run(Process.java:899)
  com.rapidminer.Process.run(Process.java:795)
  com.rapidminer.Process.run(Process.java:790)
  com.rapidminer.Process.run(Process.java:780)
  com.rapidminer.gui.ProcessThread.run(ProcessThread.java:62)



Nov 22, 2010 10:19:39 AM SEVERE: Process failed: The attribute text was already present in the example set.
Nov 22, 2010 10:19:39 AM SEVERE: Here:          Root[1] (Process)
          subprocess 'Main Process'
            +- Read Database[1] (Read Database)
            +- Set Role[0] (Set Role)
            +- Set Role (2)[1] (Set Role)
            +- Replace[1] (Replace)
            +- Nominal to Text[1] (Nominal to Text)
            +- Process Documents from Data[1] (Process Documents from Data)
          subprocess 'Vector Creation'
            |    +- Transform Cases[19098] (Transform Cases)
            |    +- Tokenize[19098] (Tokenize)
            |    +- Filter Stopwords (2)[19098] (Filter Stopwords (English))
            |    +- Stem (2)[19098] (Stem (Porter))
            |    +- Filter Tokens (by Length)[19098] (Filter Tokens (by Length))
            |    +- Extract Token Number[19098] (Extract Token Number)
            +- SVM[1] (Support Vector Machine (LibSVM))
            +- Read Database (2)[1] (Read Database)
            +- Replace (2)[1] (Replace)
            +- Set Role (3)[1] (Set Role)
            +- Set Role (4)[0] (Set Role)
            +- Nominal to Text (2)[1] (Nominal to Text)
            +- Process Documents from Data (2)[1] (Process Documents from Data)
          subprocess 'Vector Creation'
            |    +- Transform Cases (2)[1] (Transform Cases)
      ==>  |    +- Tokenize (2)[1] (Tokenize)
            |    +- Filter Stopwords (English)[0] (Filter Stopwords (English))
            |    +- Filter Tokens (2)[0] (Filter Tokens (by Length))
            |    +- Stem (Porter)[0] (Stem (Porter))
            |    +- Extract Token Number (2)[0] (Extract Token Number)
            +- Apply Model[0] (Apply Model)
Nov 22, 2010 10:19:39 AM SEVERE: The attribute text was already present in the example set.


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.0" expanded="true" name="Root">
    <description>Using a simple Naive Bayes classifier.</description>
    <process expanded="true" height="656" width="949">
      <operator activated="true" class="read_database" compatibility="5.0.10" expanded="true" height="60" name="Read Database" width="90" x="45" y="570">
        <list key="data_set_meta_data_information"/>
        <parameter key="attribute_names_already_defined" value="true"/>
        <parameter key="connection" value="sql_work"/>
        <parameter key="query" value="SELECT &quot;category&quot;, &quot;doctext&quot;&#10;  FROM &quot;traindocs&quot;&#10; WHERE&#13; &quot;filter&quot; = 'test'&#10;&#10;"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.0.10" expanded="true" height="76" name="Set Role (2)" width="90" x="45" y="390">
        <parameter key="name" value="category"/>
        <parameter key="target_role" value="label"/>
      </operator>
      <operator activated="true" class="replace" compatibility="5.0.10" expanded="true" height="76" name="Replace" width="90" x="179" y="390">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="doctext"/>
        <parameter key="regular_expression" value="http.* "/>
        <parameter key="include_special_attributes" value="true"/>
        <parameter key="replace_what" value="http.*\s|#\w+\s|@\w*\s"/&gt;
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.0.10" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="480">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="doctext"/>
        <parameter key="attributes" value="posttitle|postdesc"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.0.6" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="300">
        <list key="specify_weights"/>
        <process expanded="true" height="565" width="882">
          <operator activated="true" class="text:transform_cases" compatibility="5.0.6" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="30"/>
          <operator activated="true" class="text:tokenize" compatibility="5.0.6" expanded="true" height="60" name="Tokenize" width="90" x="179" y="30"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.0.7" expanded="true" height="60" name="Filter Stopwords (2)" width="90" x="313" y="30"/>
          <operator activated="true" class="text:stem_porter" compatibility="5.0.7" expanded="true" height="60" name="Stem (2)" width="90" x="447" y="30"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.0.7" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="648" y="30">
            <parameter key="min_chars" value="2"/>
          </operator>
          <operator activated="true" class="text:extract_token_number" compatibility="5.0.7" expanded="true" height="60" name="Extract Token Number" width="90" x="794" y="45"/>
          <connect from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
          <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
          <connect from_op="Stem (2)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Extract Token Number" to_port="document"/>
          <connect from_op="Extract Token Number" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="support_vector_machine_libsvm" compatibility="5.0.10" expanded="true" height="76" name="SVM" width="90" x="514" y="255">
        <parameter key="kernel_type" value="linear"/>
        <list key="class_weights"/>
      </operator>
      <operator activated="true" class="read_database" compatibility="5.0.10" expanded="true" height="60" name="Read Database (2)" width="90" x="45" y="30">
        <list key="data_set_meta_data_information"/>
        <parameter key="attribute_names_already_defined" value="true"/>
        <parameter key="connection" value="sql_work"/>
        <parameter key="query" value="SELECT &quot;id&quot;, &quot;doctext&quot;&#13;&#10;FROM &quot;traindocs&quot;&#13;&#10;"/>
      </operator>
      <operator activated="true" class="replace" compatibility="5.0.10" expanded="true" height="76" name="Replace (2)" width="90" x="179" y="30">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="text"/>
        <parameter key="regular_expression" value="http.* "/>
        <parameter key="include_special_attributes" value="true"/>
        <parameter key="replace_what" value="http.*\s|#\w+\s|@\w*\s"/&gt;
      </operator>
      <operator activated="true" class="set_role" compatibility="5.0.10" expanded="true" height="76" name="Set Role (3)" width="90" x="45" y="120">
        <parameter key="name" value="id"/>
        <parameter key="target_role" value="id"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.0.10" expanded="true" height="76" name="Nominal to Text (2)" width="90" x="179" y="120">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="text"/>
        <parameter key="attributes" value="posttitle|postdesc"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.0.6" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="313" y="120">
        <parameter key="keep_text" value="true"/>
        <list key="specify_weights"/>
        <process expanded="true" height="657" width="882">
          <operator activated="true" class="text:transform_cases" compatibility="5.0.6" expanded="true" height="60" name="Transform Cases (2)" width="90" x="45" y="30"/>
          <operator activated="true" class="text:tokenize" compatibility="5.0.6" expanded="true" height="60" name="Tokenize (2)" width="90" x="313" y="30"/>
          <operator activated="false" class="text:filter_stopwords_english" compatibility="5.0.7" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="313" y="210"/>
          <operator activated="false" class="text:filter_by_length" compatibility="5.0.7" expanded="true" height="60" name="Filter Tokens (2)" width="90" x="179" y="210">
            <parameter key="min_chars" value="2"/>
          </operator>
          <operator activated="false" class="text:stem_porter" compatibility="5.0.7" expanded="true" height="60" name="Stem (Porter)" width="90" x="447" y="210"/>
          <operator activated="false" class="text:extract_token_number" compatibility="5.0.7" expanded="true" height="60" name="Extract Token Number (2)" width="90" x="581" y="210"/>
          <connect from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="apply_model" compatibility="5.0.10" expanded="true" height="76" name="Apply Model" width="90" x="447" y="120">
        <list key="application_parameters"/>
      </operator>
      <connect from_op="Read Database" from_port="output" to_op="Set Role (2)" to_port="example set input"/>
      <connect from_op="Set Role (2)" from_port="example set output" to_op="Replace" to_port="example set input"/>
      <connect from_op="Replace" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="SVM" to_port="training set"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/>
      <connect from_op="SVM" from_port="model" to_op="Apply Model" to_port="model"/>
      <connect from_op="Read Database (2)" from_port="output" to_op="Replace (2)" to_port="example set input"/>
      <connect from_op="Replace (2)" from_port="example set output" to_op="Set Role (3)" to_port="example set input"/>
      <connect from_op="Set Role (3)" from_port="example set output" to_op="Nominal to Text (2)" to_port="example set input"/>
      <connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
      <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="216"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Answers

  • land
    land New Altair Community Member
    Hi,
    the problem is as the error says that you there's already an attribute "text" present in your example set. During the application the Process Documents operator takes care that the bag of word is built with exactly the same attribute names, because the models would otherwise use wrong attributes for classification.
    If in the application data set an attribute is already present that wasn't in the training data set, and this attribute name is part of the word list, then this error will occur. Best thing you can do is prevent this by either including exactly the same attributes in training as in testing (switch on keep text parameter in training, too), or remove the additionally created attribute of by switching "keep text" parameter in application of.

    Greetings,
      Sebastian
  • B_
    B_ New Altair Community Member
    thanks Sebastion

    I'm surprised this error hasn't shown up before since I use this basic structure for several classification tasks.  Turning off keep text solved the problem.
  • AndrewB1
    AndrewB1 New Altair Community Member

    This bug is still present.   As you note when you select "keep_text =  true"  Process Documents will add a new field called text.   If you have tokenized your data and text is a token the process will break.   Their does not seem to be an elligant work around as of RM 7.4.

  • MartinLiebig
    MartinLiebig
    Altair Employee

    Hi,

     

    there is one. Just don't keep the quest but join the text on the set later on.

     

    Best,

    Martin

  • LarissaMoraes
    LarissaMoraes New Altair Community Member
    Hi, 

    I have this problem too, but I don't understand how to solve it. Because this problem occurs when I'm using auto model. Can anyone help me with more details on what I need to do please?