"Does Rapid Miner have Normalize White space in Text processing"

nawafpower
nawafpower New Altair Community Member
edited November 5 in Community Q&A
Hi everybody,
I just wonder if the Rapid Miner does have "Normalize White Space" in its built in functions? I am trying to preprocess a text documents by normalizing the Case " To lower case", and Normalize White Space in the text files. If anybody can help with this it will be great.
Thanks

Answers

  • IngoRM
    IngoRM New Altair Community Member
    Hi,

    sorry, I did not get what you are after. Could you give an example for a text before and after the desired transformation together with a description about what happened in between?

    Cheers,
    Ingo
  • nawafpower
    nawafpower New Altair Community Member
    Hi,
    By Normalize white space I mean "removing any leading or trailing space and reducing any internal white space to one space character per occurrence" . It's available in JGAAP application by Patrick Juola , I found out that this preprocessing step is very important in the text classification process. I need to implement it in RM if it possible.
    Regards
  • el_chief
    el_chief New Altair Community Member
    Most text classification processes will tokenize the document, rendering white space removal pointless.

    But if for some reason you really needed to do it, it could be accomplished with one line of groovy script.

    Why do you need to do this?
  • nawafpower
    nawafpower New Altair Community Member
    Hi Neil,
    I have been playing with JGAAP and I found that best results came with normalize whitespace and unify case for Authorship purposes, when you mentioned doing one line code for this process, how can I do own programming with Rapid Miner GUI? I did ask you on your youtube channel if you can do a small video on Authorship but may be you don't have time, but if you can it will be great.
    I valuate your notes Neil, they were always helpful.
  • IngoRM
    IngoRM New Altair Community Member
    Hi,

    well, you could use a combination of the operators "Trim" (removing leading and trailing white spaces) with "Replace" (replacing any "surviving" white space by a single space) for this task. Please note, that those two operators work on attributes (and not on documents or tokens) so you would have to perform the transformation before you use the text processing operators.

    Below you will find a sample process which demonstrates the two operators.

    how can I do own programming with Rapid Miner GUI?
    There is a white paper in our shop which explains that:

    http://rapid-i.com/component/page,shop.product_details/flypage,flypage.tpl/product_id,52/category_id,5/option,com_virtuemart/Itemid,180/

    Cheers,
    Ingo

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.008" expanded="true" name="Process">
        <process expanded="true" height="235" width="413">
          <operator activated="true" breakpoints="after" class="subprocess" compatibility="5.1.008" expanded="true" height="76" name="Subprocess" width="90" x="45" y="30">
            <process expanded="true" height="674" width="924">
              <operator activated="true" class="retrieve" compatibility="5.1.008" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
                <parameter key="repository_entry" value="//Samples/data/Labor-Negotiations"/>
              </operator>
              <operator activated="true" class="select_attributes" compatibility="5.1.008" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="30">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="vacation"/>
                <parameter key="include_special_attributes" value="true"/>
              </operator>
              <operator activated="true" class="replace" compatibility="5.1.008" expanded="true" height="76" name="Replace" width="90" x="313" y="30">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="vacation"/>
                <parameter key="replace_what" value="-"/>
                <parameter key="replace_by" value="            "/>
              </operator>
              <operator activated="true" class="replace" compatibility="5.1.008" expanded="true" height="76" name="Replace (2)" width="90" x="447" y="30">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="vacation"/>
                <parameter key="replace_what" value="(.*)"/>
                <parameter key="replace_by" value="            $1          "/>
              </operator>
              <operator activated="true" class="filter_examples" compatibility="5.1.008" expanded="true" height="76" name="Filter Examples" width="90" x="581" y="30">
                <parameter key="condition_class" value="no_missing_attributes"/>
              </operator>
              <operator activated="true" class="nominal_to_text" compatibility="5.1.008" expanded="true" height="76" name="Nominal to Text" width="90" x="715" y="30">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="vacation"/>
              </operator>
              <connect from_op="Retrieve" from_port="output" to_op="Select Attributes" to_port="example set input"/>
              <connect from_op="Select Attributes" from_port="example set output" to_op="Replace" to_port="example set input"/>
              <connect from_op="Replace" from_port="example set output" to_op="Replace (2)" to_port="example set input"/>
              <connect from_op="Replace (2)" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
              <connect from_op="Filter Examples" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
              <connect from_op="Nominal to Text" from_port="example set output" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="trim" compatibility="5.1.008" expanded="true" height="76" name="Trim" width="90" x="179" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="vacation"/>
          </operator>
          <operator activated="true" class="replace" compatibility="5.1.008" expanded="true" height="76" name="Replace (3)" width="90" x="313" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="vacation"/>
            <parameter key="replace_what" value="\s+"/>
            <parameter key="replace_by" value=" "/>
          </operator>
          <connect from_op="Subprocess" from_port="out 1" to_op="Trim" to_port="example set input"/>
          <connect from_op="Trim" from_port="example set output" to_op="Replace (3)" to_port="example set input"/>
          <connect from_op="Replace (3)" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>