"Regarding Text Mining"

maria_godric
maria_godric New Altair Community Member
edited November 5 in Community Q&A
Hi,

I have a text document.How can I delete the contents in between two special characters (For Example  my document contains #something#). I want to delete the special character also. I tried with TextCleaner but we have to include the content whatever we want to delete.So I think this will not work out if its for huge amount of data.Is there any Operators available in RM?

Thanks,
Maria

Answers

  • land
    land New Altair Community Member
    Hi,
    you might add an TokenReplace Operator before the Tokenizer during TextProcessing and then use regular expressions to capture whatever you want.

    Here's an example process setup:
    <operator name="Root" class="Process" expanded="yes">
        <operator name="TextInput" class="TextInput" expanded="yes">
            <list key="texts">
            </list>
            <list key="namespaces">
            </list>
            <operator name="TokenReplace" class="TokenReplace">
                <list key="replace_dictionary">
                  <parameter key="#[^#]*#" value=" "/>
                </list>
            </operator>
            <operator name="StringTokenizer" class="StringTokenizer">
            </operator>
        </operator>
    </operator>
    For more information about regular expressions, you could visit wikipedia http://en.wikipedia.org/wiki/Regular_expression and for trying something without executing the process, you could use the online form at http://en.wikipedia.org/wiki/Regular_expression.

    Greetings,
      Sebastian
  • maria_godric
    maria_godric New Altair Community Member
    Thanks Sebastain.

    It worked fine.But I would like to get the edited text in the same format as that of original data ie I need to save it in .txt format .

    Thanks,
    Maria