Altair RISE

A program to recognize and reward our most engaged community members
Nominate Yourself Now!

"Retaining selected word pairs when tokenizing"

User: "carl"
New Altair Community Member
Updated by Jocelyn

When tokenizing into single word tokens, is there a way to keep selected pairs of words together as a single token?  

 

For example, in soccer the term "centre forward" makes more sense as a single token.  I looked at n-grams, but this pairs words that I do not want to pair.  I tried using the stem dictionary, but this seems not to work across multiple tokens, and if I put the stem before tokenize, e.g. to change centre forward to centre-forward, this doesn't appear to work.

Find more posts tagged with

Sort by:
1 - 1 of 11
    User: "IngoRM"
    New Altair Community Member
    Accepted Answer

    Hi Carl,

     

    All observations are correct.  Since there is no replace operator across multiple tokens, I think you have to apply the Replace operator on the data set in your case.  The other options do not seem to be really feasible here.

     

    But don't worry, you can actually do this by first transforming your document into an example set, perfom the replacement, and transform it back into a document.  The process below shows you how you can do this.  Please note that you either need to change your tokenization to something else than "non letters" or you need to use letters as the delimiter in your replacement (or just no delimiter at all).

     

    This is probably not winning a first price for elegance but it does the job :smileywink:

     

    Hope that helps,

    Ingo

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="open_file" compatibility="7.3.001" expanded="true" height="68" name="Open File" width="90" x="45" y="34">
    <parameter key="filename" value="C:\Users\IngoMierswa\Desktop\Latest Materials\Data\mini_newsgroups\mini_newsgroups\alt.atheism\51121"/>
    </operator>
    <operator activated="true" class="text:read_document" compatibility="7.3.000" expanded="true" height="68" name="Read Document" width="90" x="179" y="34"/>
    <operator activated="true" class="text:documents_to_data" compatibility="7.3.000" expanded="true" height="82" name="Documents to Data" width="90" x="313" y="34">
    <parameter key="text_attribute" value="text"/>
    </operator>
    <operator activated="true" class="replace" compatibility="7.3.001" expanded="true" height="82" name="Replace" width="90" x="447" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="text"/>
    <parameter key="replace_what" value="political correctness"/>
    <parameter key="replace_by" value="politicalDELIMcorrectness"/>
    </operator>
    <operator activated="true" class="text:data_to_documents" compatibility="7.3.000" expanded="true" height="68" name="Data to Documents" width="90" x="581" y="34">
    <list key="specify_weights"/>
    </operator>
    <operator activated="true" class="select" compatibility="7.3.001" expanded="true" height="68" name="Select" width="90" x="715" y="34"/>
    <operator activated="true" class="text:tokenize" compatibility="7.3.000" expanded="true" height="68" name="Tokenize" width="90" x="849" y="34">
    <parameter key="characters" value=".: "/>
    </operator>
    <connect from_op="Open File" from_port="file" to_op="Read Document" to_port="file"/>
    <connect from_op="Read Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
    <connect from_op="Documents to Data" from_port="example set" to_op="Replace" to_port="example set input"/>
    <connect from_op="Replace" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
    <connect from_op="Data to Documents" from_port="documents" to_op="Select" to_port="collection"/>
    <connect from_op="Select" from_port="selected" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>