Define spliting characters for tokenizer?

silentguy
silentguy New Altair Community Member
edited November 5 in Community Q&A
Hi!
I was playing around with the text plugin because it seemed to be the easiest way to try to run svms on the data I am working with and the example aready seem quite useful, but the StringTokenizer does too much splitting for my files, e.g. it splits stuff like "get_file" at "_", "c:\windows" at "\" etc...
Is there a way to tell it to split only on blank spaces, only on newlines, etc? I tried making my own Tokenizer, but sadly the given one only calls edu.udo.cs.wvtool.generic.tokenizer.StringTokenizer which comes from a library...
Tagged:

Answers

  • fischer
    fischer New Altair Community Member
    Hi,

    I think going with a custom tokenizer is the way to go. You are right that the fact that most of the text plugin is hidden in a library obstructs extending it easily. The text operators are migrated into the core and you can check with 5.0 whether implementing a custom tokenizer becomes easier for you.

    Best,
    Simon
  • land
    land New Altair Community Member
    Hi,
    here's another way around, if you can't wait for the next version:
    • Store the texts in a nominal attribute in an example set
    • Use the split operator to split the texts according to your needs and distribute it over a number of attributes
    • Use the MissingValueReplenishment operator to exchange missing values with a blank " ".
    • Either change all generated nominal attributes to String attributes or use the filter_nominal_attributes parameter of the string text input.
    • Perform the text input, but don't use a String tokenizer. All tokenization already is carried out beforehand.
    This should do the trick...

    Greetings,
      Sebastian
  • silentguy
    silentguy New Altair Community Member
    Whoops, missed the second post... thanks for the tip...
    *scratch* okay, I wanted to try that but i noticed how little I actually know about what I can do...
    Currently I have a lot of file looking like this:
    <Prozess 1>
    <Thread 1:1>
    get_file_attributes()
    create_file()
    vm_read()
    vm_read()
    vm_read()
    vm_read()
    vm_read()
    vm_read()
    enum_modules()
    </Thread 1>
    </Prozess 1>
    <Prozess 2>
    <Thread 2>
    load_image()
    get_system_directory()
    get_file_attributes()
    open_key()
    open_key()
    delete_key()
    </Thread 2>
    </Prozess 2>
    Right now I can't even find the way to load them in a way to have the text as attributes (and the file names as ids and the folder as label or something)...
  • Ryujakk
    Ryujakk New Altair Community Member
    Hi!

    I had the same request as silentguy. I managed to hack together a new operator which lets the user specify which characters should be used as separators. You can download the modified plugin here: http://www.filedropper.com/rapidminer-advancedstringtokenizer-46
    It's a temporary solution until 5.0 comes out, but it does the trick for me. It's still probably full of bugs though, so don't use it to secure a nuclear plant  ;D
    PM me if you want the source code or find any bugs!
  • dbrown
    dbrown New Altair Community Member
    I was using the "advanced string tokenizer" provided by Ryujjak for RM 4.6 and found it was just what I needed.  Is there any plan to include this functionality in RM 5.0?

    I downloaded the text processing add-on for RM 5.0 and found that it contains only the simple Tokenize block which splits at any non-letter character.  I want to be able to define my splitting characters (e.g., split only at whitespace and square brackets) so that RM will not split terms such as netshare1_user1 (which I want to keep as a single term).

    Will this feature be offered--if so, when is the plan to release it?

    Thanks,
    David
  • fischer
    fischer New Altair Community Member
    Hi,

    looks like it is not worth making a plugin for a single operator :-) But of course, if you send me the source code, I'd love to build it into the next release of the text extension. Unfortunately, the link seems to be no longer working.

    Cheers,
    Simon
  • Ryujakk
    Ryujakk New Altair Community Member
    Hi,

    Indeed, for some reason my promotional 30 year account of 250 GB of storage space at filedropper seems to have been terminated  :'(
    Anyhow, I ported the operator to be compatible with RM 5.0. For now, you can download it at http://www.megaupload.com/?d=0WP7NMQG (source included.)

    I added the following code to the package com.rapidminer.operator.text.io.tokenizer :

    /*
    *  RapidMiner
    *
    *  Copyright (C) 2001-2009 by Rapid-I and the contributors
    *
    *  Complete list of developers available at our web site:
    *
    *      http://rapid-i.com
    *
    *  This program is free software: you can redistribute it and/or modify
    *  it under the terms of the GNU Affero General Public License as published by
    *  the Free Software Foundation, either version 3 of the License, or
    *  (at your option) any later version.
    *
    *  This program is distributed in the hope that it will be useful,
    *  but WITHOUT ANY WARRANTY; without even the implied warranty of
    *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    *  GNU Affero General Public License for more details.
    *
    *  You should have received a copy of the GNU Affero General Public License
    *  along with this program.  If not, see http://www.gnu.org/licenses/.
    */
    package com.rapidminer.operator.text.io.tokenizer;

    import java.util.ArrayList;
    import java.util.List;

    import com.rapidminer.operator.OperatorDescription;
    import com.rapidminer.operator.UserError;
    import com.rapidminer.operator.text.Document;
    import com.rapidminer.operator.text.Token;
    import com.rapidminer.operator.text.io.AbstractTokenProcessor;
    import com.rapidminer.parameter.ParameterType;
    import com.rapidminer.parameter.ParameterTypeString;

    /**
    * This class tokenizes all tokens in the input.
    * The characters used as separators can be specified.
    *
    * @author Ryujakk
    */
    public class AdvancedTokenizerOperator extends AbstractTokenProcessor {

    public static final String SEPARATORS = "characters";

    public AdvancedTokenizerOperator(OperatorDescription description) {
    super(description);
    }

    @Override
    protected Document doWork(Document textObject) throws UserError {
    String separators = getParameterAsString(SEPARATORS);

    List<Token> newSequence = new ArrayList<Token>();
    for (Token token: textObject.getTokenSequence()) {
    char[] tokenChars = token.getToken().toCharArray();
    int start = 0;
    for (int i = 0; i < tokenChars.length; i++) {
    if (separators.contains(""+tokenChars)) {
    if (i - start > 0) {
    newSequence.add(new Token(new String(tokenChars, start, i - start), token));
    }
    start = i + 1;
    }
    }
    if (tokenChars.length - start > 0)
    newSequence.add(new Token(new String(tokenChars, start, tokenChars.length - start), token));
    }
    textObject.setTokenSequence(newSequence);
    return textObject;
    }

    @Override
    public List<ParameterType> getParameterTypes() {
    List<ParameterType> types = super.getParameterTypes();
    types.add(new ParameterTypeString(SEPARATORS, "The characters used to separate individual tokens.", " "));
    return types;
    }
    }

    Basically, 2 lines changed from the original StringTokenizerOperator! It's all yours now.

    - R
  • land
    land New Altair Community Member
    Hi,
    thank you very much. I will include it into the tokenizer now.

    Greetings,
      Sebastian