"How to filter out a text so as to keep only words given in a list of words"

barthos
barthos New Altair Community Member
edited November 5 in Community Q&A
Hello,

I would like to filter out a text so that the operator keeps only the words of the text that are present in a list (provided) (or equally remove all the words that are not in the list). Ideally, the Stopword by dictionnary with an option "invert selection" would be perfect.
As a sided question, I would like to know the purpose of the entry "wor" (I guess it means word) in the Process_Document_from_Data operator.

Thanks,
Barthélémy

Answers

  • colo
    colo New Altair Community Member
    Hi Barthélémy,

    when I read your post I remembered a similar question posted some time ago. You can find it here: http://rapid-i.com/rapidforum/index.php/topic,3493.0.html (did you even search for it?  ;)) But don't expect a fully satisfying solution there. I don't know if the developers have something new at hand today...

    What entry "wor" do you mean? The input port of the operator??

    Regards
    Matthias
  • land
    land New Altair Community Member
    Hi,
    if you have a word list and want to count only words that are in this word list, you simply can forward the word list to the "wor" input port of the process documents operator. Only then it is assured that for new texts the representation remains the same as during the training! If you don't do this the set of words can differ and the TF-IDF calculation will be different.

    If you need to filter the text for having the text filtered and not a filtered TF-IDF representation, then there's unfortunately no way until now. You could raise a feature request in our bugtracker for that.

    With kind regards,
    Sebastian
  • barthos
    barthos New Altair Community Member
    Thanks a lot !
    However, I've tried to make a list of words to pass to the entry "wor" but it looks like I haven't find the way to do it. Is there a special operator to tranform documents or example set into a list of words?
    Thanks again,
    Barthélémy