Process Documents from Data: Apply to a new set of data

btibert
btibert New Altair Community Member
edited November 5 in Community Q&A
Perhaps I am missing something obvious, but you can envision that the Process Documents from Data operator is pretty comparable to other pre-processing models that we can use with Apply Model.  After processing an ExampleSet of text with this operator, is there a way to apply the same model on top of a new ExampleSet? 

A comparable flow would be using CountVectorizer in sklearn.

Best Answer

  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓
    So you would need to do both actually.  If there are specific document processing steps that you take inside Process Documents then you will need to apply those to future datasets as well (e.g., tokenization, n-grams, etc) but then you will use the wordlist input port to ensure that only those words which were present in your initial model construction get counted for purposes of subsequent scoring. Otherwise you may generate new words from the new documents and it would be missing words that are being looked for by the model.

Answers

  • btibert
    btibert New Altair Community Member
    Is the approach to use the "Word List" in a Process Documents operator(s) that takes in the new data?  The subprocess wouldn't have any operators but simply pass the input to output?
  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓
    So you would need to do both actually.  If there are specific document processing steps that you take inside Process Documents then you will need to apply those to future datasets as well (e.g., tokenization, n-grams, etc) but then you will use the wordlist input port to ensure that only those words which were present in your initial model construction get counted for purposes of subsequent scoring. Otherwise you may generate new words from the new documents and it would be missing words that are being looked for by the model.
  • btibert
    btibert New Altair Community Member
    When I attached the Word List to process new documents, it behaved as I expected.  If this new document has new tokens, I consider those to be OOV and ignored.
  • btibert
    btibert New Altair Community Member
    Actually, my last comment was an oversight.  You do indeed have to re-use the same operators again, even though we are passing in the word list.  I feel as if this is an extra step that should have been avoided, but it does work.  Other toolkits abstract away the need to do the same processing.