Hi Marco,
Tks for the idea, and clearly, the process is fast!
but 2 things don't fit my needs:
1 - Lost of grouping results of word’s matches under categories names (head of columns of the dictionary): I'm loosing the possibility to pivot at the end of the process in order to group results under the categories of the dictionary. Appying you process to my specific case (described at the beginning of the thread), returns as a result a dataset containing 6742 columns for each words matched in the dictionary and no category.
2- lost of findings reustling from a match of a character chain (e.g. if I put « app » in the dictionary, my process « lopp in loop » will return all results that match the verbatim containing this chain of characters. In some way it operates sucha as a steming process (I don’t care having false positives, because a manual verification will be done in a second step). The process you propse uses tokenization (result of word processing) has for consequence a lost of this capability.
The first point is a sine qua none one, but the second could be an acceptable lost.
Cheers
Loops inside loops have a huge computational complexity, I think it's O(n^2), which is fairly undesirable in any programming language, not just RapidMiner Studio. Perhaps there is a way to simplify the search by applying some tricks? Also, it sounds like you would benefit from tokenizing words rather than using columns for your searches.
Do you mind to share your process with us so that we can check if there is anything we can do?
About your question: is there a way to define the instance design, in consideration of the number of columns?
Number of columns isn't a real measure for memory consumption unless you know exactly how large it is and how's it composed; I think your big issue isn't memory but optimization, though. (I may be wrong but worth the shot).
All the best,
Rod.