Text Mining classification problem with two data sets

mschmidkon
mschmidkon New Altair Community Member
edited November 5 in Community Q&A
Hey!
I have an issue with text mining and classification according to keywords with two datasets. The goal is to classify products according to textual description.

INITIAL SITUATION:
I've got two data sets, the first one contains a unique identifier (a number representing a product) and four columns including text describing this product (short/long text description etc.). The second data set contains two columns, the first one is text describing a label for classification and the second column contains a classification code. The goal is to classify the products from data set 1 according to the second data set, therefore, identical word occurences have to be identified and the classification code with the highest occurences of similar words should be taken. The process should take one product from the first data set and look up all labels from the second data set in order to find the best suiting label.
CURRENT SITUATION:
I created a RapidMiner process which reads the two csv files seperately, converts the input with 'Process Documents from Data' including Tokenizing, Filter Stopwords, Stem and Generate n-Grams. The result set includes the occurences of the tokenized words and now I want to compare the result sets of the two data sets (both data sets don't have the same amount of attributes in the same order, but there are identical ones) with the goal to find 'similar' words and classify the product. Does anybody know how to compare these two datasets with an operator from rapidminer and how to classify these products?

Thank you very much!

Michael

Best Answer

  • rfuentealba
    rfuentealba New Altair Community Member
    Answer ✓

    Do you mind to share your process with us, so that we can provide you better guidance?

    All the best,

    Rod.

Answers