Set Class Label to The Dataset

Hi All,
Answers
-
Hello @Fatin_Fezarudin,
Let me see if I get it:
- There is text.
- There should be a text classification based in certain words.
True?
It all depends on how you want to determine what words are important, and there are at least three ways (that I know of) to determine such a thing:
1.- Having a collection of words.
Just:
- Have a list of words somewhere.
- Use Loop Examples to walk through that list of words, and inside this list:
----> Filter Examples and use a "contains" filter.
----> Add your word as an attribute.
----> Join and save (you can use a Remember/Retrieve operator so you can handle what is saved and how)
- Retrieve the final results. Add Set Role to create your labels.
Except for the Remember/Retrieve, this is the easiest thing you can do, but for that you should already know what words are important.
2.- Creating a collection of words.
Our great community manager @sgenzer posted this solution a while ago. I'm using Bold to indicate the names of the operators you should use on each step. Unfortunately I'm abroad and don
- (This is a suggestion) Filter Stopwords before doing the rest. Stopwords are words that connect other words but don't add meaning by themselves.
- Take your text and use the Split operator to create a ton of attributes.
- Transpose this mess so that your text is listed word by word in one attribute and a ton of examples.
- Use the Join operator with your keyword database list to see overlap.
- Aggregate to see word frequencies.
- (This is my addition) Filter Examples to get the most important words, Select Attributes to get a good grasp of your data, and then Label by the word list, and you will have many classes for each doc.
Now, since you have many classes here, I wouldn't save the result of the Join in a dataset, because that will end up in a huge file.
This is not difficult either, but since you don't have control over what words appear, you should work a lot with adding or removing breakpoints to get an idea on how things go.
3.- Analyze the text with text mining operators.
The usual process is:
- Use the Process Documents From Files or one of the appropiate text mining tools to:
----> create TF-IDF vectors,
----> Tokenize,
----> Lowercase,
----> Filter Stopwords,
----> Generate N-Grams if you need associations of words.
----> Or Filter Tokens by POS to get only verbs, nouns, adjectives...
----> Or Filter Tokens by one of the others.
----> Or Lemmatize to create some meaning.
- Once you get your results, you can apply some kind of segmentation operator (it's up to you, I'm running out of knowledge here) to define which words are important.
- Once you get that segmentation, you can do some magic to associate these important words to the original texts.
That said, I consider text mining and natural language processing as a complete area inside Machine Learning. There is so much to know regarding how languages work, sentences and all that... But as a first, this should become your initial guide.
All the best,
1 -
@rfuentealba Hi , i have the same problem and thank you for your answer . i also have a text and by "process document " i separated each text to words . i have 100 texts i need to do classification . would you please let me know how i can choose one word for each row as a class so i can use clustering operator.Thanks in advance for the reply0