Text Classification/Categorization Help

Question

Hey there!

Just looking for some help regarding a project I'm currently working on. I'm very new to RapidMiner and AI in general and I'm looking for some direction.

I have a noSQL MongoDB that is storing 8000 different scraped jobs. The main attributes are Description, Title, Text and Keywords and I have assigned the label jobs to all of them.

I want to be able to automatically classify/categorize all my jobs into different job sectors based on their job titles, for example a software development job would be categorized into the technology sector. I am really clueless on how to actually go about and implement this and how RapidMiner's different classification models work, any help would be greatly appreciated.

Thanks for reading!

kypexin · Answer

Hi @1505993

I did a project on text classification once, so I think I could cite here one of my answers in the other thread regarding text classification, hope this might be helpful or inspiring for you in some way: https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/autotagging-and-autocategorizing-text-pieces/m-p/43717/highlight/true#M29049

lionelderkrikor · Answer

Hi @1505993,

I have difficulties to understand :

You have already an AI program in C# that is able to automatically label jobs according to differents variables (job title etc.) ?

If that's the case, in deed, you don't need to train a model and you don't need RapidMiner .....

but one question : have you evaluate the performance of this program (the accuracy = total right predictions / total predictions) ?

To explain in more detail my approach :

1. You have first to label manually 1000 jobs. I insist on "manually" bacause this 1000 jobs have to be 100 % correctly labeled (an AI program can't reach 100 % accuracy) and that's why I said "it takes more work".

2. Train many models (kNN, Neural Networks etc.) on this labeled dataset of 1000 jobs.

3. Evaluate the accuracy of these models using the Cross Validation operator. (this accuracy is representative of the accuracy of your models on unlabelled data).

4. Select and apply the best model on your unlabelled dataset (your remaining 7000 jobs).

I hope that it's clearer.

Regards,

Lionel

1505993 · Answer

I will try to train a classification model and compare each model to see how accurate the results are.

The only concern I have with this method is the labeling of the 1000 jobs. I will write a function in C# and change the labels in the database to the secotrs but doesn't that make the classification model redundant? Couldn't I just do that for all of the jobs?

Appreciate the help, just needed some direction.