"my text mining query - short quick teaser for you experts"
harryharriet
New Altair Community Member
Hi all, can anybody help me. I am fairly new to data mining, the techniques and tools. However i am clear on the concept behind it.
My aim is this:
I have 200+ thousands of company details such as organistation name, telephone number, address etc (all stored in a spreedsheet file)... and I am trying to put these companies into a specific business type catergory.
I have 3 levels of classification. Top Level:>Second Level>Third Level(if necessary). For example:
Consultancy>Management>Planning.
Now my team and I have already manually done about 30 thousand companies. We done this by using various webpage searches and making our predictions based on the information we have read from the google searches and company websties. Now we would like to automactically classify this information for the remaining 100+ thousand companies.
Now, we have already figured out a concept to
a) crawl the web using the company name+address via a google search, directory websites etc and save all the webpages into a document for the 30thousand complaeted.
b) we then use this data and train the tool
c)We reapeat option a) on the remaing companies and then run the classifier on the new documents using the key words etc to be able to classify the company...
Now with a little understanding of probability. I am sure that training using the 30 thousands already done will give me good results when testing it on the actual 30 thousand to find any matches. But once we try to test it on the new unseen set of data, the results may be terrible. Am i correct?
I do not mind having just the top level done as this will be a starting point. Now my problem here is WHERE DO WE GO FROM HERE? Which is the best tool (if any) to do the job. RapidDoc, Rapidminer.
Can anyone help me with the process, or tell me if i am going down the wrong route. thanks alot in advance and any help will be useful.
P.S. sorry if I havent explained it well. Its all pretty new to me too.
My aim is this:
I have 200+ thousands of company details such as organistation name, telephone number, address etc (all stored in a spreedsheet file)... and I am trying to put these companies into a specific business type catergory.
I have 3 levels of classification. Top Level:>Second Level>Third Level(if necessary). For example:
Consultancy>Management>Planning.
Now my team and I have already manually done about 30 thousand companies. We done this by using various webpage searches and making our predictions based on the information we have read from the google searches and company websties. Now we would like to automactically classify this information for the remaining 100+ thousand companies.
Now, we have already figured out a concept to
a) crawl the web using the company name+address via a google search, directory websites etc and save all the webpages into a document for the 30thousand complaeted.
b) we then use this data and train the tool
c)We reapeat option a) on the remaing companies and then run the classifier on the new documents using the key words etc to be able to classify the company...
Now with a little understanding of probability. I am sure that training using the 30 thousands already done will give me good results when testing it on the actual 30 thousand to find any matches. But once we try to test it on the new unseen set of data, the results may be terrible. Am i correct?
I do not mind having just the top level done as this will be a starting point. Now my problem here is WHERE DO WE GO FROM HERE? Which is the best tool (if any) to do the job. RapidDoc, Rapidminer.
Can anyone help me with the process, or tell me if i am going down the wrong route. thanks alot in advance and any help will be useful.
P.S. sorry if I havent explained it well. Its all pretty new to me too.
Tagged:
0
Answers
-
Hi there,
This looks remarkably similar to your previous post, so my answer remains...
http://rapid-i.com/rapidforum/index.php/topic,2906.msg11533.html#msg115330