I need help building taxonomies from large number of documents
boatanchorguy
New Altair Community Member
Thank you to Marius for the Read Before Posting instructions. Following his suggestions,
1. Describe what you are doing.
I need to build many taxonomies from a large number of documents.
2. If you are working with data, give a detailed description of your data (number of examples and attributes, attribute types, label type etc.).
I did enormous amounts of searches over months and now have several thousand documents I need to process, mostly pdf, some msword, excel, & ppt.
3. Describe which results or actions you are expecting.
I need good clean taxonomies, for many different topics. I am hoping to set up a proper method using Rapidminer, but there does not seem to be an obvious pathway to do this.
Ideally, for each topic, the method would a) pre-process the documents, filtering for such items as the proper word or key phrase in the title, or the abstract; b) assembling the filtered documents; c) (optional) extracting tables of contents, indices, glossaries, etc.; d) extracting and amalgamating the sub-topics appropriate to the particular topic; e) generating the taxonomy.
I am new to Rapidminer, and relatively new to data mining in general, so please keep it simple for me.
Please help me understand any and all methods I could use to accomplish this.
And please let me know if I am following the proper procedures for this forum, or how I can improve this post.
Thank you very much.
Sam
1. Describe what you are doing.
I need to build many taxonomies from a large number of documents.
2. If you are working with data, give a detailed description of your data (number of examples and attributes, attribute types, label type etc.).
I did enormous amounts of searches over months and now have several thousand documents I need to process, mostly pdf, some msword, excel, & ppt.
3. Describe which results or actions you are expecting.
I need good clean taxonomies, for many different topics. I am hoping to set up a proper method using Rapidminer, but there does not seem to be an obvious pathway to do this.
Ideally, for each topic, the method would a) pre-process the documents, filtering for such items as the proper word or key phrase in the title, or the abstract; b) assembling the filtered documents; c) (optional) extracting tables of contents, indices, glossaries, etc.; d) extracting and amalgamating the sub-topics appropriate to the particular topic; e) generating the taxonomy.
I am new to Rapidminer, and relatively new to data mining in general, so please keep it simple for me.
Please help me understand any and all methods I could use to accomplish this.
And please let me know if I am following the proper procedures for this forum, or how I can improve this post.
Thank you very much.
Sam
Tagged:
0