"Segmenting millions of text segments with Textsegmenter"
Braulio
New Altair Community Member
Hi there,
I am building a text mining process that should be able to process various xml-files with following properties:
Each XML file contains several thousands blogposts and some information to each post (author, time, etc.).
My question: Is there a way to process this file, taking in account the segments but not necessarily dividing this file in millions of other files like the TextSegmenter does.
My assumption: It will take ages to process millions of files to mine for knowledge or do sentiment analysis
Any help will be greatly appreciated.
Thanks
Braulio
I am building a text mining process that should be able to process various xml-files with following properties:
Each XML file contains several thousands blogposts and some information to each post (author, time, etc.).
My question: Is there a way to process this file, taking in account the segments but not necessarily dividing this file in millions of other files like the TextSegmenter does.
My assumption: It will take ages to process millions of files to mine for knowledge or do sentiment analysis
Any help will be greatly appreciated.
Thanks
Braulio
Tagged:
0
Answers
-
Hi,
if there is a clear splitting criterion in your original texts, you could load the text as the content of a single attribute, use the new split operator which will deliver one attribute per text (this can actually lead to memory problems since meta data for attributes is rather costly in terms of memory). After that, you could transpose the example set and work with the StringTextInput operator.
Cheers,
Ingo
0