"text input from a single text file using text plugin"
angshu
New Altair Community Member
Hi,
I am new to text plugin, I am trying to do some text clustering using rapidminer with text plugin. I have all the text in one file in which each line needs to be considered as a different document. I tried using SplitSegmenter, but since a new file is created for every line, the space in blowing up which will hamper scalability.
Can someone suggest a way i can cluster the different lines in the same text so i dont hae to create different files.
Appreciate your response
Regards
Angshu
I am new to text plugin, I am trying to do some text clustering using rapidminer with text plugin. I have all the text in one file in which each line needs to be considered as a different document. I tried using SplitSegmenter, but since a new file is created for every line, the space in blowing up which will hamper scalability.
Can someone suggest a way i can cluster the different lines in the same text so i dont hae to create different files.
Appreciate your response
Regards
Angshu
Tagged:
0
Answers
-
Hi,
this is possible. You have to do a little trick: Load the file using the CSVExampleSource operator. Configure the operator in a way, that only one column is created from the file! In order to do so, specify a text never occuring in the field for the column separtion regular expression. Then insert a Nominal2String operator to change the value type to string. After this, using the StringTextInput, you can transform the texts into wordvectors for clustring. To simplify your life, I append a sample process:<operator name="Root" class="Process" expanded="yes">
<operator name="CSVExampleSource" class="CSVExampleSource">
<parameter key="filename" value="C:\Dokumente und Einstellungen\sland\Desktop\test.txt"/>
<parameter key="read_attribute_names" value="false"/>
<parameter key="column_separators" value="This text never occures in the file --- sdhaksj dhaskljdh alkdjsh sa"/>
<parameter key="use_comment_characters" value="false"/>
</operator>
<operator name="Nominal2String" class="Nominal2String">
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<list key="namespaces">
</list>
</operator>
</operator>
Greetings,
Sebastian0 -
Hi Angshu,
Just to add to what Sebastian was saying, in GUI form, you can use the following operator flow,
1. Examplesource - configure your input( tab/csv delimited; format of input fields(nominal or string,etc); type of variable( label for dependent variable and attribute for independent variables, id for keys) ;then save it in attribute file.
2. Stringtextinput - for generating word vectors ; for further info visit,http://kmandcomputing.blogspot.com/2008/06/opinion-mining-with-rapidminer-quick.html
I had faced the same problem and the flow mentioned above helped.
Thanks,
Ram
0 -
Thanks Sebastian and Ram, your replies helped a lot
Best Regards
Angshu0