"text mining visulaization"
emolano
New Altair Community Member
Hi all,
Help for a new user! I'm doing some text mining and want to visualize the word frequency. how can I do this?
something like a tag cloud/word cloud would be nice.
This is what I have so far...
<operator name="Root" class="Process" expanded="yes">
<description text="#ylt#h3#ygt#CRM Data Mining#ylt#/h3#ygt##ylt#p#ygt#.#ylt#/p#ygt#"/>
<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
<parameter key="database_url" value="jdbc:mysql://test:3306/test"/>
<parameter key="username" value="test"/>
<parameter key="password" value="C2jgjgjh4JiellkjDOm4="/>
<parameter key="query" value="SELECT `ID_NUM`, `SHORT_DESC`, `PLATFORM` FROM `PROBLEM` WHERE platform is not null;"/>
<parameter key="label_attribute" value="PLATFORM"/>
<parameter key="id_attribute" value="ID_NUM"/>
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="filter_nominal_attributes" value="true"/>
<parameter key="remove_original_attributes" value="true"/>
<parameter key="vector_creation" value="TermOccurrences"/>
<parameter key="output_word_list" value="C:\Documents and Settings\emolano\My Documents\rm_workspace\output"/>
<list key="namespaces">
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
<parameter key="min_chars" value="2"/>
</operator>
<operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
</operator>
<operator name="PorterStemmer" class="PorterStemmer">
</operator>
<operator name="TermNGramGenerator" class="TermNGramGenerator">
</operator>
</operator>
</operator>
... I get the word frequency but not know hot to visualize it...
Thanks
e
Help for a new user! I'm doing some text mining and want to visualize the word frequency. how can I do this?
something like a tag cloud/word cloud would be nice.
This is what I have so far...
<operator name="Root" class="Process" expanded="yes">
<description text="#ylt#h3#ygt#CRM Data Mining#ylt#/h3#ygt##ylt#p#ygt#.#ylt#/p#ygt#"/>
<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
<parameter key="database_url" value="jdbc:mysql://test:3306/test"/>
<parameter key="username" value="test"/>
<parameter key="password" value="C2jgjgjh4JiellkjDOm4="/>
<parameter key="query" value="SELECT `ID_NUM`, `SHORT_DESC`, `PLATFORM` FROM `PROBLEM` WHERE platform is not null;"/>
<parameter key="label_attribute" value="PLATFORM"/>
<parameter key="id_attribute" value="ID_NUM"/>
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="filter_nominal_attributes" value="true"/>
<parameter key="remove_original_attributes" value="true"/>
<parameter key="vector_creation" value="TermOccurrences"/>
<parameter key="output_word_list" value="C:\Documents and Settings\emolano\My Documents\rm_workspace\output"/>
<list key="namespaces">
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
<parameter key="min_chars" value="2"/>
</operator>
<operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
</operator>
<operator name="PorterStemmer" class="PorterStemmer">
</operator>
<operator name="TermNGramGenerator" class="TermNGramGenerator">
</operator>
</operator>
</operator>
... I get the word frequency but not know hot to visualize it...
Thanks
e
Tagged:
0
Answers
-
Hi,
I would suggest a parallel plot - at least if you have less than a few thousand terms. Alternatively, you could also use the CorpusBasedWeighting for each class and visualize the different weight vectors.
Cheers,
Ingo0 -
Hi Ingo,
you said "the CorpusBasedWeighting for each class". How can I define such a class? In my case, the values of the Weighting are 0 or 1, which seems to deliver no usable results.
I have two further related questions:
1) In my setting, I am loading some txts and get a list of words with values like "avg = 0.029 +/- 0.167". I don´t understand exactly, what this means. Can I group the words using this information depending on their occurence in the source-files?
2) But most important is that I would like to seperate my txts in groups and visualize their analyses to compare them. For a tiny example, one group could be femal, one group is male text and I would like to compare the usage of words or combination of words (like: these are typical female phrases:...). Is there a possibility to tell rapid-miner which text belongs to which group and to consider this information?
Cheers,
Chris
Setting:
<operator name="Root" class="Process" expanded="yes">
<operator name="TextInput" class="TextInput" expanded="yes">
<list key="texts">
<parameter key="NoteLens" value="C:\Dokumente und Einstellungen\cniemann\Eigene Dateien\NoteLens Documents\store"/>
</list>
<parameter key="default_content_type" value="txt"/>
<parameter key="default_content_encoding" value="UTF-8"/>
<parameter key="default_content_language" value="german"/>
<parameter key="vector_creation" value="TermOccurrences"/>
<parameter key="id_attribute_type" value="short"/>
<list key="namespaces">
</list>
<parameter key="create_text_visualizer" value="true"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="GermanStopwordFilter" class="GermanStopwordFilter">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
</operator>
</operator>
<operator name="CorpusBasedWeighting" class="CorpusBasedWeighting">
<parameter key="normalize_weights" value="false"/>
<parameter key="class_to_characterize" value="3"/>
</operator>
</operator>
0