keyword-based text mining
Hello there,
I have a list of 50 keywords and want to analyze their occurence frequency in my dataset.
The general text mining process is not the problem. But I only want to analyze these 50 keywords.
How can I apply this?
Thank you very much in advance!
Best Answer
-
You just create a wordlist with those 50 words and then apply that specific wordlist (using the wordlist input port) for any subsequent document you are going to process.
2
Answers
-
Hi @seba77,
if I good understand, you can use use the Create ExampleSet operator to write yout list of 50 keywords and the Process Documents
from Data and Process Documents operators to filter all the others words from your document.
Here an example of process to adapt to your keywords and document :
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="operator_toolbox:create_exampleset_from_doc" compatibility="0.7.000" expanded="true" height="68" name="Create Exampleset" width="90" x="45" y="34">
<parameter key="Input Csv" value="att1 apples oranges bananas"/>
<parameter key="Parse all as Nominal" value="true"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="179" y="34">
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="447" y="34"/>
<connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:create_document" compatibility="7.5.000" expanded="true" height="68" name="Create Document" width="90" x="447" y="187">
<parameter key="text" value="apples are sweeter than oranges but bananas are the sweetest of them all"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="7.5.000" expanded="true" height="103" name="Process Documents" width="90" x="581" y="85">
<process expanded="true">
<connect from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Create Exampleset" from_port="output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="word list" to_op="Process Documents" to_port="word list"/>
<connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 2"/>
<connect from_op="Process Documents" from_port="word list" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>Does this example of process answer to your need ?
Regards,
Lionel
3 -
You just create a wordlist with those 50 words and then apply that specific wordlist (using the wordlist input port) for any subsequent document you are going to process.
2