Hello.
I'm working on a task i couldn't find any help for on the web. What I try is to extract the most frequent words of a document collection and associate them with all sentences they appear in and also all documents they appear in.I would like to have the result be like a tree structure, e.g. in an excel file containing all information, and an output like this:
Token1----
hyperlink--------Sentence1- - -
hyperlink - - -Doc --\
Sentence2- - - - - - Doc \ (order with reference to highest TF/IDF-value of token)
Sentence3- - - - - - - Doc--/
Token2----hyperlink--------Sentence1- - - - - Doc
Sentence2
........
...
.etc. you get the idea.
To accomplish this, i try to add the sentences and the documents as meta information to the tokens(sentence tokenizer and word tokenizer operators), and to read this meta information with excel(write excel operator). The result is somewhat too redundant, it contains 13245 examples out of only 16 documents, i think upscaling this process is going to be quite hard. I also wonder if there is a possibility to add meta information in different "levels", specifically to add the document as meta information to the sentences it contains and then add this "package" to the tokens as meta information?
I am not very familiar with data structures and RapidMiner and hope this is going to be possible, here's my process so far:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
<process expanded="true" height="530" width="748">
<operator activated="true" class="text:process_document_from_file" compatibility="5.3.000" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="75">
<list key="text_directories">
<parameter key="stirling" value="C:\Users\Marc\Desktop\Data\Stirling"/>
</list>
<parameter key="keep_text" value="true"/>
<process expanded="true">
<connect from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="information_extraction:sentence_tokenizer" compatibility="1.0.000" expanded="true" height="76" name="SentenceTokenizer" width="90" x="179" y="165">
<parameter key="optionalAttribute" value="text"/>
<parameter key="new token-name" value="Sentences"/>
</operator>
<operator activated="true" class="information_extraction:word_tokenizer" compatibility="1.0.000" expanded="true" height="76" name="WordTokenizer" width="90" x="313" y="255">
<parameter key="optionalAttribute" value="Sentences"/>
<parameter key="new token-name" value="Words"/>
</operator>
<operator activated="true" class="write_excel" compatibility="5.3.000" expanded="true" height="76" name="Write Excel" width="90" x="581" y="300">
<parameter key="excel_file" value="C:\Users\Marc\Desktop\Data\Excel_Result\result.xls"/>
</operator>
<connect from_op="Process Documents from Files" from_port="example set" to_op="SentenceTokenizer" to_port="example set input"/>
<connect from_op="Process Documents from Files" from_port="word list" to_port="result 1"/>
<connect from_op="SentenceTokenizer" from_port="example set output" to_op="WordTokenizer" to_port="example set input"/>
<connect from_op="WordTokenizer" from_port="example set output" to_op="Write Excel" to_port="input"/>
<connect from_op="WordTokenizer" from_port="original example set output" to_port="result 2"/>
<connect from_op="Write Excel" from_port="through" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>
thank you in advance