"[SOLVED] Empty Word List"
beedaan
New Altair Community Member
Hi All,
I am counting the occurrences of words in a txt document. The text document has abstracts of other documents, as well as the document title. The general format of the file is such:
<document name>
<abstract>
<white space>
...
This continues for roughly 36,00 documents. The total size of the document is 46MB. I am expecting to get a word list of word occurrences as a result. What I actually get is an empty word list. Here is my attached process:
Please let me know what I am doing wrong. Thanks.
I am counting the occurrences of words in a txt document. The text document has abstracts of other documents, as well as the document title. The general format of the file is such:
<document name>
<abstract>
<white space>
...
This continues for roughly 36,00 documents. The total size of the document is 46MB. I am expecting to get a word list of word occurrences as a result. What I actually get is an empty word list. Here is my attached process:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>I used this youtube video as a guide: https://www.youtube.com/watch?feature=endscreen&;NR=1&v=EjD2M4r4mBM
<process version="5.2.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<process expanded="true" height="641" width="1024">
<operator activated="true" class="text:read_document" compatibility="5.2.004" expanded="true" height="60" name="Read Document" width="90" x="179" y="75">
<parameter key="file" value="C:\Users\Administrator\Desktop\DTIC_RDF\sample.xml"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.2.004" expanded="true" height="94" name="Process Documents" width="90" x="447" y="75">
<parameter key="create_word_vector" value="false"/>
<parameter key="add_meta_information" value="false"/>
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="9999"/>
<process expanded="true" height="645" width="1024">
<operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="125" y="28"/>
<operator activated="true" class="text:transform_cases" compatibility="5.2.004" expanded="true" height="60" name="Transform Cases" width="90" x="313" y="75"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="word list" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Please let me know what I am doing wrong. Thanks.
Tagged:
0
Answers
-
Heya,
it might be helpful if you check the option "create word vector" in the Process Documents operator
Additionally, you are reading only one document, but your pruning settings are configured to ignore words which appear in less than two documents. So for testing I suggest to disable pruning.
Happy mining,
Marius0 -
Thanks for the help. This worked for me. I have a question though, I got it to work first by creating a word vector. I got it to work again my not creating a word vector. In my results, I still had a word list. What does "create word vector" actually do?0
-
It should prevent the creation of the word vector if disabled. However, I did not ever disable the option, because I see no reason why I would not create a wordlist.
After changing options, it is generally a good idea to hit "enter" or click somewhere on the process pane to make sure that the changes are actually submitted. Maybe the options were not applied when you hit the run button (yes, this needs improvement :-\ )
Best, Marius0 -
Thanks for the response. I'm tinkering around with some of the text association features. I am having issues with the program crashing. I can tell you what I am doing to get these crashes if you are interested.0
-
Of course we are interested in that, but please open a new thread for it. If you get a dialog with "Submit Bug" you can also just click that button and describe everything in the dialog which will popup. That way the bug is submitted directly into our bug tracking system and won't get lost in the depths of the forum. Additionally, the bug report will contain some valuable information about the program state at the moment of the crash, which will greatly help us to fix it.0
-
Great! Thanks for the reply0