"Problem with filtering the text"
kersor
New Altair Community Member
hi everyone
i want to filter a txt document and remove the stopwords.i just put the procces read document,then tokenize.then filter stopwords and then write document but the result is the same.The stop words did not removed.the xml is here.no broblem or warning found just the result forlder is the same just like the text i put in the read document.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
<parameter key="parallelize_main_process" value="true"/>
<process expanded="true" height="341" width="681">
<operator activated="true" class="text:read_document" compatibility="5.1.001" expanded="true" height="60" name="Read Document" width="90" x="36" y="86">
<parameter key="file" value="C:\Users\Αlkis_!!\Desktop\negative.txt"/>
</operator>
<operator activated="true" class="text:tokenize" compatibility="5.1.001" expanded="true" height="60" name="Tokenize" width="90" x="176" y="88"/>
<operator activated="true" class="text:filter_stopwords_dictionary" compatibility="5.1.001" expanded="true" height="60" name="Filter Stopwords (Dictionary)" width="90" x="317" y="79">
<parameter key="file" value="C:\Users\Αlkis_!!\Desktop\stopwords_greek.txt"/>
</operator>
<operator activated="true" class="text:write_document" compatibility="5.1.001" expanded="true" height="60" name="Write Document" width="90" x="447" y="75">
<parameter key="file" value="C:\Users\Αlkis_!!\Desktop\result\ρεσσσσσσσσσσσσσσσσσ"/>
</operator>
<connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
<connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Write Document" to_port="document"/>
<connect from_op="Write Document" from_port="document" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
anynone that could help me.the stopwords funtion is the dictionary cause i use greek characters.
i want to filter a txt document and remove the stopwords.i just put the procces read document,then tokenize.then filter stopwords and then write document but the result is the same.The stop words did not removed.the xml is here.no broblem or warning found just the result forlder is the same just like the text i put in the read document.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
<parameter key="parallelize_main_process" value="true"/>
<process expanded="true" height="341" width="681">
<operator activated="true" class="text:read_document" compatibility="5.1.001" expanded="true" height="60" name="Read Document" width="90" x="36" y="86">
<parameter key="file" value="C:\Users\Αlkis_!!\Desktop\negative.txt"/>
</operator>
<operator activated="true" class="text:tokenize" compatibility="5.1.001" expanded="true" height="60" name="Tokenize" width="90" x="176" y="88"/>
<operator activated="true" class="text:filter_stopwords_dictionary" compatibility="5.1.001" expanded="true" height="60" name="Filter Stopwords (Dictionary)" width="90" x="317" y="79">
<parameter key="file" value="C:\Users\Αlkis_!!\Desktop\stopwords_greek.txt"/>
</operator>
<operator activated="true" class="text:write_document" compatibility="5.1.001" expanded="true" height="60" name="Write Document" width="90" x="447" y="75">
<parameter key="file" value="C:\Users\Αlkis_!!\Desktop\result\ρεσσσσσσσσσσσσσσσσσ"/>
</operator>
<connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
<connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Write Document" to_port="document"/>
<connect from_op="Write Document" from_port="document" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
anynone that could help me.the stopwords funtion is the dictionary cause i use greek characters.
Tagged:
0
Answers
-
Hi,
Edit: Please note the workaround for the rather unpleasant operator behaviour posted by colo which is probably the solution for your problem.
Original message:
make sure to select an appropriate encoding for your greek symbols in your process/operators.
Apart from that it's hard to tell what's wrong as the process works fine for me (of course on my own data files)..
Did you step through the process (selecting an operator and pressing F7 creates a breakpoint after the operator) to see where the process stops doing what you want it to do?
Regards,
Marco0 -
Hi,
this is a problem of the Document data type, that I was confronted with earlier (mentioned it here: http://rapid-i.com/rapidforum/index.php/topic,2126.0.html).
You can modify whatever you want, finally the original document content is used (for "Write Document" for example, or operators like "Extract Information"). Intended behavior or not, this is a fact that made the data type mostly unusable for me, I am always converting to example sets and doing my work on the columns instead of documents. But documents still offer more possibilites and operators for text mining tasks (like the ability to handle multiple matches from regular expressions or xpath ("Cut Document") or stopword filters etc.).
The only way I found to use the modified document content is the encapsulation in one of the "Process Documents" operators using the option "keep text". This results in an example set, which again has to be transformed to write a document as file (Extract Macro, Create Document, Write Document for example). BUT the "funny" thing with this is the following: if you place your operator chain inside a "Process Documents" operator suddenly the modified output is used for "Write Document". It should work if you modify your example this way:<?xml version="1.0" encoding="UTF-8" standalone="no"?>
I would prefer a more flexible usability of the document type. I expected exactly the same behavior as you did, but got confused and still don't know why it's working this way. Why should I modify documents if the output is always only the original content? Why is the modified content just used inside "Process Documents" and not every time?
<process version="5.1.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.008" expanded="true" name="Process">
<parameter key="parallelize_main_process" value="true"/>
<process expanded="true" height="359" width="681">
<operator activated="true" class="text:read_document" compatibility="5.1.001" expanded="true" height="60" name="Read Document" width="90" x="45" y="30">
<parameter key="file" value="C:\Users\?lkis_!!\Desktop\negative.txt"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.1.001" expanded="true" height="94" name="Process Documents" width="90" x="179" y="30">
<parameter key="create_word_vector" value="false"/>
<parameter key="keep_text" value="true"/>
<process expanded="true" height="607" width="773">
<operator activated="true" class="text:tokenize" compatibility="5.1.001" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
<operator activated="true" class="text:filter_stopwords_dictionary" compatibility="5.1.001" expanded="true" height="60" name="Filter Stopwords (Dictionary)" width="90" x="179" y="30">
<parameter key="file" value="C:\Users\?lkis_!!\Desktop\stopwords_greek.txt"/>
</operator>
<operator activated="true" class="text:write_document" compatibility="5.1.001" expanded="true" height="60" name="Write Document" width="90" x="313" y="30">
<parameter key="file" value="C:\Users\?lkis_!!\Desktop\result\???????????????????"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
<connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Write Document" to_port="document"/>
<connect from_op="Write Document" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Best regards
Matthias0 -
Hi,
oh.. I stumbled upon this problem a while ago in a different context where I had to use Documents to create a new web plugin operator, but I did not know that this affects more operators which use Documents..
I will bring this up as soon as possible.
Regards,
Marco0 -
thanks for the replies hope it will work.0