problem with Filter stop word operator
Mohamad1367
New Altair Community Member
Hi .i am working on a sentiment analysis project in persian language and installed the rosette extension for some text preproccesing purpose in this language such as tokenization .
i have a problem with filter stop word(dictionary) operator...when i apply this operator to my data set( after tokenization) i recieve only tokenized data set without filtering stop words...what is the cause of this problem?
Tagged:
0
Answers
-
For this to work, you need to supply a dictionary file in the 2nd input port of "Filter Stopwords (Dictionary)" operator. The way it works is it screens out words that are in the dictionary file. Since you are not supplying it with any dictionary file, then it is not filtering anything.
1 -
Thank you for your answer @Telcontar120
Is it possible to share the example procces with me so that I can understand it better?thanks very much dear
0 -
@Telcontar120 i connected the open file operator to fil input of stop word and attached the stop word dictionary to that but it didn't work ...is this what you mean in previous comment?
0 -
Sorry, I don't read Persian so I am not able to make much of the data files. But yes, you should be able to do this with the Open File operator. You can also just directly specify the file in the parameters of the Stopwords Dictionary operator, where there is a place to specify the path to the file you want to use.
A simplified process that works is attached, you would just need to swap your file paths and names.<?xml version="1.0" encoding="UTF-8"?><process version="9.6.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="text:read_document" compatibility="9.3.001" expanded="true" height="68" name="Read Document" width="90" x="45" y="85"> <parameter key="file" value="C:\Users\brian\Google Drive\RapidMiner\Training Text Mining\SourceData\Room Service Reviews\food_swissotel_chicago.2.gold.txt"/> <parameter key="extract_text_only" value="true"/> <parameter key="use_file_extension_as_type" value="true"/> <parameter key="content_type" value="txt"/> <parameter key="encoding" value="SYSTEM"/> </operator> <operator activated="true" class="text:process_documents" compatibility="9.3.001" expanded="true" height="103" name="Process Documents" width="90" x="246" y="85"> <parameter key="create_word_vector" value="true"/> <parameter key="vector_creation" value="TF-IDF"/> <parameter key="add_meta_information" value="true"/> <parameter key="keep_text" value="false"/> <parameter key="prune_method" value="none"/> <parameter key="prune_below_percent" value="3.0"/> <parameter key="prune_above_percent" value="30.0"/> <parameter key="prune_below_rank" value="0.05"/> <parameter key="prune_above_rank" value="0.95"/> <parameter key="datamanagement" value="double_sparse_array"/> <parameter key="data_management" value="auto"/> <process expanded="true"> <operator activated="true" class="open_file" compatibility="9.6.000" expanded="true" height="68" name="Open File" width="90" x="112" y="136"> <parameter key="resource_type" value="file"/> <parameter key="filename" value="C:\Users\brian\Downloads\stopwords.txt"/> </operator> <operator activated="true" class="text:tokenize" compatibility="9.3.001" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34"> <parameter key="mode" value="non letters"/> <parameter key="characters" value=".:"/> <parameter key="language" value="English"/> <parameter key="max_token_length" value="3"/> </operator> <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="9.3.001" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="380" y="34"> <parameter key="case_sensitive" value="false"/> <parameter key="encoding" value="SYSTEM"/> </operator> <connect from_port="document" to_op="Tokenize" to_port="document"/> <connect from_op="Open File" from_port="file" to_op="Filter Stopwords (Dictionary)" to_port="file"/> <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/> <connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_port="document 1"/> <portSpacing port="source_document" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <connect from_op="Read Document" from_port="output" to_op="Process Documents" to_port="documents 1"/> <connect from_op="Process Documents" from_port="example set" to_port="result 1"/> <connect from_op="Process Documents" from_port="word list" to_port="result 2"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> </process> </operator> </process>
This should be enough to get you started. You can of course do more with the processing of the documents if you desire.
0