I have problem removing url and hashtags in the data(from excel)

fangirl96
New Altair Community Member
I’m having a problem in removing url and hashtags in the data(from excel). I have inputted data(tweets) using 3 read excel then append them. After that, I connected the append operator to replace then inputted regex for url and hashtags in parameters named regular expression and replace what. Then, I connected it to data to document then process documents where I have Transform cases, Tokenize and Filter Stopwords(dictionary) respectively. The results were tokenized and the stopwords I created were removed. But the one with hashtags, only the # symbol is removed. For example, original text is #vscocam the result is vscocam while the url it is not removed. It was just tokenized too.
0
Answers
-
hello @fangirl96 - welcome to the community. I think I understand and believe you just need to adjust your regex. Can you give some examples and the process you're using (see instructions "Read Before Posting" on the right).
Scott0 -
This is the full xml of my process.
<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="7.5.003" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">
<parameter key="excel_file" value="C:\Users\ace\Desktop\Airasia1 total.xlsx"/>
<parameter key="imported_cell_range" value="A1:A14"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="Text.true.text.attribute"/>
</list>
</operator>
<operator activated="true" class="read_excel" compatibility="7.5.003" expanded="true" height="68" name="Read Excel (3)" width="90" x="45" y="136">
<parameter key="excel_file" value="C:\Users\ace\Dropbox\Thesis V3.0\Thesis 2 - data gathering (testing 3) with additional\Negative\neg_airasia.xlsx"/>
<parameter key="imported_cell_range" value="A1:A184"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="Text.true.text.attribute"/>
</list>
</operator>
<operator activated="true" class="read_excel" compatibility="7.5.003" expanded="true" height="68" name="Read Excel (4)" width="90" x="45" y="238">
<parameter key="excel_file" value="C:\Users\ace\Dropbox\Thesis V3.0\Thesis 2 - data gathering (testing 3) with additional\Negative\neg_cebupac.xlsx"/>
<parameter key="imported_cell_range" value="A1:A53"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="Text.true.text.attribute"/>
</list>
</operator>
<operator activated="true" class="append" compatibility="7.5.003" expanded="true" height="124" name="Append" width="90" x="179" y="136"/>
<operator activated="true" class="text:data_to_documents" compatibility="7.5.000" expanded="true" height="68" name="Data to Documents" width="90" x="313" y="34">
<list key="specify_weights"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="7.5.000" expanded="true" height="103" name="Process Documents" width="90" x="447" y="34">
<process expanded="true">
<operator activated="true" breakpoints="before,after" class="text:replace_tokens" compatibility="7.5.000" expanded="true" height="68" name="Replace Tokens" width="90" x="112" y="34">
<list key="replace_dictionary">
<parameter key="@[a-zA-Z]*" value=" "/>
<parameter key="#[a-zA-Z0-9]*" value=" "/>
</list>
</operator>
<operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="112" y="136"/>
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="238">
<parameter key="expression" value="\[\d*\][^\[\]]*"/>
</operator>
<operator activated="true" class="text:stem_porter" compatibility="7.5.000" expanded="true" height="68" name="Stem (Porter)" width="90" x="246" y="136"/>
<operator activated="true" class="text:filter_stopwords_dictionary" compatibility="7.5.000" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="246" y="238">
<parameter key="file" value="C:\Users\ace\Dropbox\Thesis V3.0\THESIS 4\airasia.txt"/>
</operator>
<operator activated="true" class="text:generate_n_grams_terms" compatibility="7.5.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="380" y="238"/>
<connect from_port="document" to_op="Replace Tokens" to_port="document"/>
<connect from_op="Replace Tokens" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Stem (Porter)" to_port="document"/>
<connect from_op="Stem (Porter)" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
<connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
<connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Append" to_port="example set 1"/>
<connect from_op="Read Excel (3)" from_port="output" to_op="Append" to_port="example set 2"/>
<connect from_op="Read Excel (4)" from_port="output" to_op="Append" to_port="example set 3"/>
<connect from_op="Append" from_port="merged set" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>The links are not removed but the hashtags were removed.
PS. The links included in my data is starting with https
0 -
0
-
@fangirl96 take a look at my tutorial process here: http://www.neuralmarkettrends.com/blog/entry/use-rapidminer-discover-twitter-content
I extract hashtags and drop https: to a generic word called 'link'
1