🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Remove or replace URL and RT from Twitter dataset

User: "ikayunida123"
New Altair Community Member
Updated by Jocelyn

Hello everyone!

So right now I'm trying to do a data cleaning phase on text classification using Twitter dataset. But I have a problem about how to replace (or maybe remove) the URL, RT and @ character. I've read some post on the forum but I didn't understand anything :catsad:

For the URL on the dataset, I want to change the format from "https:" or "http:" to "link" (I don't know why it can't have a null value like " "). But after I executed my process using Replace operator, the result from "http://blablabla" didn't change into "link" only, but the result come out like this "linkblablabla". Maybe it has something to do with the RegEx? :catsad: I know what's RegEx but I don't how how to use and write it :catsad:

I'm really confused right now. Please help me.

This's my RapidMiner process :

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.1.001" expanded="true" height="68" name="Retrieve Dataset Skripsi" width="90" x="45" y="34">
<parameter key="repository_entry" value="Dataset Skripsi"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="8.1.001" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Text"/>
</operator>
<operator activated="true" class="set_role" compatibility="8.1.001" expanded="true" height="82" name="Set Role" width="90" x="313" y="34">
<parameter key="attribute_name" value="Label"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="8.1.001" expanded="true" height="103" name="Filter Examples" width="90" x="447" y="34">
<parameter key="condition_class" value="no_missing_attributes"/>
<list key="filters_list"/>
</operator>
<operator activated="true" class="remove_duplicates" compatibility="8.1.001" expanded="true" height="103" name="Remove Duplicates" width="90" x="581" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Text"/>
</operator>
<operator activated="true" class="replace" compatibility="8.1.001" expanded="true" height="82" name="Replace" width="90" x="715" y="34">
<parameter key="replace_what" value="(https://)"/>
<parameter key="replace_by" value="link"/>
</operator>
<connect from_op="Retrieve Dataset Skripsi" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
<connect from_op="Remove Duplicates" from_port="example set output" to_op="Replace" to_port="example set input"/>
<connect from_op="Replace" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

I need your help. Thank you!

Find more posts tagged with

Sort by:
1 - 1 of 11
    User: "rfuentealba"
    New Altair Community Member
    Accepted Answer

    Hi @ikayunida123

     

    I found another one for your viewing pleasure (...or not):

     

    (https?|http)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]

     

    This one produces the following results:

     

    Screen Shot 2018-06-07 at 01.20.12.png

    It's not simple to read, but it is indeed easy to understand: instead of using (.+) or (.*), the square brackets limit the amount and type of characters that must be recognized after certain patterns.

     

    Hmmm... I've decided that it is not easy to understand either. I tried to explain, I swear. But hopefully this or the other ones I already shared with you might help you.

     

    All the best,