How to clean tweets from hashtags and @

baran
baran New Altair Community Member
edited November 2024 in Community Q&A
Hi everybody
I tried for 3 days to clean tweets from hashtags and @ but I couldn' t. Is there anybody for help

Welcome!

It looks like you're new here. Sign in or register to get started.

Answers

  • IngoRM
    IngoRM New Altair Community Member

    Hi,

     

    Do you mean just getting rid of the symbols "@ and #" or do you also want to remove what is following after, e.g. "@ingomierswa" and "#datascience" should be completely removed?

     

    Both is easily possible with the operator "Replace" and a simple regular expression.  Below is a small sample process showing you how this is done.

     

    Hope this helps,

    Ingo

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="7.3.001" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="34">
    <list key="attribute_values">
    <parameter key="sample_tweet" value="&quot;This is just a sample tweet from @ingomierswa on #datascience - end of tweet.&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.3.001" expanded="true" height="103" name="Multiply" width="90" x="246" y="34"/>
    <operator activated="true" class="replace" compatibility="7.3.001" expanded="true" height="82" name="Only remove symbols" width="90" x="380" y="34">
    <parameter key="replace_what" value="@|#"/&gt;
    </operator>
    <operator activated="true" class="replace" compatibility="7.3.001" expanded="true" height="82" name="Complete entities removed" width="90" x="380" y="136">
    <parameter key="replace_what" value="@[a-zA-Z]*|#[a-zA-Z]*"/&gt;
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Only remove symbols" to_port="example set input"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Complete entities removed" to_port="example set input"/>
    <connect from_op="Only remove symbols" from_port="example set output" to_port="result 1"/>
    <connect from_op="Complete entities removed" from_port="example set output" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="84"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>
  • baran
    baran New Altair Community Member
    Yes exactly Thank you I will try it tomorrow then edit this post.
  • Hyram
    Hyram New Altair Community Member
    Hi @IngoRM. This worked thank you, but I'm left with characters other than letters. So this clears up letters after the # but not other characters. For example, I had @g_smug and it only removed @g and stopped at the underscore. Any suggestions?

    Thanks 
  • kayman
    kayman New Altair Community Member

    Extend your regex a bit like this :

    \b(@|#)[^\. \s, ]+

    It looks a bit ugly but basically means find anything 'word' that starts with either @ or #, and select everything till the next space, dot or comma. You replace this with nothing and it's gone.

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.