Altair RISE
A program to recognize and reward our most engaged community members
Nominate Yourself Now!
Home
Discussions
Community Q&A
Removing mentions with "@" and emojis from Excel Data
Anna_May1
Hello RapidMiner Community,
I am currently working on a supervised sentiment analysis. I had success doing the sentiment analysis itself, but I'm not quiet happy with the data it uses.
As part of the data preparation, I wand to remove mentions (thus names following an "@" ) and I have tried out some suggestions. The process I have generated so far is uploaded here as well as the test data.
I am working with the "replace" operator but sadly, following this process, the outcome still incorporate some mentions. These mentions are still there because either a) they are the second mention in one row or b) they mention is not right at the beginning of the row.
Do any of you guys have some input regarding this?
In general, the goals I am trying to achieve are:
-remove any word (not the whole row) starting with "@".
-remove empty rows
-remove duplicates
-remove emojis (right now, with this process I ended up with question marks instead of the emojis as output, so I'd rather remove the emojis right away)
Grateful for any suggestions!
Anna May
Find more posts tagged with
AI Studio
Twitter
Data Sets
Text Mining + NLP
Sentiment Analysis
ETL + Data Prep
Accepted answers
All comments
MartinLiebig
Hi
@Anna_May1
,
good one! I needed to google a bit for the right regex. The attached process should do the trick.
Best,
Martin
<?xml version="1.0" encoding="UTF-8"?><process version="9.8.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.8.000" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="9.8.000" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">
<parameter key="excel_file" value="C:\Users\MartinSchmitz\Downloads\Test Comments 1.xlsx"/>
<parameter key="sheet_selection" value="sheet number"/>
<parameter key="sheet_number" value="1"/>
<parameter key="imported_cell_range" value="A1"/>
<parameter key="encoding" value="SYSTEM"/>
<parameter key="first_row_as_names" value="true"/>
<list key="annotations"/>
<parameter key="date_format" value=""/>
<parameter key="time_zone" value="SYSTEM"/>
<parameter key="locale" value="English (United States)"/>
<parameter key="read_all_values_as_polynominal" value="false"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Comments.true.polynominal.attribute"/>
</list>
<parameter key="read_not_matching_values_as_missings" value="false"/>
<parameter key="datamanagement" value="double_array"/>
<parameter key="data_management" value="auto"/>
</operator>
<operator activated="true" class="replace" compatibility="9.8.000" expanded="true" height="82" name="Replace" width="90" x="179" y="34">
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="single_value"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="replace_what" value="[^\x00-\x7F]"/>
<description align="center" color="transparent" colored="false" width="126">Replace all non-ascii letters</description>
</operator>
<operator activated="true" class="replace" compatibility="9.8.000" expanded="true" height="82" name="Replace (2)" width="90" x="313" y="34">
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="single_value"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="replace_what" value="
@/>
;
<description align="center" color="transparent" colored="false" width="126">Replace
@<
;/description>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Replace" to_port="example set input"/>
<connect from_op="Replace" from_port="example set output" to_op="Replace (2)" to_port="example set input"/>
<connect from_op="Replace (2)" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Anna_May1
Hi
@mschmitz
,
thanks for the quick reply
. I tried your code and it did remove the emojis but it didn't remove any of the mentions. So all mentions are still there, even the ones that are at the beginning of a row, that were removed before.
Do you have any input as to why this might be the case?
Cheers,
Anna May
Tes Data Prep 3.rmp
MartinLiebig
Hi
@Anna_May1
,
sorry, my fault. I thought you wanted to replace only the
@-symbol
and not @ with the name. Attached is the correct one.
Best,
Martin
<?xml version="1.0" encoding="UTF-8"?><process version="9.8.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.8.000" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="9.8.000" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">
<parameter key="excel_file" value="C:\Users\MartinSchmitz\Downloads\Test Comments 1.xlsx"/>
<parameter key="sheet_selection" value="sheet number"/>
<parameter key="sheet_number" value="1"/>
<parameter key="imported_cell_range" value="A1"/>
<parameter key="encoding" value="SYSTEM"/>
<parameter key="first_row_as_names" value="true"/>
<list key="annotations"/>
<parameter key="date_format" value=""/>
<parameter key="time_zone" value="SYSTEM"/>
<parameter key="locale" value="English (United States)"/>
<parameter key="read_all_values_as_polynominal" value="false"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Comments.true.polynominal.attribute"/>
</list>
<parameter key="read_not_matching_values_as_missings" value="false"/>
<parameter key="datamanagement" value="double_array"/>
<parameter key="data_management" value="auto"/>
</operator>
<operator activated="true" class="replace" compatibility="9.8.000" expanded="true" height="82" name="Replace" width="90" x="179" y="34">
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="single_value"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="replace_what" value="[^\x00-\x7F]"/>
<description align="center" color="transparent" colored="false" width="126">Replace all non-ascii letters</description>
</operator>
<operator activated="true" class="replace" compatibility="9.8.000" expanded="true" height="82" name="Replace (2)" width="90" x="313" y="34">
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="single_value"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="replace_what" value="
@(\w+)"/>
;
<description align="center" color="transparent" colored="false" width="126">Replace
@<
;/description>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Replace" to_port="example set input"/>
<connect from_op="Replace" from_port="example set output" to_op="Replace (2)" to_port="example set input"/>
<connect from_op="Replace (2)" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Anna_May1
Hi
@mschmitz
,
thanks again for your time! I have no idea why but this still doesn't work for me. Would you mind sharing your process in another format?
Cheers,
Anna May
MartinLiebig
Hi
@Anna_May1
,
you are right. There is something wrong with the xml, lets try rmp.
Best,
Martin
replace ascii and ats.rmp
Nicole_Samson
hi im also working on the same issue, how do i use this solution? is it a macro or something else? TIA
MartinLiebig
Hi,
you can just download this process and load it into your RapidMiner using File->Load Process.
Best,
Martin
danni72
Streamlining Excel data is a breeze with a simple formula. By removing '@' mentions and emojis, you're enhancing data clarity
.
Clean, concise information fosters efficiency and makes analysis a smoother process. Excel mastery in action.
DominKirla
I tried this code to remove emojis and then share that file on whatsapp, but mine friend told me emojis were still there. Please tell me an alternative solution that can help me to remove unnecessary emojis.
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)
Groups