how can i extract unique URL in a set of tweets for each user in twitter data set with rapidminer?
Answers
-
Hi @ramzanzadeh72,
To extract URL from tweets you can use the Extract Entities operator from Aylien extension (to download from Marketplace and you have
to obtain an API key on the Aylien site).
However, in your case, you have to purchase a paid license because the free license is limited to 1000 examples / day.
then you can use Aggregate operator to count the unique URL by user.
and be patient....... the Extract Entities operator is very long to compute.
Here the process :
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="8.2.000" expanded="true" height="68" name="Read CSV" width="90" x="112" y="34">
<parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Extract_URL\data.csv"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="text.true.polynominal.attribute"/>
<parameter key="1" value="user_id.true.real.attribute"/>
</list>
</operator>
<operator activated="true" class="filter_example_range" compatibility="8.2.000" expanded="true" height="82" name="Filter Example Range" width="90" x="246" y="34">
<parameter key="first_example" value="40"/>
<parameter key="last_example" value="100"/>
</operator>
<operator activated="true" class="com.aylien.textapi.rapidminer:aylien_entities" compatibility="0.2.000" expanded="true" height="68" name="Extract Entities" width="90" x="447" y="34">
<parameter key="connection" value="Aylien_dkk"/>
<parameter key="input_attribute" value="text"/>
</operator>
<operator activated="true" class="aggregate" compatibility="8.2.000" expanded="true" height="82" name="Aggregate" width="90" x="581" y="34">
<list key="aggregation_attributes">
<parameter key="url" value="count"/>
</list>
<parameter key="group_by_attributes" value="user_id"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Filter Example Range" to_port="example set input"/>
<connect from_op="Filter Example Range" from_port="example set output" to_op="Extract Entities" to_port="Example Set"/>
<connect from_op="Extract Entities" from_port="Example Set" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>Regards,
Lionel
0 -
If you don't want to pay for the Aylien plan, you could also try to extract URLs with specific regular expressions. Search the forum for several examples of how to do this (it has been mentioned in a couple of other threads). The manual method is a bit more cumbersome but should be able to extract any URL with the standard format of http://... or https://... or www....
0 -
@Telcontar120
My problem is that in some tweet exist two or more url and in this case what can I do?? I need to first store urls of each user and then count unique urls, is this posible in rapidminer??0 -
My problem is that some tweet contain two or more url and when I extract urls in this tweets and then use aggregate only first url considered, what can I do??0 -
How mentionned by @Telcontar120, a free method to extract URLs is to use specific regular expressions.
However, I don't know if it is possible to perform what you want to do with RapidMiner's native operators.
So I propose a process with 2 branches using 2 Python scripts :
- one branch used to extract all the URLs :
- one branch to extract the URLs and count them :
In your dataset, the URLs seems to be very simple, so I choose a simple regex to extract URLs (but you can look up a better pattern
and set it in the Set Macro operator) :
Here the process :
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="8.2.000" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34">
<parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Extract_URL\data.csv"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="text.true.polynominal.attribute"/>
<parameter key="1" value="user_id.true.real.attribute"/>
</list>
</operator>
<operator activated="true" class="set_macro" compatibility="8.2.000" expanded="true" height="82" name="URL_Pattern" width="90" x="179" y="34">
<parameter key="macro" value="urlPattern"/>
<parameter key="value" value="r'(https?://\S+)'"/>
</operator>
<operator activated="true" class="multiply" compatibility="8.2.000" expanded="true" height="103" name="Multiply" width="90" x="313" y="34"/>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Extract URLs" width="90" x="514" y="34">
<parameter key="script" value="import pandas as pd import re # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(data): URLPATTERN = %{urlPattern} #data['urlcount'] = data.text.apply(lambda x: re.findall(URLPATTERN, x)).str.len() data['url'] = data.text.apply(lambda x: re.findall(URLPATTERN, x)) #data.groupby('user_id').sum()['urlcount'] # connect 2 output ports to see the results return data"/>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Extract and count unique URL" width="90" x="514" y="187">
<parameter key="script" value="import pandas as pd import re # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(data): URLPATTERN = %{urlPattern} data['urlcount'] = data.text.apply(lambda x: re.findall(URLPATTERN, x)).str.len() #data.groupby('user_id').sum()['urlcount'] # connect 2 output ports to see the results return data"/>
</operator>
<operator activated="true" class="aggregate" compatibility="8.2.000" expanded="true" height="82" name="Aggregate" width="90" x="648" y="187">
<list key="aggregation_attributes">
<parameter key="urlcount" value="sum"/>
</list>
<parameter key="group_by_attributes" value="user_id"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="URL_Pattern" to_port="through 1"/>
<connect from_op="URL_Pattern" from_port="through 1" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Extract URLs" to_port="input 1"/>
<connect from_op="Multiply" from_port="output 2" to_op="Extract and count unique URL" to_port="input 1"/>
<connect from_op="Extract URLs" from_port="output 1" to_port="result 1"/>
<connect from_op="Extract and count unique URL" from_port="output 1" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>To execute this process, you need to :
- install Python on your computer.
- install Execute Python operator (from the marketplace).
I hope it helps,
Regards,
Lionel
0 -
Hi @ramzanzadeh72,
As @Telcontar120 and @lionelderkrikor mentioned, you may want to use regular expressions to identify your matches. A few days ago I wrote about identifying and removing URL's through regular expressions here. Long story short, you can use the Replace operator to apply a regular expression. This was the final expression:
https?://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]
However, I've been playing with the most common patterns I know for at least 30 minutes now, and couldn't find a way to find everything that isn't this pattern (so you can remove the rest and get URL's only). It appears that in Java (hence, in RapidMiner) you can't use negative matching, because the idea is to actually create the pattern you want matched and then either replaceAll("") the matches or find() the next one and do something (among other methods).
Sorry I couldn't come up with a solution, but at least you know that regular expressions with pure RapidMiner might not be the place to look at to do what you want (and btw, this looks like a nice to have feature, ain't it?).
All the best,
0 -
Hello
How to get in rapidminerImproved spelling of words?
For example a word
meeseg< - message
or
veeeery gooood - >very good
Does anyone know0 -
This last post should be in a new thread.
You can use "replace token" to swap a misspelling for a correct one.
0 -
Hello. I know . But my words are not fixed, and I've taken those examples. There is no way?
0 -
Hi there @jozeftomas_2020,
Please search first, as there a few posts in the Community on replacing text.
Can you please post this question in a new thread under the Getting Started Forum? This way others that have the same question will be able to find it at a later date.
Thanks,
Allie Tamulewicz
1