calculate tweet time interval for each user

ramzanzadeh72
ramzanzadeh72 New Altair Community Member
edited November 5 in Community Q&A

hi i have twitter dataset and i want to calculate tweets time intervals for each user... can i do this with rapidminer??

in my dataset i have user_id attribute  that show the id of user that send the tweet and also time attribute thar show the send time of each tweet... 

how can i do this process in rapidminer

 

Best Answer

  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Answer ✓

    @ramzanzadeh72,

     

    We should sort the dataset by user_id and then, in deed, you're right, by created_at. For this operation, I used

    the Sort (advanced) operator from the Jackhammer extension (to install from the marketplace).

    Here the new process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_csv" compatibility="8.2.000" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34">
    <parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Tweets_Interval\data.csv"/>
    <parameter key="column_separators" value=","/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="created_at.true.polynominal.attribute"/>
    <parameter key="1" value="user_id.true.real.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="nominal_to_date" compatibility="8.2.000" expanded="true" height="82" name="Nominal to Date" width="90" x="179" y="34">
    <parameter key="attribute_name" value="created_at"/>
    <parameter key="date_type" value="date_time"/>
    <parameter key="date_format" value="EEE MMM dd HH:mm:ss +0000 yyyy"/>
    </operator>
    <operator activated="true" class="rmx_toolkit:sort_advanced" compatibility="2.1.784" expanded="true" height="82" name="Sort (Advanced)" width="90" x="380" y="34">
    <parameter key="primary_sort_attribute" value="user_id"/>
    <list key="additional_sort_attributes">
    <parameter key="created_at" value="increasing"/>
    </list>
    </operator>
    <operator activated="true" class="series:lag_series" compatibility="7.4.000" expanded="true" height="82" name="Lag Series" width="90" x="514" y="34">
    <list key="attributes">
    <parameter key="created_at" value="1"/>
    </list>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="8.2.000" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="648" y="34">
    <list key="function_descriptions">
    <parameter key="tweet_interval" value="date_diff([created_at-1],created_at)"/>
    </list>
    </operator>
    <connect from_op="Read CSV" from_port="output" to_op="Nominal to Date" to_port="example set input"/>
    <connect from_op="Nominal to Date" from_port="example set output" to_op="Sort (Advanced)" to_port="example set input"/>
    <connect from_op="Sort (Advanced)" from_port="example set output" to_op="Lag Series" to_port="example set input"/>
    <connect from_op="Lag Series" from_port="example set output" to_op="Generate Attributes (2)" to_port="example set input"/>
    <connect from_op="Generate Attributes (2)" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    You can note that the interval between tweets is in milliseconds. You can customize the formula

    in the last Generate Attributes operator to convert the interval in seconds, minutes, hours, days etc.

     

    Regards,

     

    Lionel

     

Answers

  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi @ramzanzadeh72,

     

    Does this process answer to your need ?

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="social_media:search_twitter" compatibility="8.1.000" expanded="true" height="68" name="Search Twitter" width="90" x="112" y="34">
    <parameter key="connection" value="dkk"/>
    <parameter key="query" value="test"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="8.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="From-User-Id|Created-At"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="8.2.000" expanded="true" height="82" name="Generate Attributes (3)" width="90" x="380" y="34">
    <list key="function_descriptions">
    <parameter key="sent_at" value="[Created-At]"/>
    </list>
    </operator>
    <operator activated="true" class="series:lag_series" compatibility="7.4.000" expanded="true" height="82" name="Lag Series" width="90" x="514" y="34">
    <list key="attributes">
    <parameter key="sent_at" value="1"/>
    </list>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="8.2.000" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="648" y="34">
    <list key="function_descriptions">
    <parameter key="tweet_interval" value="date_diff(sent_at,[sent_at-1])"/>
    </list>
    </operator>
    <operator activated="true" class="sort" compatibility="8.2.000" expanded="true" height="82" name="Sort" width="90" x="782" y="34">
    <parameter key="attribute_name" value="From-User-Id"/>
    <parameter key="sorting_direction" value="decreasing"/>
    </operator>
    <connect from_op="Search Twitter" from_port="output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Generate Attributes (3)" to_port="example set input"/>
    <connect from_op="Generate Attributes (3)" from_port="example set output" to_op="Lag Series" to_port="example set input"/>
    <connect from_op="Lag Series" from_port="example set output" to_op="Generate Attributes (2)" to_port="example set input"/>
    <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Sort" to_port="example set input"/>
    <connect from_op="Sort" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Regards,

     

    Lionel

  • ramzanzadeh72
    ramzanzadeh72 New Altair Community Member

    hi  @lionelderkrikor

    thanke you  for your reply and attention

    it work for single user but in my dataset i have a set of users that each user send a set of tweets... for calculation this interval for each user what should i do???

     

  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi again @ramzanzadeh72,

     

    Could you share your dataset(s) and process to better understand your problem.

     

    Regards,

     

    Lionel

  • ramzanzadeh72
    ramzanzadeh72 New Altair Community Member

    @lionelderkrikor

    i share part of my dataset that user_id show id of user that send tweet and create_at show the time that tweet send by user... in this dataset we have 3 user and each user send multiple tweet that create_at show the send time of tweet.

    so we should first sort the tweets send by each user base on create_time and then calculate interval of sequential tweets of each user.

    data.csv 237.9K
  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Answer ✓

    @ramzanzadeh72,

     

    We should sort the dataset by user_id and then, in deed, you're right, by created_at. For this operation, I used

    the Sort (advanced) operator from the Jackhammer extension (to install from the marketplace).

    Here the new process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_csv" compatibility="8.2.000" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34">
    <parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Tweets_Interval\data.csv"/>
    <parameter key="column_separators" value=","/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="created_at.true.polynominal.attribute"/>
    <parameter key="1" value="user_id.true.real.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="nominal_to_date" compatibility="8.2.000" expanded="true" height="82" name="Nominal to Date" width="90" x="179" y="34">
    <parameter key="attribute_name" value="created_at"/>
    <parameter key="date_type" value="date_time"/>
    <parameter key="date_format" value="EEE MMM dd HH:mm:ss +0000 yyyy"/>
    </operator>
    <operator activated="true" class="rmx_toolkit:sort_advanced" compatibility="2.1.784" expanded="true" height="82" name="Sort (Advanced)" width="90" x="380" y="34">
    <parameter key="primary_sort_attribute" value="user_id"/>
    <list key="additional_sort_attributes">
    <parameter key="created_at" value="increasing"/>
    </list>
    </operator>
    <operator activated="true" class="series:lag_series" compatibility="7.4.000" expanded="true" height="82" name="Lag Series" width="90" x="514" y="34">
    <list key="attributes">
    <parameter key="created_at" value="1"/>
    </list>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="8.2.000" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="648" y="34">
    <list key="function_descriptions">
    <parameter key="tweet_interval" value="date_diff([created_at-1],created_at)"/>
    </list>
    </operator>
    <connect from_op="Read CSV" from_port="output" to_op="Nominal to Date" to_port="example set input"/>
    <connect from_op="Nominal to Date" from_port="example set output" to_op="Sort (Advanced)" to_port="example set input"/>
    <connect from_op="Sort (Advanced)" from_port="example set output" to_op="Lag Series" to_port="example set input"/>
    <connect from_op="Lag Series" from_port="example set output" to_op="Generate Attributes (2)" to_port="example set input"/>
    <connect from_op="Generate Attributes (2)" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    You can note that the interval between tweets is in milliseconds. You can customize the formula

    in the last Generate Attributes operator to convert the interval in seconds, minutes, hours, days etc.

     

    Regards,

     

    Lionel

     

  • ramzanzadeh72
    ramzanzadeh72 New Altair Community Member
    @lionelderkrikor
    Thanke you... thats right....
    But I have another question... how can I calculate entropy for these intervals for each user???