Extracting wrapped cells from a CSV file

Thomas_Ott
Thomas_Ott New Altair Community Member
edited November 5 in Community Q&A

Ok, here's one that's stumping me. I'm attaching a CSV file that has some Instragram comments. The file is generated from a python script that I'll eventually embed into RM but right now I'm just trying to get this to work.  My goal is to use these comments for some text processing later on. 

 

My problem is correctly parsing the entire comment from each cell. For some strange reason, I can't seem to do it correctly. The text seems to be wrapped in the cell. Any ideas on how to do this?

Tagged:

Answers

  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi @Thomas_Ott,

     

    Here, with the Nominal to Text operator between the Read CSV and Process Document from Data /Tokenize operators,

    a TF_IDF dataset is well generated : 

    Text_Mining_Thomas_Ott.png

     

    Text_Mining_Thomas_Ott_2.png

     

    "...The text seems to be wrapped in the cell..."

    What problem do you encounter precisely ?

     

    Regards,

     

    Lionel

     

  • Thomas_Ott
    Thomas_Ott New Altair Community Member
    The problem is when I use the import wizard for Read CSV, the preview doesn’t look right.
  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi again @Thomas_Ott,

     

    Indeed, RapidMiner has difficulties to read correctly your file.

    So, once again, I propose a Python script as a possible solution  (I used the read_csv function and filtered

    the "carriage returns" which cause some problems in RapidMiner) : 

    Text_Mining_Thomas_Ott_3.png

     

    The process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="112" y="85">
    <parameter key="script" value="import pandas as pd&#10;import re&#10;&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;&#10;path = 'C:/Users/Lionel/Documents/Formations_DataScience/Rapidminer/Tests_Rapidminer/Text_Mining_Thomas_Ott/out2.csv'&#10;&#10;&#10;&#10;def rm_main():&#10;&#10; data = pd.read_csv(path)&#10; data.iloc[0] = data.iloc[0].str.replace('[^\w\s#@/:%.,_-]', '', flags=re.UNICODE)&#10; data['0'] = data['0'].str.replace('[^\w\s#@/:%.,_-]', '', flags=re.UNICODE)&#10; data = data.replace(r'\n','', regex=True)&#10; data = data.replace(r'\r','', regex=True)&#10; &#10; # connect 2 output ports to see the results&#10; return data"/>
    </operator>
    <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    To execute this process, don't forget to set the path (where your file is stored) in the Python script.

     

    I hope it helps,

     

    Regards,

     

    Lionel

     

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    @lionelderkrikor I see, you used iloc[0] in a dataframe. So RM loads in one row at a time, hence you can use iloc[0] instead of looping. Correct?

     

    I wonder if the trick is to write the data out correctly in the first place and avoid all this mess. Do you happen to know of a 'remove carriage return option' when writing out the dataframe?

     

    Before you ask, why not skip exporting the CSV and just pass it through the RM process, I get that but the requirement is to spit out the CSV. :)

     

    Thanks BTW!

     

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-

    import pandas as pd
    import json
    from pprint import pprint
    import csv

    df = pd.read_json('https://www.instagram.com/explore/tags/keto/?__a=1')

    text_export = []
    rawtext = df['graphql']['hashtag']['edge_hashtag_to_media']['edges']
    for i in rawtext:
    rawtext = i['node']['edge_media_to_caption']['edges']
    #print(rawtext)
    for j in rawtext:
    final_text = j['node']['text']
    df['final_text'] = j['node']['text']
    text_export.append(final_text)
    #print(df['final_text'])

    text_out = pd.DataFrame(text_export)
    text_out.to_csv('out2.csv')

     

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    @lionelderkrikor ok, I solved it. The data was incredibly messy. I just striped all white spaces using \s and then just replaced with a whitespace. Kinda stupid but it works. 

     

    text_out.replace(r'\s',' ', regex=True, inplace=True)

    Now I get everything I want in output file AND I can productionalize things.

     

    Thanks for your help. 

  • SGolbert
    SGolbert New Altair Community Member

    Hi @Thomas_Ott,

     

    good that you solved it. I've had similar problems when reading CSV generated with R or Python before.

     

    As a general note, I have found that saving your external tables as XML can be advantageous, as the Read XML operator is way more forgiving than the Read CSV one. Then you can retain special characters.

     

    Regards,

    Sebastian

     

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    @SGolbert I haven't thought of doing that. A handy tip indeed. 

  • sgenzer
    sgenzer
    Altair Employee

    another trick I use in situations like these is to convert the text to URL-encoded text which makes it MUCH easier to strip out unwanted characters, then convert back.

     

    Screen Shot 2018-08-16 at 9.38.54 AM.png


    Scott