Import a CSV, without false Linebreaks

MarkusW
MarkusW New Altair Community Member
edited November 5 in Community Q&A
Hi,
I'm currently writing my Bachelors on Machine-Learning (Sarcasm detection specifically).
My Prof. recommended Rapid Miner.
Here's the problem: when trying to import the corpus I intend to work with (and am already working with in a survey, and that is also used in dozens of works im referencing) Rapid Miner moves content that belongs into the last column into the first, presumably, because the text there contains Line-breaks.
If any other software I had to work with so far did that, I'd probably know what to do...
I try as hard as I can to tell Rapid Miner to disregard everything that isn't a TAB.

Best Answers

  • kayman
    kayman New Altair Community Member
    edited September 2021 Answer ✓
    You could try to replace every linebreak first with a dummy string (something like [lb]) and replace it back with a linebreak after you loaded it as csv.

    Adding linebreaks is a bit of a dirty trick since you can not easily add them with a regex, but what works for me is to first create an attribute with value %0A, which is linefeed char, then decode it using the decode url operator, and store this as a macro. Then you can insert it as a replacement value using the macro. 

    Or you can replace them upfront using notepad++ or so, here you can replace directly with \\r\\n. (single slash instead of double, but otherwise it doesn't show up here) 

    Then again, if your csv is looking for tabs as linefeads it should ignore the 'false' ones all together. So could it be there are like unicode tabs in your content that cause this behavior? 
  • BalazsBarany
    BalazsBarany New Altair Community Member
    Answer ✓
    It is true that CSV doesn't have a good specification and some programs can cope better with line breaks inside quoted strings than RapidMiner.

    For me the manual conversion into Excel and then Read Excel in RapidMiner was a possible workaround.

    Of course I strive to put everything in relational databases as early as possible, so these kinds of problems go away.

Answers

  • BalazsBarany
    BalazsBarany New Altair Community Member
    Hi,

    you can use other software to import the CSV and export it in a more structured format like xlsx or into a database. RapidMiner will read the line breaks without a problem from these.

    If you work with a survey software anyway, you should have other export options in addition to CSV.

    Regards,
    Balázs
  • MarkusW
    MarkusW New Altair Community Member
    Hi Balázs,
    thanks for the quick response.
    The problem is: it doesn't work. I can't import .db files at all.
    While opening it in Libre-office, saving it as xlsx seems to work, I wouldn't call it a solution to the problem of workin with text
  • kayman
    kayman New Altair Community Member
    edited September 2021 Answer ✓
    You could try to replace every linebreak first with a dummy string (something like [lb]) and replace it back with a linebreak after you loaded it as csv.

    Adding linebreaks is a bit of a dirty trick since you can not easily add them with a regex, but what works for me is to first create an attribute with value %0A, which is linefeed char, then decode it using the decode url operator, and store this as a macro. Then you can insert it as a replacement value using the macro. 

    Or you can replace them upfront using notepad++ or so, here you can replace directly with \\r\\n. (single slash instead of double, but otherwise it doesn't show up here) 

    Then again, if your csv is looking for tabs as linefeads it should ignore the 'false' ones all together. So could it be there are like unicode tabs in your content that cause this behavior? 
  • BalazsBarany
    BalazsBarany New Altair Community Member
    Answer ✓
    It is true that CSV doesn't have a good specification and some programs can cope better with line breaks inside quoted strings than RapidMiner.

    For me the manual conversion into Excel and then Read Excel in RapidMiner was a possible workaround.

    Of course I strive to put everything in relational databases as early as possible, so these kinds of problems go away.