Import a CSV, without false Linebreaks

New Altair Community Member

Sep 10, 2021

Updated Nov 5, 2024 by Jocelyn

Hi,

I'm currently writing my Bachelors on Machine-Learning (Sarcasm detection specifically).

My Prof. recommended Rapid Miner.

Here's the problem: when trying to import the corpus I intend to work with (and am already working with in a survey, and that is also used in dozens of works im referencing) Rapid Miner moves content that belongs into the last column into the first, presumably, because the text there contains Line-breaks.

If any other software I had to work with so far did that, I'd probably know what to do...

I try as hard as I can to tell Rapid Miner to disregard everything that isn't a TAB.

Find more posts tagged with

AI Studio

CSV

Data Import

Sort by:

1 - 4 of 41

BalazsBaranyRM

New Altair Community Member

Sep 10, 2021

Hi,

you can use other software to import the CSV and export it in a more structured format like xlsx or into a database. RapidMiner will read the line breaks without a problem from these.

If you work with a survey software anyway, you should have other export options in addition to CSV.

Regards,
Balázs

MarkusW

New Altair Community Member

Sep 10, 2021

Hi Balázs,

thanks for the quick response.

The problem is: it doesn't work. I can't import .db files at all.

While opening it in Libre-office, saving it as xlsx seems to work, I wouldn't call it a solution to the problem of workin with text

kayman

New Altair Community Member

Accepted Answer

Sep 12, 2021

Updated Sep 12, 2021 by kayman

You could try to replace every linebreak first with a dummy string (something like [lb]) and replace it back with a linebreak after you loaded it as csv.

Adding linebreaks is a bit of a dirty trick since you can not easily add them with a regex, but what works for me is to first create an attribute with value %0A, which is linefeed char, then decode it using the decode url operator, and store this as a macro. Then you can insert it as a replacement value using the macro.

Or you can replace them upfront using notepad++ or so, here you can replace directly with \\r\\n. (single slash instead of double, but otherwise it doesn't show up here)

Then again, if your csv is looking for tabs as linefeads it should ignore the 'false' ones all together. So could it be there are like unicode tabs in your content that cause this behavior?

BalazsBaranyRM

New Altair Community Member

Accepted Answer

Sep 13, 2021

It is true that CSV doesn't have a good specification and some programs can cope better with line breaks inside quoted strings than RapidMiner.

For me the manual conversion into Excel and then Read Excel in RapidMiner was a possible workaround.

Of course I strive to put everything in relational databases as early as possible, so these kinds of problems go away.

Sort by:

1 - 2 of 21

kayman

New Altair Community Member

Accepted Answer

Sep 12, 2021

Updated Sep 12, 2021 by kayman

View in context

BalazsBaranyRM

New Altair Community Member

Accepted Answer

Sep 13, 2021