Read CSV - How to use quotes as escape character for quotes

LearnAW
LearnAW New Altair Community Member
edited November 2024 in Community Q&A
I'm reading a series of CSV data files, comma-separated and using quotes.  Within a data line, if quotes are used in a field, it indicates that using double-quotes.  i.e., It essentially uses quotes as the escape character for quotes.  An example line could be:
  "News Alert","Mon, 13 May 2019 08:29:58","""NEWS OFFICE"" <newsoffice@spamdude.com>"
which it SHOULD interpret as 3 fields as follows: 
  (1) News Alert  (2) Mon, 13 May 2019 08:29:58 (3) "NEWS OFFICE" <newsoffice@spamdude.com>

I'm using the Read CSV operator, with "use quotes" checked and using quotes as both the quotes character and escape character.  The result is that it not only doesn't read the line correctly, it completely skips reading any line that has the double-quotes in it.  My operator XML is as follows:

          <operator activated="true" class="read_csv" compatibility="9.0.003" expanded="true" height="68" name="Read CSV" width="90" x="112" y="34">
            <parameter key="column_separators" value=","/>
            <parameter key="escape_character" value="&quot;"/>
            <list key="annotations"/>
            <list key="data_set_meta_data_information"/>
            <parameter key="read_not_matching_values_as_missings" value="false"/>
          </operator>

Is there a way to do this so it reads and interprets my example line properly, or do I have to preprocess all my data files with a Python script or something similar to replace the double-quotes with some other escape character (like the default backslash), before ingesting to RapidMiner?  Thanks for the help!
Tagged:

Best Answers

  • SGolbert
    SGolbert New Altair Community Member
    Answer ✓


    I have reproduced your problem by creating a csv file. It's better when you share your entire process and a test file.


    Where does the csv come from? Is there a possible to change the escape character in the csv generator? I have to say the problem lies in the operator, I can read the file just fine with Pandas.


    So a solution would be to use the read_csv function with Pandas in a Execute Python operator. You could also read without the "use quotes" option and fix the result in RapidMiner, but I don't recommend it.


    Kind regards,
    Sebastian
  • Marco_Boeck
    Marco_Boeck New Altair Community Member
    edited May 2019 Answer ✓
    Hi,

    that's why I hate CSV, you get the most weird contents and >9000 special cases which would need manual handling..  :s
    In your case, you defined " as escape character and " as quote character. That cannot work, because what escape character simply means is: the next characters (whatever it is!) is read as is into the result, it will have no semantic meaning. So if you have the following line and would like to split on commas (so 2 columns):
    "Hello",123
    this will create only a single column with the content
    Hello,123
    as output, because you escaped the comma. This is necessary because escaping must always work, otherwise the point of escaping becomes mood.

    Obviously it gets even worse when you now have lines like 
    "Hello World",""Hi World"",123
    or
    "Hello World","""Hi World""",123
    or
    "Hello World","""Hi"" World",123
    This is not a well defined syntax anymore, because now you have different meanings for the occurance of escape/quote characters - but the characters are identical and thus are treated the same by the code.
    Python probably solves this by having a huge segment of "if current character x and next character x/y and next character x/y/z then do something else than you normally would" statements.. We instead opted for a clear finite state machine which more or less assumes that the CSV file has a consistent syntax and thus goes through the line parsing methodically.
    Just some insight into why this file will not work in Studio.

    Regards,
    Marco
  • LearnAW
    LearnAW New Altair Community Member
    Answer ✓
    Thanks Sebastian and Marco, I appreciate your perspectives.  And yes, I think I was assuming (or hoping for) a level of sophistication in parsing that I don't think Read CSV provides. (i.e. A precedence would have to exist, to first interpret a quote as the start of a string, and then any subsequent double quotes as an escaped quote.)  If it's treating the escape character the same across the entire input, then I see Marco's point how this can't work.  It looks like I have some preprocessing to do.  If anyone else knows a way to avoid this and read it correctly with just RapidMiner, please illuminate for me!  Thanks to all.

Answers

  • SGolbert
    SGolbert New Altair Community Member
    Answer ✓


    I have reproduced your problem by creating a csv file. It's better when you share your entire process and a test file.


    Where does the csv come from? Is there a possible to change the escape character in the csv generator? I have to say the problem lies in the operator, I can read the file just fine with Pandas.


    So a solution would be to use the read_csv function with Pandas in a Execute Python operator. You could also read without the "use quotes" option and fix the result in RapidMiner, but I don't recommend it.


    Kind regards,
    Sebastian
  • Marco_Boeck
    Marco_Boeck New Altair Community Member
    edited May 2019 Answer ✓
    Hi,

    that's why I hate CSV, you get the most weird contents and >9000 special cases which would need manual handling..  :s
    In your case, you defined " as escape character and " as quote character. That cannot work, because what escape character simply means is: the next characters (whatever it is!) is read as is into the result, it will have no semantic meaning. So if you have the following line and would like to split on commas (so 2 columns):
    "Hello",123
    this will create only a single column with the content
    Hello,123
    as output, because you escaped the comma. This is necessary because escaping must always work, otherwise the point of escaping becomes mood.

    Obviously it gets even worse when you now have lines like 
    "Hello World",""Hi World"",123
    or
    "Hello World","""Hi World""",123
    or
    "Hello World","""Hi"" World",123
    This is not a well defined syntax anymore, because now you have different meanings for the occurance of escape/quote characters - but the characters are identical and thus are treated the same by the code.
    Python probably solves this by having a huge segment of "if current character x and next character x/y and next character x/y/z then do something else than you normally would" statements.. We instead opted for a clear finite state machine which more or less assumes that the CSV file has a consistent syntax and thus goes through the line parsing methodically.
    Just some insight into why this file will not work in Studio.

    Regards,
    Marco
  • LearnAW
    LearnAW New Altair Community Member
    Answer ✓
    Thanks Sebastian and Marco, I appreciate your perspectives.  And yes, I think I was assuming (or hoping for) a level of sophistication in parsing that I don't think Read CSV provides. (i.e. A precedence would have to exist, to first interpret a quote as the start of a string, and then any subsequent double quotes as an escaped quote.)  If it's treating the escape character the same across the entire input, then I see Marco's point how this can't work.  It looks like I have some preprocessing to do.  If anyone else knows a way to avoid this and read it correctly with just RapidMiner, please illuminate for me!  Thanks to all.