Arabic Light Stemming a CSV file

NoorKhalifa
NoorKhalifa New Altair Community Member
edited November 5 in Community Q&A
I have a CSV file with around 4000 rows of text. I want to use the Arabic Light Stemmer to stem each record.

I have done the following but the text is not being stemmed. The output is the same as the input.


and inside the Process

Answers

  • BalazsBarany
    BalazsBarany New Altair Community Member
    Hi!

    To stem words, first you need words. Use Tokenize before Stem to split the text into words.

    Regards,

    Balázs
  • NoorKhalifa
    NoorKhalifa New Altair Community Member
    @BalazsBarany

    I did the following



    inside the Process, but the output is still exactly the same as the input.

    Is there a problem with reading Arabic text?

    I specified the Encoding method to be UTF-8 when i imported the CSV file. Is there anything else I should do?
  • BalazsBarany
    BalazsBarany New Altair Community Member
    Hi,

    put a breakpoint after on Tokenize and play with the settings. If you see the words in different colors, the tokenization is working correctly.

    I have no idea about the conventions with Arabic text, maybe a different word separator is necessary etc.

    If the text looks normal to you in RapidMiner, then the encoding is correct. You would see that it is broken with a wrong encoding.

    Regards,

    Balázs
  • NoorKhalifa
    NoorKhalifa New Altair Community Member
    @BalazsBarany

    I am facing this issue, what could be a possible reason?


  • BalazsBarany
    BalazsBarany New Altair Community Member
    Hi,

    you need to use Nominal to Text before Process Documents in order to mark your nominal attributes as text (suitable for the Text Processing operators).

    Regards,
    Balázs
  • NoorKhalifa
    NoorKhalifa New Altair Community Member
    @BalazsBarany

    When I put a break point after Stem, I can see the correctly stemmed sentence. But the final output in the Results is like the following. What can I do to fix this? I want the output to be rows of the stemmed sentence.


  • BalazsBarany
    BalazsBarany New Altair Community Member
    Hi!

    Use the "keep text" option that all "Process Documents" operators have. 

    The default operation mode of Process Documents is to create the wide table suitable for machine learning methods. 

    Tokenization can split your text into letters, words or sentences. Stemming works on words, at least in Western languages. 

    Regards,
    Balázs
  • NoorKhalifa
    NoorKhalifa New Altair Community Member
    @BalazsBarany

    Great, that solved it. But now, when i use Write CSV, I don't get Arabic text in the output CSV file.

    I set the encoding to UTF-8 for Read CSV, Write CSV, and the process when pressing on the white canvas.

    What can I do to solve that?


  • BalazsBarany
    BalazsBarany New Altair Community Member
    Hi!

    Try using a software in which you can set the import encoding. Excel is not very smart when just opening a CSV file. Something with Import should also work in Excel, where you get a dialog for selecting the encoding.

    The encoding of text files is not obvious to most software. It often needs to be specified manually. You can use an advanced editor (GVim, Notepad++ etc.) to determine if the file itself is really in UTF-8.

    Regards,
    Balázs
  • jwpfau
    jwpfau New Altair Community Member
    Hi Noor,

    Excel seem to have moved the CSV Import to Data → From Text/CSV



    Greetings,
    Jonas
  • NoorKhalifa
    NoorKhalifa New Altair Community Member
    @jwpfau

    Hello!

    After clicking from Text/CSV, what should I do?


  • jwpfau
    jwpfau New Altair Community Member
    Hi Noor, 

    For me the first dialog was the "Import Data" file selector, the second one the csv table preview from my screenshot.

    I fear the excel autodetection completely failed for your file, is there anything in the "Open As" menu that says csv or utf-8?

    Greetings,
    Jonas
  • NoorKhalifa
    NoorKhalifa New Altair Community Member
    @jwpfau

    I didn't manage to do that in Excel, but importing the file in Notepad gave me the Arabic equivalent.

    Thanks!
  • jwpfau
    jwpfau New Altair Community Member
    Hi Noor,

    you can force csv parsing here.



    But you will stay in the more cumbersome Power Query Editor flow afterwards.

    Greetings,
    Jonas