🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Arabic Light Stemming a CSV file

User: "NoorKhalifa"
New Altair Community Member
Updated by Jocelyn
I have a CSV file with around 4000 rows of text. I want to use the Arabic Light Stemmer to stem each record.

I have done the following but the text is not being stemmed. The output is the same as the input.


and inside the Process

Find more posts tagged with

Sort by:
1 - 14 of 141
    User: "BalazsBaranyRM"
    New Altair Community Member
    Hi!

    To stem words, first you need words. Use Tokenize before Stem to split the text into words.

    Regards,

    Balázs
    User: "NoorKhalifa"
    New Altair Community Member
    OP
    @BalazsBarany

    I did the following



    inside the Process, but the output is still exactly the same as the input.

    Is there a problem with reading Arabic text?

    I specified the Encoding method to be UTF-8 when i imported the CSV file. Is there anything else I should do?
    User: "BalazsBaranyRM"
    New Altair Community Member
    Hi,

    put a breakpoint after on Tokenize and play with the settings. If you see the words in different colors, the tokenization is working correctly.

    I have no idea about the conventions with Arabic text, maybe a different word separator is necessary etc.

    If the text looks normal to you in RapidMiner, then the encoding is correct. You would see that it is broken with a wrong encoding.

    Regards,

    Balázs
    User: "NoorKhalifa"
    New Altair Community Member
    OP
    @BalazsBarany

    I am facing this issue, what could be a possible reason?


    User: "BalazsBaranyRM"
    New Altair Community Member
    Hi,

    you need to use Nominal to Text before Process Documents in order to mark your nominal attributes as text (suitable for the Text Processing operators).

    Regards,
    Balázs
    User: "NoorKhalifa"
    New Altair Community Member
    OP
    @BalazsBarany

    When I put a break point after Stem, I can see the correctly stemmed sentence. But the final output in the Results is like the following. What can I do to fix this? I want the output to be rows of the stemmed sentence.


    User: "BalazsBaranyRM"
    New Altair Community Member
    Hi!

    Use the "keep text" option that all "Process Documents" operators have. 

    The default operation mode of Process Documents is to create the wide table suitable for machine learning methods. 

    Tokenization can split your text into letters, words or sentences. Stemming works on words, at least in Western languages. 

    Regards,
    Balázs
    User: "NoorKhalifa"
    New Altair Community Member
    OP
    @BalazsBarany

    Great, that solved it. But now, when i use Write CSV, I don't get Arabic text in the output CSV file.

    I set the encoding to UTF-8 for Read CSV, Write CSV, and the process when pressing on the white canvas.

    What can I do to solve that?


    User: "BalazsBaranyRM"
    New Altair Community Member
    Hi!

    Try using a software in which you can set the import encoding. Excel is not very smart when just opening a CSV file. Something with Import should also work in Excel, where you get a dialog for selecting the encoding.

    The encoding of text files is not obvious to most software. It often needs to be specified manually. You can use an advanced editor (GVim, Notepad++ etc.) to determine if the file itself is really in UTF-8.

    Regards,
    Balázs
    User: "jwpfau"
    Altair Employee
    Hi Noor,

    Excel seem to have moved the CSV Import to Data → From Text/CSV



    Greetings,
    Jonas
    User: "NoorKhalifa"
    New Altair Community Member
    OP
    @jwpfau

    Hello!

    After clicking from Text/CSV, what should I do?


    User: "jwpfau"
    Altair Employee
    Hi Noor, 

    For me the first dialog was the "Import Data" file selector, the second one the csv table preview from my screenshot.

    I fear the excel autodetection completely failed for your file, is there anything in the "Open As" menu that says csv or utf-8?

    Greetings,
    Jonas
    User: "NoorKhalifa"
    New Altair Community Member
    OP
    @jwpfau

    I didn't manage to do that in Excel, but importing the file in Notepad gave me the Arabic equivalent.

    Thanks!
    User: "jwpfau"
    Altair Employee
    Hi Noor,

    you can force csv parsing here.



    But you will stay in the more cumbersome Power Query Editor flow afterwards.

    Greetings,
    Jonas