Altair RISE
A program to recognize and reward our most engaged community members
Nominate Yourself Now!
Home
Discussions
Community Q&A
Arabic Light Stemming a CSV file
NoorKhalifa
I have a CSV file with around 4000 rows of text. I want to use the Arabic Light Stemmer to stem each record.
I have done the following but the text is not being stemmed. The output is the same as the input.
and inside the Process
Find more posts tagged with
AI Studio
Stemming
Text Mining + NLP
Accepted answers
All comments
BalazsBaranyRM
Hi!
To stem words, first you need words. Use Tokenize before Stem to split the text into words.
Regards,
Balázs
NoorKhalifa
@BalazsBarany
I did the following
inside the Process, but the output is still exactly the same as the input.
Is there a problem with reading Arabic text?
I specified the Encoding method to be UTF-8 when i imported the CSV file. Is there anything else I should do?
BalazsBaranyRM
Hi,
put a breakpoint after on Tokenize and play with the settings. If you see the words in different colors, the tokenization is working correctly.
I have no idea about the conventions with Arabic text, maybe a different word separator is necessary etc.
If the text looks normal to you in RapidMiner, then the encoding is correct. You would see that it is broken with a wrong encoding.
Regards,
Balázs
NoorKhalifa
@BalazsBarany
I am facing this issue, what could be a possible reason?
BalazsBaranyRM
Hi,
you need to use
Nominal to Text
before
Process Documents
in order to mark your nominal attributes as text (suitable for the Text Processing operators).
Regards,
Balázs
NoorKhalifa
@BalazsBarany
When I put a break point after Stem, I can see the correctly stemmed sentence. But the final output in the Results is like the following. What can I do to fix this? I want the output to be rows of the stemmed sentence.
BalazsBaranyRM
Hi!
Use the "keep text" option that all "Process Documents" operators have.
The default operation mode of Process Documents is to create the wide table suitable for machine learning methods.
Tokenization can split your text into letters, words or sentences. Stemming works on words, at least in Western languages.
Regards,
Balázs
NoorKhalifa
@BalazsBarany
Great, that solved it. But now, when i use
Write CSV
, I don't get Arabic text in the output CSV file.
I set the encoding to UTF-8 for
Read CSV
,
Write CSV
, and the process when pressing on the white canvas.
What can I do to solve that?
BalazsBaranyRM
Hi!
Try using a software in which you can set the import encoding. Excel is not very smart when just opening a CSV file. Something with Import should also work in Excel, where you get a dialog for selecting the encoding.
The encoding of text files is not obvious to most software. It often needs to be specified manually. You can use an advanced editor (GVim, Notepad++ etc.) to determine if the file itself is really in UTF-8.
Regards,
Balázs
jwpfau
Hi Noor,
Excel seem to have moved the CSV Import to Data → From Text/CSV
Greetings,
Jonas
NoorKhalifa
@jwpfau
Hello!
After clicking from Text/CSV, what should I do?
jwpfau
Hi Noor,
For me the first dialog was the "Import Data" file selector, the second one the csv table preview from my screenshot.
I fear the excel autodetection completely failed for your file, is there anything in the "Open As" menu that says csv or utf-8?
Greetings,
Jonas
NoorKhalifa
@jwpfau
I didn't manage to do that in Excel, but importing the file in Notepad gave me the Arabic equivalent.
Thanks!
jwpfau
Hi Noor,
you can force csv parsing here.
But you will stay in the more cumbersome Power Query Editor flow afterwards.
Greetings,
Jonas
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)
Groups