Arabic words recognition

shk721 · June 2008

I was wondering if someone could solve the encoding problem for Arabic language . Basically , by choosing the right encoding forma t in the content_encoding _parameter the system displays the Arabic word correctly in the result view . However , two problem raised :

1. The message viewer when I apply a model displays the words as "?????" .
2. The wordlist produced also consists of question marks instead of words.
3. When I try to use StopWordFilter , I discovered that the system isn't able to match Arabic to filter .

Thanks in advance;
Hassan

IngoRM · June 2008

Hello Hassan,

did you try to also define the encoding in the main process operator (root)? Maybe this helps.

About the stop words: RM currently does not support a stop word filter for arabic words but you could simply create one with the file based stop word filter (don't remember the exact name right now).

Cheers,
Ingo

shk721 · June 2008

Hi Ingo;

Thanks for your prompt response .

actually , I have defined the encoding in the root process and in the preference and it didn't work . However, i want to know if there is an

enhancement of output encoding in Rapidminer because as i said in the beginning , the reading process of the input data was perfect .

i am looking for your help to resolve this problem.

cheers ;
Hassan

IngoRM · July 2008

Hello Hassan,

hmm, that's sort of weird. I must admit that we do not have any experience with Arabic characters but we know that the output should also work for Chinese characters so I assume there is no principal problem with this. Could you provide us some texts so we could try to find out what's going on?

Thanks and cheers,
Ingo

shk721 · July 2008

HI Ingo ;

i been waiting for your response .

this sample of arabic texts:

ان الرهن العقاري ذا الأصول الإسلامية، عُمل فيه بطرق موسعة وناجحة بكل المقاييس، في الدول الأجنبية، ونقل هذا النظام عن طريق باحثين تخصصوا في الرهن العقاري، إلى دول إسلامية مثل ماليزيا، وسنغافورة، وكذلك البحرين ودبي.

i appreciate your reaction , and i really need to sort this out . Also, to keep informed about the probelm , it is in writing with the program give the
feedback. it looks direct with default encoding not with the specified encoding .

i am eagrly awaited to hear form you , because i need to sort it out to start my disertation .

Cheers;
Hassan

IngoRM · July 2008

Hello,

I must admit that I was not even able to properly work with the test sample since I had no program available which was able to display it. I wanted to create a small data file containing some of the words together with an .aml file describing the data in order to work with that but I didn't manage to get create those files - at least I was not able to see anything and I am assuming I lost the information about the characters somewhere in this process.

My suggestion: please create a date file together with an .aml file which I can directly load with the ExampleSource operator. Please also specify the encoding in the .aml file and attach both files together with the information about the correct encodig here. Maybe then I am able to sort out what's happening in the output.

Cheers,
Ingo

kochan · July 2009

I just came across this thread which is one year old. I have had the same problem with languages written in non-latin scripts. I resolved the problems that I had in the development version by fixed the text plugin code as I have described here:

https://sourceforge.net/tracker/?func=detail&aid=2724678&group_id=131810&atid=722307

If you need to save a modelfile as something other than binary, then changes also have to be made to the ModelWriter operator.

Regards,

Andreas

land · July 2009

Hi Andreas,
thank you for this hint. Since we are not faced with non latin text in the usual day work, we weren't fully aware of this. But we will keep this in mind, while revising the text plugin for the next major version of rapidMiner.

Greetings,
Sebastian

drstevekramer · April 2010

Andreas and Sebastian, is it possible to use the getContentEncoding approach with the separate wvtool Java library, which I am using (rather than the Text plug-in)? I am having the same problem of Arabic text being displayed as question marks, even though I specified utf-8 as the encoding when creating the WVTDocumentInfo.

Thanks,
Steve

Arabic words recognition

Answers

Categories