Arabic words recognition
shk721
New Altair Community Member
I was wondering if someone could solve the encoding problem for Arabic language . Basically , by choosing the right encoding forma t in the content_encoding _parameter the system displays the Arabic word correctly in the result view . However , two problem raised :
1. The message viewer when I apply a model displays the words as "?????" .
2. The wordlist produced also consists of question marks instead of words.
3. When I try to use StopWordFilter , I discovered that the system isn't able to match Arabic to filter .
Thanks in advance;
Hassan
1. The message viewer when I apply a model displays the words as "?????" .
2. The wordlist produced also consists of question marks instead of words.
3. When I try to use StopWordFilter , I discovered that the system isn't able to match Arabic to filter .
Thanks in advance;
Hassan
Tagged:
0
Answers
-
Hello Hassan,
did you try to also define the encoding in the main process operator (root)? Maybe this helps.
About the stop words: RM currently does not support a stop word filter for arabic words but you could simply create one with the file based stop word filter (don't remember the exact name right now).
Cheers,
Ingo0 -
Hi Ingo;
Thanks for your prompt response .
actually , I have defined the encoding in the root process and in the preference and it didn't work . However, i want to know if there is an
enhancement of output encoding in Rapidminer because as i said in the beginning , the reading process of the input data was perfect .
i am looking for your help to resolve this problem.
cheers ;
Hassan0 -
Hello Hassan,
hmm, that's sort of weird. I must admit that we do not have any experience with Arabic characters but we know that the output should also work for Chinese characters so I assume there is no principal problem with this. Could you provide us some texts so we could try to find out what's going on?
Thanks and cheers,
Ingo0 -
HI Ingo ;
i been waiting for your response .
this sample of arabic texts:
ان الرهن العقاري ذا الأصول الإسلامية، عُمل فيه بطرق موسعة وناجحة بكل المقاييس، في الدول الأجنبية، ونقل هذا النظام عن طريق باحثين تخصصوا في الرهن العقاري، إلى دول إسلامية مثل ماليزيا، وسنغافورة، وكذلك البحرين ودبي.
i appreciate your reaction , and i really need to sort this out . Also, to keep informed about the probelm , it is in writing with the program give the
feedback. it looks direct with default encoding not with the specified encoding .
i am eagrly awaited to hear form you , because i need to sort it out to start my disertation .
Cheers;
Hassan
0 -
Hello,
I must admit that I was not even able to properly work with the test sample since I had no program available which was able to display it. I wanted to create a small data file containing some of the words together with an .aml file describing the data in order to work with that but I didn't manage to get create those files - at least I was not able to see anything and I am assuming I lost the information about the characters somewhere in this process.
My suggestion: please create a date file together with an .aml file which I can directly load with the ExampleSource operator. Please also specify the encoding in the .aml file and attach both files together with the information about the correct encodig here. Maybe then I am able to sort out what's happening in the output.
Cheers,
Ingo0 -
I just came across this thread which is one year old. I have had the same problem with languages written in non-latin scripts. I resolved the problems that I had in the development version by fixed the text plugin code as I have described here:
https://sourceforge.net/tracker/?func=detail&aid=2724678&group_id=131810&atid=722307
If you need to save a modelfile as something other than binary, then changes also have to be made to the ModelWriter operator.
Regards,
Andreas0 -
Hi Andreas,
thank you for this hint. Since we are not faced with non latin text in the usual day work, we weren't fully aware of this. But we will keep this in mind, while revising the text plugin for the next major version of rapidMiner.
Greetings,
Sebastian0 -
Andreas and Sebastian, is it possible to use the getContentEncoding approach with the separate wvtool Java library, which I am using (rather than the Text plug-in)? I am having the same problem of Arabic text being displayed as question marks, even though I specified utf-8 as the encoding when creating the WVTDocumentInfo.
Thanks,
Steve
0