Web Mining for Other Languages
Hi!
I have some Japanese pages in my analysis, and I noticed that the operator "Extract Content" isn't friendly with texts other than UTF-8. Is there any way to change how it handles its encoding?
(EDIT: While this question is hanging, I think I'm going to try the complement - removing all html tags and the regions with <script>. It seems to work okay thus far.)
I have some Japanese pages in my analysis, and I noticed that the operator "Extract Content" isn't friendly with texts other than UTF-8. Is there any way to change how it handles its encoding?
(EDIT: While this question is hanging, I think I'm going to try the complement - removing all html tags and the regions with <script>. It seems to work okay thus far.)
Find more posts tagged with
Sort by:
1 - 3 of
31

I agree. My colleagues and I plan to conduct comparative studies on English and Chinese online newspapers. Would you have anything in the pipeline for Cantonese and Mandarin character recognition? Thanks!
Hi there,
I know people who do this sort of thing, they say that a problem with Chinese is tokenising the sentences, as there are no spaces to separate the words, check this out http://www.foreverastudent.com/2012/03/chinese-word-frequency-list-news.html .
It is possible, but not easy!
Good luck.
I know people who do this sort of thing, they say that a problem with Chinese is tokenising the sentences, as there are no spaces to separate the words, check this out http://www.foreverastudent.com/2012/03/chinese-word-frequency-list-news.html .
It is possible, but not easy!
Good luck.