🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Web Mining for Other Languages

User: "krsnewwave"
New Altair Community Member
Updated by Jocelyn
Hi!

I have some Japanese pages in my analysis, and I noticed that the operator "Extract Content" isn't friendly with texts other than UTF-8. Is there any way to change how it handles its encoding?

(EDIT: While this question is hanging, I think I'm going to try the complement - removing all html tags and the regions with <script>. It seems to work okay thus far.)

Find more posts tagged with

Sort by:
1 - 3 of 31
    User: "restuar"
    New Altair Community Member
    I agree. My colleagues and I plan to conduct comparative studies on English and Chinese online newspapers. Would you have anything in the pipeline for Cantonese and Mandarin character recognition? Thanks!
    User: "haddock"
    New Altair Community Member
    Hi there,

    I know people who do this sort of thing, they say that a problem with Chinese is tokenising the sentences, as there are no spaces to separate the words, check this out http://www.foreverastudent.com/2012/03/chinese-word-frequency-list-news.html .

    It is possible, but not easy!

    Good luck.
    User: "restuar"
    New Altair Community Member
    Thank you. My team mates and I have agreed to use the english version of the website. My problem now is how to tell rapidminer that when it accesses the chinese website, it should use the translated one. Would you know how to solve this problem?