Altair RISE
A program to recognize and reward our most engaged community members
Nominate Yourself Now!
Home
Discussions
Community Q&A
Language filter to retain English only
JamieLim
I have documents that include English and a mixture of other languages. Can I filter to retain only the english text without going through all documents to identify all the other languages that I want to exclude?
Find more posts tagged with
AI Studio
Filtering
Accepted answers
Telcontar120
In theory you could tokenize based on spaces, which would give you a set of "words" that would be potentially in multiple languages. You could then use the filter token with dictionary operator to retain only those tokens which were in a given language dictionary (that you would need to supply as a txt file). This would be a kind of crude language filter using only native RapidMiner operators, but I think the accuracy would not be as high as you would like due to ambiguous words and also your treatment of potentially mixed language texts.
JamieLim
I ended up using python to split up the parargraphs into sentences and then identified the english sentences from the non-english ones and managed to do a pretty good filter. Then, the filter text is passed into RapidMiner, tokenize and still a few non-english words were left, and i removed these by adding them to a stopwords dictionary.
All comments
sgenzer
ah interesting question. The short answer is "not easily".
In my mind you have two options:
- Manually classify a set of documents and train a ML model to discriminate between them, then apply the model on all new documents.
- Use an external API such as Google Translate or AWS Translate to do this for you
Scott
JamieLim
sgenzer
What about if we just retain alphanumeric and space in the text? Is there an easier way to achieve this ?
Telcontar120
In theory you could tokenize based on spaces, which would give you a set of "words" that would be potentially in multiple languages. You could then use the filter token with dictionary operator to retain only those tokens which were in a given language dictionary (that you would need to supply as a txt file). This would be a kind of crude language filter using only native RapidMiner operators, but I think the accuracy would not be as high as you would like due to ambiguous words and also your treatment of potentially mixed language texts.
sgenzer
@JamieLim
to quote Euclid of Alexandria:
There is no royal road to geometry.
or in other words, sometimes there is no quick-and-dirty answer.
Scott
JamieLim
I ended up using python to split up the parargraphs into sentences and then identified the english sentences from the non-english ones and managed to do a pretty good filter. Then, the filter text is passed into RapidMiner, tokenize and still a few non-english words were left, and i removed these by adding them to a stopwords dictionary.
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)
Groups