nav[aria-label="Primary Navigation"] { padding: 0; & ul { list-style: none; width: 100%; display: flex; flex-direction: row; justify-content: start; align-items: start; gap: 30px; padding: 0; & li { margin: 0; } & ul li { list-style: none; } } }

Siemens Community Catalyst Program

The Siemens Community Catalyst program was co-created with our community to acknowledge technology leaders who consistently contribute to the Siemens Community. Nominations are accepted on a rolling basis.

Nominate Now

Language filter to retain English only

JamieLim

I have documents that include English and a mixture of other languages. Can I filter to retain only the english text without going through all documents to identify all the other languages that I want to exclude?

Find more posts tagged with

AI Studio

Filtering

Accepted answers

Telcontar120

In theory you could tokenize based on spaces, which would give you a set of "words" that would be potentially in multiple languages. You could then use the filter token with dictionary operator to retain only those tokens which were in a given language dictionary (that you would need to supply as a txt file). This would be a kind of crude language filter using only native RapidMiner operators, but I think the accuracy would not be as high as you would like due to ambiguous words and also your treatment of potentially mixed language texts.

JamieLim

I ended up using python to split up the parargraphs into sentences and then identified the english sentences from the non-english ones and managed to do a pretty good filter. Then, the filter text is passed into RapidMiner, tokenize and still a few non-english words were left, and i removed these by adding them to a stopwords dictionary.

All comments

sgenzer

ah interesting question. The short answer is "not easily".

In my mind you have two options:

- Manually classify a set of documents and train a ML model to discriminate between them, then apply the model on all new documents.
- Use an external API such as Google Translate or AWS Translate to do this for you

Scott

JamieLim

sgenzer What about if we just retain alphanumeric and space in the text? Is there an easier way to achieve this ?

Telcontar120

sgenzer

@JamieLim to quote Euclid of Alexandria:

There is no royal road to geometry.

or in other words, sometimes there is no quick-and-dirty answer.

Scott

JamieLim