"Text Mining: How to split data according to language"

uteheinze
uteheinze New Altair Community Member
edited November 5 in Community Q&A
Hi there,

I am currently trying to split the text corpus I am working with into the different languages the texts are written in, but I fail and seek help.
First, I classified the languages of each text in my text corpus by using a Naive Bayes based language detector. Thus, I already know which of the texts are e.g. German or English. Now, I want to select only the German or English texts in order to analyze them seperately, but I fail and don't know the correct operators to use. I already tried to use the Filter Examples operator, but it looks like only the different prediction labels for the languages are filtered and the corresponding texts are omitted.

Can anybody help?

Thanks in advance!!

Ute

Answers

  • ighyboo
    ighyboo New Altair Community Member
    Not sure how you structured your process but maybe you can use the Apply model operator to create a "language" label and then filter according to that label? ???

    Alternative you can use a language detection API through the "Enrich data by WebService" operator to create such attribute. I personally used http://detectlanguage.com/ and it was very good and easy to implement.

    Hope this helps

    Igor
  • tibi
    tibi New Altair Community Member

    Igor,I tried your suggestion of using the Enrich data by WebService" operator to create such atribute, however I am not sure about:

    1. What quesry type to use

    2. and what the regular expression would have to look like for this to work.

     

    I do have a API key from detectlanguage key and I am able to pass data to the detectlanguage.com service.  Now teh question is how do I get the value from languge parsed out. 

     

    Thanks in advance for your help.

  • sgenzer
    sgenzer
    Altair Employee

    hello @tibi - welcome to the community.  This is an old thread but maybe I can help?  Can you please post your XML process (see instructions on the right)?

     

    Thanks.


    Scott

     

  • tibi
    tibi New Altair Community Member

    Thank you for writing back. Atatched is my XML code. I edited teh code so that it does not show my API key.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve Samorincan_facebook_statuses - orig short" width="90" x="313" y="238">
    <parameter key="repository_entry" value="//Facebooklanguage/data/Samorincan_facebook_statuses - orig short"/>
    </operator>
    <operator activated="true" class="web:enrich_data_by_webservice" compatibility="7.3.000" expanded="true" height="68" name="Enrich Data by Webservice" width="90" x="581" y="238">
    <parameter key="query_type" value="Regular Expression"/>
    <list key="string_machting_queries"/>
    <list key="regular_expression_queries"/>
    <list key="regular_region_queries"/>
    <list key="xpath_queries"/>
    <list key="namespaces"/>
    <list key="index_queries"/>
    <list key="jsonpath_queries"/>
    <parameter key="url" value="http://ws.detectlanguage.com/0.2/detect?q=&amp;lt;%status_message%&amp;gt;&amp;amp;key=MYKYHERE"/>
    <parameter key="delay" value="1"/>
    <list key="request_properties"/>
    </operator>
    <connect from_op="Retrieve Samorincan_facebook_statuses - orig short" from_port="output" to_op="Enrich Data by Webservice" to_port="Example Set"/>
    <connect from_op="Enrich Data by Webservice" from_port="ExampleSet" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
  • sgenzer
    sgenzer
    Altair Employee

    hello @tibi - looks like an encoding issue.  Give this a try (again deleting API key):

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="8.0.001" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="34">
    <list key="attribute_values">
    <parameter key="message" value="&quot;buenos dias señor&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="web:encode_urls" compatibility="7.3.000" expanded="true" height="82" name="Encode URLs" width="90" x="179" y="34">
    <parameter key="url_attribute" value="message"/>
    <parameter key="encoding" value="UTF-8"/>
    </operator>
    <operator activated="true" class="web:enrich_data_by_webservice" compatibility="7.3.000" expanded="true" height="68" name="Enrich Data by Webservice" width="90" x="313" y="34">
    <parameter key="query_type" value="JsonPath"/>
    <list key="string_machting_queries"/>
    <list key="regular_expression_queries">
    <parameter key="foo" value=".*"/>
    </list>
    <list key="regular_region_queries"/>
    <list key="xpath_queries"/>
    <list key="namespaces"/>
    <list key="index_queries"/>
    <list key="jsonpath_queries">
    <parameter key="language" value="$..language"/>
    <parameter key="isReliable" value="$..isReliable"/>
    <parameter key="confidence" value="$..confidence"/>
    </list>
    <parameter key="url" value="http://ws.detectlanguage.com/0.2/detect?q=&amp;lt;%message%&amp;gt;&amp;amp;key=YOUR-KEY-HERE"/>
    <list key="request_properties"/>
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Encode URLs" to_port="example set input"/>
    <connect from_op="Encode URLs" from_port="example set output" to_op="Enrich Data by Webservice" to_port="Example Set"/>
    <connect from_op="Enrich Data by Webservice" from_port="ExampleSet" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

    Scott

  • tibi
    tibi New Altair Community Member

    Scott,

     

    Yes.  That is waht it was.  Thank you!

     

    One more thing. When I have text string  with two languages in it, the API on the web actaully returns 2 sets of values for language, isReliable and confidence.  I actually need these values.  Here is an example what gets returned by the API in this situation:

    data	
    detections
    0
    language "sk"
    isReliable true
    confidence 13.38
    1
    language "hu"
    isReliable false
    confidence 14.68

    I assume I have to edit the jsonpath queries for the Enrich Data by Webservice operator.  Any suggestions, please?

     

    Thanks,

    Tibor

     

  • sgenzer
    sgenzer
    Altair Employee

    ok I think that would be fine but...can you please give me a text string that will give that result?  :)


    Scott

     

    [EDIT: ok I got a snippet from the DetectLanguage site. So I have never found a reliable way to parse JSON beyond simple ways using that operator so, strangely enough, I find it more straightforward to convert to XML and go from there.  It looks completely bizarre but until RapidMiner makes a good Read JSON operator, this is what I have found works best for me.]

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="8.0.001" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="34">
    <list key="attribute_values">
    <parameter key="message" value="&quot;jak sie jambo prosze bardzo&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="web:encode_urls" compatibility="7.3.000" expanded="true" height="82" name="Encode URLs" width="90" x="179" y="34">
    <parameter key="url_attribute" value="message"/>
    <parameter key="encoding" value="UTF-8"/>
    </operator>
    <operator activated="true" class="web:enrich_data_by_webservice" compatibility="7.3.000" expanded="true" height="68" name="Enrich Data by Webservice" width="90" x="313" y="34">
    <parameter key="query_type" value="Regular Expression"/>
    <list key="string_machting_queries"/>
    <list key="regular_expression_queries">
    <parameter key="foo" value=".*"/>
    </list>
    <list key="regular_region_queries"/>
    <list key="xpath_queries"/>
    <list key="namespaces"/>
    <list key="index_queries"/>
    <list key="jsonpath_queries">
    <parameter key="language" value="$..language"/>
    <parameter key="isReliable" value="$..isReliable"/>
    <parameter key="confidence" value="$..confidence"/>
    </list>
    <parameter key="url" value="http://ws.detectlanguage.com/0.2/detect?q=&amp;lt;%message%&amp;gt;&amp;amp;key=e3ee4a9dd9b7fe4fd597f363a8a2d02e"/>
    <list key="request_properties"/>
    </operator>
    <operator activated="true" class="text:data_to_documents" compatibility="7.5.000" expanded="true" height="68" name="Data to Documents" width="90" x="447" y="34">
    <parameter key="select_attributes_and_weights" value="true"/>
    <list key="specify_weights">
    <parameter key="foo" value="1.0"/>
    </list>
    </operator>
    <operator activated="true" class="text:combine_documents" compatibility="7.5.000" expanded="true" height="82" name="Combine Documents" width="90" x="581" y="34"/>
    <operator activated="false" class="text:create_document" compatibility="7.5.000" expanded="true" height="68" name="Create Document" width="90" x="581" y="136">
    <parameter key="text" value="{&#10; &quot;data&quot;:{&#10; &quot;detections&quot;:[&#10; {&#10; &quot;isReliable&quot;:true,&#10; &quot;confidence&quot;:39.45,&#10; &quot;language&quot;:&quot;es&quot;&#10; },&#10; {&#10; &quot;isReliable&quot;:false,&#10; &quot;confidence&quot;:3.08,&#10; &quot;language&quot;:&quot;pt&quot;&#10; }&#10; ]&#10; }&#10;}"/>
    </operator>
    <operator activated="true" class="web:json_to_xml" compatibility="7.3.000" expanded="true" height="68" name="JSON to XML" width="90" x="715" y="34"/>
    <operator activated="true" class="text:write_document" compatibility="7.5.000" expanded="true" height="82" name="Write Document" width="90" x="849" y="34"/>
    <operator activated="true" class="advanced_file_connectors:read_xml" compatibility="8.0.001" expanded="true" height="68" name="Read XML" width="90" x="981" y="85">
    <parameter key="file" value="/Users/genzerconsulting/Desktop/Untitled 3.xml"/>
    <parameter key="xpath_for_examples" value="//json"/>
    <enumeration key="xpaths_for_attributes">
    <parameter key="xpath_for_attribute" value="data[1]/detections[1]/isReliable[1]/text()"/>
    <parameter key="xpath_for_attribute" value="data[1]/detections[1]/confidence[1]/text()"/>
    <parameter key="xpath_for_attribute" value="data[1]/detections[1]/language[1]/text()"/>
    <parameter key="xpath_for_attribute" value="data[1]/detections[2]/isReliable[1]/text()"/>
    <parameter key="xpath_for_attribute" value="data[1]/detections[2]/confidence[1]/text()"/>
    <parameter key="xpath_for_attribute" value="data[1]/detections[2]/language[1]/text()"/>
    </enumeration>
    <list key="namespaces"/>
    <parameter key="use_default_namespace" value="false"/>
    <list key="annotations"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="isReliable[1].true.nominal.attribute"/>
    <parameter key="1" value="confidence[1].true.numeric.attribute"/>
    <parameter key="2" value="language[1].true.nominal.attribute"/>
    <parameter key="3" value="isReliable[2].true.nominal.attribute"/>
    <parameter key="4" value="/confidence[2].true.numeric.attribute"/>
    <parameter key="5" value="language[2].true.nominal.attribute"/>
    </list>
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Encode URLs" to_port="example set input"/>
    <connect from_op="Encode URLs" from_port="example set output" to_op="Enrich Data by Webservice" to_port="Example Set"/>
    <connect from_op="Enrich Data by Webservice" from_port="ExampleSet" to_op="Data to Documents" to_port="example set"/>
    <connect from_op="Data to Documents" from_port="documents" to_op="Combine Documents" to_port="documents 1"/>
    <connect from_op="Combine Documents" from_port="document" to_op="JSON to XML" to_port="document"/>
    <connect from_op="JSON to XML" from_port="document" to_op="Write Document" to_port="document"/>
    <connect from_op="Write Document" from_port="file" to_op="Read XML" to_port="file"/>
    <connect from_op="Read XML" from_port="output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>