🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

"Dictionary Spanish (text mining)"

ronel74User: "ronel74"
New Altair Community Member
Updated by Jocelyn
Hi, I recently started to use rapidminer and I am having troubles with some operators regarding text processing, because the language that I am working with is spanish.

The operators that I would like to use are:

Stemming
tokenize linguistic
filter stopwords

Are these operators available for spanish texts. ??

Find more posts tagged with

Sort by:
1 - 4 of 41
    The snowball stemming supports spanish
    Still no Filter Stopwords available in Spanish though, right? :(

    Actually there are Spanish stopwords you can download from the internet and add to your process using the Filter Stopwords (Dictionary). 
    Just follow the operator documentation and create a file with one Spanish word per line and use that. 

    Here's a short example using the stopwords listed here: http://www.ranks.nl/stopwords/spanish
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="7.0.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Root">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="1969"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="UTF-8"/>
        <parameter key="parallelize_main_process" value="false"/>
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="7.0.000" expanded="true" height="68" name="Spanish Stopwords" width="90" x="45" y="187">
            <parameter key="text" value="un&#10;una&#10;unas&#10;unos&#10;uno&#10;sobre&#10;todo&#10;también&#10;tras&#10;otro&#10;algún&#10;alguno&#10;alguna&#10;algunos&#10;algunas&#10;ser&#10;es&#10;soy&#10;eres&#10;somos&#10;sois&#10;estoy&#10;esta&#10;estamos&#10;estais&#10;estan&#10;como&#10;en&#10;para&#10;atras&#10;porque&#10;por qué&#10;estado&#10;estaba&#10;ante&#10;antes&#10;siendo&#10;ambos&#10;pero&#10;por&#10;poder&#10;puede&#10;puedo&#10;podemos&#10;podeis&#10;pueden&#10;fui&#10;fue&#10;fuimos&#10;fueron&#10;hacer&#10;hago&#10;hace&#10;hacemos&#10;haceis&#10;hacen&#10;cada&#10;fin&#10;incluso&#10;primero&#10;desde&#10;conseguir&#10;consigo&#10;consigue&#10;consigues&#10;conseguimos&#10;consiguen&#10;ir&#10;voy&#10;va&#10;vamos&#10;vais&#10;van&#10;vaya&#10;gueno&#10;ha&#10;tener&#10;tengo&#10;tiene&#10;tenemos&#10;teneis&#10;tienen&#10;el&#10;la&#10;lo&#10;las&#10;los&#10;su&#10;aqui&#10;mio&#10;tuyo&#10;ellos&#10;ellas&#10;nos&#10;nosotros&#10;vosotros&#10;vosotras&#10;si&#10;dentro&#10;solo&#10;solamente&#10;saber&#10;sabes&#10;sabe&#10;sabemos&#10;sabeis&#10;saben&#10;ultimo&#10;largo&#10;bastante&#10;haces&#10;muchos&#10;aquellos&#10;aquellas&#10;sus&#10;entonces&#10;tiempo&#10;verdad&#10;verdadero&#10;verdadera&#10;cierto&#10;ciertos&#10;cierta&#10;ciertas&#10;intentar&#10;intento&#10;intenta&#10;intentas&#10;intentamos&#10;intentais&#10;intentan&#10;dos&#10;bajo&#10;arriba&#10;encima&#10;usar&#10;uso&#10;usas&#10;usa&#10;usamos&#10;usais&#10;usan&#10;emplear&#10;empleo&#10;empleas&#10;emplean&#10;ampleamos&#10;empleais&#10;valor&#10;muy&#10;era&#10;eras&#10;eramos&#10;eran&#10;modo&#10;bien&#10;cual&#10;cuando&#10;donde&#10;mientras&#10;quien&#10;con&#10;entre&#10;sin&#10;trabajo&#10;trabajar&#10;trabajas&#10;trabaja&#10;trabajamos&#10;trabajais&#10;trabajan&#10;podria&#10;podrias&#10;podriamos&#10;podrian&#10;podriais&#10;yo&#10;aquel"/>
            <parameter key="add label" value="false"/>
            <parameter key="label_type" value="nominal"/>
          </operator>
          <operator activated="true" class="text:write_document" compatibility="7.0.000" expanded="true" height="82" name="Create a file of these words" width="90" x="179" y="187">
            <parameter key="overwrite" value="true"/>
            <parameter key="encoding" value="UTF-8"/>
          </operator>
          <operator activated="true" class="text:read_document" compatibility="7.0.000" expanded="true" height="68" name="Read Document" width="90" x="45" y="34">
            <parameter key="file" value="myFile.txt"/>
            <parameter key="extract_text_only" value="true"/>
            <parameter key="use_file_extension_as_type" value="true"/>
            <parameter key="content_type" value="txt"/>
            <parameter key="encoding" value="UTF-8"/>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="7.0.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">
            <parameter key="mode" value="non letters"/>
            <parameter key="characters" value=".:"/>
            <parameter key="language" value="English"/>
            <parameter key="max_token_length" value="3"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="7.0.000" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="380" y="34">
            <parameter key="case_sensitive" value="false"/>
            <parameter key="encoding" value="UTF-8"/>
          </operator>
          <connect from_op="Spanish Stopwords" from_port="output" to_op="Create a file of these words" to_port="document"/>
          <connect from_op="Create a file of these words" from_port="file" to_op="Filter Stopwords (Dictionary)" to_port="file"/>
          <connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
          <connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Thank you very much, I did that and it worked perfectly.