New Extension for Applied Onomastics (name recognition) on GitHub + help needed

NamSor
NamSor New Altair Community Member
edited November 5 in Community Q&A
Hi,

Last month we've prototyped RapidMiner integration with NamSor GendRE API, to recognize the gender of names
http://namesorts.com/2014/04/23/rapidminer-to-enrich-gender-data/
using  'Enrich Data by Webservice'.

We've started building a custom extension to offer more functionalities, but we're running into problems.
https://github.com/namsor/rapidminer-onomastics-extension

1) The firstName in the CSV output doesn't correspond to the input
2) The REAL value shows a rounded value instead of full precision (don't look at the value it's random generated)
3) We had to create a 'DummyOperator' with 'name generate_extract' otherwise RM complains that the documentation is missing

Otherwise, the integration seems to work wth RM5.3.015, the operator appears under /Onomastics/Name2Gender

Any help welcome!
Thanks,
Elian

Input file:
firstName;lastName;countryIso2
Blas;PEREZ+HENRIQUEZ;
A.+Craig;COPETAS;
Abdel;AISSOU;
Abderrahman;BEDDI;
Achmad+Danny;GAZALI;
Ada;COLAU;
Adam;GREEN;
Adam+S.;POSEN;
Adeline;BRAESCU+KERLAN;
Aditya;GARG;
Adnan;BALI;
Adnane;EL+FASSI;
Adriaan;SMIT;
Adrian;MCGINN;
Adrián;MICHEL+ESPINO;
Adriana;VERDIER;
Adrien;REGNIER+LAURENT;fr
Adrien;SURU;
Илья;Ковальчук;ru


What we get in the output (genderScale is a random number) :

"firstName";"lastName";"countryIso2";"genderScale";"gender"
"Blas";"PEREZ+HENRIQUEZ";;0.0;"Male"
"A.+Craig";"COPETAS";;1.0;"Female"
"Abdel";"AISSOU";;2.0;"Unknown"
"Blas";"BEDDI";;0.0;"Male"
"A.+Craig";"GAZALI";;1.0;"Female"
"Blas";"COLAU";;0.0;"Male"
"Abdel";"GREEN";;2.0;"Unknown"
"Blas";"POSEN";;0.0;"Male"
"Blas";"BRAESCU+KERLAN";;0.0;"Male"
"Blas";"GARG";;0.0;"Male"
"Abdel";"BALI";;2.0;"Unknown"
"A.+Craig";"EL+FASSI";;1.0;"Female"
"Blas";"SMIT";;0.0;"Male"
"A.+Craig";"MCGINN";;1.0;"Female"
"Abdel";"MICHEL+ESPINO";;2.0;"Unknown"
"Abdel";"VERDIER";;2.0;"Unknown"
"A.+Craig";"REGNIER+LAURENT";"fr";1.0;"Female"
"A.+Craig";"SURU";;1.0;"Female"
"Blas";"Ковальчук";"ru";0.0;"Male"

Answers

  • Marco_Boeck
    Marco_Boeck New Altair Community Member
    Hi,

    cool stuff 8)

    1) I don't quite get the problem. What CSV output?
    2) RapidMiner is by default rounding to 3 fraction digits when displaying data. You can change the default setting in the preferences under "General" -> "rapidminer.general.fractiondigits.numbers". When calculating, the actual numbers are used.
    3) Not quite sure what that is about, are you getting this warning in the console also when removing your extension? I don't think it has to do anything with it.

    Regards,
    Marco
  • NamSor
    NamSor New Altair Community Member
    Hi Marco! Thanks for helping out.

    I've created a simple process loading data from an Excel file with

    >firstName;lastName;countryIso2
    >Blas;PEREZ+HENRIQUEZ;
    >A.+Craig;COPETAS;
    >Abdel;AISSOU;

    Then I've connected this Import Excel operator with my custom Extension operator Name2Gender, and connected the output to a CSV file. Unfortunately, the output of my Extension operator seems completely mixed up, with the same firstName being repeated several times, incorrect numeric values, etc.

    I think the problem comes from the way I pass parameters in and out in the doWork method


    @Override
    public void doWork() throws OperatorException {

    ExampleSet exampleSet = inputSet.getData();
    Attributes attributes = exampleSet.getAttributes();
    Attribute fnAttribute = attributes.get(ATTRIBUTE_FN);
    Attribute lnAttribute = attributes.get(ATTRIBUTE_LN);
    Attribute iso2Attribute = attributes.get(ATTRIBUTE_ISO2);

    String mashapeAPIKey = getParameterAsString(MASHAPE_API_KEY);
    String defaultISO2 = getParameterAsString(DEFAULT_COUNTRY_ISO2);
    double threshold = getParameterAsDouble(ATTRIBUTE_THRESHOLD);

    Attribute genderScaleAttribute = AttributeFactory.createAttribute(
    ATTRIBUTE_GENDERSCALE, Ontology.REAL);
    genderScaleAttribute.setTableIndex(fnAttribute.getTableIndex());
    attributes.addRegular(genderScaleAttribute);

    Attribute genderAttribute = AttributeFactory.createAttribute(
    ATTRIBUTE_GENDER, Ontology.STRING);
    genderAttribute.setTableIndex(fnAttribute.getTableIndex());
    attributes.addRegular(genderAttribute);

    for (Example example : exampleSet) {
    String firstName = example.getValueAsString(fnAttribute);
    String lastName = example.getValueAsString(lnAttribute);
    String iso2 = example.getValueAsString(iso2Attribute);
    if (iso2 != null && iso2.trim().length() == 2) {
    // real value
    } else if (defaultISO2 != null && defaultISO2.trim().length() == 2) {
    iso2 = defaultISO2.trim();
    } else {
    // invalid value, set to null
    iso2 = null;
    }

    double genderScale = 0d;
    if (MOCKUP) {
    genderScale = RND.nextDouble() * 2 - 1;
    } else {
    // API stuff goes here
    }
    String gender = "Unknown";
    if (genderScale > threshold) {
    gender = "Female";
    } else if (genderScale < -threshold) {
    gender = "Male";
    }
    example.setValue(genderScaleAttribute, genderScale);
    example.setValue(genderAttribute, gender);
    }
    outputSet.deliver(exampleSet);
    }

    Any idea?
    Thx,
    Elian
  • Marco_Boeck
    Marco_Boeck New Altair Community Member
    Hi,

    the call

    genderScaleAttribute.setTableIndex(fnAttribute.getTableIndex());
    seems dangerous. Generally speaking, you can only append new attribute columns on the right. Does removing said line fix your problem?

    Regards,
    Marco
  • NamSor
    NamSor New Altair Community Member
    Hi Marco,

    Without this call, I get a ArrayIndexOutOfBoundsException. I took this method from "How-to-Extend-RapidMiner-5" documentation. Is there an updated document?

    Thx in advance for your help,
    Elian

    SEVERE: java.lang.ArrayIndexOutOfBoundsException: -1
    java.lang.ArrayIndexOutOfBoundsException: -1
            at com.rapidminer.example.table.DoubleArrayDataRow.set(DoubleArrayDataRo
    w.java:61)
            at com.rapidminer.example.table.AbstractAttribute.setValue(AbstractAttri
    bute.java:184)
            at com.rapidminer.example.table.DataRow.set(DataRow.java:85)
            at com.rapidminer.example.Example.setValue(Example.java:140)
            at com.namsor.api.rapidminer.Name2GenderOperator.doWork(Name2GenderOpera
    tor.java:160)
            at com.rapidminer.operator.Operator.execute(Operator.java:866)
            at com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUn
    itExecutor.java:51)
            at com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:711)

            at com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:375)
            at com.rapidminer.operator.Operator.execute(Operator.java:866)
  • Marco_Boeck
    Marco_Boeck New Altair Community Member
    Hi,

    the document will be updated, however I cannot name any date as of yet.
    Please use these calls to add new attributes to an existing ExampleSet.

    exampleSet.getExampleTable().addAttribute(newAttribute);
    exampleSet.getAttributes().addRegular(newAttribute);
    Regards,
    Marco
  • NamSor
    NamSor New Altair Community Member
    Thanks a lot Marco, that worked! E.