New Extension for Applied Onomastics (name recognition) on GitHub + help needed
NamSor
New Altair Community Member
Hi,
Last month we've prototyped RapidMiner integration with NamSor GendRE API, to recognize the gender of names
http://namesorts.com/2014/04/23/rapidminer-to-enrich-gender-data/
using 'Enrich Data by Webservice'.
We've started building a custom extension to offer more functionalities, but we're running into problems.
https://github.com/namsor/rapidminer-onomastics-extension
1) The firstName in the CSV output doesn't correspond to the input
2) The REAL value shows a rounded value instead of full precision (don't look at the value it's random generated)
3) We had to create a 'DummyOperator' with 'name generate_extract' otherwise RM complains that the documentation is missing
Otherwise, the integration seems to work wth RM5.3.015, the operator appears under /Onomastics/Name2Gender
Any help welcome!
Thanks,
Elian
Input file:
firstName;lastName;countryIso2
Blas;PEREZ+HENRIQUEZ;
A.+Craig;COPETAS;
Abdel;AISSOU;
Abderrahman;BEDDI;
Achmad+Danny;GAZALI;
Ada;COLAU;
Adam;GREEN;
Adam+S.;POSEN;
Adeline;BRAESCU+KERLAN;
Aditya;GARG;
Adnan;BALI;
Adnane;EL+FASSI;
Adriaan;SMIT;
Adrian;MCGINN;
Adrián;MICHEL+ESPINO;
Adriana;VERDIER;
Adrien;REGNIER+LAURENT;fr
Adrien;SURU;
Илья;Ковальчук;ru
What we get in the output (genderScale is a random number) :
"firstName";"lastName";"countryIso2";"genderScale";"gender"
"Blas";"PEREZ+HENRIQUEZ";;0.0;"Male"
"A.+Craig";"COPETAS";;1.0;"Female"
"Abdel";"AISSOU";;2.0;"Unknown"
"Blas";"BEDDI";;0.0;"Male"
"A.+Craig";"GAZALI";;1.0;"Female"
"Blas";"COLAU";;0.0;"Male"
"Abdel";"GREEN";;2.0;"Unknown"
"Blas";"POSEN";;0.0;"Male"
"Blas";"BRAESCU+KERLAN";;0.0;"Male"
"Blas";"GARG";;0.0;"Male"
"Abdel";"BALI";;2.0;"Unknown"
"A.+Craig";"EL+FASSI";;1.0;"Female"
"Blas";"SMIT";;0.0;"Male"
"A.+Craig";"MCGINN";;1.0;"Female"
"Abdel";"MICHEL+ESPINO";;2.0;"Unknown"
"Abdel";"VERDIER";;2.0;"Unknown"
"A.+Craig";"REGNIER+LAURENT";"fr";1.0;"Female"
"A.+Craig";"SURU";;1.0;"Female"
"Blas";"Ковальчук";"ru";0.0;"Male"
Last month we've prototyped RapidMiner integration with NamSor GendRE API, to recognize the gender of names
http://namesorts.com/2014/04/23/rapidminer-to-enrich-gender-data/
using 'Enrich Data by Webservice'.
We've started building a custom extension to offer more functionalities, but we're running into problems.
https://github.com/namsor/rapidminer-onomastics-extension
1) The firstName in the CSV output doesn't correspond to the input
2) The REAL value shows a rounded value instead of full precision (don't look at the value it's random generated)
3) We had to create a 'DummyOperator' with 'name generate_extract' otherwise RM complains that the documentation is missing
Otherwise, the integration seems to work wth RM5.3.015, the operator appears under /Onomastics/Name2Gender
Any help welcome!
Thanks,
Elian
Input file:
firstName;lastName;countryIso2
Blas;PEREZ+HENRIQUEZ;
A.+Craig;COPETAS;
Abdel;AISSOU;
Abderrahman;BEDDI;
Achmad+Danny;GAZALI;
Ada;COLAU;
Adam;GREEN;
Adam+S.;POSEN;
Adeline;BRAESCU+KERLAN;
Aditya;GARG;
Adnan;BALI;
Adnane;EL+FASSI;
Adriaan;SMIT;
Adrian;MCGINN;
Adrián;MICHEL+ESPINO;
Adriana;VERDIER;
Adrien;REGNIER+LAURENT;fr
Adrien;SURU;
Илья;Ковальчук;ru
What we get in the output (genderScale is a random number) :
"firstName";"lastName";"countryIso2";"genderScale";"gender"
"Blas";"PEREZ+HENRIQUEZ";;0.0;"Male"
"A.+Craig";"COPETAS";;1.0;"Female"
"Abdel";"AISSOU";;2.0;"Unknown"
"Blas";"BEDDI";;0.0;"Male"
"A.+Craig";"GAZALI";;1.0;"Female"
"Blas";"COLAU";;0.0;"Male"
"Abdel";"GREEN";;2.0;"Unknown"
"Blas";"POSEN";;0.0;"Male"
"Blas";"BRAESCU+KERLAN";;0.0;"Male"
"Blas";"GARG";;0.0;"Male"
"Abdel";"BALI";;2.0;"Unknown"
"A.+Craig";"EL+FASSI";;1.0;"Female"
"Blas";"SMIT";;0.0;"Male"
"A.+Craig";"MCGINN";;1.0;"Female"
"Abdel";"MICHEL+ESPINO";;2.0;"Unknown"
"Abdel";"VERDIER";;2.0;"Unknown"
"A.+Craig";"REGNIER+LAURENT";"fr";1.0;"Female"
"A.+Craig";"SURU";;1.0;"Female"
"Blas";"Ковальчук";"ru";0.0;"Male"
0
Answers
-
Hi,
cool stuff 8)
1) I don't quite get the problem. What CSV output?
2) RapidMiner is by default rounding to 3 fraction digits when displaying data. You can change the default setting in the preferences under "General" -> "rapidminer.general.fractiondigits.numbers". When calculating, the actual numbers are used.
3) Not quite sure what that is about, are you getting this warning in the console also when removing your extension? I don't think it has to do anything with it.
Regards,
Marco0 -
Hi Marco! Thanks for helping out.
I've created a simple process loading data from an Excel file with
>firstName;lastName;countryIso2
>Blas;PEREZ+HENRIQUEZ;
>A.+Craig;COPETAS;
>Abdel;AISSOU;
Then I've connected this Import Excel operator with my custom Extension operator Name2Gender, and connected the output to a CSV file. Unfortunately, the output of my Extension operator seems completely mixed up, with the same firstName being repeated several times, incorrect numeric values, etc.
I think the problem comes from the way I pass parameters in and out in the doWork method
@Override
public void doWork() throws OperatorException {
ExampleSet exampleSet = inputSet.getData();
Attributes attributes = exampleSet.getAttributes();
Attribute fnAttribute = attributes.get(ATTRIBUTE_FN);
Attribute lnAttribute = attributes.get(ATTRIBUTE_LN);
Attribute iso2Attribute = attributes.get(ATTRIBUTE_ISO2);
String mashapeAPIKey = getParameterAsString(MASHAPE_API_KEY);
String defaultISO2 = getParameterAsString(DEFAULT_COUNTRY_ISO2);
double threshold = getParameterAsDouble(ATTRIBUTE_THRESHOLD);
Attribute genderScaleAttribute = AttributeFactory.createAttribute(
ATTRIBUTE_GENDERSCALE, Ontology.REAL);
genderScaleAttribute.setTableIndex(fnAttribute.getTableIndex());
attributes.addRegular(genderScaleAttribute);
Attribute genderAttribute = AttributeFactory.createAttribute(
ATTRIBUTE_GENDER, Ontology.STRING);
genderAttribute.setTableIndex(fnAttribute.getTableIndex());
attributes.addRegular(genderAttribute);
for (Example example : exampleSet) {
String firstName = example.getValueAsString(fnAttribute);
String lastName = example.getValueAsString(lnAttribute);
String iso2 = example.getValueAsString(iso2Attribute);
if (iso2 != null && iso2.trim().length() == 2) {
// real value
} else if (defaultISO2 != null && defaultISO2.trim().length() == 2) {
iso2 = defaultISO2.trim();
} else {
// invalid value, set to null
iso2 = null;
}
double genderScale = 0d;
if (MOCKUP) {
genderScale = RND.nextDouble() * 2 - 1;
} else {
// API stuff goes here
}
String gender = "Unknown";
if (genderScale > threshold) {
gender = "Female";
} else if (genderScale < -threshold) {
gender = "Male";
}
example.setValue(genderScaleAttribute, genderScale);
example.setValue(genderAttribute, gender);
}
outputSet.deliver(exampleSet);
}
Any idea?
Thx,
Elian0 -
Hi,
the call
seems dangerous. Generally speaking, you can only append new attribute columns on the right. Does removing said line fix your problem?
genderScaleAttribute.setTableIndex(fnAttribute.getTableIndex());
Regards,
Marco0 -
Hi Marco,
Without this call, I get a ArrayIndexOutOfBoundsException. I took this method from "How-to-Extend-RapidMiner-5" documentation. Is there an updated document?
Thx in advance for your help,
Elian
SEVERE: java.lang.ArrayIndexOutOfBoundsException: -1
java.lang.ArrayIndexOutOfBoundsException: -1
at com.rapidminer.example.table.DoubleArrayDataRow.set(DoubleArrayDataRo
w.java:61)
at com.rapidminer.example.table.AbstractAttribute.setValue(AbstractAttri
bute.java:184)
at com.rapidminer.example.table.DataRow.set(DataRow.java:85)
at com.rapidminer.example.Example.setValue(Example.java:140)
at com.namsor.api.rapidminer.Name2GenderOperator.doWork(Name2GenderOpera
tor.java:160)
at com.rapidminer.operator.Operator.execute(Operator.java:866)
at com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUn
itExecutor.java:51)
at com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:711)
at com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:375)
at com.rapidminer.operator.Operator.execute(Operator.java:866)0 -
Hi,
the document will be updated, however I cannot name any date as of yet.
Please use these calls to add new attributes to an existing ExampleSet.
Regards,
exampleSet.getExampleTable().addAttribute(newAttribute);
exampleSet.getAttributes().addRegular(newAttribute);
Marco0 -
Thanks a lot Marco, that worked! E.0