Get and set roles from a reference data set
christos_karras
New Altair Community Member
I have a data set with various features that have been excluded by marking them with a role (they were not removed because they can be useful for reference even if they should be excluded from most operators). Now, I would like to apply the same roles to another data set that has the same columns (on which no roles have been set)
In Python, I was able to do this to achieve my objective:
def rm_main(data,refdata): <br>
data.rm_metadata = refdata.rm_metadata <br>
return data, refdata
However, this can be slow for large data sets because the whole dataset is passed back and forth between Python and RapidMiner, which is not necessary in cases where the only thing I want to do is manipulate the columns metadata.
Is there a native way to do something similar with RapidMiner operators (or with an extension that adds such an operator)?
Otherwise, would the Groovy scripting operator be usable for this? I tried experimenting with it but could not find something that works.
Example (not functional, all attributes are seen to have a "null" role):
ExampleSet inputData = input[0]; <br>
ExampleSet referenceData = input[1]; <br>
ExampleSetMetaData inputMetaData = operator.getInputPorts().getPortByIndex(0).getMetaData();
ExampleSetMetaData referenceMetaData = operator.getInputPorts().getPortByIndex(1).getMetaData(); <br>
for (Attribute attribute: referenceData.getAttributes()) { <br>
AttributeMetaData referenceAttributeMetaData = referenceMetaData.getAttributeByName(attribute.getName())
String referenceRole = referenceAttributeMetaData.getRole() <br>
LogService.root.log(Level.INFO, "Role for " + attribute.getName() + ": " + referenceRole); <br>
}
0
Answers
-
I thought of a solution using both the Filter and Append operators, which happens to do what I want even if it's not made explicit. It seems to be working fine. I'm still curious about the feasibility of using the scripting operator however.
- Filter removes all rows from the "reference dataset": a dataset where the columns have the roles I want to set
- First input of the Append operator is the "reference dataset", second input is the actual data, with the same columns but without any role set
The resulting dataset will use the metadata of the first input (with the roles), but will include all rows from the actual data.<?xml version="1.0" encoding="UTF-8"?><process version="9.6.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="utility:create_exampleset" compatibility="9.6.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="179" y="85"> <parameter key="generator_type" value="numeric series"/> <parameter key="number_of_examples" value="1000000"/> <parameter key="use_stepsize" value="false"/> <list key="function_descriptions"/> <parameter key="add_id_attribute" value="false"/> <list key="numeric_series_configuration"> <parameter key="A" value="linear.0\.0.1\.0"/> <parameter key="C" value="linear.0\.0.1\.0"/> <parameter key="F" value="linear.0\.0.1\.0"/> <parameter key="B" value="linear.0\.0.1\.0"/> <parameter key="D" value="linear.0\.0.1\.0"/> <parameter key="G" value="linear.0\.0.1\.0"/> <parameter key="E" value="linear.0\.0.1\.0"/> </list> <list key="date_series_configuration"/> <list key="date_series_configuration (interval)"/> <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="column_separator" value=","/> <parameter key="parse_all_as_nominal" value="false"/> <parameter key="decimal_point_character" value="."/> <parameter key="trim_attribute_names" value="true"/> </operator> <operator activated="true" class="utility:create_exampleset" compatibility="9.6.000" expanded="true" height="68" name="Create ExampleSet - Reference data with roles" width="90" x="179" y="187"> <parameter key="generator_type" value="numeric series"/> <parameter key="number_of_examples" value="100"/> <parameter key="use_stepsize" value="false"/> <list key="function_descriptions"/> <parameter key="add_id_attribute" value="false"/> <list key="numeric_series_configuration"> <parameter key="A" value="linear.0\.0.1\.0"/> <parameter key="B" value="linear.0\.0.1\.0"/> <parameter key="C" value="linear.0\.0.1\.0"/> <parameter key="D" value="linear.0\.0.1\.0"/> <parameter key="E" value="linear.0\.0.1\.0"/> <parameter key="F" value="linear.0\.0.1\.0"/> <parameter key="G" value="linear.0\.0.1\.0"/> </list> <list key="date_series_configuration"/> <list key="date_series_configuration (interval)"/> <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="column_separator" value=","/> <parameter key="parse_all_as_nominal" value="false"/> <parameter key="decimal_point_character" value="."/> <parameter key="trim_attribute_names" value="true"/> </operator> <operator activated="true" class="set_role" compatibility="9.6.000" expanded="true" height="82" name="Set Roles" width="90" x="313" y="187"> <parameter key="attribute_name" value="D"/> <parameter key="target_role" value="regular"/> <list key="set_additional_roles"> <parameter key="B" value="ignoreB"/> <parameter key="F" value="ignoreF"/> <parameter key="A" value="label"/> <parameter key="C" value="id"/> </list> </operator> <operator activated="true" class="filter_examples" compatibility="9.6.000" expanded="true" height="103" name="Filter All Examples" width="90" x="447" y="187"> <parameter key="parameter_expression" value="false"/> <parameter key="condition_class" value="expression"/> <parameter key="invert_filter" value="false"/> <list key="filters_list"/> <parameter key="filters_logic_and" value="true"/> <parameter key="filters_check_metadata" value="true"/> <description align="center" color="transparent" colored="false" width="126">Create an empty dataset for its column's metadata</description> </operator> <operator activated="true" class="multiply" compatibility="9.6.000" expanded="true" height="103" name="Multiply Reference Data" width="90" x="648" y="187"/> <operator activated="true" class="order_attributes" compatibility="9.6.000" expanded="true" height="82" name="Reorder Attributes" width="90" x="849" y="85"> <parameter key="sort_mode" value="reference data"/> <parameter key="attribute_ordering" value=""/> <parameter key="use_regular_expressions" value="false"/> <parameter key="handle_unmatched" value="append"/> <parameter key="sort_direction" value="ascending"/> </operator> <operator activated="true" class="append" compatibility="9.6.000" expanded="true" height="103" name="Append" width="90" x="1050" y="187"> <parameter key="datamanagement" value="double_array"/> <parameter key="data_management" value="auto"/> <parameter key="merge_type" value="all"/> </operator> <connect from_op="Create ExampleSet" from_port="output" to_op="Reorder Attributes" to_port="example set input"/> <connect from_op="Create ExampleSet - Reference data with roles" from_port="output" to_op="Set Roles" to_port="example set input"/> <connect from_op="Set Roles" from_port="example set output" to_op="Filter All Examples" to_port="example set input"/> <connect from_op="Filter All Examples" from_port="example set output" to_op="Multiply Reference Data" to_port="input"/> <connect from_op="Multiply Reference Data" from_port="output 1" to_op="Reorder Attributes" to_port="reference_data"/> <connect from_op="Multiply Reference Data" from_port="output 2" to_op="Append" to_port="example set 1"/> <connect from_op="Reorder Attributes" from_port="example set output" to_op="Append" to_port="example set 2"/> <connect from_op="Append" from_port="merged set" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
0