🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Can RapidMiner de-identify or make data anonymous?

User: "CraigBostonUSA"
New Altair Community Member
Updated by Jocelyn

Is it possible to anonymize or de-identify data with RapidMiner?

 

Thanks!download.jpg

Find more posts tagged with

Sort by:
1 - 10 of 101
    User: "sgenzer"
    Altair Employee
    Accepted Answer

    yup.  Try "Obfuscate".


    Scott

    User: "earmijo"
    New Altair Community Member

    I think "obfuscate" will mask some variables, but that's all it does. Look into other specialized software to do anonymization (k-anonymization, l-diversity, etc). I have one in mind: 

    I've used it and it is very powerful:

     

    http://arx.deidentifier.org/

     

    User: "sgenzer"
    Altair Employee

    Obfuscate will anonymize nominal attribute names and nominal data values:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Iris" width="90" x="179" y="136">
    <parameter key="repository_entry" value="//Samples/data/Iris"/>
    </operator>
    <operator activated="true" class="productivity:obfuscate" compatibility="7.6.001" expanded="true" height="82" name="Obfuscate" width="90" x="380" y="136">
    <parameter key="use_local_random_seed" value="true"/>
    </operator>
    <connect from_op="Retrieve Iris" from_port="output" to_op="Obfuscate" to_port="example set input"/>
    <connect from_op="Obfuscate" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    I agree that this is not extremely robust and would recommend a proper hashing algorithm of some kind if you truly want to protect your data and/or you have numerical values.  Often I will create a hash myself using a form of public key crypto rather than use obfuscate.

     

    Thanks for the suggestion for that software, @earmijo - I will need to check that out.

     

    Scott

    User: "earmijo"
    New Altair Community Member

    Hi Scott. I am currently learning about the subject. The handling of PII (personal identifiers) by completely masking them is standard. The key problem with anonymization is what to do with quasi-identifiers. If you completely mask them, the dataset is rendered useless. Finding the right balance between usefulness and anonymity seems to be goal (there are some researchers that believe this compromise is not possible. See for instance:  The False Promise of Anonymization : "Data can be either useful or perfectly anonymous but never both.") A nice intro to the main issues is discussed in this video

     

    https://www.youtube.com/watch?v=O3hxp117EHs

     

     

    User: "sgenzer"
    Altair Employee

    that is very interesting @earmijo.  Thank you for sharing.  I worked a lot with PII-sensitive data when I was freelancing (mostly FERPA compliance here in the USA - protecting PII of student data from schools) and I like the way you phrase this tough quandary.  Food for thought.

     

    Scott

    Hey @earmijo,

    looks like a lot of brain food for me :). Since your linked library is java - have you considered to embed this into RM as operators?

     

    Cheers,

    Martin

    User: "earmijo"
    New Altair Community Member

    Hi @mschmitz

     

    That would be awesome (an extension linking both applications). Unfortunately, it is beyond my skills. I know how to drive the car, but I have no idea what's under the hood :-) 

    Hi,

     

    an important question first - how good is your Java? :)

     

    Best,

    Martin

    User: "Telcontar120"
    New Altair Community Member
    Maybe @sgenzer another simpler option would be add this one to the API dev list since it looks like some of the basics would be interoperable that way: http://arx.deidentifier.org/api/

    But clearly the best option would be to develop a full extension using their API toolkit (they have a lot of options)!
    User: "sgenzer"
    Altair Employee

    excellent idea, @Telcontar120.  Noted.  @mschmitz and I are making some (very slow) progress on this project.  Delay is purely my fault - I am an abysmal Java programmer and the rest of the dev team is completely booked with RM8.0.  We will get there!

     

    Scott