Can RapidMiner de-identify or make data anonymous?

CraigBostonUSA
CraigBostonUSA New Altair Community Member
edited November 2024 in Community Q&A

Is it possible to anonymize or de-identify data with RapidMiner?

 

Thanks!download.jpg

Tagged:

Best Answer

Answers

  • sgenzer
    sgenzer
    Altair Employee
    Answer ✓

    yup.  Try "Obfuscate".


    Scott

  • earmijo
    earmijo New Altair Community Member

    I think "obfuscate" will mask some variables, but that's all it does. Look into other specialized software to do anonymization (k-anonymization, l-diversity, etc). I have one in mind: 

    I've used it and it is very powerful:

     

    http://arx.deidentifier.org/

     

  • sgenzer
    sgenzer
    Altair Employee

    Obfuscate will anonymize nominal attribute names and nominal data values:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Iris" width="90" x="179" y="136">
    <parameter key="repository_entry" value="//Samples/data/Iris"/>
    </operator>
    <operator activated="true" class="productivity:obfuscate" compatibility="7.6.001" expanded="true" height="82" name="Obfuscate" width="90" x="380" y="136">
    <parameter key="use_local_random_seed" value="true"/>
    </operator>
    <connect from_op="Retrieve Iris" from_port="output" to_op="Obfuscate" to_port="example set input"/>
    <connect from_op="Obfuscate" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    I agree that this is not extremely robust and would recommend a proper hashing algorithm of some kind if you truly want to protect your data and/or you have numerical values.  Often I will create a hash myself using a form of public key crypto rather than use obfuscate.

     

    Thanks for the suggestion for that software, @earmijo - I will need to check that out.

     

    Scott

  • earmijo
    earmijo New Altair Community Member

    Hi Scott. I am currently learning about the subject. The handling of PII (personal identifiers) by completely masking them is standard. The key problem with anonymization is what to do with quasi-identifiers. If you completely mask them, the dataset is rendered useless. Finding the right balance between usefulness and anonymity seems to be goal (there are some researchers that believe this compromise is not possible. See for instance:  The False Promise of Anonymization : "Data can be either useful or perfectly anonymous but never both.") A nice intro to the main issues is discussed in this video

     

    https://www.youtube.com/watch?v=O3hxp117EHs

     

     

  • sgenzer
    sgenzer
    Altair Employee

    that is very interesting @earmijo.  Thank you for sharing.  I worked a lot with PII-sensitive data when I was freelancing (mostly FERPA compliance here in the USA - protecting PII of student data from schools) and I like the way you phrase this tough quandary.  Food for thought.

     

    Scott

  • MartinLiebig
    MartinLiebig
    Altair Employee

    Hey @earmijo,

    looks like a lot of brain food for me :). Since your linked library is java - have you considered to embed this into RM as operators?

     

    Cheers,

    Martin

  • earmijo
    earmijo New Altair Community Member

    Hi @mschmitz

     

    That would be awesome (an extension linking both applications). Unfortunately, it is beyond my skills. I know how to drive the car, but I have no idea what's under the hood :-) 

  • MartinLiebig
    MartinLiebig
    Altair Employee

    Hi,

     

    an important question first - how good is your Java? :)

     

    Best,

    Martin

  • Telcontar120
    Telcontar120 New Altair Community Member
    Maybe @sgenzer another simpler option would be add this one to the API dev list since it looks like some of the basics would be interoperable that way: http://arx.deidentifier.org/api/

    But clearly the best option would be to develop a full extension using their API toolkit (they have a lot of options)!
  • sgenzer
    sgenzer
    Altair Employee

    excellent idea, @Telcontar120.  Noted.  @mschmitz and I are making some (very slow) progress on this project.  Delay is purely my fault - I am an abysmal Java programmer and the rest of the dev team is completely booked with RM8.0.  We will get there!

     

    Scott