08 / 08 / 2019
Is Data Anonymization Nothing More Than An Easy Guessing Game?
THE ANSWER IS NO. BUT BY THE WAY SOME CLAIM TO “ANONYMIZE” DATA, YOU’D CERTAINLY THINK SO!
A recent academic study shone a light on practices that claim to achieve anonymization but in fact fall short. On one level, it highlights the pitfalls of simply removing or obfuscating direct identifiers and sampling when attempting to anonymize datasets. However, it also exposes an uglier truth – many out there are misusing the term “anonymization” when describing how they claim to address privacy concerns regarding the use of personal data.
Guess who? Hmm. You got it right…again.
Here’s an easy way to understand what’s going on.Remember the popular board game Guess Who? Your opponent “picks” a character and you ask questions about characteristics of that character to progressively narrow down the list of possibilities until you can identify them. The more you know about the target, the easier it gets. Importantly, you always win – it’s just a matter of how many questions it takes.
What does that mean when it comes to anonymization? It illustrates that issuing datasets having obfuscated direct identifiers is perilous because you give all the facts at once so re-identifiability is almost certain for a dataset of any complexity. Even “anonymizing” certain attributes to make them less identifiable (generalising person names to gender, exact hair colour to “light/dark”, etc.) still leads to re-identification risk if there are enough attributes to combine. Let’s play a game of Guess Who to demonstrate this in action.
“Are you wearing a hat?”
“No.” This leaves 19 people it could be.
“Do you have yellow hair?”
“No.” That leaves 15 people.
“Do you have glasses?”
“Yes.” Down to 3 people.
“Are you have white hair?”
“No.” Congratulations, you’ve identified the person with 4 pieces of information (I’m Tom!).
If I had said “yes” to the question about wearing a hat and “yes” if you then asked if I’m a woman, you would have identified me with only 2 pieces of information (Claire).
Imagine how easy it is with a database containing 10s or even 100s of fields.
If you can ask an unlimited number of questions, you can go through the sequence above and re-identify any individual. If some characteristics are tweaked it would be slightly more difficult to re-identify individuals, however you can simply ask more questions. If all attributes are completely obfuscated or jumbled, it may be impossible to re-identify anyone, but that ruins the game (or in analytical terms, the utility is destroyed). Similarly, if you have a set of descriptions like those shown below, you can match each row to someone on the board and the more features you have, the easier it is to match with someone.
The Value of True Anonymization
Based on the above, you might be wondering how valuable anonymization is at all. But not all efforts to anonymize are equally effective. Or, more accurately, not all types of “anonymization” are actually anonymization. As the recent academic study demonstrated, many datasets are being released and presented as anonymised but in fact will allow for re-identifcation with relative ease.
True anonymization may analyse granular data but only returns aggregate statistics or calculations. These aggregate responses do not include identifiers and require that responses meet certain thresholds (minimum number of individuals represented, each individual contributing no more than x% to the results, etc.). How do you apply this to the Guess Who example? Well, you would simply refuse to answer the question about glasses or any other question that leaves less than some minimum threshold of people. It might not make the 8 year old you’re playing against happy, but it prevents individuals from being easily identified with the data available.
In the real world (board games aside), achieving true anonymization is not easy to do. Striking the balance between privacy and utility requires a sophisticated approach that incorporates innovative technological, structural, legal and organisational safeguards. All applied by true experts – in both data science and privacy. Not all solutions do this and not all solutions achieve true GDPR-grade anonymization. Then again, nobody ever said that applying anonymization was child’s play.