28 / 04 / 2021
Anonymizing data: everything you need to know
Forward-thinking organizations around the world are looking towards data anonymization as a proactive solution for overcoming the commercial challenges presented by increasingly stringent data privacy regulations and rising consumer concerns. If carried out effectively, anonymizing data can both protect the privacy rights of individual data subjects and preserve the goals of revenue-generating strategies that rely on data for success. But as your organization looks towards data anonymization tools, it’s important to understand the what, why and how of true anonymization in order to ensure you move forward with an effective solution.
What is data anonymization?
Data anonymization is the process of protecting private or sensitive information so that individuals cannot be directly or indirectly identified from the personal data. The aim is to irreversibly prevent the identifiability of an individual from the data that is held; only then can it be considered effectively and sufficiently anonymized.
It’s important to be aware that even when you clear datasets of personal data – the direct identifiers – that data is still at risk of re-identification, whereby the controller or another party has the ability to associate records with an individual using combinations of fields (quasi-identifiers).
Under the strict rules of the GDPR, if you intend to use and store data for an indefinite period of time, all identifiers, both the direct and indirect identifiers, must be removed from the data to ensure that individuals are protected.
The GDPR outlines re-identification risks as follows:
- Singling out: when it is possible to distinguish or isolate the data relating to one individual from all other information within that dataset. For example, a dataset may contain the weight of individuals, where only one person is 64kg; since this is a unique value relating to one specific individual, they could be singled out.
- Linkability: when there is the possibility to link quasi-identifiers across datasets, making it much easier to identify an individual. When anonymized data is combined or matched with another source (which is often the case in business), a major risk is posed and highly accurate re-identification can occur, despite the anonymization that took place.
- Inference: when there is a possibility to infer a link between two pieces of personal information in a dataset, even though the information is not explicitly linked. One example of this would be if the dataset contained statistics on salaries and levels of seniority. An inference could be drawn between the two pieces of information, which could enable an individual to be identified.
Why are businesses anonymizing data?
So, why anonymize data in the first place? Anonymizing data enables you to share and utilize your data, whilst preserving privacy, which creates a number of commercial benefits.
By intentionally removing the identifiers and carefully removing the risk of re-identification, you are ensuring that the data is no longer personal data; this means data subject rights, requirements and restrictions imposed by the GDPR, no longer apply. As a result of “switching off” GDPR, organizations can tap into the full value of their anonymized data to meet their business needs.
Anonymizing data also opens up a number of other commercial benefits. When handling personal data, organizations need to support the rights of data subjects (for example, the right to be forgotten, the right to access etc.). Not only does this cause significant instability in the data universe, but it also requires significant resources, costs and organizational risks that can prevent your organization from driving forward at pace. Tapping into the right data anonymization tools alleviates a number of these challenges.
In addition to this, the GDPR lays out strict rules around data retention, namely that the period for which data is retained must be justified under a lawful basis. By anonymizing data, an organization can store and analyze the data for as long as it is useful, which is critical for those who want to lawfully conduct longitudinal analyzes.
Organizations that invest in anonymizing data, not only widen their lens to deliver more meaningful insights, they also demonstrate to consumers that data privacy is of high importance. In a privacy-conscious economy, this ability to build and preserve trust is the key to leveraging long-term loyalty. Data that is effectively anonymized cannot be joined to other datasets; therefore, individuals cannot be re-identified and their fundamental rights to privacy are fully protected.
Like all leading-edge businesses who seize opportunity at points in time where there are significant societal shifts, it is those that evolve at pace to overcome the challenges that will gain them recognition for taking a stance that global consumers are calling for. Recognition as a privacy leader in a data-led economy is now a key business differentiator, so there has never been a more opportune time to tap into an expert data anonymization service to gain that competitive advantage.
Data anonymization techniques
Anonymization techniques need to be carefully considered by data controllers based on a variety of risk factors. Some of the key techniques for anonymizing data are outlined below. It should be noted that many of the data anonymization techniques listed below can be configured to transform a data field to a lesser or greater extent. Transforming a field to a significant degree increases the privacy protection; however, it also reduces analytical utility. Thus, while these techniques can result in anonymized data, quite frequently when put into practice, they do not reduce re-identification risk to the degree required to meet the definition of anonymized data.
Encryption is a process that includes a number of different algorithms that are used to enhance the privacy of data by converting that data into an unintelligible form using a reversible encryption process. Encryption has primarily been used by organizations to protect data-in-transit and data-at-rest, so that a 3rd party without the correct decryption key will not be able to read or understand the data. Typically, to use or analyze the data, it must first be decrypted. A technique called homomorphic encryption enables protection of data-in-use, by enabling analysis to be performed over data without having to decrypt it first.
Tokenization is the process of converting an identifiable piece of information into an unintelligible sequence of characters and numbers – referred to as a token – by using any of the well-known techniques such as a hash algorithm or format-preserving encryption algorithm. Hash algorithms are cryptographic functions that are designed to be one-way, so that they can’t be reverse-engineered to derive the original value.
Perturbation (noise addition) is a privacy-preserving technique that adds ‘noise’ to a dataset. As an example, mathematical functions (addition, multiplication or more sophisticated equations) can be applied to numerical fields to alter the value. Values are typically perturbed by a random amount of ‘noise’ and the random increases or decreases may be evenly distributed across a column to preserve aggregate statistics for the column. Adding certain types of noise to the results of a query over a dataset, or when generating a dataset, can achieve a mathematical guarantee of privacy, such a k-anonymity or differential privacy. K-anonymity means that for a set of sensitive fields, each data subject is indistinguishable from k-1 others, while differential privacy means that if the records for any given data subject are removed from a dataset, the outputs for a query may differ from those generated from the entire dataset by no more than the degree specified by epsilon, or “privacy parameter”.
Generalization is a technique that takes a precise value and makes it imprecise to hide the original value of each field while preserving some analytical value. For example, a precise street address could be generalized to just the street, county or country level; or an occupation could be generalized to the industry level. To generalize a field, an appropriate “generalization hierarchy” is required so that an appropriate level of the hierarchy may be chosen.
Masking is a technique which hides original data by replacing it with a default character to protect the more identifiable portion of the field. For example, the first six digits of a credit card number might be preserved (the numbers that indicate the issuing bank) while the remainder of the digits are replaced with “0” if the field is numerical, or “X” if it is a string, thus protecting the sensitive data.
Binning is a technique that takes exact values which are then grouped – or bucketed – into smaller groups and replaced by a general value that makes them less identifiable.
Redaction is a technique whereby some, or all, of a field is removed, rather than being replaced with default characters. Some or all of a field may be deleted entirely rather than replaced with default characters.
Data synthesis is a technique whereby a new dataset is generated based on a learned model of an original dataset. There are a number of approaches to synthesizing data, each with varying levels of sophistication and implications on the risk profile and analytical usefulness of the synthetic data.
What is the difference between data masking, de-identification and anonymization?
These terms – data masking, de-identification and anonymization – are widely used (and misused) when talking about data privacy, but it is important to have a clear understanding of each.
Effective anonymization prevents any future re-identification of an individual, even by the data controller. The process involves transforming direct and indirect re-identification risks so that individuals cannot be re-identified through singling out, linkability or inference. When the application of the anonymization techniques, listed above, result in data where re-identification can occur, the result is pseudonymization/pseudonymized data rather than anonymization/anonymized data (see below).
However, de-identification may preserve identifying information, which means it is capable of being re-linked in certain scenarios, especially where the original data is retained and stored by the same controller. In the USA, anonymization and de-identification are often used interchangeably. However, in the EU, de-identification typically refers to removing, altering or tokenizing direct identifiers. As mentioned above, data masking is a technique used to remove or hide information that can directly identify an individual (e.g. their name or phone number).
What is the difference between anonymization and pseudonymization?
Another term that has come to the forefront of data privacy in recent times is that of pseudonymization. The primary difference between pseudonymization and anonymization is that anonymized data can no longer be considered as personal data whereas pseudonymized data is still considered as personal data.
If data is effectively anonymized, it should not be possible to reidentify an individual using “all means reasonably likely to be used by the controller or any party”, taking into account the technology available at that time. “All means reasonably likely to be used” includes linking fields in the anonymized dataset with the same fields in the original data. This means that, where the same controller has the original and anonymized datasets, the level of transformation required to achieve this threshold is likely to be overly destructive to the dataset’s analytical usefulness. Taking “technology available at the time” into account means that, as more powerful computing technology comes to the fore, such as quantum computers, formerly “anonymized data” may not be considered anonymous anymore.
Pseudonymized data, on the other hand, is still personal data and, as such, it is still falls under the full scope of the data protection legislation. When pseudonymizing data, direct identifiers should not be intelligible and any quasi-identifiers in the data should be sufficiently transformed so they’re not able to be used to re-identify an individual without the use of supplementary information (such as mapping tables that link unintelligible tokens with identifiable individuals). It is also worth noting that when pseudonymizing data, certain data subject rights do not need to be considered if the data is sufficiently removed from the original data that re-identification cannot occur without additional data that is held separately from the pseudonymized dataset.
Data anonymization tools: what you need to know
As more and more data-driven organizations look to anonymization as a solution to overcoming data privacy challenges and an opportunity to drive growth, it is important to distinguish between those who offer pseudonymization services or software and those who provide a true, irreversible anonymization solution.
Risk quantification is key to effective anonymization; if you can’t measure data privacy risk, you can’t manage that risk. Over-anonymizing or under-anonymizing data leaves companies exposed to either too much risk or inaccurate analytics. Today’s privacy-enhancing tools do not provide a before and after view of quantified risk levels, which means that many companies who harness such tools are leaving themselves exposed and unnecessarily out of pocket. In addition, a number of data anonymization tools also require organizations to have deep technical and legal expertise, which is a significant resource cost if considered but a significant risk if not looked at.
Data anonymization tools are not one-size-fits-all solutions. Privacy-enhancing technologies (PETs) must be configured for the specific scenario at hand, since each context has its own requirements and each PET has areas of strength and weakness. In order to maximize data utility while preserving privacy, organizations need to look to the right technologies powered by the right expertise, and ensure that they are signing-up for a solution that understands the ins and outs of effective anonymization to avoid data risk and accelerate data-driven innovation.