Not All Synthetic Data Is Created Equal

White paper

Not All Synthetic Data Is Created Equal

PREPARED BY MICHAEL FENTON, IMRAN KHAN, MAURICE COYLE, AND AOIFE SEXTON

The privacy risk contained within a synthetic dataset can be objectively quantified so that more informed decisions may be made. One of the biggest impediments to innovation in commercial and academic spheres is access to data. Building novel algorithms and technologies requires access to realistic data, so that business logic, predictive model accuracy, and algorithm performance can be tested and validated. However, the original data is typically highly confidential and may contain personal data such that it also represents a compliance risk to repurpose the data for testing purposes. Even transferring this data to secure development environments within an organization increases the company’s risk profile.

Recently, methods for generating synthetic data that resemble the original data have received a lot of attention and investment, often with claims that the synthesized dataset does not contain personal data. These techniques vary in terms of how they balance the analytic similarity and privacy characteristics of the synthetic data. As we will describe in this paper, maximum privacy and utility cannot be achieved simultaneously; a trade-off must always be made.

Understanding Privacy Risk

Data synthesis techniques that claim to preserve general analytic equivalence while simultaneously making re-identification of individuals impossible are highly unlikely to achieve this. They are more likely to contain considerable privacy risk by underestimating the risk of re-identification. In this paper, we will define synthetic data and describe a number of different data synthesis techniques. We will describe how privacy and analytic utility are conflicting goals, limiting the use cases to which a given synthetic dataset may be applied. We will detail how the privacy risk contained within a synthetic dataset can be objectively quantified so that better, more informed decisions may be made, leading to increased confidence in the appropriate use of synthetic data.

Download the white paper to review several approaches to data synthesis and use cases for the datasets they produce.