Synthetic data is a subject (at least for the uninitiated) encircled in a layer of scepticism and uncertainty. While synthetic data has existed for some time, the manner in which it is being utilised has evolved substantially. As with all novel implementations of technology, there are always teething problems and it takes time to develop best practices.

In its more recent application, synthetic data is used in place of traditional anonymised data, particularly in the training and development of AI by organisations. The main goal of this is to maintain high standards of data quality during training, validation, and testing phases while minimising the privacy impact on potential data subjects. As a result, we are seeing an increase in the use of this technology, in particular within the healthcare, financial services and educational sectors, where organisations have had to, for better or worse, rely on older anonymisation techniques to avoid the problems that arise when information can be attributed to a particular person.

In this article, we explore this challenge, common to many organisations that develop and provide AI systems, and consider how synthetic data can be used for the benefit of both organisations and wider society.

The importance of data quality

Data quality is fundamental to the training and testing of effective and fair AI systems. It is all the more important when the system will be used to influence or support decisions affecting humans, such as whether or not a person qualifies to receive loans or mortgages. Organisations should therefore aim to obtain as high a quality set of data as possible. The higher the quality of data used to train the AI system, the more likely the results will be accurate. In theory the results should contain fewer false negatives/positives and ultimately more objective, consistent, and predictable outputs.

The importance of quality data in the area of AI is understandably recognised. In fact, data quality has been deemed so important that it is recognised within a number of principles for the use and development of ethical AI and consistently features in the discussions over future regulation of the technology. For example, the OECD.AI’s principles include, among other things, direct reference to fairness and transparency.

This would be unobtainable without the ability to gather data that is accurate enough to reflect the true position of its data subjects. The European Commission has also recognised the importance of quality data, requiring that providers of AI systems access and use high quality data sets. In order to ensure this, the current wording of the draft AI Act includes requirements on data quality, including quality criteria for the training, validation, and testing data sets for all high-risk AI systems[1].

Data protection challenges

Throughout the development and implementation of AI, there are a number of challenges which arise from the use of personal data and the data protection obligations that come with it.

During the training phase, and while engineers rigorously focus on data quality, simultaneous data protection risks can be easily missed or mismanaged. Organisations should therefore reflect on the fact that wherever training data relates to real people, data protection laws will likely apply. In practice, this can cause tension in interests between those trying to create the most accurate and fair model and those responsible for ensuring the privacy standards are met.

Data protection laws implement several important rules that create a number of challenges for the developer of an AI system. For example, organisations must have a legal basis for the processing activity itself, personal data used must be transparently processed, and the use of personal data should be minimised to the greatest extent possible.

Many anonymisation techniques cannot resolve the issue

How do organisations attempt to resolve this? In practice, we see organisations leveraging anonymisation techniques to avoid data protection requirements during many of the phases of AI’s use and development. Afterall, personal data laws will not apply to data that cannot be attributed to an individual. Although anonymisation has been cautiously recognised by data protection regulators as an effective method for protecting personal data, in reality, the utility of this technique varies significantly based on the anonymisation strategy.

For anonymisation to be effective, the risk of re-identification must be “sufficiently remote”.[2] This excludes situations of pseudonymisation (a practice which involves potential re-identification by combining with additional information). The standards required for re-identification risks are very high[3] and in some cases, this standard cannot be met in a manner that is practicable for those involved. As a result, in many cases, a large number of data points will need to be removed or adjusted to meet the requisite standard.

In effect, the high standard of the “remoteness” of identification necessary for anonymous data can significantly the lower the quality of the training data. As noted above, low quality training data will typically reduce the statistical accuracy of the AI system, exposing the system (and individuals) to other potentially serious risks when deployed (such as bias or unintended decision-making processes). For example, where the AI system will be used for the purposes of making recommendations relating to health care and treatment, if an AI system has been fed data that is heavily in favour of only one certain type of patient, this may disadvantage the ability of other patients to receive treatment.

Synthetic data as a real alternative

Enter synthetic data, a privacy-preserving alternative. Like many technical terms, there are varying definitions for synthetic data. For example, the UK’s Office of National Statistics defines synthetic data as:

microdata records created to improve data utility while preventing disclosure of confidential respondent information”.[4]

Despite the plethora of definitions, it is most widely understood to mean data that does not relate to real people, as it has been generated artificially.

Synthetic data is generated from original (personal) data usually by intelligent models that are trained to reproduce the characteristics and structure of the original data. The dual benefit being that models trained using synthetic data should deliver very similar results to the original data when tested but without using personal data in the training process.

Whilst initially met with scepticism, synthetic data continues to gain traction and become recognised by regulators for its benefits. For example, the UK’s Information Commission has recognised synthetic data as a viable privacy-enhancing method to be applied to training data.[5] The European Data Protection Supervisor (the relevant data protection authority for all European Union institutions) has also recognised the potential added value of synthetic data from a privacy perspective and its contribution towards mitigating bias risk.[6]

Proceed with caution

Synthetic data, however, is not a silver bullet solution to the data protection vs data quality conundrum. Like all technical processes, users should proceed with caution, ensuring their use of synthetic data is carefully considered and implemented.

Firstly, before using synthetic data, organisations should ensure that the synthetic data provides a viable and valuable alternative to the originally intended training (personal) data. It should also be ensured that, where synthetic data is the most appropriate alternative, that the artificial data used is in fact reflective of the trends and circumstances it aims to replicate.

Secondly, synthetic data can be generated using many different methods, each posing different risks and implications for the data set (including from a privacy perspective). Organisations planning to use synthetic data will need to understand the method used and its implications for the training of their specific AI system. Failure to do so may result in unintended trends in the data created, leading to inaccurate decisions and bias emerging in outputs.

Finally, the use of any synthetic data requires careful implementation to work in practice. As mentioned above, original data (often personal data) is required to create synthetic data. An organisation planning to use synthetic training data will therefore still need to consider data protection obligations before and after synthesis. These include:

  • Ensuring all relevant parties are in compliance with data protection requirements prior to synthesis. For example, ensuring that there is a lawful basis for using the original personal data, deleting the original personal data within identified timeframes, and complying with transparency obligations;
  • Identifying and implementing the least intrusive methods of processing personal data to create the synthetic data set. For example, whether it is possible to use a synthesis program within the organisation’s own network rather than sharing the personal data with third parties to synthesize on the organisation’s behalf;
  • Prior to synthesis, reviewing the original data set to ensure that any inherent issues or biases have been identified and mitigated to the extent possible. Bias in the original data set will likely be replicated in the synthetic data. A benefit of synthetic data may be that organisations can spot these issues early and adjust the synthesis process so that a fairer and more diverse synthetic data set is used to train the system; and
  • After synthesis, reviewing the synthetic data to ensure that there is a sufficiently low re-identification risk (i.e. that someone could identify a data subject in the original data set from the synthetic data). This can be done by measuring the correlation between the original data and the synthetic data. The European Data Protection Supervisor[7] has recommended that a privacy assurance assessment is performed to ensure the synthetic data is not actually data that can still be attributed to the original subjects.

Synthetic data, a potential solution?

Synthetic data provides an interesting and often viable alternative to using anonymisation techniques for real life personal data sets. It can be an effective privacy-enhancing technique while minimising any impact on training data quality. However, the use of synthetic data is not a fool-proof remedy. It is still relatively new in this form of application and the generation of best practices are therefore still on-going. In addition, access to synthesis programs or synthetic data sets are still limited and costly which poses an additional challenge for SMEs in particular. For those organisations that do proceed to utilise this technology, the implementation of synthetic data should be carefully considered.