As AI continues to dominate the headlines, one word that is consistently cropping up is “transparency” - particularly in relation to the origin of data sets that have been used to train AI systems. Over the last few months, we’ve seen several calls for transparency make their way into legislative discussions around how best to regulate AI, including in the draft EU AI Act (see here), the Bletchley Declaration that was signed on the first day of the UK’s AI Safety Summit last month (see here) and in a recent UK Private Members’ Bill that was introduced into the House of Lords by Lord Holmes of Richmond.
But it’s not just legislative bodies that are considering this. Established corporates are also making their views known, particularly in the US. Last week, the Data & Trust Alliance announced what are believed to be the first cross-industry standards for data provenance. These voluntary standards are designed to help companies understand where, when and how the data they manage was collected or generated, with the aim of providing increased transparency into the origin of datasets that are used in both traditional data and AI applications – a noble intention if ever there was one!
What are the new data provenance standards and what are they to be used for?
Founded in September 2020, the Data & Trust Alliance (“D&TA”) is a not-for-profit consortium of large US corporates and institutions across industries that work together to develop and adopt responsible data and AI practices.
In their latest initiative, experts from 19 companies within the D&TA, including Deloitte, IBM and Nielsen, worked together to create a set of eight proposed “Data Provenance Standards” for use in relation to both traditional data and AI use cases. The eight standards are: lineage, source, legal rights, privacy and protection, generation date, data type, generation method, and intended use and restrictions.
In essence, these standards propose a sophisticated and uniform way of labelling datasets (at the dataset level) by reference to specific metadata tags for each standard. For example, the “source” standard identifies the origin of the dataset (person, organisation, system etc.); the “generation method” standard identifies how the data was produced (e.g. through data mining, web crawling or user-generated content); the “legal rights” standard will identify things like whether any copyright-protected material is present in the dataset; and the “intended use and restrictions” standard will identify any permitted uses and express restrictions on how the dataset may be used.
Whilst labelling data in and of itself is not new, the alliance says that these are the first cross-industry standards of their kind and go beyond the traditional tags that have been used in the past.
The standards are helpfully accompanied by a set of use cases and scenarios highlighting the sorts of decisions that the standards can help to answer (and what they can’t). From these, it is clear that they can help businesses answer questions such as whether to acquire a particular dataset (e.g. for use in predictive AI modelling) and whether that dataset is appropriate for their intended purposes. And, more generally, they can be used to help identify other risks and benefits of a particular dataset, such as whether the data is structured or unstructured and how many sources were used. However, whilst it’s clear that the standards can be used in many different scenarios, the D&TA have expressly highlighted that it may be challenging to apply them effectively to “large language models that are trained on vast amounts of public data sourced from diverse locations on the internet”.
The standards remain in draft form for now as the D&TA continues to refine them and seeks external input from interested practitioners, but their stated aim is to have version 1 ready for release in Q2 2024.
It is becoming increasingly clear that transparency around the origin of datasets, particularly in an AI context, is vitally important. This is not only for reasons of building trust in AI tools, but also when considering broader questions, such as those relating to IP infringement and bias. Voluntary initiatives such as the D&TA’s proposed standards are therefore to be welcomed.
However, as the D&TA openly acknowledge, standards like this can be difficult to apply effectively to large language models like ChatGPT. Whether such models will be legally required to provide transparency over their datasets and, if so, how far those requirements will go, remains to be seen. But with the trilogue negotiations over the EU AI Act reaching their hiatus and the UK’s Code of Practice on Copyright and AI eagerly anticipated, we remain hopeful of getting a steer on that in the near future.
“When implemented, the standards will provide transparency into the origin of the datasets used for both traditional data applications and a rapidly growing number of artificial intelligence (AI) applications, which is expected to enhance AI value and trustworthiness.” Data & Trust Alliance press release, 30 November 2023.