AI Training data play a key role in the development of AI systems. However, they contain a risk of being inaccurate, discriminating or imbalanced. Accordingly, they can trigger significant liability claims. The current and draft EU law does not address this liability risk adequately. It is therefore contractual agreements between the different parties involved that need to fill this void.

Artificial Intelligence (AI) represents one of the most critical technologies of the 21st century due to its massive impact on our (future) lives. Whilst there are various legislative approaches to address the use of the AI itself (most notably the EU draft for an AI Act), the same cannot be said about the underlying data – the data that are necessary to train these systems. This is all the more astonishing as these training data are almost as relevant as the AI itself. In particular for all forms of “supervised learning” (on which most of the current AI systems are built), the amount and quality of the training data is critical. Therefore, companies wanting to reap the benefits of integrating AI systems into their operations will need to obtain as much high-quality training data as possible.

Where do these data come from? In general, there are three different approaches to generate training data for AI systems:

  • Data collection and annotation: The training data may be collected in a raw format and then be annotated. The process of annotating contains mainly the mundane task of connecting the relevant raw data with a description, e.g., describing a picture showing a horse as such. Collecting and annotating training data has the advantage of self-reliance but is often costly. Moreover, a company may have difficulties gathering sufficient training data for various reasons. In these cases, AI systems would not reach the required level to perform the intended tasks.

  • Data purchase: A speedier process is to buy set of training data – ready for use. A market for data providers has been established in the last years and is continuously growing. Depending on the application area, such specialized data providers offer already annotated training data for AI systems in many different fields.

  • Use of synthetic training data: Another option is synthetic training data. Synthetic data is information that is artificially manufactured rather than generated by real-world events. This data is created based on existing data, mainly through minor variations of existing data. Particular providers already offer such services.

Liability constellations in the event of erroneous training data

There are two main risks associated with training data, depending on the application area and the origin of data: First, data can be wrong or of low quality. Such quality concern may relate to the accuracy of data but also its up-to-datedness. Second, data can be biased or at least imbalanced. Biased training data, although substantively accurate, typically reflects inherent biases of societies, for example, by overrepresenting middle-aged, middle-class, white, heterosexual males from a developed country.

The consequence of erroneous training data can be malfunctions of the AI system, leading to incorrect or, in any case, undesirable outputs from AI systems. Such malfunctions can have significant consequences leading to personal injuries, for example, autonomous cars causing traffic accidents because of incorrect visual recognition data. In addition, individuals may be denied key life opportunities because AI systems make decisions based on biased training data, such as applying to jobs. As these examples show, the type and composition of training data may not only lead to bias but also to potential liability. This situation becomes even more complicated if the provider of the data and the operator of the AI system are different entities that are independent from one another.

The liability of the data provider towards the AI system operator

In these situations, the data providers may be liable to the AI system operator. The starting point for any such liability consideration would be the statutory risk allocations. The permanent transfer of training data in return for payment constitutes a purchase contract. According to the law of purchases, the purchaser, i.e., the AI operator, has warranty claims if the purchased item deviates from the agreed quality. The temporary transfer of training data in return for payment is usually seen as a license agreement with elements of a lease. Given that all these contractual relationships would typically involve some sort of warranties in the most general sense, both parties have an interest to agree more precisely what it means for either party of the training data provided turns out to be erroneous or biased.

Particular attention is also required to not simply adopt the language that many standards contracts provide with regard to the warranty for data delivered. Not infrequently, such data purchase agreement, include far-reaching warranties by the data provider that the data supplied is free of errors. However, such far-reaching, unilateral guarantees usually do not align with the interests at stake. Such guarantees may disproportionately favor the AI system developer or operator, which comes at the risk of o obstructing the emerging market for training data. Data providers might refrain from providing training data due to liability risks in certain situations.

A more granular approach to contracts would better reflect the interest of both parties and the general public. One way could be for the contracting parties to require a specific data quality based on recognized standards depending on the AI system's application area:

  • Application area of AI system: For high-risk AI systems, the parties may agree on a high-quality standard of training data. In contrast, lower training data quality requirements might apply to AI systems operating in less risky or almost completely risk-free domains.

  • Performance catalog: The contractual liability of the data provided should not cover every erroneous data point but only in the case of a possibly qualified infringement of a previously defined data performance catalog. It could be drafted in a way like IT-service level agreements.

  • Liability limitations: There should be clear rules on when the data provider's liability is excluded (e.g., in cases of force majeure) and possible maximum liability limits.

Even a detailed performance description does not solve a severe challenge regarding liability: it is highly complicated to determine individual responsibilities for causing an AI system malfunction. In particular, it becomes complex to prove whether and to what extent the training data contributed to a malfunction of the AI system. First, these problems of proof are attributable to the fact that the decision-making of AI systems, when based on artificial neural networks, is not entirely understandable to humans. How exactly a decision emerges is a black box. Second, creating training data is an early step in developing and applying an AI system, which means several parties are involved along the way. Such third parties are, for example, AI developers influencing the nature of the AI system through their training, other data providers delivering training data, and environmental influences on the AI system. Besides, numerous other contributions are possible. The contractual parties should pay attention to these evidentiary issues as the AI training data performance description. Potential approaches could be:

  • Expert decision: One possible approach would be to agree in the contract on an expert or an arbitration body that would determine the causation contribution of the training data on a binding percentage basis for the parties.

  • Fixed rates: Alternatively, the parties could also agree on fixed liability rates to make disputes about the causation contribution of training data legally secure. Depending on the field of application and the importance of the training data for the output, these liability rates can vary greatly.

  • Insurance: Finally, the parties could jointly take out an insurance policy to intervene in liability cases. They would then have to pay premiums for such an insurance policy for a certain period.


The recent EU legislation has not addressed liability for AI training data sufficiently. This omission is surprising given the enormous importance of AI training data for establishing an AI system. The parties concerned should address two central issues: the grounds for liability and evidentiary issues. To demonstrate the grounds for liability, a risk-based approach is appropriate. The more risk-prone the intended use of the AI system, the higher the requirements to be placed on the training data. However, a data provider and its contract party should also state liability for erroneous training data as part of the contractual distribution of risk. For proof requirements in the event of a liability case, much depends on the intended use of the AI system. The parties must define reasonable rules for determining the cause in the event of a malfunction of the AI system.