Due diligence considerations with respect to licensing data and acquiring data-dependent businesses.

In the relentless pursuit of the competitive advantages that arise from efficiency and speed, companies are increasingly using artificial intelligence and machine learning (AI/ML) in their businesses. But the story does not end there. Whether you believe it is the new "oil" or not,1 data is essential to such AI/ML technology—and every AI/ML application is only as good as the quality of data on which it is trained.2 As such, companies are increasingly looking to collect data not just via their own products and services but also from public and third-party sources (including from the internet through automated means like web scraping or by licensing from data vendors and aggregators). This article explores the rights and protections that may apply to such public or third- party data—and several due diligence questions to consider when looking to acquire or license such data.

In recent years, we have seen an increase in the number of M&A acquisition targets whose various products and services are based on or rely upon AI/ML technology or large pools of business-critical data. More and more of our clients are also seeking to license third-party data to power or augment their products and services. In both instances, the protectability of the data and how it is (or may be) exploited can lead to novel due diligence issues and determinations of risk allocation. We often hear target companies or vendors say that the relevant data is "in the public domain" or is not collected from "behind a login" and therefore may be freely accessed, collected and used in the relevant business. This analysis, however, is often not so straightforward. The rights that a target company or data vendor may have to certain data (and, in the licensing context, what rights such entity may grant) often turn on the nature of the data, how is it collected and processed, and its intended use.

Under United States law, data may be subject to various legal restrictions and regimes, including contractual protections, rights of privacy or publicity, and intellectual property rights. Further, certain computer trespass statutes and torts law claims may also apply depending on the manner in which the data is accessed and collected. Some of the more common rights and protections that may apply to data in the US are:

  • Contractual rights – Website operators often place limits and restrictions on access to information3 through terms of service or other "browsewrap" terms and conditions (i.e., terms and conditions to which website users are purportedly bound due to the mere act of navigating or using the website). Generally, courts find that "browsewrap" agreements are unenforceable unless there is actual or constructive knowledge of such terms.4 Courts are, however, increasingly considering the nature of the user accessing the website (e.g., whether such user is a business or sophisticated party) and whether the user received actual or constructive knowledge outside of the ordinary website terms (e.g., via a cease-and-desist notice).5 If such "browsewrap" terms are held to form a binding agreement between the user and website operator, then the access, collection or use of data hosted on such website by the user in breach of such terms may give rise to a breach of contract claim against the user.
  • Copyrights — While the underlying facts or data may not be protectable, a data set as a whole may be protected by copyright as a compilation if it has the required minimum level of originality.6 Moreover, unless the terms and conditions of a website explicitly allow the copying of particular images or text, then copyright law precludes the reproduction (e.g., scraping) of (as well as the display of and creation of derivative works based on) images or text that are subject to copyright protection. This is the case regardless of the purpose for which the data is collected unless a fair use exception exists (which hinges on the subsequent application of the copied material, not upon the nature of the material itself). A claim under the Digital Millennium Copyright Act may also exist if the user circumvents technological measures (e.g., robots.txt files, monitoring software or firewall software) that control access to a copyrighted work.7
  • Computer Frauds and Abuses Act (and state equivalents8) – Claims under the Computer Fraud and Abuses Act (CFAA) are commonly asserted as means of protecting data.9 Among other things, the CFAA protects any computer used in or affecting interstate commerce where the violator has intentionally accessed the computer without authorization (or exceeded authorization) and thereby obtained information from any such computer.10 The scope and coverage of the CFAA with respect to data scraping claims, in particular the meanings of "without authorization" and "exceeds authorization," have been the subject of significant and, at times, conflicting case law. In recent years, courts have grappled with, inter alia, whether this language extends to breaches of a website's terms of service or other "browsewrap" terms and conditions for publicly available websites (or only to certain breaches of such terms and conditions) or whether such language should only apply to non-public data found behind a login or other authentication process. In 2017, in awarding hiQ Labs, Inc. (hiQ) a preliminary injunction that prevented LinkedIn Corp. (LinkedIn) from blocking hiQ's access to the public profiles of LinkedIn's members,11 the District Court for the Northern District of California found that hiQ raised serious questions as to the applicability of the CFAA to hiQ's conduct. hiQ had continued to access public profiles after LinkedIn both issued a cease-and-desist letter and took proactive security measures to block such access. The Court distinguished previous authorities based on the fact that the defendants in both prior cases had gained access to computer networks that were protected by a password authentication system. The Court distinguished between "public" data and data not generally visible to the public and noted that "authorization" is most naturally read in reference to the identity of the person accessing the data (and not the method of access).12 This decision, however, is on appeal.
  • Trespass to chattels and other rights and protections – The tort of trespass to chattels has been interpreted to apply to using a computer system without, or in excess of, authorization where the website operator can establish actual damage (e.g., where an automated bot that is used to crawl a website and access data consumes a significant portion of the capacity of the website operator's servers or computer system).13 Moreover, although privacy and antitrust considerations are beyond the scope of this article, they may also be present as part of this analysis. Similarly, this article only considers protection of data under US law. Under certain foreign laws, databases may receive additional protections (e.g., The EU Directive 96/9/EC on the Legal Protection of Databases (Database Directive), implemented by the EU Member States, provides sui generis protection for databases).

As can be seen from the above, there are various causes of action that may apply to publicly available data, and the landscape of data scraping case law is emerging and, in part, unsettled. For example, the question of whether data has been accessed from behind a password authentication system may be relevant to a CFAA analysis but will not be relevant to a copyright infringement analysis. As such, when undertaking due diligence with respect to a target company whose AI/ML technology or products or services critically rely on, or whose competitive advantage is derived from, certain data or when licensing data from a third party, it is important to understand the following:

  • The types of data involved: For example, does the subject data only comprise facts (e.g., certain product attributes) or does it contain creative expression such as long-form prose and/or images? Is personal information or other sensitive information included?
  • How each type of data is collected and where from: For example, is the data collected from publicly available websites or from websites or forums only accessible behind a login? Are the sources from which it is collected subject to any "clickwrap" or "browsewrap" or other terms and conditions?
  • How each type of data is used (or is intended to be used): For example, is the data being used internally to train AI/ML technology, is it reproduced (wholly or in part) in the target company's products or services (including on its website), or will it be used to produce derivative works or in a transformative manner?
  • The existence of any demands or claims with respect to the data or the company's collection practices: Has the target company or vendor ever received any cease-and-desist notices with respect to its data (or its data collection practices)? If so, what is the extent of the data affected and how has the company addressed such claims?

All of these questions and considerations are directed to understanding whether the relevant data is subject to any third-party rights or other protections and provide a starting point to assess and analyze any risks (and their materiality) in connection with the subject transaction.