By: Allison Trimble, associate senior counsel at DST Systems, Inc., and Soo Y. Kang, general counsel and director of the consulting division for Zasio Enterprises Inc.

This article was published as part of ACC’s new column “Breaking Down Big Data” that is written by members of the big data sub-committee of the ACC IT, Privacy & eCommerce Committee. Here, they discuss how to manage the ever-changing big data issues in the legal field.

The first article of this series provided an overview of select laws that govern de-identification of personal information, specifically outlining the different standards and the effect of de-identification on the use of the remaining data set. In the second part, we focus on the practical challenges of meeting de-identification standards, including both GDPR’s heightened standard for anonymization, as well as meeting more traditional standards tied to the likelihood of re-identification. In addition, this article will touch on the typical technical methods for de-identifying personal information, and how the evolution of those methods may affect corporate compliance.

Minimizing the risk of re-identification

Simply removing direct and indirect personal identifiers isn’t sufficient to achieve de-identification of a dataset. Data controllers must also analyze the context in which the data is presented and the risk of re-identification. In evaluating this risk, consider all methods of re-identification reasonably likely to be used, which include:

  • Singling out: occurs when data related to one individual can be distinguished from all other information in a dataset (e.g., due to a unique value, data that records the height of individuals and only one individual is 6 feet tall);
  • Data linking: occurs when identifiers in a data set (or from external sources) are linked together (e.g., where data has been pseudonymized, direct comparisons can be made between the data masked by a pseudonym and other available data); and,
  • Inference: occurs when an inference is drawn between two pieces of information in a dataset (e.g., an inference can be drawn where a dataset shows a correlation between salary and age).

The most difficult re-identification risk to manage is data linking, in part due to the growing availability of large datasets within the public sector. Note that 87 percent of the US population can be re-identified through a combination of their zip code, gender, and date of birth. As an example, in 2006, Netflix released an anonymized dataset of movie ratings of 500,000 Netflix subscribers as part of its inaugural prize contest to find a superior movie recommendation algorithm.

Unfortunately, the subscribers were re-identified by cross-referencing that dataset against publicly available information, such as the Internet Movie Database. That same year, AOL’s research team released an anonymized search dataset of 20 million search queries from 650,000 AOL users for public consumption. While the dataset was de-identified by removing IP addresses and usernames, it did not prevent the identification of individuals by cross-referencing the dataset with other publicly available information. As evidenced by the above, the population size to whom the dataset is released for digestion is certainly a risk that must be factored into the overall equation.

In addition, corporations must also take into consideration numerous other variables that can heighten risk for re-identification, such as using core datasets that contain sensitive information (i.e., financial or health account information), which may be of high value to bad actors and motivate such actors to attempt to re-identify data subjects.

Technical methods for de-identification

Technical methods for performing de-identification are not prescribed by law, but rather are often left to the discretion of the data controller. While guidance is sometimes offered by interested parties (e.g., Article 29 Working Party opinion on anonymization techniques), the law only requires that the method employed be sufficient to meet the de-identification standard articulated. Due to this ambiguity, the type of technical method deployed is often dictated by the intended use of the final dataset. Common techniques for de-identification typically include:

  • Randomization: occurs when “noise” or other random changes are made to a dataset;
  • Generalization: occurs by reducing the granularity of the data so less precise data is disclosed;
  • Masking: occurs when direct or indirect identifiers are removed (this method often accompanies other technical methods); and,
  • Pseudonymization: occurs when identifying characteristics are replaced with a pseudonym.

Standards such as anonymization provide challenging for most corporations to meet, as one of the few technical methods sufficient to meet this standard involves deleting the raw or source data, so that no original data set exists for purposes of “reversing” the de-identification efforts. Corporations may lack the appetite to pursue meeting this standard as deleting raw or source data often cuts against its value. Accordingly, the appropriate method for de-identification will often be driven by an assessment of risk vs. value. The continuing evolution of legal standards for de-identification requires corporations to become flexible in the technical methods they use so that they are best positioned to address changes.

Recommended best practices

Corporations will be best positioned to limit the risk of re-identification and respond quickly with appropriate technical methods of de-identification by adopting the following practices:

  • Engage stakeholders to assess the risk of re-identification and underscore the importance of evaluating context. Scope of inquiry should include factors such as the intended use, whether the de-identified data will be released to third parties, the nature of the source data, and the de-identification techniques used.
  • Keep in mind that anonymization is an extremely burdensome standard and one that, in this technological age, can’t be achieved without sacrificing value. Accordingly, in reviewing contracts, do not over-obligate by agreeing to meet anonymized standards unless the term is defined as something other than “irreversible.”
  • Be proactive by regularly monitoring changes to laws in the jurisdictions of interest to the corporation, as well as keeping up to date with new technologies and their capabilities.

For further reading, enjoy a 30-day trial membership with the Association of Corporate Counsel and browse our online library.