Data can be valuable for a variety of reasons. Organizations often find that one of its greatest values is to research product or service markets, customer behaviors, or market trends. When data is used for research, it can raise both privacy and security concerns. Specifically, if the research is shared with third parties, either to help conduct analysis or to share findings, organizations may inadvertently violate privacy representations. If data is copied into research environments, it may also weaken the security depending upon the security controllers used by researchers.
One way to reduce privacy and security risks is to de-identify the data. De-identification of data refers to the process used to prevent personal identifiers from being connected with information. The FTC indicated in its 2012 report Protecting Consumer Privacy in an Era of Rapid Change: Recommendations for Businesses and Policymakers that the FTC’s privacy framework only applies to data that is “reasonably linkable” to a consumer.1 The report explains that “data is not ‘reasonably linkable’ to the extent that a company: (1) takes reasonable measures to ensure that the data is de-identified; (2) publicly commits not to try to re-identify the data; and (3) contractually prohibits downstream recipients from trying to re-identify the data.”2 With respect to the first prong of the test, the FTC clarified that this “means that a company must achieve a reasonable level of justified confidence that the data cannot reasonably be used to infer information about, or otherwise be linked to, a particular consumer, computer, or other device.”3 Thus, the FTC recognizes that while it may not be possible to remove the disclosure risk completely, de-identification is considered successful when there is a reasonable basis to believe that the remaining information in a particular record cannot be used to identify an individual.
De-identification is not a single technique, but rather a collection of approaches, tools, and algorithms that can be applied to different kinds of data with differing levels of effectiveness. In 2010, the National Institute of Standards and Technology (NIST) published the Guide to Protecting the Confidentiality of Personally Identifiable Information (PII) that provides a set of instructions and de-identification techniques for federal agencies, which can also be used by non-governmental organizations on a voluntary basis. The guide defines “de-identified information” as “records that have had enough PII removed or obscured, also referred to as masked or obfuscated, such that the remaining information does not identify an individual and there is no reasonable basis to believe that the information can be used to identify an individual.”4
NIST has identified the following five techniques that can be used to de-identify records of information:
- Suppression: The personal identifiers can be suppressed, removed, or replaced with completely random values.
- Averaging: The personal identifiers of a selected field of data can be replaced with the average value for the entire group of data.
- Generalization: The personal identifiers can be reported as being within a given range or as a member of a set (i.e., names can be replaced with “PERSON NAME”).
- Perturbation: The personal identifiers can be exchanged with other information within a defined level of variation (i.e., DOB may be randomly adjusted -5 or +5 years).
- Swapping: The personal identifiers can be replaced between records (i.e., swapping the ZIP codes of two unrelated records).
The following provides snapshot information concerning de-identification, anonymization, and pseudonymization.
Click here to view table.