The Department of Health and Human Services’ (HHS) Office for Civil Rights (OCR) released its long-anticipated Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA)1 Privacy Rule2 (Guidance) on November 26, 2012.3 The Guidance — which synthesizes stakeholder feedback from a March 2010 HHS workshop4 and provides clarification of existing rules — fulfills Section 13424(c) of the Health Information Technology for Economic and Clinical Health (HITECH) Act,5 requiring the Secretary of HHS to issue guidance concerning implementation of the HIPAA Privacy Rule’s requirements for the de-identification of Protected Health Information (PHI).

While the Guidance does not contain any surprises, it clarifies many commonly-accepted interpretations of the Privacy Rule related to de-identification, and provides additional detail concerning technical aspects of the two de-identification methodologies designated by the Privacy Regulation. Under the Privacy Rule, information is considered de-identified if it does not identify the relevant individual and there is no reasonable basis to believe it could be used to identify the individual.6 The Privacy Rule specifies two methods that can be used to render PHI no longer “individually identifiable” — a safe harbor whereby pre-set list of specific identifiers are removed7 and an “Expert Determination” method under which a qualified expert determines that the likelihood of re-identification of a data set is very small.8 The Privacy Rule permits covered entities to use and disclose PHI for the creation of de-identified information using either the Expert Determination or Safe Harbor methodologies.9 

Expert determination method of de-identification

The Expert Determination method requires a covered entity to obtain a formal determination by a qualified expert that the risk is very small that information can be used alone or in combination with other reasonably available information to identify an individual. HHS recognizes that some risk of re-identification exists even with de-identified information, but nonetheless, de-identified information is not considered PHI. 

OCR does not require any specific qualification or professional degree for an expert to be qualified in rendering health information de-identified. The agency notes, however, that experts may be found in statistical, mathematical, or other scientific fields, and points out that “from an enforcement perspective,” OCR will look to the individual’s relevant professional experience and training, including experience in health information de-identification methodologies.

The Guidance states that there is no specific numerical level of risk (e.g., 2, 5, or 10 percent) for an expert to determine that there is a very small risk for information to be used to identify and individual. The appropriate level of risk will depend on the anticipated recipient and other environmental factors.

OCR also makes clear that there is no designated process for an expert to use in assessing the risk of identification. The agency emphasizes the importance of documentation throughout the de-identification process and requires an entity to make this documentation available upon request. The Guidance also outlines a general process for experts determining the identification risk, which includes: evaluating the extent to which the information is identifiable, working with the covered entity to determine proper statistical, or scientific methods to mitigate risk of identification, applying such methods, assessing the risk, and potentially repeating the process until the covered entity and expert agree the data has reached a “very small” risk level, as required by Section 164.514(b)(1)(i) of the Privacy Rule.

The Guidance also points out that experts can derive multiple solutions from the same data. This is allowed as long as the expert ensures the different data sets cannot be combined to identify individuals who are subjects of the data. The Guidance provides that experts can accomplish this by obtaining technical proof that the data cannot be merged or by requiring contractual limitations like data use agreements.

Principles experts may use to determine the risk of identifiability include:

  • Replicability: the chance a data element will consistently occur or vary (e.g., date of birth will remain consistent and is, therefore, high risk).
  • Data source availability: the chance external data sources contain patient identifiers and replicable features in heath information (e.g., patient name and demographics are often found in public data sources and are, therefore, high risk).
  • Distinguishability: the extent to which the subject’s data is distinguishable in the health information (e.g., one study suggests that over half of U.S. residents can be uniquely identified by the three data elements of birth date, gender, and five-digit ZIP code, which is, therefore, high risk).

OCR states that the risk assessment often turns on the extent to which a data set can be “linked” to a data source divulging the identity of the relevant individuals. Experts may apply commonly accepted statistical or scientific principles to determine this probability. For data to be linked, it must be unique, accompanied by a naming data source, and have a mechanism for connecting the de-identified data to the naming data source. For example, some features like age, ZIP code, or gender in a data set can easily be linked to public data sources like a voter registration database, which may reveal the individual’s name and other information. This is why commonly found and publicly available features (like demographic information) tend to be higher risk than clinical features, which fewer people can access. 

If the risk assessment reveals a greater than very small risk, the expert may apply a number of strategies to modify the information and mitigate the risk

  • Suppression: removing certain risky features from the data (e.g., redacting the age or ZIP code of a record if those features apply to a narrow pool of individuals).
  • Generalization: broadening the data of a record to make it less granular (e.g., revealing only the first four digits of a ZIP code, or categorizing age into five-year categories).
  • Perturbation: replacing specific values with other specific values (e.g., reporting a patient’s age as a random number within a five-year window of the actual age. This can be performed to maintain statistical properties such as mean or variance from the original data).

In the Guidance OCR acknowledges that risk mitigation strategies sometimes involve multiple approaches, such as applying both generalization and suppression. OCR stresses that no particular approach is required, and experts must make case-by-case determinations concerning which methods best suit the risk mitigation needs of a particular data set. Experts may, as an added precaution, execute a Data Use Agreement, limiting for example, who can access the data or forbidding the data recipient from re-identifying the data.10 This goes beyond the requirements of the Privacy Rule, as no such agreement is required when sharing de-identified data. 

OCR also notes that disclosure of records with “codes” is not strictly prohibited under the Expert Determination method. Sharing records with “code,” cypher text, or cryptographic hash functions is permissible as long as an expert determines that the data meets the de-identification requirements of section 164.514(b)(1) and the covered entity does not reveal the key associated with the functions, including to the data recipient. OCR points to the National Institute of Standards and Technology (NIST) for further guidance.11 

Finally, the Privacy Rule does not require attaching an “expiration date” to an expert determination that information is de-identified. OCR recommends, however, that before re-releasing the information at a later time, covered entities reexamine it and make sure no additional de-identification process is necessary. Risk levels are situation specific and can change over time, so covered entities must make sure the very small risk level of identification is maintained.

Safe Harbor method of de-identification

Under the Safe Harbor method, a covered entity must remove 18 specific data elements (name, social security number, etc.) related to the individual, his or her employer, household, or family members enumerated in Section 164.514(b).12 In addition, the covered entity must have no “actual knowledge” that the information could be used to identify an individual who is a subject of the information. 

Covered entities may leave the first three digits of a ZIP code in the data as long as the geographic region represented by those three digits includes over 20,000 people — otherwise, these digits must be changed to “000.” OCR expects covered entities to use the most current publicly available Census data concerning ZIP codes when determining which zip codes may have fewer than 20,000 people. As for dates, consistent with the current Privacy Rule, OCR states that nothing more specific than the year is allowed. This also applies to dates associated with patient test measures, like dates on a laboratory report. And ages of 89 or older must be generalized as 90 or above, as the data set of individuals at each age above 89 may be narrow enough to risk identification. And not all names must be suppressed under the Safe Harbor Method. Names of individuals corresponding to the health information, their relatives, employers, or household members must be suppressed, but names of providers or workforce members of the covered entity or business associate may remain in the data set.13 

OCR is also unequivocal that partial or derivative identifiers are not allowed under the Safe Harbor method. For example, data containing patient initials or the last four digits of social security numbers does not qualify as de-identified information.

While it does not provide an explicit definition, OCR offers examples of identifiers that may be considered “any other unique identifying number, characteristic, or code” — which is the last of the 18 Safe Harbor data elements. The examples include identifying numbers such as clinical trial record numbers, identifying codes such as embedded bar codes or codes derived from a secure hash function without a secret key, and identifying characteristics such as information that someone is the “current President of State University.”

A final requirement of the HIPAA Safe Harbor de-identification method is that covered entities must have no “actual knowledge” that the data could be used to identify an individual data subject. OCR defines “actual knowledge” as “clear and direct” knowledge. If a covered entity is aware that a data set is not in fact de-identified — if it concludes that the information could identify an individual who is a subject of the data — it has “actual knowledge” and may not disclose the information. Merely being aware of studies concerning methods to re-identify information or to identify individuals using de-identified information, however, does not constitute actual knowledge.

These HIPAA de-identification standards apply regardless of whether the data is from standardized fields or free text fields. OCR makes specific note that covered entities should be mindful to remove identifying information from free text fields such as clinical narratives, which can be mine fields of sensitive information. OCR references Health Level 7 (HL7) and the International Standards Organizations (ISO) as two resources with additional information on best practices in documentation and standards.

The full text of OCR’s Guidance is available at: http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/hhs_deid_guidance.pdf.