Data analytics is undergoing a watershed moment internationally that is likely to impact common industry norms. In Québec, of course, Bill 64 and its draconian penalties will come into force largely in September 2023, including Canada’s first statutory treatment of technologies “that allow a person to be identified, located or profiled.” Europe is even farther ahead: on November 23, 2021, the Internal Market and Consumer Protection Committee of the European Parliament unanimously backed the proposed Digital Markets Act (the DMA), which sets to prohibit the use of combined personal information to deliver targeted advertising by major advertising platforms.
Providers of data for targeted advertising and data insights have also felt pressure from lawmakers regarding third-party tracking, which often takes the form of third-party cookies inserted into browsers that track users to gather information on their behavioural patterns and interests. The industry is in the midst of a significant upheaval: Firefox and Safari have blocked third-party cookies from their browsers entirely, Apple has implemented privacy settings to their mobile devices through iOS 14.5 to require opt-in to third-party tracking on apps, and Google has committed to phasing out its third party cookie system by 2023.
Data analytics is a constant battle between the utility and the anonymity of the underlying data set. Businesses may wish to anonymize personal information to simplify regulatory obligations and reduce breach risks, while retaining enough critical personal information for the data to be useful. This leads to a pivotal question — how can businesses learn the most about group behaviours while knowing as little as possible about the specific individuals in the group?
Facing increased regulatory scrutiny, businesses have come up with unique solutions to retain critical personal information, while minimizing privacy risks associated through anonymization. By strategically applying anonymization techniques, businesses maximize the analytical value of personal information, while minimizing the risks associated with keeping personal information. In doing so, the risk of harm associated with privacy violations, regulatory investigations, and disclosure obligations can be reduced as personal information held by a business ceases to specifically identify individuals, or greatly reduces potential harms to those individuals. We discuss these solutions below.
COMMON ANONYMIZATION AND MINIMIZATION TECHNIQUES
Privacy regulators have increasingly supported the use of anonymization techniques to reduce the risks associated with businesses processing and keeping personal information. As a useful guideline, the European Commission identified three factors to assess the level of security provided by an anonymization technique: (i) is it still possible to single out an individual?; (ii) is it still possible to link records relating to an individual?; and (iii) can information still be inferred concerning an individual? In practice, perfect anonymization of data would render data nearly unusable from a business perspective. However, businesses can implement a variety of anonymization and minimization techniques that preserve the analytical usefulness of data to draw business insights, while at the same time protecting personal information from being widely disseminated. As these techniques technically permit the reidentification of data for analysis purposes, they are referred to as “pseudo-anonymization.” Through a combination of methods for pseudo-anonymizing personal information, businesses have implemented a variety of creative ways to maximize analytical usefulness while reducing the legal risk involved with data processing.
Data suppression is the practice of eliminating certain categories of data that are irrelevant to a given analytics exercise. As an example, if the full name of an individual is irrelevant to analytics but was collected as part of the payment information process, the full name would be removed from any analyst’s request for data. Ideally, suppression should be used when a category of personal information is either irrelevant or when the category cannot otherwise be suitably anonymized with another technique, as the data cannot subsequently be recovered.
Masking is similar in principle to suppression, but a less permanent method of anonymizing data. The technique involves replacing characters in personal information with dummy characters to reduce the possibility of unauthorized access to sensitive data. A common example is the use of uniform characters when inputting a password to prevent recording (i.e. passwords become • • • • • • • • when typed). The same practice is used to mask credit card information, replacing numbers with XXXX-XXXX-XXX-1234 to prevent malicious use. Masking can be a useful, but non‑permanent, means of providing added security by preventing the widespread dissemination of sensitive personal information across an organization.
Mixing, Scrambling or Shuffling
This process describes either shifting the letters or digits of personal information within one instance of personal information, or across an entire data set. By dissociating the logical order a data set comes in, the amount of identifying information that can be extracted by malicious actors is significantly reduced. In addition, information that goes through a scramble or mixing makes the process of identifying the personal information of other data subjects by attempting to decode the mixing process more complicated, as the columns or data set subject to a mixing process is most often randomized on each access instance. Generalization Generalization involves deliberately reducing the accuracy of a data set to comprise a range or broader definition. Data categories that benefit from generalization are often those whose analytical value is preserved even when abstracted to a certain degree. For service offerings, an example can include moving from a specific postal code to the first three digits of that code or even to a broader neighbourhood level. Another example would be to move from a specific data of birth to month/year of birth to a specific age (55) to a general age range (50 to 60). Generalization is most effective when implemented selectively, as how much a data value is generalized has a strong impact on the protection afforded to individuals in the data set.
The process of adding noise hides personal information collected by adding in false data in select amounts. “Noise” is defined as data points, or entire fields of data, that do not actually correlate to an individual. The process of adding noise is also highly variable depending on the data collected, but the general principle involves “hiding” real personal information among randomly generated data that serves no actual purpose. When an organization seeds false data among real data, malicious actors are significantly hampered from using the data set for nefarious means or reverse engineering the above-mentioned anonymization techniques by using the data set as a whole. A newer method used by businesses called “differential privacy,” discussed below, applies the practice of adding noise in unique ways to increase the security of personal information held by businesses.
Encryption is an effective means of implementing the above-mentioned techniques. The process involves filtering collected data through an encryption algorithm that renders the data useless to a human reader, which can then be unscrambled using a private password. A common and easily used method is symmetric encryption, where data is hidden by an algorithm on collection and becomes readable only after inputting a private key password. Encryption comes in a variety of formats ranging from simple private key encryption to complex end-to-end encryption, but serves the common purpose of making the personal information collected by the business unreadable by malicious actors. Techniques like “salting and hashing” increase the difficulty of breaking the code. However, authorized analysts with a need to access the data set can reap the analytical benefits of the data with access to the decryption key.
NEWER ANONYMIZATION AND MINIMIZATION TECHNIQUES
Federated Learning of Cohorts (FLoC) FLoC is a combination of generalization, suppression, and adding noise that involves the collection of personal information and sorting it into anonymized cohorts by its identifying factors. Google implemented the technique as an alternative to third‑party cookie tracking technology on their Chrome browser in March 2021. Cohorts are sortedby the types of internet activity that users have in common, serving as a method of generalization and suppression by providing advertisers with only the most pertinent data categories on an abstract level. Cohorts equally contain hundreds if not thousands of users, making any individual’s behaviour difficult to associate back to a specific person.
FLoC has been deployed on Chrome browsers as a pilot project, resulting in a potential radical shift in the effectiveness of third-party cookies. Google’s privacy sandbox provides the mechanisms behind FLoC on an open-source basis, permitting businesses the option of exploring whether FLoC could be of use for their own purposes. In principle, the technology behind FLoC could equally apply to businesses who are seeking to generalize personal information held to shield themselves from privacy breaches, resulting only in abstract cohorts rather than personally identifying information.
Tokenization is a more thoroughly applied method of encryption and masking that replaces personal information with a series of tokens that identifies specific pieces of personal information. The principle has already seen broad use in the payment processing industry, where credit card payment information has been tokenized to permit transfer requests between acquirer banks, payment networks, and issuer banks without revealing sensitive personal information during transfers.
Tokenization acts as a further step to masking by replacing the personal information values entirely. The process involves the use of a “token vault,” which stores the core algorithm used to generate a variety of tokens. Personal information that is submitted to the business is stored in the token vault, and the token is then transferred for various purposes. Only once a request is made to the token vault can the token be exchanged for the personal information it represents. As the token itself has no intrinsic value, even if malicious actors could crack the encryption, the token would not subsequently reveal any personal information. As an added benefit, any request to exchange a token for the personal information it represents could be tracked by the business to facilitate the investigation of a privacy incident. Tokens are also frequently randomized every time they are entered, even if the underlying personal information remains the same.
The technology behind tokenization is a strongly proven concept, with consistent innovations due to the popularization of block chain technology. However, tokenization is often not implemented as a stand-alone security offering and is often frequently paired with other solutions to offer a more comprehensively secure privacy system. Depending on the type of personal information being processed and traded, tokenization can be an effective means of protecting the transfer of personal information.
Multiparty Computation (MPC)
Secure Multiparty Computation (or “split processing”) is a cryptographic solution that permits the sharing of data processing results while leaving the data used to produce those insights secret. Previously, this process required a “trusted third-party source” to act as an intermediary. The process involved two parties giving relevant data to a third party, who delivered the required insights without revealing to either what the values were, and delivering the results confidentially.
MPC cuts out the intermediary by emulating the third party through advanced cryptography. The result is that business insights can be accurately gained while never having access to the personal information that produces it, especially in relation to larger data sets. If used properly, MPC has the potential to provide businesses with a secure means of deriving data insights even when the
operating environment poses serious privacy risks. One example is a case where a data exporter wants to process personal information jointly using two service providers in jurisdictions with limited legal protections for personal information. The data exporter can implement an MPC system where the two service providers process personal information simultaneously without ever having access to the specific data set in question.
Though MPC is a method that has existed for some time, its recent application into data protection strategies is in no small part due to international regulators recognizing its effectiveness as a privacy protection measure. The European Data Protection Board specifically identifies MPC as both an effective supplementary measure to protect data in non-EU jurisdictions, and speaks to its potential as a technology that applies for systems adhering to privacy by default standards. The International Association of Privacy Professionals reported that in the United States, public institutions implement MPC to protect federal databases, and the Promoting Digital Privacy Technologies Act identifies MPCs as a cryptography technique of note to be studied.
Differential Privacy is a technique that simplifies the process of adding noise to a data set for even authorized users. In this model, the database is segregated from the analyst, who cannot see the personal information collected by the business. When analysts seek to generate a conclusion from certain data values, they submit requests to an intermediary piece of software known as a “Privacy Guard.” The Privacy Guard assesses the privacy risk associated with a given request, and adds random noise to compensate before returning a data value.
The result is that the value given back to the analyst is close enough to the real value to be useful, while at the same time sufficiently noisy to prevent any kind of reverse engineering that would expose an individual’s personal information. Businesses have implemented the practice of differential privacy with some success, including Microsoft, Apple, and Google. By calibrating the amount of random noise added into the privacy risk, differential privacy can offer a comprehensive solution to retain analytical usefulness by shielding the true data values, but provide an accurate overall picture of trends within a data set.
Synthetic data is an addition to the above-mentioned practice of “Adding Noise.” The general practice is the use of an algorithm that simulates the connections made through analysis of personal information, and reverse-engineers the conclusions to generate sets of dummy data. MIT has released the Synthetic Data Vault to assist developers in this regard. In a test of the usefulness of insights drawn from the use of synthetic data when compared to actual datasets, researchers were capable of drawing accurate conclusions 70% of the time even while using synthetic datasets.
In principle, synthetic data methods could sidestep the use of personal information entirely. Businesses could draw useful insights and analytics from simulated customer behaviour, rather than exposing the business to privacy risks involved with collecting data from customers. However, synthetic data solutions are still in the early stages of implementation. Depending on the type of analytics a business is seeking to replicate, synthetic data could be a costly means of anonymizing data compared to the alternative methods mentioned herein.
Universal ID technology is an applied use of both encryption and suppression, which identifies individual users by a generic username rather than collecting a broad spectrum of personal information for users online. The most prominent version of this technology is the open-source Unified ID 2.0, established by TradeDesk, and adopted by Buzzfeed, AMC Networks, Foursquare, Salon, and the LA Times. Universal ID involves an open-source, encrypted, and unique username for individuals who browse partner websites. Users who create a profile have their email addresses encrypted and tokenized (as explained above), and the universal ID token is traded between service providers and advertisers to provide targeted advertising to individuals without knowing many of the unnecessary particulars about the underlying individual that may leave them open to malicious actors.
Universal ID systems are not exclusive to the private industry, and the technology has seen successful application in the public sector. Examples include the ID Austria program, whose pilot phase concluded in autumn 2021. The system uses the same tokenization methodology to encrypt the personal information of Austrian citizens, who can now use the digital identifier as a means of accessing public services. Though universal ID systems are often discussed in the context of cross-business applicability, businesses with multiple parent or subsidiary service offering could also benefit from a unified ID system. An example in practice is the Universal ID offering by SAP, which unifies the service offerings to a single system.
Personal information and data analytics are an essential part of the financial projections for many businesses worldwide. As regulators continue to clamp down and impose exacting standards on the processing of personal information, while potential penalties reach staggeringly high levels of revenues, strategic anonymization can offer practical benefits by actively reducing legal risks while preserving the usefulness of personal information. Businesses should consider the practical benefits of implementing one or more of the above-mentioned techniques in order to ensure compliance that is more effective without compromising the efficiency of business practices.