In this third article of our "Big Data & Issues & Opportunities" series (see our previous article here), we look, on the one hand, at the impact of anonymisation and pseudonymisation in a personal data protection context and, on the other hand, into the possible use of anonymisation and pseudonymisation techniques as a way to protect non-personal data.

First and foremost, it shall be noted that a discrepancy may exist between the legal and technical definitions of certain anonymisation and pseudonymisation techniques discussed in this article. For the purpose of our legal analysis, this article will rely on the legal definitions as outlined below.

Anonymisation, nowadays used as a common denominator for different types of techniques, can be described as a process by which information is manipulated (concealed or hidden) to make it difficult to identify data subjects.[1] The Oxford English Dictionary defines it as the act of removing identifying particulars or details for statistical or other purposes.

In its Opinion 05/2014 on Anonymisation Techniques, the Article 29 Working Party (the predecessor of the European Data Protection Board) discusses two different families of anonymisation techniques:[2]

  • Randomisation: anonymisation techniques that alter the veracity of the data in order to remove the strong link between the data and the individual. This family includes techniques such as noise addition, permutation, and differential privacy.
  • Generalisation: anonymisation techniques that generalise, or dilute, the attributes of data subjects by modifying the respective scale or order of magnitude. This family includes techniques such as aggregation or K-anonymity, L-diversity, and T-closeness.

Pseudonymisation as a specific technique has gained attention more recently with its explicit codification into the General Data Protection Regulation[3] (hereinafter "GDPR"). Indeed, the GDPR now specifically defines pseudonymisation as a technique of processing personal data in such a way that it can no longer be attributed to a specific individual without the use of additional information, which must be kept separately and subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person.[4]

The Article 29 Working Party had however already discussed it in its Opinion 05/2014 on Anonymisation Techniques, and notably gave the following examples of pseudonymisation techniques:[5]

  • Encryption with secret key: a technique whereby plain text is changed into unintelligible code and the decryption key is kept secret.
  • Deterministic encryption with deletion of the key: a technique whereby a random number is selected as a pseudonym for each attribute in a database and the correspondence table is subsequently deleted.
  • Hashing: A technique that consists of irreversibly mapping input of any size to a fixed size output. In order to reduce the likelihood of deriving the input value, salted-hash functions or keyed-hash functions with stored or deleted key may be used.
  • Tokenisation: A technique that consists of replacing card ID numbers by values that have reduced usefulness for an attacker.

The techniques and their respective definitions discussed above demonstrate the techniques' importance in a personal data protection context. However, on the basis of our research, we believe that anonymisation and pseudonymisation techniques may prove to be apt instruments to protect non-personal information in a technical manner.

Anonymisation and pseudonymisation of personal data

By their very nature, anonymisation and pseudonymisation perform different functions in the framework of data protection law. A major difference between the two concepts relates to the goals of the techniques. The goal of anonymisation is primarily to remove linking attributes and to avoid or impede the identification of individuals.[6] Pseudonymisation, however, is not aimed at rendering a data subject unidentifiable, given that – at least in the hands of the data controller – the original data are either still available or deducible. The different functions are discussed below.

Anonymisation and pseudonymisation as a processing subject to data protection law

The Article 29 Working Party Opinion 05/2014 on Anonymisation Techniques emphasises that "anonymisation constitutes a further processing of personal data."[7] The same reasoning can be applied to pseudonymisation, which is apparent from the definition of pseudonymisation included in the GDPR.[8]

This entails that, when applying an anonymisation or pseudonymisation technique to personal data, one must comply with the data protection principle of purpose limitation, and notably with the requirement of compatibility with the purpose for which the data were initially collected (see also our previous article here).[9] In other words, anonymising or pseudonymising personal data for purposes not compatible with the original purpose amounts to a violation of data protection rules unless there are other lawful grounds for processing.[10]

Such strict application is criticisable as it may discourage data controllers from applying such techniques in the first place. Furthermore, as will be demonstrated below, anonymisation and pseudonymisation may serve as a means to comply with certain data protection rules, such as data protection by design, security of processing, and the purpose limitation principle itself. Therefore, on the premise that anonymisation and pseudonymisation techniques are applied to appropriately secure personal data and comply with other aspects of the GDPR, this should be considered to be compatible with – or even an inherent part of – the original processing purpose.

Anonymisation as a means to avoid the applicability of data protection law

Recital 26 of the GDPR specifies that data protection principles should not apply to anonymous information or to personal data rendered anonymous in such a way that the data subject is no longer identifiable. The Recital further explicitly excludes anonymous information from the GDPR's scope.[11]

The same Recital however specifically states that personal data which have undergone pseudonymisation, but which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person and thus falling within the scope of GDPR.[12] In a big data context, this may be a preferred approach given that some level of identifiability may be needed, notably to achieve predictability in the analytics. It does imply however that pseudonymised data remains subject to data protection rules.[13]

This article therefore further examines whether and, if so, how the use of anonymisation techniques may provide a way out of the scope of data protection law.

In the context of the Data Protection Directive (repealed by the GDPR), the Article 29 Working Party highlighted in its Opinion 05/2014 that only when data is anonymised to the effect that it is no longer possible to associate it to an individual taking into account all the means likely reasonably to be used either by the data controller or a third party, it will not constitute personal data.[14] According to the Working Party Opinion, an effective anonymisation technique prevents all parties from singling out an individual in a dataset, from linking two records within a dataset (or between two separate datasets), and from inferring any information in such dataset.[15] In the opinion of the Working Party, this would require anonymisation as permanent as erasure, i.e. irreversible anonymisation.[16] The Working Party examines, in the third and substantial section of Opinion 05/2014, various anonymisation practices and techniques, none of which meet with certainty the criteria of effective anonymisation. Consequently, a case-by-case approach, in combination with a risk analysis, should be favoured in order to determine the optimal solution. Combinations of different anonymisation techniques could be used to reach the required (high) level of anonymisation, in which case data protection law would not apply.[17]

Some commentators have been critical of the Article 29 Working Party's proposition on the basis that the Article 29 Working Party applies an absolute definition of acceptable risk in the form of zero risk.[18] They argue that data protection law itself does not require a zero risk approach and that, if the acceptable risk threshold is zero for any potential recipient of the data, there is no existing technique that can achieve the required degree of anonymisation.[19] This might encourage the processing of data in identifiable form, which in fact presents higher risks. Therefore, such commentators claim that, when one assesses identifiability taking into account all means reasonably likely to be used, one should focus on whether identification has become "reasonably" impossible. This would be measured mainly in terms of time and resources required to identify the individual, while taking into consideration the available technology as well as technological developments.[20]

A judgment from the Court of Justice of the European Union (hereinafter "CJEU") of 19 October 2016 in the Breyer case, though still rendered under the Data Protection Directive, might indicate a more practical mind-set. In that judgment, which dealt with the question whether dynamic IP addresses may constitute personal data, the CJEU held that the possibility to combine a dynamic IP address with the additional data held by the Internet service provider does not constitute a means likely reasonably to be used to identify the data subject "if the identification of the data subject is prohibited by law or practically impossible on account of the fact that it requires a disproportionate effort in terms of time, cost and man-power, so that the risk of identification appears in reality to be insignificant."[21] This seems to indicate that the CJEU prefers to steer towards a risk-based approach and away from the Article 29 Working Party's absolute approach.

In conclusion, although the Working Party Opinion and the GDPR provide a clarification of the legal status of anonymisation and pseudonymisation techniques, they regrettably do not contain any guidance for data controllers or data processors on how to effectively anonymise or pseudonymise data.[22] Pursuant to the GDPR, however, associations and other bodies representing categories of data controllers or processors may prepare codes of conduct regarding the pseudonymisation of personal data.[23] We believe such codes of conduct are indispensable to the uptake of pseudonymisation techniques in a big data context, including in the transport sector.

Illustration in the transport sector: In its Code of Practice on Anonymisation , the UK Information Commissioner's Office ("ICO") looks into a case study involving the use of mobile phone data to study road traffic speeds. In such hypothesis, a telecommunications provider would share subscriber records with a research body, which would try to derive information about traffic speeds by looking at the speed with which individual phones are moving between particular locations. This would entail the processing of potentially intrusive personal information, i.e. geo-location data. According to the ICO, such processing can be avoided by replacing the mobile phone numbers with dummy values. The telecommunications provider could achieve this either through encryption of the individual data records or through tokenisation. In both instances, it is essential that the encryption key, respectively the mapping table are kept secret.

Anonymisation and pseudonymisation as a means to avoid the applicability of specific data protection obligations

Even if data protection law applies in general, anonymisation and pseudonymisation may serve as mechanisms to release data controllers or processors from certain specific data protection obligations related to personal data breach (such obligations will be further addressed in our upcoming article on Breach-related Obligations).

Anonymisation and pseudonymisation as a means to comply with data protection law

Anonymisation and pseudonymisation may also constitute a means to comply with certain data protection rules. Thus, even when the application of data protection law cannot be bypassed, some techniques may facilitate complying with it. In this respect, Recital 28 of the GDPR explicitly provides that "the application of pseudonymisation to personal data can […] help controllers and processors to meet their data protection obligations."

  • Data protection by design and by default: As discussed in our previous article on Privacy & Data Protection here, controllers must implement 'appropriate technical and organisational measures' to ensure the data protection principles under Article 5 of the GDPR are complied with in an effective way and to integrate the necessary safeguards into the processing in order to meet the requirements of the GDPR. Such measures may result, for example, from pseudonymisation techniques.[25]
  • Security of processing: Controllers (and processors) are required to implement appropriate technical and organisational measures.[26] Such measures shall take into account several factors such as (i) the state of the art; (ii) the costs of implementation; (iii) the nature, scope, context, and purposes of the processing; and (iv) the risk of varying likelihood and severity for the rights and freedoms of natural persons. The GDPR goes further than the former Data Protection Directive as it provides specific – yet limited – suggestions for what types of security measures might be considered "appropriate to the risk". The first of these suggested measures is "the pseudonymisation and encryption of personal data".[27]
  • Purpose limitation (further processing of personal data): According to the purpose limitation principle[28], personal data must be collected for specified, explicit and legitimate purposes and not further processed in a manner incompatible with those purposes. In order to ascertain whether such processing for another purpose is compatible with the purpose for which the personal data were initially collected, the GDPR requires the data controller to take into account the existence of appropriate safeguards, including pseudonymisation and encryption.[29]
  • Storage limitation: The storage limitation principle[30] requires personal data to be kept in a form permitting identification of data subjects for no longer than is necessary for the purposes for which the data were collected or for which they are further processed. This would call for either the deletion or the (effective) anonymisation of such data.[31]

Techniques of anonymisation as a way to protect non-personal data

It cannot be excluded that certain stakeholders participating in big data analytics, including in the transport sector, engage in the disclosure of their trade secrets[32]. The big data analytics lifecycle may also include the analysis of confidential information, which for some reasons may not qualify as a trade secret. Any disclosure of such confidential information may be potentially harmful to the commercial interests of the stakeholder involved.

Considering the commercial value of trade secrets and/or confidential information to any given company, it is essential to prudently protect them. This may be done by solely providing access to such information on a strict need-to-know basis or by putting in place non-disclosure agreements with anyone who needs to have access to the information. Such practical and contractual considerations may well be a good basis for protection, but they are not always sufficient. For instance, a contract cannot be enforced against third parties to that contract. Moreover, a breach of a non-disclosure agreement inevitably entails the loss of the "secret" character of a trade secret and is therefore usually irreversible. In such sense, only financial compensation is available as a remedy. Finally, practical and contractual solutions do not cover the situation of loss of information through theft or leaks, when the company was not willing to share the information in the first place.

It may therefore prove useful to implement a technical protection to supplement the practical and contractual protection and to render theft or leaks of non-personal information difficult or even impossible. The requirements related to the technical protection of data may then be reflected in the contractual terms, such as in the parties' obligations and warranties. From a legal perspective[33], anonymisation and pseudonymisation techniques may prove to be good protection mechanisms, given that their legal significance has already been recognised in the context of data protection legislation, and most recently by the GDPR.[34]

Using anonymisation for the protection of non-personal information, notably in a big data analytics context, may yield the following benefits:

  • The implementation of anonymisation techniques may qualify as a reasonable step "under the circumstances, by the person lawfully in control of the information, to keep it secret" in order to have one's information fall within the scope of the Trade Secrets Directive and thus to be able to invoke legal protection.
  • More in general, by implementing a technical protection like anonymisation, one may be able to demonstrate, e.g. in court, that one has acted as a bonus pater familias[35] in protecting one's own or another's assets.
  • Duplicating the mechanisms of protection (i.e. implementing a combination of legal, practical, contractual and technical protections) equates to a greater protection altogether.
  • Sufficiently anonymised or pseudonymised information will not be compromised in case of a data leak or breach. The same would be true for encrypted information, provided that the key to the encrypted information does not reside with a third party.
  • The implementation of a technical protection can be a means to strengthen contracts between the stakeholders involved in the big data analytics; i.e. by increasing the data importer's liability in case it does not adequately anonymise the imported information or in case it does not sufficiently protect the key to pseudonymised information.
  • Whereas a legal framework for the ownership of data is currently lacking, a more pragmatic solution may be found for the ownership of the key to pseudonymised data in the existing legal framework on software protection.[36] Hence, it may be possible for companies to frame the sharing of pseudonymised information with a copyright-type software licence over said key, thus adding an extra layer of (both legal and contractual) protection.

Taking into account the advantages anonymisation offers in protecting non-personal information, it is commendable to apply anonymisation techniques to such sensitive non-personal information shared in a big data analytics context. Indeed, if companies can be reassured about the technical protection of their information in a big data environment, they will be more willing to share that information with big data analytics service providers or with big data analytics platforms.

Illustration in the transport sector: In their paper on Anonymization of Data from Field Operational Tests , Y. Barnard et al. discuss the use of anonymisation and other data processing techniques to strip logs of personal and confidential information in order to encourage data sharing for transport research and innovation projects, with a particular focus on field operational tests ("FOTs"). FOTs involve the collection of large amounts of data to study driving behaviour interacting with intelligent transport systems ("ITS"), including cooperative intelligent transport systems ("C-ITS") and automated vehicles. The data gathered in such context may be personal, commercial, and/or research sensitive. Y. Barnard et al. therefore advocate the use of anonymisation techniques, while pointing out the potential risk of losing essential information in the process. According to them, an effective anonymisation technique, preserving however research essential information, would facilitate the access to and re-use of valuable data.

Conclusion

Anonymisation and pseudonymisation techniques generally provide fertile ground for opportunities with respect to big data applications, including in the transport sector. In this respect, it shall be noted that the use of anonymisation is specifically encouraged by Recital 13 of the ITS Directive[38] as "one of the principles of enhancing individuals' privacy". In addition, this article explored the possibility of applying anonymisation and pseudonymisation techniques to non-personal information.

Nevertheless, account must be taken of the challenges that may arise in this respect. Most importantly, a balance will need to be struck between, on the one hand, the aspired level of anonymisation (and its legal consequences) and, on the other hand, the desired level of predictability and utility of the big data analytics.

Illustration in the transport sector: The CabAnon Project run by the 'Laboratoire d'Innovation Numérique de la CNIL' ("LINC") aims to assess the utility of properly anonymised data. For this purpose, the LINC team analyses records of taxi rides in New York City. While recognising that anonymisation entails a certain loss of information and, hence, a loss in terms of accuracy and utility, LINC aims to quantify such loss. It notably looked at the NYC taxi dataset's utility with respect to the following applications: (i) allowing taxi users to identify spots in their vicinity where they are likely to quickly find a taxi using density of traffic; (ii) allowing city planners to conceive other solutions to organise mobility based on the number of passengers per taxi; (iii) allowing people to determine the best moments to commute and city planners to identify places with traffic congestion on the basis of traffic speed; and (iv) providing insights to city planners on how people move through the city and how to improve public transportation based on the direction of traffic. LINC's first results showed that exploitable results could be achieved with a rather coarse but robust anonymisation approach.

It follows from the foregoing that, as such, anonymisation and pseudonymisation techniques and their legal consequences are desirable concepts in the big data analytics lifecycle, including in the transport sector. However, a better alignment is needed between the legal and technical interpretations of those concepts, so that legal and technical professionals may share a common understanding on the consequences of the use of such techniques.

Additionally, the creation of codes of conduct and similar initiatives is indispensable to support stakeholders in assessing the risk of re-identification. Such initiatives should be further developed throughout the EU, including in the transport sector.

Finally, a wider and better uptake of anonymisation and pseudonymisation techniques should be encouraged, not only in the field of personal data protection, but also with respect to non-personal information requiring or meriting protection (e.g. trade secrets), in light of the advantages of those techniques discussed in this article. To this end, investment in terms of both time and money should be made to further research, elaborate, and increase the robustness of such techniques, taking into consideration their possible concrete application to different types of data.