Data types, Data sets and Big Data

A dataset is a collection of data which is manipulated by computers to produce relevant information on any given subject matter. Big Data is the colloquial way of describing the aggregation, analysis and increasing value of vast exploitable datasets.

We have seen that the increase in big data use has led to new areas of technology that are hitting the headlines today. This includes machine learning (that is to say Artificial Intelligence, “AI”), 3D printing, virtual reality, the Internet of Things and nanotechnology. (For more information see our blog on the Internet of Things

Big Data is the collection of massive amounts of data taken from a variety of sources such as internet searches, credit card purchases, mobile location services etc.   Near real time analysis provides what is known as repurposed data (that is to say data taken from a source for one purpose and analysed for use for a different purpose). This provides huge potential to marketers and advertisers amongst many other commercial and non-commercial uses. Big Data may allow those using it to analyse present behaviour and predict the consumer needs before the consumer knows their needs themselves. Big data is more than just a buzzword if you understand fully what it consists of and more importantly how to use it to commercial advantage.

Big Data is typically characterised by “aggregation” and “analysis”:

Aggregation, as many refer to it, is known also as the three V’s:

  • Volume:  vast volumes of data;
  • Variety:  in many variable formats (text, image, video, sound); and in unstructured (typically, 80%) as well as structured (typically, 20%) varieties; and
  • Velocity: speed of processing.

Analysis involves datasets being analysed by quantitative analysis software (such as using algorithmic computation or artificial intelligence). This may enable a shift from analysing behaviour retrospectively to an ability to predict behaviour – and this is where the commercial value is realised.

It is personal?

The Data Protection Act 1998 (DPA) defines personal data as data which relates to a living individual who can be identified from those data or from those data and other information which is in the possession of, or is likely to come into the possession of, the data controller.

As guidance the Information Commissioner’s Office (“ICO”) sets out guidelines in determining whether the data is personal or not. It considers whether a living individual be identified from the data, or, from the data and other information in your possession, or likely to come into your possession. It highlights that the data should ‘relate to’ the identifiable living individual, whether in personal or family life, business or profession to be personal and if so whether the data is ‘obviously about’ a particular individual. The guidelines consider if the data is ‘linked to’ an individual so that it provides particular information about that individual; if the data used, or is it to be used, to inform or influence actions or decisions affecting an identifiable individual; or if the data has any biographical significance in relation to the individual. Importantly the ICO questions if the data impacts or has the potential to impact on an individual, whether in a personal, family, business or professional capacity.

The “anonymisation” of data

Why anonymise data?

The law provides that the principles of data protection shall not apply to data rendered anonymous in such a way that the data subject is no longer identifiable.

Essentially, data controllers who hold information about individuals will not fall foul of the DPA and will fulfil their data protection obligations if they render the information they hold about individuals anonymous. In doing so, data controllers can use the anonymised data in different ways as there are fewer regulatory burdens on it and the DPA’s purpose-limitation does not apply.

The primary objective, as obvious as it sounds, is to protect the individual’s privacy when making available the data resources that activities such a as research and planning rely on.

Can you truly anonymise data?

It would seem on the face of it that it would be simple to say whether a particular set of data relates to and identifies an individual – and as such simple to render anonymous. However, it can be impossible to fully assess re-identification risk with absolute certainty. Where individual sets of data looked at in isolation may be “anonymous”, together and perhaps with supplementary data the identification of the individual may be possible. Harvesting and analysing vast amounts of Big Data may result in re-identification of personal data. Unsurprisingly, if you do produce personal data through re-identification, you will take on your own data controller responsibilities. The sheer volume of data and the variety of sources of data available about any one individual begs the question, can you truly anonymise data?

The answer is most probably, no. As such it is important for organisations who use anonymised data to focus on mitigating the risks of re-identification to the point where the chance is “extremely remote.” Organisations should be able to demonstrate and satisfy that they have carried out a robust assessment of the risks of re-identification and have adopted solutions proportionate to any risk. This is likely to involve implementing procedures and protocols in collecting, storing and using data which may involve a range and combination of technical measures such as data masking, ‘pseudonymisation’, aggregation and banding, as well as legal and organisational safeguards.