Big data has been a “burgeoning” information technology trend for a number of years now. But indications are that it is finally about to be accepted into the mainstream. And Australian companies are at the forefront of this trend or, more accurately, at least some of them are. A recent study by Tata Consulting Services found that Australia had a relatively low take-up of big data usage (with only 32% of Australian respondents undertaking big data initiatives in 2012, compared to 70% in India and 68% in the US), but that Australian companies who do use big data spend more than those in other countries (with a median spend in Australia of $50 million per company, compared to only $9 million in the US).

These statistics clearly suggest that the time has come for all companies to look at big data and seriously consider what role it has to play in their business. And a good starting point is to understand what “big data” really is and the purposes for which it can be used.

What is big data and how is it used?

In short, the term “big data” is usually used to refer to very large sets of often unstructured data, and can be contrasted against the “small data” subsets of data that are more commonly used for traditional data analysis. A key advantage of big data is that it can reveal useful correlations and insights that may otherwise be missed due to the sampling choices and other assumptions that need to be made when applying a traditional small data approach.

The shift towards big data analysis techniques has essentially been driven by two factors:

  • Firstly, the ever-increasing volume of data that is available for analysis as people around the world conduct more and more of their business and personal lives in a digital environment. In a recent book on big data, academic Viktor Mayer-Schonberger and journalist Kenneth Cukier give a sense of the volume of data currently in play by estimating that if all the digital data in the world in 2013 was printed in books and then laid on the ground, the books would cover the entire surface of the United States in a stack 52 volumes high. And given that the authors also estimate that the amount of data is doubling every 3 years, by 2016 the stack of books would be over 100 volumes high!
  • Secondly, improvements in technology which have reduced the cost and increased the availability of the amount of computing power required to crunch through massive data sets. In the past, attempting to process the volume of data referred to by Mayer-Schonberger and Cukier would have been unthinkable. However, modern technology is removing many past limitations. And as a result companies are becoming increasingly ambitious in terms of their data processing capabilities. As one example, Google has a stated aim of digitising every unique book that has ever been published (copyright laws allowing). By Google’s own estimate, this would involve scanning 130 million books.

These are overwhelming figures and it can be difficult for a lay person to anticipate how it would ever be possible to distil this mass of data into useful information. Perhaps the most famous example application of big data to date comes from the retailer Target in the United States, which analysed customer purchase records in order to identify products that women frequently purchase shortly after becoming pregnant (such as a nutritional supplements, a larger handbag to hold child-related paraphernalia and so on). By applying this information to new purchases, Target found that it could predict with a reasonable degree of accuracy when a particular shopper was pregnant, and then start marketing to them accordingly. Indeed, in one case Target sent coupons for baby clothes to a young woman who, based on the products she had been buying, they predicted was pregnant. Her father promptly complained to the store, claiming they were encouraging his still high school-aged daughter to become pregnant. When the store manager called a few days later to offer an apology, the father admitted that, in fact, Target was correct and his daughter had admitted that she was indeed pregnant.

Another more public-minded example comes from Google, which found that it could predict the geographic flu viruses by tracking web searches entered in different locations (presumably entered by users searching for cough medicines and local medical centres). Google’s tracking methods proved to be much faster than official records, which struggled to keep pace with rapidly spreading viruses as they relied on reports received from medical practitioners. Given most patients would be sick with the virus for some time before consulting a doctor, these records would almost be out of date before they had even been received by the authorities, let alone by the time they had been properly analysed and plotted geographically.

Many people would say that this is exciting stuff and it is certainly true that big data can deliver some interesting and revolutionary insights from common transactions. However, others may find the potential of big data a little scary and would, with some justification, be concerned about the privacy implications of their everyday transactions being recorded and analysed in this way. Given this natural tension and the ever expanding reach of privacy regulation, it is important for businesses to carefully consider privacy implications before embarking on any significant big data initiative.

How do Australian privacy laws affect big data?

In Australia, privacy law is in a stage of transition, as significant changes to the Privacy Act 1988 (Cth), Australia’s key piece of privacy legislation, came into effect in March 2014. The discussion in this article will focus on the landscape prevailing after those changes took effect.

At its heart, privacy regulation in Australia is concerned with the treatment of “personal information”, which is defined to include any information or opinion “about an identified individual, or an individual who is reasonably identifiable”. As such, any information in a customer record, such as the purchasing records kept by Target, is likely to be personal information for the purposes of Australian privacy law. Even if the record is assigned an “anonymous” identifier in place of the customer’s name, it will still be personal information if the identifier can be easily connected with the customer in question (such as where the identifier is linked to the customer’s credit card, so that the customer will be recognised on the next occasion that they use their card to make a purchase).

Where a big data project involves personal information, it will be important that it is carried out in accordance with the Australian Privacy Principles (“APPs”), which are made under the Privacy Act and set out the core tenets of Australian privacy law. In particular, the following aspects of the APPs will be relevant:

  • APPs 1.3 and 1.4 require that an organisation have a privacy policy that, amongst other things, describes the types of personal information that the organisation collects. Arguably, information that has been inferred or otherwise generated from big data analysis of existing data has not been “collected” in the ordinary sense of the word. However, given that this analysis would not have been possible without the original data, our view is that the inferred information should be deemed to have been collected in the same way as the original data was collected. 

    The only difference is one of timing: the original data is collected at the time it is obtained from the original data source, whereas the inferred information should only be deemed to be collected at the time that the relevant analysis that leads to that inference takes place. This interpretation is supported by the fact that “personal information” is defined to include an opinion and an opinion is something that can only ever be inferred from some base information.  If drawing an inference did not amount to a “collection” of information, then the rules on collection in the APPs could not apply to personal information comprised of an opinion, which would significantly undermine the effectiveness of the APPs in regulating that type of personal information. It follows from this that an organisation intending to carry out big data analysis on customer records must specify in its privacy policy that it may collect information in this way.
  • Other rules on collection also need to be considered. In particular, APP 3.1 prohibits collection of personal information by an organisation unless it is reasonably necessary for, or directly related to, the organisation’s functions or activities. In other words, an organisation should not generate new personal information about an individual simply because it is able to do so. On this basis, an organisation should only carry out big data analysis that may produce new personal information if the information is reasonably necessary for one of the organisation’s legitimate functions or activities. This means that the goals and likely outcomes of any big data exercise need to be considered in advance, in order to ensure that the organisation has a legitimate basis for collecting any outputs that constitute personal information.
  • Various APPs apply special rules to “sensitive information”, which is information personal information relating to a sensitive topic area (such as an individual’s health, racial or ethnic origin, political opinions and so on). It is not difficult to see how the product of big data analysis could potentially qualify as sensitive information. For example, if a grocery retailer analyses the types of food that a particular customer usually buys and deduces from this information that the customer regularly purchases products designed for those who have an intolerance to gluten, this would may qualify as information about health and should be treated as sensitive information (although the identity of the gluten-intolerant individual may not be the purchaser of the household groceries). Generally speaking, under APP 3.3 an organisation must not collect sensitive information unless it has the relevant individual’s consent to do so. 

    Accordingly, in our example, unless the customer had previously consented to the retailer collecting information about their food intolerances, the retailer may be in breach of the APPs. Other APPs will also be relevant. For example under APP 7 an organisation must generally not use sensitive information for direct marketing unless the individual has consented to that use (by contrast other “ordinary” types of personal information may be used for direct marketing in a broader range of circumstances). Accordingly, to return to our example, if the retailer forms the opinion that their customer is likely to have an allergy to gluten, the retailer would need the customer’s consent in order to use any information about the customer’s food intolerances to send them marketing material about alternative diets, gluten-free recipe books or other similar products designed for people with that type of intolerance.
  • APP 8 limits the circumstances in which personal information may be transferred outside Australia. These limitations will need to be considered if an organisation wishes to use an entity overseas to carry out any big data analysis of personal information (e.g. if an overseas data processor is to be engaged to do the data crunching). In most cases, assuming it is not practical to obtain consents from all of the individuals to whom the information relates, the organisation in Australia will need to procure a contractual commitment from the overseas entity to comply with Australian privacy laws in its handling of the relevant information. In addition, in order to comply with privacy notification requirements under APPs 1.4 and 5.2, the organisation will need to ensure that its privacy notices accurately disclose its plans to share information overseas.
  • Under APP 11.2 an organisation must either destroy or de-identify / anonymise personal information in its possession once it is no longer required for any purpose for which it was originally collected. Anonymisation may seem a simple way in which to avoid privacy concerns, as once information can no longer be linked to an individual it cannot pose a threat to that individual’s privacy. However, anonymisation is not always a straight-forward task, particularly when dealing with the type of rich and detailed data used for big data analysis. For example, a number of years ago America Online (“AOL”) was embarrassed when it released an “anonymised” database of 20 million search queries compiled over a period of 3 months, for public research purposes. 

    Despite AOL’s attempts at anonymisation (which included removing search queries for social security and credit card numbers), it turned out that a sufficiently detailed search history could still be traced back to a particular user. Two New York Times reporters traced one collection of searches (which included geographic searches such as “landscapers in Lilburn, Ga” and searches for a series of people with the last name “Arnold”) to a particular user named Thelma Arnold in Georgia (who clearly enjoyed gardening and researching her family history). In short, when a particular data set is sufficiently detailed, there will always be a risk that it can be traced back to a unique individual. Perhaps the best form of anonymisation is to aggregate the data in a statistical format and then to destroy the underlying records used in the aggregation process. However, there is a trade-off that comes with this approach, as deleting the underlying data also prevents any future big data analysis that may produce new insights not already captured in the aggregated format.

As the above examples illustrate, when dealing with information about identified individuals, there are many issues that require careful consideration before proceeding with any big data analysis. The consequences of failing to ensure proper compliance with applicable privacy regulations may be severe. As part of a package of recent changes to the Privacy Act, the Australian Privacy Commissioner’s investigatory powers have been enhanced and new potential fines for breaching privacy laws have been introduced (to a maximum of $1.7 million for a serious contravention). Other orders or undertakings may also be required in a breach scenario, and these could conceivably require things such as the deletion of information that has been improperly obtained or collected. Accordingly, an organisation that misuses personal information as part of a big data exercise may find that the fruit of its labour, and any associated expense, goes to waste.

However, apart from official sanctions, would-be big data users need to think carefully about how their actions may be perceived in the court of public opinion. As mentioned above, the public may view some of the more innovative uses of big data with some concern. If not carefully handled, big data projects run the risk of turning off customers who are concerned about safeguarding their privacy, even if they do not technically involve any contravention of privacy regulations. To mitigate this risk, it is important for organisations to ensure that their big data initiatives satisfy relevant privacy standards both in spirit as well as in substance. After all, not every father wants to find out that his daughter is pregnant through the local discount retailer!