“Big data” is among the biggest of buzz words circulating the planet, not only in industry but throughout the public consciousness. Big data’s most visible examples include Edward Snowden and the NSA’s Prism program and Amazon.com’s product recommendations. Less well known is how the proliferation of low-cost tools for implementing big data projects has drastically reduced big data’s barrier to entry, empowering enterprises both large and small to make smarter decisions and offer sophisticated services powered by extensive data analysis. However, lowered barriers to entry do not equate with lowered obligations and potential risks.
What Is “Big Data?”
The term “big data” is used to refer to a wide variety of analyses on data sets too large for legacy data analytic techniques. Big data is usually defined in industry as the three Vs: volume, velocity and variety. Volume refers to the vast amounts of data being collected, (possibly) stored and analyzed. Velocity refers to the speed at which information is both generated and analyzed, often at or near real-time. Variety refers to the myriad types of collected data: structured and unstructured, private and public. Big data analyses can include, for example, terabytes or petabytes of data consisting of a mix of highly structured data extracted from a private enterprise data warehouse, loosely structured transactional data from line-of-business applications, and unstructured data from private email or public social media sources.
What Low Cost Tools Are Available To Implement Big Data?
At their core, big data architectures are made up of three technologies: data sources, data infrastructures, and analytics/presentation tools. Low cost and/or open source solutions are available for all three.
Social media sources such as Twitter and Facebook provide access to certain of their data through free APIs. Similarly, the Open Government API provides free access to an ever-increasing supply of governmental data. Google Analytics and the open source alternative Piwik can provide a wealth of data on website traffic, traffic sources, measured conversions, sales tracking and the like – all at little or no cost. Traditional line-of-business applications also provide data for use in big data analyses.
Data infrastructure tools for big data analyses consist of frameworks and tools for storing and processing the massive data sets. Apache Hadoop is a widely-used open source framework that forms the foundation of most big data infrastructures. For example, Facebook’s Hadoop cluster purportedly handles over 300 petabytes of data. Several noSQL databases such as MongoDB offer open source database solutions.
Big data analytic tools include software for performing traditional statistical analyses and applying machine learning algorithms. R is a popular open source programming framework that provides traditional statistical analysis and graphical output capabilities. Apache Mahout is a collection of free libraries for implementing distributed or otherwise scalable machine learning algorithms.
These technologies make big data possible for enterprises of all sizes – and new solutions will continue to emerge.
What Obligations Have I Incurred?
The use of open source software is not obligation-free. Each piece of open source software comes with certain license obligations. The open source licenses attached to many of the available big data tools may pose problems depending on your use case. For example, R and Piwik are licensed under the GNU General Public License, or GPL, whose “copyleft” provisions require that any modifications to the licensed-product be redistributed under the same license terms. The GNU Affero General Public License attached to MongoDB includes similar provisions. Service providers that customize the software and redistribute it to clients may have an obligation to release the source code for their customizations. This may be true even if the customized software is only provided as a service accessible via remote access. In short, license terms must be thoroughly evaluated on a case-by-case basis before incorporating any open source software into your enterprise.
In addition, accumulating personally-identifiable information for big data projects also incurs obligations and risks. Entities with such data have a legal obligation to protect personally identifiable information from acquisition by third parties. While definitions vary from state to state, personally-identifiable information typically means an individual’s first and last name along with additional information such as an account number, a social security number, or health information. If a breach occurs, you must report it to the affected individuals, consumer reporting agencies, Attorney General’s offices and the like. Once again, reporting requirements vary from state to state.
If your big data analyses include personally-identifiable data, you must take appropriate steps to protect it. The National Institute of Standards and Technology recently released a “Framework for Improving Critical Infrastructure Cybersecurity.” The Framework focuses on using business drivers to guide cybersecurity activities and considering cybersecurity risks as part of the organization’s risk management processes. The Framework is available at here. The Framework offers a great resource for determining what steps may be appropriate for your big data project.
Big data analytical projects offer great promise, but you must carefully consider the potential risks and obligations attached to the technologies and data used in those projects.