Regrettably, the rise of "Big Data" has been accompanied by the often mistaken notion that organizations should strive to retain as much data as possible. Advancements in technology (through technologies such as the Hadoop® Distributed File System) have allowed large volumes of data to be stored cheaply. Accumulating vast quantities of data can be problematic because the quality and relevance of the data might begin to suffer, not to mention the potential cost and risk exposure related to audit, investigation, production and eDiscovery processes, as well as the implications of a data security breach.

In order for Big Data to be effective within your organization, a carefully crafted data governance strategy is key. The following principles can help manage an organization's data hoarding tendencies:

Determine the 'best-before date'

Most data has a required or useful lifecycle. Once the use and purpose of data is extinguished, the data becomes redundant, obsolete, or trivial, and provides limited or no value to an organization. Not only does storing irrelevant data cost money, but the over-retention of personal data can also run afoul of privacy laws.

Some jurisdictions have data privacy laws that prohibit the retention of an individual's personal information beyond a defined maximum period of time. Countries with strong privacy laws generally mandate that personal data be securely deleted or destroyed when no longer required by law or to fulfill the purposes for which it was collected. A clear data retention policy is vital to helping organizations ensure that they are compliant with their legal obligations. Organizational data including personal information should be earmarked for deletion after the appropriate date, and an appropriate manager appointed to ensure that expired data is routinely purged.

Keep data organized and under control

Large amounts of data that are not organized will quickly turn into an unwieldy mess. Although technology such as Apache's Hadoop® allows for the analysis of structured and loosely structured data sets, the issue of 'bad data in, bad data out' still applies. Analytics and insights derived from bad data (such as unorganized, obsolete, or expired data) may cause more harm than good. To help keep things under control, the data governance function within an organization should be led by a data governance officer and a data governance committee should include stakeholders from information technology (IT), legal/compliance, and business functions. Having an appropriately structured and staffed committee can help reduce data silos, implement data quality and cleansing, and enforce organizational data schemas. Appropriate oversight will ensure that data remains useful for thoughtful analysis and decision-making.