Lawyers seeking the latest in eDiscovery approaches to find relevant information faster and at lower cost are turning to machine learning tools. Pursuant to comment 8 to ABA Model Rule of Professional Conduct 1.1, they are required to “keep abreast of changes in the law and its practice, including the benefits and risks associated with relevant technology,” That is, they need not become technical experts in machine learning to meet this duty, but they must understand the technology well enough to oversee its proper use.
While many lawyers and legal support professionals are familiar with, and use (or oversee the use of), machine learning technology to more efficiently cull through and sort data for litigation, investigations and regulatory compliance matters, there are new machine learning approaches worth taking stock of.
Machine Learning in eDiscovery
First, a brief history of machine learning in eDiscovery. The broad definition of machine learning is computers that identify and act upon data patterns, and over time learn to improve their accuracy. In eDiscovery, machine learning is behind critical analytics like predictive coding and clustering.
eDiscovery machine learning has been available for many years in market-leading review platforms. Learning technology falls into two types: supervised machine learning and unsupervised machine learning.
Supervised learning depends on a human-generated seed set that teaches the software how it should define data. Predictive coding, also called technology-assisted review, is a prime example (the software refers to seed sets to match data patterns to a relevancy percentage. Over time the predictive coding tool learns from ongoing reviewer feedback). Newer forms of technology-assisted review rely on either a hybrid approach or more unsupervised approach called Continuous Active Learning, which uses judgmental seeds to start, but then trains primarily with highly relevant documents.
Unsupervised machine learning depends on recognizing patterns contained within data, and comparing them to other data or search queries. These features learn over time as data sets grow and more patterns emerge. In eDiscovery, unsupervised machine learning includes clustering, concept search, and near-duplicate identification.
Clustering matches similar conceptual content between documents and groups documents accordingly. These groupings, or clusters, can be presented as simple text labels or by any number of visual methods. Concept search expands text-based queries by identifying data sets based on the ideas they contain and not necessarily specific text. Near-duplicate identification compares the text of all documents and groups them based on degrees of similarity.
New Machine Learning Approaches that Dramatically Lower the Costs and Risks of Document Review
Predictive coding, clustering, concept searching and other analytics are now standard in industry-leading review platforms. However, they only work on a case-by-case basis, limiting our ability to apply what we have learned on one case to future cases. New research and development has extended analytics to massive data sets located in different systems across multiple cases and matters. By exposing the analytics to much larger data sets and learning opportunities, machine learning geometrically increases its value.
The heart of this environment is a Massively Parallel Processing (MPP) architecture, which enables high performance machine learning across massive data sets and multiple platforms. In eDiscovery terms, this means that this type of analytics solution can store attorney decisions, work product, and critical document facts across all legal data for better insights into current and future matters—saving more cost and time than review platform machine learning techniques alone.
Big data analytics (which can also be used in combination with review platform analytics) has the power to turn the traditional analytics-based approach on its head. By consolidating all legal information from firms, vendors and databases into a single, secure repository, this approach segments the data and automatically classifies and identifies data that needs to be reviewed—assessing up to billions of classification codes to identify privileged documents and documents relevant to new cases.
Using the best of human expertise plus machine learning, big data analytics platforms go further for lawyers than any other machine learning technique by predicting relevance and privilege and applying this learning to subsequent reviews. This translates into the following benefits:
- Substantial savings unlocked on new matters by reusing work product
- Platform-agnostic approach with unique and proven workflows
- Insights from one matter applied across multiple cases and future matters
- Better coding consistency and quality control
- Improved review processes through workflow automation
- Predictive insights into data within the organization that has the potential of becoming a future liability
- A platform-agnostic approach that supports legal teams’ use of multiple review tools
Cases in Point
Here’s how big data analytics works in practice. One eDiscovery matter involved millions of potentially relevant documents. The review team’s traditional manual process netted nearly 1.5 million false positives – results that reviewers had to manually weed out of the review set. In contrast, working in connecting with Conduent data scientists, eDiscovery process experts and subject matter experts, a big data analytics platform only identified 32,000 false positives in the same data.
Across the entire review process, the manual review resulted in a precision rate of 10.39%. The machine learning platform had a precision rate of 87.49% — a 77% improvement. And it took a fraction of the time to do it.
It’s no wonder that lawyers are getting up to speed on the latest machine learning approaches.