It was not long ago that business — and with it, litigation involving business — was conducted far differently. Managers drafted memoranda, employees created reports and assistants typed letters. Photocopies were made, and files and archives organized all the paper. A litigator collecting documents would identify the right employees, find their relevant files and examine the archive index to locate historical documents. Perhaps a peek in the email account, and after reviewing for privilege and applying bates numbers, the production was done.
Today’s employee — at nearly every level, and in nearly every field — generates little paper but mountains of data. Email chains lengthen, splinter and multiply. Texts and instant messages fly from workstations, laptops and mobile devices. Documents reside on file servers and collaborative workspaces, and iterations proliferate. Materials are exchanged within and between companies on encrypted websites. Operative business and contractual communication often occurs via electronic exchange, not signed paper. The speed of business, the shrinking administrative workforce and the availability of useful (albeit limited) search tools, all mean that nobody pays attention to organizing this amorphous data for posterity. And then the lawsuit hits.
Litigators were among the last to realize that business, and business litigation, had changed. At first, we pretended that electronic data was another category of paper, and satisfied ourselves (and, we thought, our discovery obligations) by asking lay custodians to identify, search and print their relevant electronic files. Then, Zubulake was issued, the Federal Rules were amended and courts took the offensive on discovery of electronically stored information (ESI). Minimizing ESI search and production obligations now posed untenable risks to our cases and to our clients.
Faced with this seismic shift, we improvised. Full manual review was impossible in most large cases, so we negotiated search terms, hoping to stumble on balanced terms that identified relevant documents and excluded the irrelevant. But terms proved under-inclusive, over-inclusive or both, leading to few needles in large haystacks. We hired armies of contract attorneys to sift through data to identify responsive — and "hot" — documents. But even the brightest contract attorney had limited expertise with the legal issues in a case and limited visibility into the factual issues. This produced idiosyncratic decisions and, often, inconsistency and error. Then, once the review was completed and the contract army disbanded, the institutional knowledge developed by those closest to the documentary record largely evaporated, just in time for depositions, motion practice and trial. For all the money spent and 2 gigabytes exchanged, this process — which became the new standard for discovery compliance — advanced substantive case development too little, or not at all.
Emerging technology created this quandary. Now, emerging technology holds new promise to mitigate these challenges for litigators, and for business clients weary of paying too much for too little return. 2012 saw the first steps toward judicial validation of new "technology-assisted review" approaches that endeavor to re-balance the scale. 2013 has witnessed judicial reliance, endorsement and advocacy of machine learning tools that increase effectiveness and efficiency in locating responsive documents.
Technology-assisted review will not produce perfect discovery. In today’s data-intensive business world, no review approach will.1 Nor will these new tools completely replace previously employed review tools, at least not for awhile. Indeed, for smaller data sets, the cost of using predictive coding assistance very well may be outweighed by any practical benefit of its use.2
However, for the right case, these new technologies offer efficient and defensible processes for reviewing large quantities of ESI with more consistency and fewer errors than the alternatives. Not only do these tools hold the promise of fewer overall dollars spent, they promise that the dollars will be spent in a smarter way. These new tools place a premium on the training of a computer algorithm by lawyers with knowledge of the facts, the law and the key issues. The more those training the algorithm know about the case, the smarter the algorithm will be and the better it will work. This increases overall accuracy, expedites identification of critical documents and increases early hands-on facility among the litigation team with the documentary record. In short, these tools promise not only to satisfy a party’s discovery obligations, but to more completely and more expeditiously arm the party’s litigators for the litigation to come.
The use of technology to assist with manual review is not new. Key-word searching applies search terms (sometimes, Boolean, proximity and/or wildcard terms) to limit the universe of data for review. "Deduplication" and grouping of "near duplicates" further reduce the quantity of documents to be reviewed and minimize the risk of inconsistent review. Threading email chains enables all iterations of the same chain to be coded together, enhancing efficiency and consistency. "Smart filters" permit reviewers to restrict documents by email or domain address, for example, and eliminate many of the "junk" e-mails captured by keywords. Each of these tools is employed to winnow the review set and to streamline what remains, in essence, a manual review process. 3
Now, new software technology platforms offer iterative learning methods intended to reduce — but not replace — human review. These tools are referred to variously as "technology-assisted coding" and/or "predictive coding." Though the algorithms and predictive analytics offered by particular software differ, these tools all rely on attorneys to train an algorithm that is applied to identify responsive, privileged and "hot" documents based upon similarity to the human-coded set. As documents are retrieved, lawyers — preferably those with significant knowledge of the facts, the law and the key issues in the case — instruct the algorithm whether its "predicted" coding is accurate. This process repeats itself over multiple iterations. As the database "trains" itself with more human reviewer feedback, the algorithm's predictability improves.
In practice, one or more members of the case team review an initial set of documents randomly selected from the full review population. These attorneys code each as either responsive or non-responsive, and with coding indicating substantive issues, privilege concerns and/or criticality. Using built-in analytical tools — usually, a sophisticated mix of keywords, Boolean connectors, concept searches and categorical groupings — the database identifies underlying elements and properties of the coded documents, and uses those elements and properties to make coding predictions for the un-reviewed universe of documents. A new set of documents is fed to the attorney review team, along with the algorithm’s "predicted" coding for each. Attorneys correct the proposed coding if necessary and "re-train" the algorithm; then the process repeats. When the database’s predictions and the attorneys' coding coincide to a determined level of confidence, the system has learned enough to make confident predictions for the remaining documents. Although timelines vary somewhat, attorneys may only need to review a small fraction of the overall data set to reach this point. The process concludes with quality control rounds, where random samples are selected and computer-predicted results are tested against human coding. If the coding corresponds to a determined degree of confidence, the process is complete. If not, the algorithm returns for further rounds of training until it meets the quality control metric.
It is important to remember that selecting a particular tool is not one-size-fits-all. Each software employs slightly different approaches, and each provides distinctive workflows. Accordingly, consideration of predictive coding technologies should be a collaborative process between a client, outside counsel, litigation support staff and prospective vendors. The selection should reflect consideration of the volume of ESI, the scope of review, applicable timing and cost. Also, certain cases may benefit from hybrid approaches, such as where predictive coding is applied to thorny data sets (e.g., large email archives), while remaining documents are reviewed via traditional manual review or the use of keyword and deduplication processes to initially reduce the volume of a data set followed by the use of predictive coding to locate responsive documents within the remaining corpus.3
Benefits of Predictive Coding
We enumerate here several of the chief advantages of predictive coding technologies that we highlighted in the discussion above.
Magistrate Judge Peck of the U.S. District Court for the Southern District of New York — the first judge to publicly endorse the use of predictive coding — explained: "while some lawyers still consider manual review to be the 'gold standard,' that is a myth, as statistics clearly show that computerized searches are least as accurate, if not more so, than manual review."4
While some may resist the idea that machines can perform portions of our jobs better than we do, human error is endemic in manual review. As litigators know and as the judiciary has begun to recognize, "such review is prone to human error and marred with inconsistencies from the various attorneys' determination of whether a document is responsive." 5 Large-scale manual review almost invariably requires the use of contract attorneys. These attorneys' knowledge of a client's industry and its business is likely negligible, and their experience litigating cases involving similar issues is limited. Their knowledge of the facts of the case is contained entirely within the background information provided in the course of their review. The amount of data to be reviewed is enormous, and court-imposed deadlines are often tight, requiring long hours and sprawling teams. Individual reviewers differ in their ability to maintain alertness, spot responsive documents, assess a document's potential criticality to the case, navigate privilege issues, and make correct and consistent decisions about marginal documents.6Notwithstanding quality control procedures, inconsistencies, inaccuracies and errors inevitably remain.
Keyword searches are not the answer, for several reasons. First, these searches are formulated at the earliest stages of case development, often before extensive interviews of key players, and generally before the issues and applicable lexicon are understood fully. Second, such searches fail to include variations — for example, slang, misspellings and acronyms — that could exclude key data. Third, keyword searches often capture a large amount of irrelevant data caused by "false hits," requiring the same extensive manual review processes described above. In short, as Judge Shira Scheindlin of the U.S. District Court for the Southern District of New York put it, "[s]imple keyword searching is often not enough . . . there is increasing strong evidence that keyword searching is not nearly as effective at identifying relevant information as many lawyers would like to believe." 7 5
Utilizing manual review or keyword searches — or, most commonly, a hybrid of the two — often results in coding and production inconsistencies. Inconsistencies become fertile ground for exploitation by adversaries, often resulting in costly fights where nothing is gained and credibility may be lost. Similarly, these methods pose the risk that relevant, even critical, documents may remain undiscovered; the producing party may be without the benefit of documents that would make its case, or may be subject to sanctions for non-production of documents that would make its adversary's case.
Technology-assisted coding avoids many of these pitfalls by permitting a single attorney or small team to review and categorize large quantities of ESI with lower effort, and demonstrably greater consistency and accuracy. The software mechanisms for training the algorithm rely on a flexible and proven set of methods for identifying documents similar to those deemed relevant, or "hot," such that the risk of overlooking critical documents is reduced. And once trained to a defensible degree of confidence, the algorithm is applied consistently across the universe of collected data, eliminating inconsistencies by ensuring that all data is reviewed pursuant to a single set of parameters.
2. Prioritizing Review
Moreover, technology-assisted review allows for efficient work flows. Documents may be batched for review based upon the algorithm’s prediction of the likelihood they are responsive and/or "hot." In a case where depositions will immediately follow large productions, those preparing for depositions will have the benefit of the most critical documents much earlier in the process. Also, prioritizing review allows a party to expedite the algorithm training process, such as by assigning documents with the highest predicted relevance to attorneys with the most underlying knowledge of the case.
3. Enhanced Institutional Knowledge
Leaving aside error rates, manual contract attorney review also suffers from a lack of institutional knowledge and memory: once the project is complete, the contract team disbands, and their coding notations are the only record of their process. Although this constitutes compliance with discovery obligations, it often leaves counsel ill-equipped to use the documentary record effectively in depositions, motion practice and at trial. Counsel who will be handling the litigation going forward are often several steps removed from the teams wading through data. Although senior litigation counsel may supervise the overall review process and may review certain key documents as they are identified, it is often not until deposition preparation that senior litigators in document-intensive cases get their arms around the documentary record. By that time, the senior litigator has a document set filtered through multiple sets of junior attorneys: the contract attorneys who identify responsive and "hot" documents, the associates who quality-control the production and assemble a definitive set of critical documents and the associate who pulls together critical documents relevant to depositions. 6
Predictive coding turns this dynamic on its head. By having a senior attorney (or a team thereof) assume active involvement in the initial stages of teaching the database what is responsive and what is important, these senior attorneys acquire a greater understanding of the documentary record as it is being developed. And, by involving senior attorneys in the initial stages of review, those attorneys' knowledge of industry, client, facts and law are incorporated into their coding, which in turn produces a more robust algorithm. Junior attorneys may be called upon to complete the process, but with the algorithm doing much of the heavy lifting that contract attorneys once performed, review teams are leaner and composed of associates who will continue to work on the case going forward.
Similarly, once the investment is made to train the algorithm, the algorithm stays trained. If a new set of client data is later collected, the trained algorithm can be applied to that new data set without the need to mount a brand-new review process. And when the opponent's production arrives, the algorithm — understanding by that point a great deal about what counsel considers most important — will quickly identify potentially critical documents for review.
All of this means that effective use of these tools allows a party’s litigation counsel — those responsible for setting litigation strategy, counseling clients and handling subsequent stages of litigation — to get much smarter, much faster.
Conducting a Defensible Technology-Assisted Review
Though technology-assisted coding is relativity new, and was judicially approved last year, courts are trending in favor. Indeed, in 2013, several federal courts across the United States have encouraged the parties to use predictive cording and acknowledged the advantages of this technology.8
In light of the many advantages detailed above, now is the time to consider its use in appropriate cases. Below is a non-exhaustive list of guidelines, gleaned from recent experience and from the few court opinions issued to date, for developing and implementing an efficient, robust and defensible predictive coding process.
First, the Sedona Conference Cooperation Proclamation states that "the best solution in the entire area of electronic discovery is cooperation among counsel."9Predictive coding is no exception to this rule. Absent good reason not to, counsel should advise opposing counsel that it intends to use technology-assisted coding and attempt to secure opposing counsel's agreement. Counsel should also confer with opposing counsel on a review protocol. A nonexhaustive list of issues for discussion include: (1) use of keywords in the collection of documents, (2) number of custodians, (3) size of the seed set, (4) use of concept groups, (5) number of iterative rounds to stabilize algorithm training, and (6) targeted confidence level. Open discussion with opposing counsel on these issues can ensure defensibility by securing agreement, or can narrow disputed issues for judicial resolution. At a minimum, up-front discussion prevents later claims of sandbagging.
Second, in business litigations where discovery burdens are roughly balanced, the invitation to use predictive coding will often be welcome and may result in a bilateral agreement10. In the event an adversary also elects to use technology-assisted coding, counsel should consider whether to engage a single vendor and split database costs. Of course, if this approach is proposed, the parties need a protocol for protecting the confidentiality of unproduced and privileged documents, and of party-specific issue coding.
Third, counsel should continue its transparency throughout the review process. This may mean providing opposing counsel with: (1) a list of custodians, (2) keywords applied, (3) documents reviewed as a function of the initial seed or control set, whether they were ultimately coded as responsive or nonresponsive, (4) issue codes or concept groups, and/or (5) proof of a valid quality control process, including the confidence level determined to conclude the review.11 Timely disclosure of these matters could strengthen the protocol's defensibility in the event of later challenge. Such disclosure could require the adversary to articulate its objection(s) on the matters being disclosed or risk waiver.
Fourth, regardless of disclosure, all aspects of the process should be carefully documented. Even where there is agreement among counsel as to the use of predictive coding, the variations among tools and the presence of individualized determinations counsel in favor of meticulous documentation. The need to document becomes stronger still where the decision to use predictive coding is made unilaterally or over the adversary’s objection. Further, the responsibility for documentation should not be ceded to the vendor; counsel and/or its litigation support staff should document the process that they may be called upon to defend.
Fifth, to further control costs and fees, a party may consider staging review. For example, this could involve collecting and reviewing documents solely from sources or custodians most likely to have relevant data, without prejudice to the requesting party seeking additional documents after the conclusion of that first-stage review. Or, where the client has identified an initial set of key documents, these documents can be seeded into an initial training set as responsive and "hot," potentially shaving rounds off of the training process.
A Highly Promising Tool
In sum, predictive coding technologies hold great promise for extricating businesses, and their litigators, from the burdens posed by data proliferation. Structured and executed properly, a review protocol using these tools can be an efficient, cost-effective and defensible means of complying with discovery obligations. But just as importantly, these new technologies advance case development by enhancing review accuracy and consistency, increasing the likelihood that key documents are captured and allowing a party and its senior litigation team real-time visibility into the documentary record as it develops.