Disclosure of documents is a significant driver of costs. Where the relevant documents are electronic, the problem is usually exacerbated. This is simply because the vast majority of documents are now created electronically and the proliferation and storage capacity of day-to-day IT equipment is such that the amount of information available may be enormous.

Consequently, increasingly sophisticated electronic processes have been developed to reduce the burden of disclosure, identify as many relevant documents as possible and reduce the volume of documents to be reviewed manually.

Predictive coding is one technique that is increasingly used in several jurisdictions, particularly the United States, and has been deployed in larger commercial litigation and arbitration in England & Wales. Only recently have the English courts expressly approved or offered guidance on the use of predictive coding, in two recent High Court decisions: Pyrrho Investments v MWB Property and BCA Trading Ltd v Feltham.

In both decisions, the court confirmed that predictive coding:

  • Could amount to a reasonable search.
  • Was permitted under CPR 31 and PD31B.

Both placed heavy emphasis on the perceived cost savings. Further, both confirmed the current view of the e-disclosure marketplace that predictive coding can be as accurate, if not more accurate, than traditional linear review linked to keyword searching.

Following these decisions, there is little doubt that use of predictive coding will increase; indeed, in most high-value cases involving significant disclosure, predictive coding is likely to become the norm sooner rather than later. Litigators and other stakeholders in litigation (such as insurers and funders) will need to familiarise themselves with predictive coding as a process and put in place appropriate procedures for its use.

Predictive coding is a process, not just a “black box”

Predictive coding involves a human reviewer “training” a computer system to identify and classify relevant documents within large volumes of data. To train the system, one or more members of the legal team, usually more senior fee-earners, who are familiar with the case issues ( “subject matter experts”) will review one or more initial batches of documents ( “seed sets”) and will code each document for relevance, privilege and specific issues. The system then uses complex algorithms to extrapolate the human reviewer’s decisions and apply them to the remainder of the document set.

It goes without saying that if the seed set is coded incorrectly then the predictive coding system will multiply those errors across the entire document set; therefore, the training of the predictive coding system should be carried out by an experienced lawyer with a full command of the case.

Courts are unlikely to be persuaded as to the reliability or defensibility of a predictive coding process simply by being given a detailed explanation of how the specific algorithm operates. The central issue is therefore to ensure that the underlying process is well considered and the output can be verified in terms of statistical accuracy.

To achieve this, it is necessary to carry out quality control exercises. These generally require the review of samples of the documents identified by the computer. The sample sizes will be determined by reference to the volume of documents within the whole set, and considerations of the acceptable level of confidence and margin of error to be attained. Generally, the higher the required level of confidence and the lower the acceptable margin of error, the sample must be larger and the review exercise more costly.

The samples are manually reviewed and coded by the subject matter expert on a “blind” basis; that is, the human reviewer does not know how the system has classified the documents. The system then compares its decisions with those of the human reviewer. It logs any instances where the human reviewer has “overturned” decisions made by the software. The results are fed back into the software, analysed and used to further calibrate the system’s decisions on the document set as a whole. This sampling process is iterative and the system continues to learn and refine its analysis with each repetition. In Pyrrho, Master Matthews noted that this process might usually need to go through several iterations to attain the required level of confidence and the permitted margin of error.

Litigators will need to engage with e-disclosure providers early in order to develop appropriate predictive coding processes. Senior lawyers are likely to be deployed in document review at the outset. Furthermore, to maintain a defensible process, it will generally be necessary to maintain a clear audit trail of the steps followed, the processes applied to the document set and the outputs from the system at each iteration. Evidence will be needed to demonstrate that the end result of the predictive coding process falls within the acceptable margin of error.

Predictive coding is unlikely to replace human review completely

One central misunderstanding is the notion that the output from the predictive coding exercise is then simply disclosed to the opposing party. The experience in the United States suggests that this very rarely happens without further human review. For example:

  • Predictive coding can be effective for identifying the responsive documents within a dataset; however, legal teams still need to review the responsive documents.
  • Most litigators would prefer to combine manual and automatic review as a means of quality control.
  • Parties often wish to carry out further review in order to guard against the possibility of inadvertent production of privileged information.
  • Certain documents such as image or audio files, photographs or design drawings do not lend themselves to predictive coding and usually require a separate manual review exercise.
  • When large quantities of hard-copy documents are scanned and subjected to an optical character recognition (OCR) process, it is inevitable that it is not 100% accurate; some document types(such as hand-written notes) are not suitable for OCR at all.

Accordingly, lawyers need to consider early on the types of electronic documents that are likely to exist and their suitability to be subjected to electronic techniques. This is likely to lead to a rise in “early data assessment” where an e-disclosure provider maps and assesses the nature and types of information stored. An appropriate balance must be struck in each case between manual and automated review processes.

Predictive coding in case management

CPR PD31B requires the parties to discuss appropriate methods of dealing with e-disclosure prior to the first case management conference (CMC). If predictive coding is agreed between the parties, it is usually necessary to record the agreed processes in a predictive coding protocol, as was the case in Pyrrho. The parties’ agreement was a significant factor in favour of approving predictive coding.

If the parties do not agree on the use of predictive coding, they will usually need to set out their positions in evidence. This usually needs to be supported by e-disclosure experts.

Both Pyrrho and BCA Trading placed heavy emphasis on the need to save costs; it was the principal determinative factor in BCA Trading. A party seeking to justify the use of predictive coding usually needs evidence to demonstrate potential costs savings. This generally involves an assessment of the party’s electronic data by an e-disclosure consultant together with projections of the likely costs of different search and review methods. This should be carried out at an early stage as it will inform the required disclosure order and any costs budgets.

The party seeking predictive coding also ought to set out a detailed protocol of proposed methods with the quality assurance methods and the basis upon which the margin of error has been deduced. A transparent and defensible protocol was the key to the approval of predictive coding in the Irish case Irish Bank Resolution Corporation Ltd v Quinn, cited with approval in Pyrrho.

Further, some qualities of a case will lend themselves to predictive coding. For example, cases with a high volume of electronic documents or a large number of custodians. In Pyrrho, more than 17.6 million electronic files were restored from the second claimant’s back-up tapes.

Although the number was reduced to around 3.1 million by de-duplication and electronic processing, there remained a large number to review. Such a process would have been exceedingly costly. Usually, because the “decisions” of a human reviewer are extrapolated to a whole data set, the cost effectiveness of predictive coding tends to increase together with the size of the dataset.

A party opposing predictive coding may face an uphill struggle in light of Pyrrho and BCA Trading. However, a number of factors might militate against its use:

  • In lower value cases, the front-loading of cost and the costs of processing and hosting electronic documents may not be proportionate. In 2012, FTI Consulting suggested that US litigators felt that cases valued at under USD $200,000 were unlikely to be large enough to warrant predictive coding.
  • Although often cost-effective and proportionate in “big document” cases, the process requires some front-loading of cost, specifically, engaging a more senior fee-earner to review the seed set and the iterative review of the sample sets; therefore, in some cases the cost of carrying out a traditional exercise remains lower. The same research by FTI suggested that cases involving fewer than 100,000 documents may not be appropriate for predictive coding.
  • Cost is not the only factor. In many cases, the bulk of the relevant documents may not be conducive to electronic review.

This article was first published in Practical Law's Dispute Resolution Blog.