The claim (worth in excess of £50 million) was struck-out following the decision of HHJ Pelling QC to reject their application for relief from sanction. This decision ends a disclosure saga that spanned three years and numerous hearings, including in the Court of Appeal.
The failings identified in the first instance judgment of Mr Justice Birss, considered in the Court of Appeal's judgment, and finally re-visited in this judgment (all required reading for litigators), covered many of the issues that can arise in a disclosure exercise. The judgments highlight several failings on the claimants' part. One "serious" and "significant" failing ultimately resulted in the breach of an "unless" order and the striking-out of the claim. It is on this failing that this brief article concentrates.
The claimants' first list of documents, served in July 2012, revealed a failure to undertake any search for disclosable documents within the voluminous hardcopy material they held, merely listing every conceivable document, by reference only to the boxes in which they were stored. Attempting to remedy this, the Claimants then instituted a process of sifting through the hardcopy files, removing clearly irrelevant material and scanning and uploading the remainder onto an electronic database. To assist with this process they engaged a third party provider, Unified.
The scanned documents were subjected to an Optical Character Recognition ("OCR") process by Unified in order to capture the text content of the scanned images. In theory, this makes them fully searchable using keyword searches. In practice however, the OCR process is not infallible and can struggle to "read" documents which are not well preserved, with clearly printed, legible black text on a white background. Handwriting (even when in clear block capitals), highlighting and other marks, as well as non-English printed characters also cause the difficulties.
The court in this case was shown numerous examples of documents that the OCR process had failed to "read" accurately. The logical conclusion was that an unknown but significant number of potentially disclosable documents which on their face contained relevant keywords, had been excluded from the search because of the failures of the OCR process. This amounted to a failure by the claimants to conduct a reasonable search and the claim was struck out for breach of an "unless" order, a sanction from which the court then refused to grant relief.
How could these failures have been avoided?
What could the claimants have done to avoid this? It is clear from the judgments that while they and their lawyers had attempted to mitigate the failings of the OCR process, their descriptions of these efforts were opaque. Consequently the defendants and ultimately the court could have little confidence in their adequacy. It is possible that early, open communication with the defendants as to the issues being faced would have set the claimants on a different path, avoiding the draconian consequences that befell the claim.
In terms of the actual process, the OCR difficulties could have been anticipated and mitigated in a number of ways, including:
- Undertaking a basic relevance assessment of documents or categories of documents at the point of selecting what was to be scanned by Unified – obviously highly relevant documents could have been marked for immediate review regardless of whether they responded to search terms.
- Documents which were likely to cause difficulties for the OCR software could have been identified (by non-lawyers) Unified and set aside for a manual review.
- Further quality control (QC) processes to assess the accuracy of the OCR process could have been undertaken by Unified.
- The claimants' lawyers could have undertaken QC checks themselves – for example a simple search for documents not containing one or more of the most commonly words in the English language (e.g. "the", "and", "to" etc) or common words in found in the documents (e.g. the parties' names) should identify documents which the OCR process had completely failed to "read".
- The search terms could have been simplified – searching for complete phrases or long words carries a risk that hardcopy documents that would be responsive are wrongly excluded due to just a single character being "misread" in the OCR process. "Fuzzy" searching can also mitigate this issue and was in fact undertaken to some degree by the claimants.
Is OCR still used? Should it be?
Modern business practice may mean that the number of documents currently created as originals in hardcopy without an electronic counterpart is limited. One might therefore assume that this is not an issue that will commonly affect future cases. However, particularly for regulated industries (where hardcopy documents may be required to be kept for long periods), it is an issue which should be taken into account by in-house lawyers and information management professionals in those sectors. Many companies still receive handwritten forms, "know your client" hard copy data and other documents in scanned hardcopy, or hardcopy only format. Others are migrating their archived hardcopy material to electronic format by scanning it. In addition to accurate indexing of such archives, quality control of the scanning and OCR process and a clear understanding of its limitations are essential to avoid mistakes which may cause problems later.
A further limitation, not relevant in this case, but which increasingly arises in cases with a foreign element is whether the OCR process will accurately "read" foreign languages. This is not just limited to non-Latin alphabets: European languages with special characters (e.g. accents, umlauts and cedillas) are often "misread". If they may be relevant, hardcopy documents containing foreign language text need to be considered carefully. If they form a readily identifiable set of documents it may be easiest to treat them separately – perhaps reviewed in full by reviewers with the relevant language skills. If search terms are to be used to filter the documents, they should be considered carefully, with alternative spellings used in anticipation of likely inaccuracies in the OCR output.
This decision is a salutary lesson to litigants and practitioners about the dangers of failing to properly plan and execute a disclosure exercise in compliance with the Civil Procedure Rules. It also highlights the dangers facing lawyers who do not understand the limitations of the technology on which they rely. If nothing else, practitioners should appreciate the old chestnut that the quality of output is entirely reliant on the quality of input, whether inputs take the form of human instruction, or the form and content of raw data.