After preserving and collecting electronically stored information ("ESI"), the next step is to determine what is relevant (or responsive to document requests or a subpoena), versus what is privileged, irrelevant or non-responsive. A great deal of the cost related to e‑discovery is incurred at this stage, and companies should be aware of the tools and strategies available for accomplishing this in the most appropriate and cost-effective manner.

Review tools are necessary because simply opening files one-by-one in their many different source applications is impractical in all but the smallest of productions, and also would risk altering metadata that some courts have held is an integral part of an electronic document. In most cases, therefore, it is necessary to load the ESI into an application that allows it to be reviewed, searched and analyzed. Some companies that are frequently involved in litigation choose to purchase such applications for their own use, but many use applications hosted on their law firm's or an e-discovery vendor's systems. Review tools usually require the ESI to be processed before loading.

Processing involves indexing and formatting collected ESI so that it can be culled and searched in a review tool. It is performed using specialized software by litigation support staff or an outside e‑discovery service provider. The nature of the processing differs with different applications, and "add-in" tools may be employed to expand the range of tasks. Typical processing tasks include extracting files from folders (e.g., .pst, .zip and other compressed formats), separating attachments, converting files to formats the review tool can use, extracting text and metadata, removing system and application files that do not contain user-generated data and have no evidentiary value (called "de-NISTing," which refers to the list of over 28 million file signatures maintained by the National Institute of Standards and Technology), and de-duplicating identical files either within or across custodians. Beyond de-NISTing, it can be beneficial to exclude file types outside of the 80-100 most common file types. Additionally, some file types such as PDFs from faxes or scans, tiffs or jpegs may have no extractable text and are just flat images. These must be isolated and OCR'd (which stands for Optical Character Recognition) so that there is searchable text.

Culling involves limiting the data set to the ESI that is likely to be relevant. It should be an ongoing, iterative process. Of course, targeting in the initial collection only appropriate custodians, data sources and relevant date ranges may be the best and least expensive form of "culling." Some applications allow certain pre-processing culling, such as de-NISTING and applying date range and custodian limitations before the ESI is loaded into the processing application. Although these pre-processing tools also may generate costs, they are often at lower rates than full-blown processing.

After processing, the ESI can be loaded into the review tool. Costs are also typically incurred at this stage, as many review tools are billed according to the volume of data loaded onto the tool or the number of reviewers using the tool. Ongoing monthly hosting charges for the stored data are standard. Accordingly, it can pay to cull as much irrelevant ESI as possible before loading it onto the review tool.

Culling options increase once the ESI is processed and loaded into a review tool. The most common culling methodology is to apply Boolean keyword searches and then to review each document that "hits" the search terms. It can be helpful to sample various search term options to maximize their "precision" (i.e., the extent to which the terms locate relevant versus irrelevant documents) and their "recall" (i.e., the extent to which the terms locate all the relevant documents in a set).

Advanced search methodologies such as "fuzzy searching" (where the application generates variants, such as common misspellings, of user-generated search terms) and concept searching can enhance precision and recall. Other advanced search methodologies include e-mail threading, where the application identifies and assembles all the individual e-mails in a thread, and identification of "near duplicates" and similar documents--all of which promise more efficient and consistent review by collecting all such documents for one reviewer and significantly reducing the overall number of documents reviewed. By pulling similar documents together, a reviewer can focus on specific concepts and themes of a case.

Additionally, a number of litigation software vendors have developed (and many others are in the process of developing) applications with functions known as "predictive coding" or "computer-assisted review" that promise to automate, to some extent, the identification of relevant documents. These applications typically assign relevance scores to the entire document set based on a sample of documents reviewed by someone knowledgeable about the issues in the case. Counsel may then limit or prioritize review of the documents based on the software's evaluation of their relevance, with the goal of limiting the cost and time burdens associated with human "eyes on" review of all the documents. As with all search methodologies, sampling and other quality control procedures may be helpful in ensuring the reasonableness of the approach taken. 

These advanced search methodologies often require "plug ins" or "add-ons" to review platforms. Consequently, they usually involve additional charges. They may also require additional processing and the assistance of litigation support personnel. Companies should carefully weigh the benefits and costs of the available tools to select those that are most appropriate for the needs of a particular case.

In our next installment of E-Discovery Basics, we will discuss production of ESI.

Other installments in our E-Discovery Basics series are available here.