It goes without saying that the amount of client data generated and exchanged during litigation can be not only daunting but expensive to manage and impossible to absorb in toto. Factual development of a case usually depends on carefully-crafted keyword searches to target important data while blindly hoping that any “leftovers” are benign. The other approach is to review everything, which is always cost and time prohibitive, and with the advent of so many powerful concept-based searching tools, we no longer need to mire through 99% of the data to isolate the 1% that is at the heart of a case. We can identify the 1% immediately and build from there.

The Index – the heart of it all

Words in a document are individually parsed and added to an independent index that can be accessed by generating queries in a database. You type a query, the query is sent to compare against the index, finds the document reference and retrieves the record. Oversimplified, perhaps, but this fundamental understanding is often overlooked, even assumed for frequent database users, and remains buried beneath discussions about data tables, lexicology and syntax. The key is to understand the different types of indices, the zenith and nadir of their retrieval capability and how to utilize them all to develop the facts of your case.

Keyword and proximity searching – bobbing for apples

Practice holds that once data is collected, an attorney sits down and develops a list of search terms to run across the data set to return the “relevant” documents for subsequent review and potential production. The list of search terms is normally developed after meeting with client personnel, reviewing a small set of client-designated records, or reading the case filings. Essentially, we are guessing at the terms, and the proximity of those terms, as they exist in the index based on limited or extraneous information. If the terms remain too broad, the client suffers by paying for the review of false search terms hits (i.e., a search for “agreement” could pull in any reference to a formal or informal agreement as well as casual email discussions unrelated to the case issues, such as “if we are in agreement on Japanese food, I’ll make reservations for lunch today.”). If the terms are too limiting, key documents may be missed because we haven’t correctly assumed the presence of one word next to another in a document (i.e. agreement w/5 “stock purchase” would not capture an email discussion where players referred to the relevant agreement as “the plan” or SPA). Keyword and proximity searching, although built on different indices, employ the same leaps of faith. The review of materials culled using keyword and proximity searching needs to remain fluid. As new terms are being identified during the review process, they should be run across the original data set to pull in additional documents otherwise missed by the “key” terms developed at the very beginning of a case.

Keyword and proximity searching, however, can be very powerful in identifying specific events for the development of a case timeline, targeting a particular string of conversations between two key players, or targeting all references to a specific product within its R&D phase. When you know what you are looking for, usually much later on in development of the case, the leap of faith becomes a small step in plugging the holes in your factual analysis.

Concept-mining: the view from above

The Latent Semantic Index (“LSI”) is the basis of concept-searching in a database, generically referred to as concept analytics. A concept analytics index doesn’t compare individual terms within each document but instead catalogs their relationship, by way of measuring their co-occurrence, for all documents in a data set. They are then grouped as “concepts.” For example, if the terms “denial” and “claim” exist in a single document, and in another document the terms “claim” and “payment” co-exist, a relationship among all of these terms is drawn and ranked and the “claim” concept is developed. The more frequently the terms co-occur, the stronger the relationship between “claims,” “payments” and “denials,” and the higher the ranking. In a concept analytics index, a ranking of 95% would represent a relatively strong concept, meaning the terms in the index co-occur frequently. Using the data and the relationships that are drawn with concept analytics, the data can expose what is happening conceptually across a data collection early on in a case.

Imagine a scenario at the start of a sexual harassment case where the attorney is asked to cobble together a list of search terms to identify key materials for review. Those search terms would probably contain the usual suspects: names of players and the term “harassment” and all its synonyms and iterations. Utilizing keyword and proximity searching, the term “harassment” may pull up a small collection of records but how often would the players that are involved actually use ’harassment’ to describe relevant events as they were happening? How would you know what terms were used without knowing the players and their vernacular? This illustrates the limitation of using keyword and proximity searching at the very beginning of a case. Using one database feature available with concept analytics, called keyword expansion, the term ‘harassment’ could be run across the index and yield a treasure-trove of materials that are conceptually linked to the term harassment, such as belittle, humiliate, demean – all synonyms for harassment that exist in the data set and which may not have made it to the original list of terms developed by the case team, potentially omitting critical materials.

Another available database feature based on concept analytics is categorization. Categorization allows the attorney to find specific examples from the documents that relate to key issues in a case and deploy those examples onto a targeted set of documents to identify similar content. For instance, if an attorney were to identify support for defending a breach of contract claim from within their client’s data early on in a case, and then apply those conceptual examples to opposing parties’ production materials, navigating a massive document collection and production could now be done adeptly and cost-effectively without sacrificing development of the themes in your case.

The hybrid: fuel-efficient

With all of the search tools that are available, the best approach is to use them all but be mindful of their precision. By utilizing concept analytics for the development of case themes early on and keyword/proximity searching for development of targeted timelines and trends as discovery unfolds, the case team can stay on course and not miss the leaves within the forest.

Leslie Nash