Duplicates abound in data collections, between custodians storing files in multiple locations—on servers, laptops and mobile devices—and e-mail chains sent to numerous custodians. A federal court recently acknowledged how analytics can boost the efficiency of review by eliminating this redundant information in a case involving breach of contract, misrepresentation, unjust enrichment and unfair trade practices claims.

What Happened in the Case?

In Family Wireless #1, LLC v. Automotive Technologies, Inc., the plaintiffs, wireless company franchisees, filed a motion to compel discovery from the defendant franchisor. The parties had met and conferred many times to develop an ESI protocol, but the sticking point was the number of custodians. Originally, the parties agreed to search the files of seven custodians; however, the plaintiffs asked the court for leave to search six more.

The court evaluated the plaintiffs’ request under Federal Rule of Civil Procedure 26(b)(2)(B), which explains that parties can object to discovery of relevant ESI that is “not reasonably accessible because of undue burden or cost” unless there is “good cause” for the discovery. The plaintiffs argued the new custodians were “‘conduits of relevant information’” despite not being top-level decision-makers. The defendant objected to the plaintiffs’ request on two grounds. First, their e-mails would be duplicative, since most reported to custodians whose records had already been searched. Second, they argued that searching these files would lead to “tens of thousands of additional documents and hours of costly review.”

The defendant’s argument failed to persuade the court. The court observed that the defendant had acknowledged that deduplication would exclude the production of duplicative e-mails, “alleviat[ing] some of the cost and time concerns.” The court also found that the defendant’s relevance argument failed, as “[t]he mere fact that many documents have already been produced is not sufficient to establish that there are no other relevant materials to be found.” Accordingly, the court permitted the plaintiffs to expand the search to three additional custodians; the plaintiffs could not establish sufficient relevancy to warrant searches of the other three custodians’ e-mails.

What Is Deduplication?

Deduplication is the elimination of identical files from a data set. Intelligent algorithms in eDiscovery platforms can be applied within or across custodians and data sources, giving administrators or users choices as the data moves through the system. Data can typically be deduplicated in one of two ways, depending on the needs of the matter: vertically, which removes duplicates within each custodian’s files, or horizontally, which removes duplicates globally across a collection. To find duplicate files, the algorithm compares each document’s hash value, a unique mathematical “fingerprint.” Any change—as small as altering one character or changing a font—will alter a document’s hash value, so only exact duplicates are marked for culling from a collection.

With recent calculations of the number of duplicates in data sets falling between 30 and 80 percent, the removal of duplicates early in the eDiscovery process can pay huge dividends in lower hosting, review, and production costs. Moreover, deduplication can reduce the risk of inconsistent responsiveness and privilege decisions on identical documents.

By JB Hinrichs, Vice President, Marketing & Communication at Xerox Legal Business Services