Tuesday, September 13, 2016

Recall, Magical Thinking, and the Assessment of eDiscovery

The current version of the Federal Rules of Civil Procedure highlight the importance of reasonableness and proportionality.  As is widely understood, the cost of dealing with the volume of documents that could potentially play a role in a legal dispute can easily overwhelm the value of the case.  Some kind of technology use is essential if we are to maintain a justice system that depends on evidence.

The problem is generally not the number of documents that will ultimately be introduced as evidence, rather it is the winnowing process that goes from the domain of potentially relevant documents down to the ones that must be produced.  Ultimately, only a handful of those may end up being critical to a case.  If we knew without effort which those documents were, we would not have to go through the complex discovery process. 

Discovery involves more than winnowing, of course.  The legal team not only has to decide which documents are pertinent to a case, but also understand the content of those documents and how they fit into and guide the theory of the case. Data analysis and understanding has not, historically, had the benefit of a well-structured process, but the winnowing task has.  In this context, I am focusing on the problem of identifying the documents to be produced from large collections.

Assessing the reasonableness of any process can be facilitated by measurement.  There is a saying that you cannot improve what you do not measure.  Although one can use intuition or other forms of judgment to assess reasonableness, intuitive feelings of reasonableness alone may not be sufficient.  In these cases, we would like to know how reasonable a process was.  For this, we need measurement.

Overwhelmingly, the primary measurement of the efficacy of the winnowing process in eDiscovery is Recall.  Of the documents that are relevant in a collection, how many (what proportion) of them have been identified?  The idea is that the more complete the identification process, the better it has been.  All other things being equal, a better process is a more reasonable process. 

Still, from time to time, question arise whether Recall is a good measure for assessing the winnowing process.

As I read it, there are four related arguments about why Recall might be inappropriate as a measure of the eDiscovery winnowing process:
  1. Recall measures completeness, but completeness is not enough
  2. Recall is overly sensitive to the easy to find documents
  3. Recall is insufficiently sensitive to rare, but critical sources of information (smoking guns)
  4. Recall measures the number of documents that are identified, but not their importance

Before discussing these criticisms, I want to spend some time thinking about measures.  A good measure should have validity and reliability.  Validity means that it actually measures the property that you are interested in.  Reliability means that measuring the property repeatedly gives consistent results. A good measure should also be easy to obtain and yield a quantity that has a minimum and maximum value (say 0.0 and 1.0 or 0.0 and 100.0). Finally, it should be transparently related to the goals of the task, so that it is easy to interpret.  Although computing it can take some effort, Recall meets these criteria for a good measure.

Completeness may not be not complete

Recall is a statistic for measuring completeness.  It corresponds directly to the requirement in The Federal Rules of Civil Procedure, Rule 26(g) that the producing party certify that a production is complete and correct, following a reasonable inquiry.  So, by these standards, completeness would seem to be a central criterion against which to judge a production.

The usefulness of any statistic depends critically on the question you are trying to ask.  If we want to know how complete an eDiscovery process has been we can simply ask how close we have come to identifying all of the relevant documents. It is difficult to think of a more transparent or valid measure than Recall to answer this question.  If you know the number of responsive documents in a collection and you know the number that have been identified, then you know how complete your process is.

To be sure, there are challenges when measuring Recall.  The primary one is that we do not actually know directly how many relevant documents are in a collection.  We need to estimate that number, and for this we use various statistical sampling and other methods.  I have discussed some of these methods elsewhere, but all of them are essentially different ways of estimating Recall.  If you want to know about the completeness of a discovery process, Recall, however estimated, is your answer (I count Elusion as being one of the methods of estimating Recall).

Critics of Recall sometimes claim that there must be more to completeness than the number of documents available and the number identified.  We turn to a couple of those suggestions next.

Sensitivity to the easy to find documents

According to the second argument, completeness in terms of documents is not completeness in terms of information.  We should really be using a measure of the completeness of information.  Some documents contain unique information and some are simply repeats of already known information.  The responsive ones with unique information tend to be more valuable than the redundant ones.

After finding one responsive document, other similar documents are automatically found, but finding many duplicates of an easy to find document do not add value to the discovery. For example, if 80% of the responsive documents are nearly identical to one another and we find one of them, we can achieve 80% Recall without finding another document.  We could appear to be successful just by finding the easy to find documents and still miss a lot of information.

But just how do we measure this missing information?  Counting documents is relatively easy, but measuring the information content of each one is practically impossible.  Experimental psychology had a flirtation with measuring information in text in ways that could be automated, but that approach generally did not work out. 

I don’t want to claim that there could never be a way of effectively measuring the amount of information in a document or a collection of a documents, but at present, I don’t know of any practical way.  The best we could do, I think, is to determine that a document is dissimilar to any that have been found so far to be responsive.  Even that would be a challenge to convert into any meaningful measure of the completeness of a production, however, let alone a practical measure. 

Recall does not measure the effectiveness of finding smoking guns or rare documents

It is common in eDiscovery to say that smoking guns tend to have friends.  That is, they are generally not unique.  A representative sample of documents has a good chance of catching smoking gun documents, if they exist in a collection.  But truly rare documents can occur, and a sampling process is unlikely to find them.  That is the definition of rare.

The challenge of finding rare documents might be a criticism of sampling, but it is not a criticism of Recall.  No matter what process we employ, even exhaustively reading all of the documents, truly rare documents necessarily present a challenge to discovery.  Many documents in a collection are rare, but their rarity does not guarantee their relevance.  Rarity is not a value by itself. Individual junk emails could also be rare and of no value at all to the litigation.

If a document type is truly rare, then it is unlikely to be encountered during the review process, or if it is encountered, it is unlikely to be recognized.  Since World War II, it has been known that humans have difficulty sustaining their attention in the face of rare signals, an effect called "vigilance decrement."  Studies of human reviewers in eDiscovery confirm that people are relatively poor at independently identifying responsive documents.  We found, for example, that only 28% of the documents identified by either of two professional reviewers were identified by both reviewers. When two reviewers disagree on whether a document is responsive, one of them must be wrong. 

Documents do not have to be rare for human reviewers to miss them. It is a common occurrence in eDiscovery that a category of documents is not recognized until after many thousands of documents have been reviewed.  Human reviews rarely go back and fix such mistakes because it is simply too expensive.

Furthermore, truly rare documents are unlikely to appear in our estimate of the truly responsive documents in a collection against which we compute Recall.  If they are not encountered or if they are not recognized when they are encountered, they cannot count either for or against Recall.  We would have no knowledge that they exist.  Documents that we do not know cannot affect any measurement.  Moreover, it would be extremely difficult to practically identify such unique documents in a large collection.  Again, this is not a problem with Recall, but with the search process in general.  These documents might magically exist, but none of the processes we have available are likely to find them.  Again, that is the definition of rare.  If they were easy to find, they would not be a problem.

Recall is an “average” kind of measure.  It is a characteristic of how a process performs over the population of all documents in a collection.  Each document may be unique in what makes it relevant and in how important it is, but Recall captures the overall quality of the process.  Rare kinds of documents contribute less than common kinds of documents.  According to decision theory, it is more difficult to accurately judge rare events relative to more frequent events, whether that judgment is done by a computer or by a human reader.

Recall does not measure importance

Recall treats each responsive document as making an equal contribution to completeness.  It treats each responsive document found as a count toward either prevalence or completeness.  But documents are not equal in their probative value.  Could there be a measure that takes account of the probative value of a document?  This would, of course be a different measure than Recall, addressing a different question. 

Probativeness concerns an individual document’s contribution to the case.  It is not a measure of the completeness of a process at finding responsive documents.  A document has probative value if it raises some new piece of evidence, but not if it is the tenth or hundredth document providing that same information.  It is difficult to see how probativeness could be used as a measure of the success of a predictive coding project rather than as a measure of an individual document in that collection.  We could not, for example simply sum up the probativeness of each document in the collection.  The probative value of a document is contingent on the document and on the already discovered documents in the collection. 

Recall can be used to some extent in the context of probativeness.  Some predictive coding projects, for example, compute separate Recall measures for “hot” documents, the most important ones to the case, and merely responsive documents, the rest of the responsive ones. This does not indicate a failure of Recall, but its application to a special subset of responsive documents.

Like responsiveness, we cannot know the probativeness of a document before the discovery process.  If we did, we would not need to conduct the eDiscovery process.  Some analysis needs to be conducted to assess probativeness and it may take the development of new approaches to machine learning to automate the estimation of a document’s probative value.  The probativeness of a document, though, is not contained solely within the document, but in the relationship between a document’s content and other sources of information.  Any process directed at automating the assessment of probativeness will have to include much more information than that contained within a document or even a document collection.  As mentioned earlier, measuring the information content of a document is itself difficult, measuring the document’s relation to the facts and needs of the case is, at least for the present, impossible.

If we knew the probativeness of each document, then we could use that information to weight our Recall.  Unfortunately, at this point, wishing for a measure of probativeness is just magical thinking.  Someday, we may be able to automate its assessment, but until we have an automated measure, basing an assessment on probativeness seems unlikely to be anything more than wishful thinking.

Furthermore, I don’t think that that is what the winnowing process is all about.  Would it be reasonable for a producing party to say, “we are only producing a small percentage of these documents, but these are the most probative ones?”  Would such a production be compatible with FRCP Rule 26(g) (requirement for complete and correct productions after reasonable enquiry)?  Could the producing party even judge which documents would be most probative to the requesting party?  Is not the probative value of documents part of the essential legal reasoning in a case?

The status of Recall

We can make up imaginary situations where Recall fails to assess the reasonableness of our selection process, but these situations are contrived and simply not realistic.  For example, one commonly suggested scenario is that one process will find more total responsive documents and thus have higher Recall than another while the second process finds fewer documents (lower Recall), but better ones. 

This scenario, it seems to me, is unlikely to actually occur.  In order for one system to have lower Recall than another, but still find a substantial number of better documents (a) there have to be a substantial number of better documents to find, (b) the lower-Recall system would have to miss a substantial number of documents found by the other process, and (c) we would have to find evidence of these other documents.  Generally speaking, an eDiscovery activity uses only one kind of eDiscovery process, though sometimes keywords are used on the same set as predictive coding.  Parties have speculated that there might be substantial numbers of documents detected by the keywords that were not identified by predictive coding, but these have been mostly speculation (e.g., Dynamo Holdings).

If such a scenario could happen, there might be some abstract sense in which we would prefer the lower-Recall process over the higher-Recall process. The production from the lower-Recall system in this scenario, though, is less complete than the higher-Recall system.  It misses a large number of responsive documents according to this scenario that are found by the higher scoring process.   

Finally, how could we know? We do not have access to some catalog of ultimate truth about the responsiveness of documents.  How could we tell that the system produced better quality documents without running the comparison (i.e., doing predictive coding twice) and without having found the more valuable documents? We can imagine a situation where we have a god's-eye view of the true nature of documents, but in reality, we can only know what we observe.

Often the objections to the use of Recall seem to be thinly veiled arguments that human review is somehow superior to computer-assisted review.  Some people still cling to the view that human review is the gold standard, that it is better to have a team of reviewers spend many hours over many months reviewing documents because somehow we will get results that we cannot get using any other approach.  There is no empirical support to such a claim. 

Many studies find that reviewers are inconsistent when making independent judgments about the responsiveness of documents.  I know of no studies, or even cases, that have found that people are better at finding rare documents or smoking guns than computer assisted review is.  Some lawyers may think that they are somehow better at identifying responsive documents than the statistics of human review would imply, but these lawyers are probably over-estimating their ability (the overconfidence effect) and they are unlikely to be the ones who actually do review the documents during the winnowing process.  Some lawyers are surely above average at recognizing responsive documents, but not all of them can be.  And the average seems actually to be rather low.

It seems clear that if complete and correct productions are the goal, then we need measures of completeness and correctness.  Completeness is clearly indicated by Recall but correctness depends on the validity of the decisions made during the review process.  Correctness is much more affected by the people using the technology than by the technology itself.

Obviously, if we produce all of the responsive documents, then we must be producing the correct ones as well. The closer the production is to complete, the closer it must be to correct. 

Rule 26(g) also refers to reasonable enquiry.  Any process we demand, must be practical to execute.  No eDiscovery process is likely to be perfect.  Hypothetical processes that demand information that is not practically obtainable may be useful for making abstract arguments, but they are unlikely to find any useful role in litigation.  As long as we are interested in completeness, then I think that our focus will remain on the measure of that completeness—Recall and its analogs.

No comments:

Post a Comment