The current version of the Federal Rules of Civil Procedure
highlight the importance of reasonableness and proportionality. As is widely understood, the cost of dealing
with the volume of documents that could potentially play a role in a legal
dispute can easily overwhelm the value of the case. Some kind of technology use is essential if
we are to maintain a justice system that depends on evidence.
The problem is generally not the number of documents that
will ultimately be introduced as evidence, rather it is the winnowing process
that goes from the domain of potentially relevant documents down to the ones
that must be produced. Ultimately, only
a handful of those may end up being critical to a case. If we knew without effort which those
documents were, we would not have to go through the complex discovery
process.
Discovery involves more than winnowing, of course. The legal team not only has to decide which
documents are pertinent to a case, but also understand the content of those
documents and how they fit into and guide the theory of the case. Data analysis
and understanding has not, historically, had the benefit of a well-structured
process, but the winnowing task has. In
this context, I am focusing on the problem of identifying the documents to be
produced from large collections.
Assessing the reasonableness of any process can be
facilitated by measurement. There is a
saying that you cannot improve what you do not measure. Although one can use intuition or other forms
of judgment to assess reasonableness, intuitive feelings of reasonableness
alone may not be sufficient. In these
cases, we would like to know how reasonable a process was. For this, we need measurement.
Overwhelmingly, the primary measurement of the efficacy of
the winnowing process in eDiscovery is Recall.
Of the documents that are relevant in a collection, how many (what
proportion) of them have been identified?
The idea is that the more complete the identification process, the
better it has been. All other things
being equal, a better process is a more reasonable process.
Still, from time to time, question arise whether Recall is a
good measure for assessing the winnowing process.
As I read it, there are four related arguments about why
Recall might be inappropriate as a measure of the eDiscovery winnowing process:
- Recall measures completeness, but completeness is not enough
- Recall is overly sensitive to the easy to find documents
- Recall is insufficiently sensitive to rare, but critical sources of information (smoking guns)
- Recall measures the number of documents that are identified, but not their importance
Before discussing these criticisms, I want to spend some
time thinking about measures. A good
measure should have validity and reliability.
Validity
means that it actually measures the property that you are interested in. Reliability
means that measuring the property repeatedly gives consistent results. A good
measure should also be easy to obtain and yield a quantity that has a minimum
and maximum value (say 0.0 and 1.0 or 0.0 and 100.0). Finally, it should be
transparently related to the goals of the task, so that it is easy to
interpret. Although computing it can
take some effort, Recall meets these criteria for a good measure.
Completeness may not be not complete
Recall is a statistic for measuring completeness. It corresponds directly to the requirement in
The Federal Rules of Civil Procedure, Rule 26(g) that the producing party
certify that a production is complete and correct, following a reasonable
inquiry. So, by these standards,
completeness would seem to be a central criterion against which to judge a
production.
The usefulness of any statistic depends critically on the
question you are trying to ask. If we
want to know how complete an eDiscovery process has been we can simply ask how
close we have come to identifying all of the relevant documents. It is
difficult to think of a more transparent or valid measure than Recall to answer
this question. If you know the number of
responsive documents in a collection and you know the number that have been
identified, then you know how complete your process is.
To be sure, there are challenges when measuring Recall. The primary one is that we do not actually
know directly how many relevant documents are in a collection. We need to estimate that number, and for this
we use various statistical sampling and other methods. I have discussed some of these methods elsewhere,
but all of them are essentially different ways of estimating Recall. If you want to know about the completeness of
a discovery process, Recall, however estimated, is your answer (I count Elusion
as being one of the methods of estimating Recall).
Critics of Recall sometimes claim that there must be more to
completeness than the number of documents available and the number
identified. We turn to a couple of those
suggestions next.
Sensitivity to the easy to find documents
According to the second argument, completeness in terms of
documents is not completeness in terms of information. We should really be using a measure of the
completeness of information. Some
documents contain unique information and some are simply repeats of already
known information. The responsive ones
with unique information tend to be more valuable than the redundant ones.
After finding one responsive document, other similar
documents are automatically found, but finding many duplicates of an easy to
find document do not add value to the discovery. For example, if 80% of the
responsive documents are nearly identical to one another and we find one of
them, we can achieve 80% Recall without finding another document. We could appear to be successful just by
finding the easy to find documents and still miss a lot of information.
But just how do we measure this missing information? Counting documents is relatively easy, but measuring
the information content of each one is practically impossible. Experimental psychology had a flirtation with
measuring information in text in ways that could be automated, but that
approach generally did
not work out.
I don’t want to claim that there could never be a way of
effectively measuring the amount of information in a document or a collection
of a documents, but at present, I don’t know of any practical way. The best we could do, I think, is to
determine that a document is dissimilar to any that have been found so far to
be responsive. Even that would be a challenge
to convert into any meaningful measure of the completeness of a production,
however, let alone a practical measure.
Recall does not measure the effectiveness of finding smoking guns or rare
documents
It is common in eDiscovery to say that smoking guns tend to
have friends. That is, they are
generally not unique. A representative
sample of documents has a good chance of catching smoking gun documents, if
they exist in a collection. But truly
rare documents can occur, and a sampling process is unlikely to find them. That is the definition of rare.
The challenge of finding rare documents might be a criticism
of sampling, but it is not a criticism of Recall. No matter what process we employ, even
exhaustively reading all of the documents, truly rare documents necessarily
present a challenge to discovery. Many
documents in a collection are rare, but their rarity does not guarantee their
relevance. Rarity is not a value by
itself. Individual junk emails could also be rare and of no value at all to the
litigation.
If a document type is truly rare, then it is unlikely to be
encountered during the review process, or if it is encountered, it is unlikely
to be recognized. Since World War II, it
has been known
that humans have difficulty sustaining their attention in the face of rare
signals, an effect called "vigilance decrement."
Studies of human reviewers in eDiscovery confirm that people are
relatively poor at independently identifying responsive documents. We
found, for example, that only 28% of the documents identified by either of two professional reviewers were identified by both reviewers. When two reviewers disagree on
whether a document is responsive, one of them must be wrong.
Documents do not have to be rare for human reviewers to miss
them. It is a common occurrence in eDiscovery that a category of documents is
not recognized until after many thousands of documents have been reviewed. Human reviews rarely go back and fix such
mistakes because it is simply too expensive.
Furthermore, truly rare documents are unlikely to appear in
our estimate of the truly responsive documents in a collection against which we
compute Recall. If they are not
encountered or if they are not recognized when they are encountered, they
cannot count either for or against Recall.
We would have no knowledge that they exist. Documents that we do not know cannot affect any measurement. Moreover, it would be extremely difficult to
practically identify such unique documents in a large collection. Again, this is not a problem with
Recall, but with the search process in general. These documents might magically exist, but
none of the processes we have available are likely to find them. Again, that is the definition of rare. If they were easy to find, they would not be
a problem.
Recall is an “average” kind of measure. It is a characteristic of how a process
performs over the population of all documents in a collection. Each document may be unique in what makes it
relevant and in how important it is, but Recall captures the overall quality of
the process. Rare kinds of documents
contribute less than common kinds of documents.
According to decision theory, it is more difficult to accurately judge
rare events relative to more frequent events, whether that judgment is done by
a computer or by a human reader.
Recall does not measure importance
Recall treats each responsive document as making an equal
contribution to completeness. It treats
each responsive document found as a count toward either prevalence or
completeness. But documents are not
equal in their probative value. Could
there be a measure that takes account of the probative value of a
document? This would, of course be a
different measure than Recall, addressing a different question.
Probativeness concerns an individual document’s contribution
to the case. It is not a measure of the
completeness of a process at finding responsive documents. A document has probative value if it raises
some new piece of evidence, but not if it is the tenth or hundredth document
providing that same information. It is
difficult to see how probativeness could be used as a measure of the success of
a predictive coding project rather than as a measure of an individual document
in that collection. We could not, for
example simply sum up the probativeness of each document in the
collection. The probative value of a
document is contingent on the document and on the already discovered documents
in the collection.
Recall can be used to some extent in the context of
probativeness. Some predictive coding
projects, for example, compute separate Recall measures for “hot” documents,
the most important ones to the case, and merely responsive documents, the rest
of the responsive ones. This does not indicate a failure of Recall, but its
application to a special subset of responsive documents.
Like responsiveness, we cannot know the probativeness of a
document before the discovery process. If
we did, we would not need to conduct the eDiscovery process. Some analysis needs to be conducted to assess
probativeness and it may take the development of new approaches to machine
learning to automate the estimation of a document’s probative value. The probativeness of a document, though, is
not contained solely within the document, but in the relationship between a
document’s content and other sources of information. Any process directed at automating the
assessment of probativeness will have to include much more information than
that contained within a document or even a document collection. As mentioned earlier, measuring the
information content of a document is itself difficult, measuring the document’s
relation to the facts and needs of the case is, at least for the present,
impossible.
If we knew the probativeness of each document, then we could
use that information to weight our Recall.
Unfortunately, at this point, wishing for a measure of probativeness is
just magical thinking. Someday, we may
be able to automate its assessment, but until we have an automated measure, basing
an assessment on probativeness seems unlikely to be anything more than wishful
thinking.
Furthermore, I don’t think that that is what the winnowing
process is all about. Would it be
reasonable for a producing party to say, “we are only producing a small
percentage of these documents, but these are the most probative ones?” Would such a production be compatible with
FRCP Rule 26(g) (requirement for complete and correct productions after
reasonable enquiry)? Could the producing
party even judge which documents would be most probative to the requesting
party? Is not the probative value of
documents part of the essential legal reasoning in a case?
The status of Recall
We can make up imaginary situations where Recall fails to assess
the reasonableness of our selection process, but these situations are contrived
and simply not realistic. For example,
one commonly suggested scenario is that one process will find more total
responsive documents and thus have higher Recall than another while the second
process finds fewer documents (lower Recall), but better ones.
This scenario, it seems to me, is unlikely to actually
occur. In order for one system to have
lower Recall than another, but still find a substantial number of better
documents (a) there have to be a substantial number of better documents to
find, (b) the lower-Recall system would have to miss a substantial number of
documents found by the other process, and (c) we would have to find evidence of
these other documents. Generally
speaking, an eDiscovery activity uses only one kind of eDiscovery process,
though sometimes keywords are used on the same set as predictive coding. Parties have speculated that there might be
substantial numbers of documents detected by the keywords that were not
identified by predictive coding, but these have been mostly speculation (e.g., Dynamo
Holdings).
If such a scenario could happen, there might be some
abstract sense in which we would prefer the lower-Recall process over the
higher-Recall process. The production from the lower-Recall system in this
scenario, though, is less complete than the higher-Recall system. It misses a large number of responsive
documents according to this scenario that are found by the higher scoring
process.
Finally, how could we know? We do not have access to some
catalog of ultimate truth about the responsiveness of documents. How could we tell that the system produced
better quality documents without running the comparison (i.e., doing predictive
coding twice) and without having found the more valuable documents? We can imagine a situation where we have a god's-eye view of the true nature of documents, but in reality, we can only know what we observe.
Often the objections to the use of Recall seem to be thinly
veiled arguments that human review is somehow superior to computer-assisted
review. Some people still cling to the
view that human review is the gold standard, that it is better to have a team of reviewers spend many hours over many
months reviewing documents because somehow we will get results that we cannot
get using any other approach. There is
no empirical support to such a claim.
Many studies find that reviewers are inconsistent when
making independent judgments about the responsiveness of documents. I know of no studies, or even cases, that have
found that people are better at finding rare documents or smoking guns than
computer assisted review is. Some
lawyers may think that they are somehow better at identifying responsive
documents than the statistics of human review would imply, but these lawyers
are probably over-estimating their ability (the overconfidence
effect) and they are unlikely to be the ones who actually do review the
documents during the winnowing process.
Some lawyers are surely above average at recognizing responsive
documents, but not all of them can be.
And the average seems actually to be rather low.
It seems clear that if complete and correct productions are
the goal, then we need measures of completeness and correctness. Completeness is clearly indicated by Recall
but correctness depends on the validity of the decisions made during the review
process. Correctness is much more
affected by the people using the technology than by the technology itself.
Obviously, if we produce all of the responsive documents,
then we must be producing the correct ones as well. The closer the production
is to complete, the closer it must be to correct.
Rule 26(g) also refers to reasonable enquiry. Any process we demand, must be practical to
execute. No eDiscovery process is likely
to be perfect. Hypothetical processes
that demand information that is not practically obtainable may be useful for
making abstract arguments, but they are unlikely to find any useful role in
litigation. As long as we are interested
in completeness, then I think that our focus will remain on the measure of that
completeness—Recall and its analogs.