Computer Assisted Review (sometimes called Technology
Assisted Review, or more particularly, Predictive Coding) is a process used to
separate responsive documents in a collection from the non-responsive ones. There
are several products and approaches to this general classification problem,
many of which appear to do at least a reasonably good job. These processes depend on one or another kind
of supervised machine learning where the computer learns from labeled examples
how to distinguish between responsive and non-responsive documents.
We are starting to understand the variables that affect the
quality of the output of Computer Assisted Review systems. These are the variables that should play a
role in choosing the appropriate technology for each specific situation.
Among these variables are:
- the technology used
- the skill with which the technology is applied
- the prevalence of responsive documents in the collection
- the subject-matter expertise of the reviewer(s) training the system
Much of the writing about predictive coding has been about
the technology used to implement it, but how that technology is deployed can
also have a profound effect. For example, the EDI/Oracle
study found that “Technology providers using similar underlying technology,
but different human resources, performed in both the top-tier and bottom-tier
of all categories.”
The prevalence of responsive documents in a collection can
also affect the outcome of computer assisted review. The more common responsive documents are, the
easier it is to train a computer to recognize them. More effort is typically needed when
responsive documents are rare than when they are common.
The contribution of subject matter expertise is mostly
common sense. If the people training a
computer assisted review system do not know well how to tell the difference
between responsive and non-responsive documents, then a computer that is trained to mimic
their responses, will also do poorly.
Ultimately every predictive coding system is a tool used by
a party to evaluate and classify a set of documents. The computer implements and generalizes the
input it receives from the humans who train it, using a “supervised machine
learning” procedure. The quality and
nature of the supervision (the input) it receives from its human trainers is essential
to determining the categorization pattern it produces. By itself, the computer
knows nothing of the categories it is trying to reproduce. Its only source of information about what
makes a document responsive is through the examples it receives. During training, the computer compares its
predicted category for a training document with the category provided to it and
adjusts its predictions to better agree with the categories provided through
supervision. Training is finished when the
computer’s predictions best match the example categorization it has been given.
Supervised machine learning is susceptible to what can be
called “class noise” or “label noise” (see Frénay & Kabán, 2014). People reviewing a given document do not
always apply the same category (class or label) to it In fact, such disagreements are quite common
in eDiscovery. According to one study,
two reviewers, working independently, may disagree on as many as half of the
documents in a sample (Roitblat, Kershaw, & Oot, 2010). When people disagree as to whether a given
document is responsive, one of them must be wrong. This inconsistency is class noise.
Class noise is not limited to disagreement between
reviewers. Even the same reviewer may
behave inconsistently from one time to the next. For example, in a conversation recently, an
attorney confided that she was looking at a document and wondered who had ever
called that document responsive. On
further examination, she found that she is the one who did it.
Sculley and Cormack (2008) confirm that some machine
learning systems are more sensitive to class noise than others, but the
accuracy of every system tested declines as inconsistency of training
categories (class noise) increases. It
is somewhat ironic that they also found that one of the simplest machine
learning approaches, the so-called Naïve Bayes Classifier, is the most
resistant to this noise of any that they tested. On the other hand, Naïve Bayes Classifiers
are commonly used as SPAM filters, typically to good effect, so maybe it is not
so surprising after all.
Naïve Bayes Classifiers are typically the whipping boy of
machine learning comparisons. They are
simple, assuming that each word in a
document can be treated as an independent piece of evidence as to how to
categorize the document. Although this
assumption is clearly false (hence the label, “naïve”; word occurrences are, in
fact, correlated), these systems are
often also said to be remarkably effective.
Some eDiscovery systems try to minimize the amount of class
noise by having one person provide the training examples. Then the noise consists solely of that
person’s inconsistency from time to time.
Other systems use multiple trainers to provide examples. Under those circumstances, the noise would
consist of each person’s inconsistency over time as well as the inconsistency
between individuals.
One concern that is sometimes raised about using a single
reviewer is that relying on one person to provide the training could result in
systematic errors in the selection of responsive documents. This risk is more theoretical than real,
however, because most cases have one person directing the team and that person
specifies the intention of the review. The
reviewers are tasked with identifying the documents that the director
specifies. Therefore, a team of
reviewers is about as likely to be systematically wrong as is a single
authoritative reviewer. Another concern
is sometimes cost. Depending on their
relative costs, a single reviewer may be more or less expensive than a team of
reviewers. Different situations may call
for different approaches, but it is inescapable that the more people feeding examples
to the system, the more inconsistent and, therefore, noisy, the training
examples will be.
The likelihood of systematic error can be reduced by having
a subject-matter expert label the examples used for training. Because the computer comes to mimic the
choices in its training examples, the better the quality of those examples, the
better the quality of the resultant categorization system.
An expert can be selected to be a topic authority, who can be
designated as the source of truth. With
multiple reviewers, a topic authority is often used to instruct the reviewers
and resolve disagreements.
In eDiscovery, one person (the person who signs the 26(g)
declaration), takes on the responsibility for declaring that a reasonable
search has been conducted. That person
would appear to be an appropriate candidate for ultimately deciding whether a
document is responsive or not. In
practice, that authority is often delegated.
When using multiple reviewers to prepare the training
examples, there are several methods that can be used to reduce the class noise
in their choices. Generically, this
process is sometimes called “truth discovery.”
The topic authority is designated as the source of truth and
noisy training examples can be submitted to this authority for a definitive
decision. But, it is not immediately
obvious how we can identify which are the noisy examples (see Brodley &
Friedl, 1999 for a suggestion). Each
document is typically seen by only one person in first-pass review. The topic authority could sample among these
documents and provide feedback to the reviewer.
This approach recognizes that each reviewer is then a kind of supervised
learning system, albeit one based on wetware (brains) rather than on software.
Disagreements between reviewers and the topic authority are
likely to be common, and would have to be resolved to reduce the class noise. The computer would then ultimately be trained
on the resolved examples. If the topic
authority samples many documents and the reviewer has to change many decisions,
this approach could become expensive.
Many documents will have to be read multiple times by the reviewer, the
topic authority, and then again by the reviewer after adjusting his or her
judgment criteria. The topic authority
again provides the truth for the training examples, but additional noise is
added by the sampling process by which the documents are selected for the topic
authority and by the reviewer’s imperfect implementation of the authority’s
decisions.
Other approaches also require documents to be read more than
once. All or a sample of documents could
be read by more than one reviewer. If
all of the reviewers agree on how to categorize a document, then that decision
could be taken to be the “truth.” If
there is disagreement, then that document could be excluded from the training
set (meaning that many documents will be reviewed and not used), or it could be
submitted to the topic authority for a definitive judgment.
Voting is another approach to defining the true status of a
document. Say that four reviewers each
read a document and three of them say that it is responsive, but one says that
it is not. A voting scheme would call
that document truly responsive. But is
each reviewer’s opinion of equal value?
What if one of those reviewers is the topic authority?
A standard voting scheme treats all reviewers as equal, but we
may be able to achieve better results if some of the reviewers are more
reliable than others and we weight the opinion of the reviewers by their
reliability. If a reviewer with more
expertise called a document non-responsive when three other reviewers called it
responsive, it might actually be more accurate to call the document
non-responsive.
The expertise of each reviewer may not be apparent at the
start of a review and we may have to simultaneously estimate the reliability of
the reviewers while we are assessing the true category of the documents. There are several ways that this estimation
can be conducted, most of them involve some iterative process that looks at the
reliability of the reviewers and the consistency of the decisions about
individual documents (see Li et al., 2016). These truth-discovery methods generally
result in higher levels of accuracy than simple voting schemes, or than leaving
the noise in the training set.
Conclusion
Even though many predictive coding tools yield respectable
results, they do have differences. Among
these differences is their sensitivity to class noise (inconsistency) in the
training set. As you might expect, the greater
the inconsistency in the coding of the training documents, the poorer is the
performance of the machine learning system.
For the most part, this inconsistency has rarely been examined in
eDiscovery, but we do have enough information to say that the greater the
number of people categorizing the training documents, the higher the expected
level of inconsistency in their judgments (i.e., the higher the noise). Truth discovery methods could be used to
reduce the class noise, but these methods can become expensive.
Not every eDiscovery case merits detailed examination of the
accuracy of the search process. But
knowledge of the variables that affect that accuracy can help to select the
right tools and methods. In addition to
the variables described in the introduction to this article, we should add (5)
class noise in the training set.
References
Brodley, C. E., and Friedl, M. A. (1999). Identifying
mislabeled training data. Journal of Artificial Intelligence Research, 11,
131-167. https://www.jair.org/media/606/live-606-1803-jair.pdf
Frénay, B. &
Kabán, A. (2014). A Comprehensive
Introduction to Label Noise. In ESANN 2014 proceedings, European Symposium
on Artificial Neural Networks, Computational Intelligence and Machine Learning.
Bruges (Belgium), 23-25 April 2014. http://www.cs.bham.ac.uk/~axk/ESANNtut.pdf
Li, Y., Gao, J., Meng, C., Li Q., Su, L., Zhao, B., Fan, W.
& Han, J. (2016). A Survey on Truth
Discovery. SIGKDD Explorations Newsletter 17, 2 (February 2016), 1-16. http://arxiv.org/pdf/1505.02463.pdf
Roitblat, H. L.,
Kershaw, A. & Oot, P. (2010). Document Categorization in Legal
Electronic Discovery: Computer Classification vs. Manual Review. Journal
of the American Society for Information Science and Technology, 61(1):70-80.
Sculley, D. & Cormack, G. V. (2008). Filtering spam in the
presence of noisy user feedback. In Proceedings
of the 5th Conference on Email and Anti-Spam (CEAS 2008). http://www.eecs.tufts.edu/~dsculley/papers/noisySpam.pdf