Tuesday, April 12, 2016

Predictive Coding with Multiple Reviewers Can Be Problematic: And What You Can Do About It

Herbert L. Roitblat, Ph.D.

Computer Assisted Review (sometimes called Technology Assisted Review, or more particularly, Predictive Coding) is a process used to separate responsive documents in a collection from the non-responsive ones. There are several products and approaches to this general classification problem, many of which appear to do at least a reasonably good job.  These processes depend on one or another kind of supervised machine learning where the computer learns from labeled examples how to distinguish between responsive and non-responsive documents.

We are starting to understand the variables that affect the quality of the output of Computer Assisted Review systems.  These are the variables that should play a role in choosing the appropriate technology for each specific situation. 

Among these variables are:
  1. the technology used
  2. the skill with which the technology is applied
  3.  the prevalence of responsive documents in the collection
  4. the subject-matter expertise of the reviewer(s) training the system


Much of the writing about predictive coding has been about the technology used to implement it, but how that technology is deployed can also have a profound effect. For example, the EDI/Oracle study found that “Technology providers using similar underlying technology, but different human resources, performed in both the top-tier and bottom-tier of all categories.”

The prevalence of responsive documents in a collection can also affect the outcome of computer assisted review.  The more common responsive documents are, the easier it is to train a computer to recognize them.  More effort is typically needed when responsive documents are rare than when they are common.

The contribution of subject matter expertise is mostly common sense.  If the people training a computer assisted review system do not know well how to tell the difference between responsive and non-responsive documents, then a computer that is trained to mimic their responses, will also do poorly.

Ultimately every predictive coding system is a tool used by a party to evaluate and classify a set of documents.  The computer implements and generalizes the input it receives from the humans who train it, using a “supervised machine learning” procedure.  The quality and nature of the supervision (the input) it receives from its human trainers is essential to determining the categorization pattern it produces. By itself, the computer knows nothing of the categories it is trying to reproduce.  Its only source of information about what makes a document responsive is through the examples it receives.  During training, the computer compares its predicted category for a training document with the category provided to it and adjusts its predictions to better agree with the categories provided through supervision.   Training is finished when the computer’s predictions best match the example categorization it has been given.

Supervised machine learning is susceptible to what can be called “class noise” or “label noise” (see Frénay & Kabán, 2014).   People reviewing a given document do not always apply the same category (class or label) to it  In fact, such disagreements are quite common in eDiscovery.  According to one study, two reviewers, working independently, may disagree on as many as half of the documents in a sample (Roitblat, Kershaw, & Oot, 2010).  When people disagree as to whether a given document is responsive, one of them must be wrong.  This inconsistency is class noise.

Class noise is not limited to disagreement between reviewers.  Even the same reviewer may behave inconsistently from one time to the next.  For example, in a conversation recently, an attorney confided that she was looking at a document and wondered who had ever called that document responsive.  On further examination, she found that she is the one who did it. 

Sculley and Cormack (2008) confirm that some machine learning systems are more sensitive to class noise than others, but the accuracy of every system tested declines as inconsistency of training categories (class noise) increases.  It is somewhat ironic that they also found that one of the simplest machine learning approaches, the so-called Naïve Bayes Classifier, is the most resistant to this noise of any that they tested.  On the other hand, Naïve Bayes Classifiers are commonly used as SPAM filters, typically to good effect, so maybe it is not so surprising after all.

Naïve Bayes Classifiers are typically the whipping boy of machine learning comparisons.  They are simple, assuming that each  word in a document can be treated as an independent piece of evidence as to how to categorize the document.  Although this assumption is clearly false (hence the label, “naïve”; word occurrences are, in fact,  correlated), these systems are often also said to be remarkably effective.
Classification accuracy under different levels of class noise (labeling inconsistency) by different machine learning tools.  A noise level of 0.25 means that 25% of the training documents were purposely misclassified before training.  See Sculley and Cormack, 2008, for details, including the specific machine learning methods used.  Sculley and Cormack use a measure that shows how much each system is negatively affected by noise (1 – ROCA).  This graph shows the complementary measure of accuracy (ROCA).



Some eDiscovery systems try to minimize the amount of class noise by having one person provide the training examples.  Then the noise consists solely of that person’s inconsistency from time to time.  Other systems use multiple trainers to provide examples.  Under those circumstances, the noise would consist of each person’s inconsistency over time as well as the inconsistency between individuals. 

One concern that is sometimes raised about using a single reviewer is that relying on one person to provide the training could result in systematic errors in the selection of responsive documents.  This risk is more theoretical than real, however, because most cases have one person directing the team and that person specifies the intention of the review.  The reviewers are tasked with identifying the documents that the director specifies.  Therefore, a team of reviewers is about as likely to be systematically wrong as is a single authoritative reviewer.  Another concern is sometimes cost.  Depending on their relative costs, a single reviewer may be more or less expensive than a team of reviewers.  Different situations may call for different approaches, but it is inescapable that the more people feeding examples to the system, the more inconsistent and, therefore, noisy, the training examples will be.

The likelihood of systematic error can be reduced by having a subject-matter expert label the examples used for training.  Because the computer comes to mimic the choices in its training examples, the better the quality of those examples, the better the quality of the resultant categorization system.

An expert can be selected to be a topic authority, who can be designated as the source of truth.  With multiple reviewers, a topic authority is often used to instruct the reviewers and resolve disagreements.  

In eDiscovery, one person (the person who signs the 26(g) declaration), takes on the responsibility for declaring that a reasonable search has been conducted.  That person would appear to be an appropriate candidate for ultimately deciding whether a document is responsive or not.  In practice, that authority is often delegated.
 
When using multiple reviewers to prepare the training examples, there are several methods that can be used to reduce the class noise in their choices.  Generically, this process is sometimes called “truth discovery.”

The topic authority is designated as the source of truth and noisy training examples can be submitted to this authority for a definitive decision.  But, it is not immediately obvious how we can identify which are the noisy examples (see Brodley & Friedl, 1999 for a suggestion).  Each document is typically seen by only one person in first-pass review.  The topic authority could sample among these documents and provide feedback to the reviewer.  This approach recognizes that each reviewer is then a kind of supervised learning system, albeit one based on wetware (brains) rather than on software. 

Disagreements between reviewers and the topic authority are likely to be common, and would have to be resolved to reduce the class noise.  The computer would then ultimately be trained on the resolved examples.  If the topic authority samples many documents and the reviewer has to change many decisions, this approach could become expensive.  Many documents will have to be read multiple times by the reviewer, the topic authority, and then again by the reviewer after adjusting his or her judgment criteria.  The topic authority again provides the truth for the training examples, but additional noise is added by the sampling process by which the documents are selected for the topic authority and by the reviewer’s imperfect implementation of the authority’s decisions.

Other approaches also require documents to be read more than once.  All or a sample of documents could be read by more than one reviewer.  If all of the reviewers agree on how to categorize a document, then that decision could be taken to be the “truth.”  If there is disagreement, then that document could be excluded from the training set (meaning that many documents will be reviewed and not used), or it could be submitted to the topic authority for a definitive judgment. 

Voting is another approach to defining the true status of a document.  Say that four reviewers each read a document and three of them say that it is responsive, but one says that it is not.  A voting scheme would call that document truly responsive.  But is each reviewer’s opinion of equal value?  What if one of those reviewers is the topic authority?

A standard voting scheme treats all reviewers as equal, but we may be able to achieve better results if some of the reviewers are more reliable than others and we weight the opinion of the reviewers by their reliability.  If a reviewer with more expertise called a document non-responsive when three other reviewers called it responsive, it might actually be more accurate to call the document non-responsive. 

The expertise of each reviewer may not be apparent at the start of a review and we may have to simultaneously estimate the reliability of the reviewers while we are assessing the true category of the documents.  There are several ways that this estimation can be conducted, most of them involve some iterative process that looks at the reliability of the reviewers and the consistency of the decisions about individual documents (see Li et al., 2016).   These truth-discovery methods generally result in higher levels of accuracy than simple voting schemes, or than leaving the noise in the training set. 

Conclusion

Even though many predictive coding tools yield respectable results, they do have differences.  Among these differences is their sensitivity to class noise (inconsistency) in the training set.  As you might expect, the greater the inconsistency in the coding of the training documents, the poorer is the performance of the machine learning system.  For the most part, this inconsistency has rarely been examined in eDiscovery, but we do have enough information to say that the greater the number of people categorizing the training documents, the higher the expected level of inconsistency in their judgments (i.e., the higher the noise).  Truth discovery methods could be used to reduce the class noise, but these methods can become expensive.

Not every eDiscovery case merits detailed examination of the accuracy of the search process.  But knowledge of the variables that affect that accuracy can help to select the right tools and methods.  In addition to the variables described in the introduction to this article, we should add (5) class noise in the training set. 

References

Brodley, C. E., and Friedl, M. A. (1999). Identifying mislabeled training data. Journal of Artificial Intelligence Research, 11, 131-167. https://www.jair.org/media/606/live-606-1803-jair.pdf

Frénay, B.  & Kabán, A. (2014).  A Comprehensive Introduction to Label Noise.  In ESANN 2014 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 23-25 April 2014.  http://www.cs.bham.ac.uk/~axk/ESANNtut.pdf

Li, Y., Gao, J., Meng, C., Li Q., Su, L., Zhao, B., Fan, W. & Han, J. (2016).  A Survey on Truth Discovery. SIGKDD Explorations Newsletter 17, 2 (February 2016), 1-16. http://arxiv.org/pdf/1505.02463.pdf

Roitblat, H. L.,  Kershaw, A. & Oot, P. (2010). Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review.  Journal of the American Society for Information Science and Technology, 61(1):70-80.

Sculley, D. & Cormack, G. V. (2008). Filtering spam in the presence of noisy user feedback. In Proceedings of the 5th Conference on Email and Anti-Spam (CEAS 2008). http://www.eecs.tufts.edu/~dsculley/papers/noisySpam.pdf