Tuesday, March 9, 2021

When to stop searching: An example from continuous active learning

 

Making reasonable decisions in eDiscovery, as elsewhere, requires that we have reasonable expectations about the consequences of those decisions.  One of these is when to stop searching for relevant documents.  For example, in the commonly used approach to identifying relevant eDiscovery documents called continuous active learning, the search process continues until a reasonable level of completeness has been achieved.  The reasonableness of any given level is often attributed to the cost of finding more relevant documents.  But there seems to be a hidden and erroneous, assumption that continued effort will eventually yield all relevant documents and the question is solely what level of effort is reasonable in the context of a specific case.

The problem of when to stop searching is not limited to continuous active learning or any other specific method of discovery.  Every search process necessarily entails comparable questions about when to stop. My concern in this essay is not with the adequacy or efficiency of continuous active learning, but with what appears to be some fundamental misunderstanding of search effort and its consequences, which impact efforts to make reasonable decisions about that effort and when to stop.  I use continuous active learning as a more or less typical example of eDiscovery search methods.

In continuous active learning, a classifier (often a machine learning process called a support vector machine) is trained to predict which documents in the collection are likely to be relevant.  The most likely documents are shown to reviewers, who then either endorse or reverse the prediction of the classifier.  The reviewers’ decisions are then added to the training set for the classifier and it continues to predict the remaining documents, that is, those that have not yet been seen.  Again, the documents predicted to be most likely to be relevant are presented to the reviewers and the process repeats for some number of cycles.  The stopping rule determines when to terminate this process of classify-judge-predict cycles. 

The implicit assumption in this reasonableness judgment seems to be that if we just continue through enough cycles, eventually we will identify all of the relevant documents in the collection.  But that assumption is wrong and here’s why.

The success of any machine learning exercise depends on three things.  The distinguishability of the data, accuracy of the machine learning algorithm (for example, the support vector machine), and the quality of the human judgments that go into training and assessing the process.  In order to achieve complete Recall, all three of these error sources has to be reduced to zero.  Let’s take them in order.

The two graphs show hypothetical simplified situations for a machine learning process to distinguish the positive (orange) from the negative (blue) instances.  For any categorization system, the goal is to find a separator that puts all of the positive instances (blue dots) on one side and all of the negative instances (orange dots) on the other side of this separator.  That task is relatively simple for the first graph, where all of the positive instances are above and to the right of a diagonal line and only one negative instance would be included.

A relatively easy categorizing problem with little overlap between the positive (orange) set and the negative (blue) set.


In the second graph, the task is much more difficult because there is substantial overlap between the two groups.  No surface exists that will perfectly separate the positive from the negative instances.  A categorizer might reach 100% Recall with the data in the first graph, but cannot with the data in the second one, without also including a lot of negative instances.  The distinction between positive and negative instances may be subtle, difficult, or obscure and the ultimate accuracy of any decision system would be limited by the ability of a fully-informed categorizer to make the right choices.

Second, machine learning algorithms do not differ very much among themselves in their ability to learn categorizations.  Overall, systems can differ in the way that they represent the documents (for example, as individual words versus as phrases or as mathematically represented concepts).  They can differ in the number of training examples they need.  They may include different sources of information. For example, some may include metadata, some just the body of the document.  And, they may differ in how they are deployed.  All of these factors may be consequential in terms of how effective they are at identifying all of the relevant documents, even if they all used the very same underlying algorithms. Conversely, if all of these other factors are held constant, they may all give essentially the same level of accuracy, but that accuracy is seldom perfect.

A relatively difficult categorizing problem with substantial overlap between the positive (orange) set and the negative (blue) set.

The third major source of potential errors comes from the people doing the task.  The request for production may be vague. The lead attorneys may not know perfectly what they are looking for and so may make errors when informing the line reviewers.  The line reviewers may have differing beliefs about what constitutes a relevant article.  They learn more of the subject matter as they go through the review, so their own decision patterns may change over time.  Most studies of human reviewer accuracy find that their recall is relatively poor, only about 50% and sometimes lower. The parties will disagree about which documents are relevant.  Even the reviewers working for one party will disagree among themselves about what constitutes a relevant document.  A reviewer may disagree with his or her own judgment at a later time and make inconsistent judgements over time. 

On top of all of this disagreement, people make mistakes.  Attention wanders.  A machine learning system depends on the human judgments it is given in order to learn what document features are associated with each category.  If these example documents are misclassified, the machine could learn to make incorrect decisions.  With enough training documents and enough reviewers, it may still be possible for a system to learn more or less correct classifications, but these inconsistencies may still lead the system to make errors, which will limit its accuracy.  

Continuous active learning highlights another aspect of the human factor in review.  At each step, the machine learning system ranks the so-far unseen documents for review.  Any documents that have already been seen are generally not included in the set for which relevance is predicted.  So, if a reviewer at one stage of the review incorrectly classifies a document as not-relevant, that document will not be available for ultimate production.  It will never be counted toward the Recall level of the process, no matter how much more effort is expended.  This is another factor that limits the ultimate achievable level of Recall.

Without considering the limits on the ultimate accuracy of any search, we will over-estimate the value of that search.  The limit on subsequent accuracy depends on factors that cannot be defeated simply by continuing the search further.

The active ranking of documents to review has to this point in the review process improved the accuracy of the reviewers, primarily by affecting their consistency with the predictions of the categorizer.  As the process continues, however, the ranking of the remaining documents comes to be dominated by the non-relevant documents and the categorizer will be of diminishing value to the reviewer.  Instead of making reliable predictions about which documents are relevant, the reviewers will have to make independent judgments.  The combination of the sparsity of remaining relevant documents with the inaccuracy of the prediction will cause the reviewers’ accuracy to diminish substantially from the level they had achieved, perhaps even dropping below the expected level of a complete manual review.  Continued search will not only be less valuable, but also less accurate than the preceding effort has produced.  Unless these factors are carefully considered, there will be a very strong tendency to over-estimate the value of continued search and impose excessive burden on the producing party.