Wednesday, September 7, 2016

Understanding Dynamo Holdings Predictive Coding


The case of Dynamo Holdings (Dynamo Holdings Limited Partnership v. Commissioner of Internal Revenue, 143 T.C. No. 9.), and its use of predictive coding is a case with high levels of cooperation, unusual methods, and poor results. Among other things, this case shows how important the human factor is in predictive coding. 

 Introduction

In September of 2014, Judge Buch of the US Tax Court ruled that the petitioners (Dynamo Holdings, the producing party) were free to use predictive coding to identify responsive documents, noting that “the Court is not normally in the business of dictating to parties the process that they should use when responding to discovery.”  The Petitioners chose to use predictive coding and worked out a protocol with the IRS to accomplish this. 

The Commissioner originally requested that the petitioners produce all of the documents on two specified backup tapes.  The Petitioners responded that it would take many months and a significant cost to fulfill the Commissioner’s request, that the documents would have to be reviewed for privilege and confidential information.  They also argued that the Commissioner’s approach amounted to a “fishing expedition” in search of new issues that could be raised against the petitioners.  Instead, they proposed to use predictive coding to reduce the document set to something more manageable.  Given their argument against providing the Commissioner with access to many potentially irrelevant documents, the protocol that they ended up agreeing to is rather curious.

As part of their negotiated protocol, the petitioners randomly selected two sets of 1,000  documents each, from a backup tape and from an Exchange database. The first set was to be used as a training set and the second as a validation or control set.   Both sets were coded by the Commissioner. 

Having the Commissioner (the receiving party) review the documents in the training set indicates a high level of cooperation (as noted by the Court), but it also shifts the burden to the receiving party and may not be sensible in all cases.  Even with a clawback agreement, this process exposes many of the very documents that the petitioners had sought to restrict in their response to the Commissioner’s original discovery request.

After training on these 1,000 documents, the petitioners reported that the predictive coding process was not performing well, so they had the Commissioner code an additional 1,200 documents that were drawn using some kind of judgmental sample to make the training set richer in responsive documents.  There is no further information regarding this supplemental training set.

The petitioners invited the Commissioner to review an additional set of 1,000 documents that they called a validation sample.  They also said that this validation sample would be unlikely to improve the model and the Commissioner declined to review them.  These are puzzling statements in that a validation sample would presumably be used to assess the training process and not train it, so it is unclear why the petitioners said that it would be unlikely to improve the model.  Second, if performance was poor, then the most obvious solution would be to provide more training, so again, why would they say that it would not improve the model and why would a dissatisfied commissioner decline to provide this small amount of additional training?

It is also puzzling that if the petitioners knew and represented that the predictive coding model was ineffective, why the parties did not stop there and try something else, perhaps a different training method, different documents, or a different predictive coding model.  The parties had been cooperating, why did that cooperation break down at this point? 

Recall With a Substantial Dose of False Positives

The court noted that there is often a tradeoff between Precision and Recall: “A broad search that misses few relevant documents will usually capture a lot of irrelevant documents, while a narrower search that minimizes ‘false positives’ will be more likely to miss some relevant documents.”  Although this tradeoff is widely appreciated, this particular case provides an opportunity to make it more comprehensible.

The petitioners presented a table of estimated results at different levels of Recall.  It’s not entirely clear how this table was derived (but see Ringtail Visual Predictive Coding for a similar table).  I infer that the true positives, and therefore, the Recall and Precision estimates, are based on the validation set (the second 1,000 documents) reviewed by the Commissioner.  This table, then, consists of estimates of the number of documents to be expected at each Recall Target.

Here is the Dynamo table:
Recall target 65% 70% 75% 80% 85% 90% 95%
Projected True Positives 8,712 9,075 9,801 10,527 11,253 11,979 12,705
Projected True Positives Plus False Positives 52,336 54,880 69,781 122,116 139,563 157,736 174,091
Precision 16% 16% 14% 8% 8% 7% 7%


Each document was assigned a relevance score by the predictive coding system.  This score can be used to order the documents from lowest to highest relevance.  We can then use a certain score as a cutoff or threshold.  Documents with scores above this threshold would be designated as positives (putatively responsive) and documents with scores below this cutoff would be designated as negatives (putatively non-responsive).  Presumably, they used the Commissioner’s judgments as a random sample of responsive and non-responsive documents and estimated the expected Recall (Recall Target) from this sample at each of seven thresholds. 

As the cutoff score is lowered, more documents are included in the putatively positive set.  These putatively positive documents will include some that are truly responsive (true positives) and some that are not truly responsive (false positives).  Lower cutoff scores yield more true positives, but also more false positives.  For example, setting the cutoff score at a relatively high level would, according to the table yield 52,336 positive documents of which, presumably, 8,712 would be truly responsive (65% Recall target).  Setting a low criterion would yield 174,091 positive documents, of which 12,705 are expected to be truly responsive (the 95% Recall target).  Increases in the so-called Recall Target in the table corresponds to a decrease in the threshold, resulting in the selection of more putatively responsive documents, including both those that are truly responsive (true positives) and those that are truly non-responsive (false positives).

"One system, many levels of Recall"
Keep in mind that the same predictive coding system produced all of these Recall Targets.  The only thing that differs between Recall Targets (table columns) is the cutoff score, not the model.  The same model is used at all Recall Target levels.

Assessing the Tradeoff

Another way to look at the tradeoff is with a graph called an ROC curve.  This graph is shown below.  Assuming that the relevance scores range from 0.0 to 1.0, if the cutoff score is set to 0.0, then all of the documents will be selected and we will have a point in the upper right hand corner of the chart.  The true positive rate will be 100%, but also the false-positive rate will be 100%.  Conversely, if we set the cutoff score to be 1.0, then none of the documents will be selected and the true positive rate and the false positive rates will both be 0%.  We can always achieve a Recall level of 1.0 by setting the threshold to 0.0 and producing all of the documents.  All ROC curves include the points at 0 false positives, 0 true positives, 100% false positives, 100% true positives because even a random system can produce these results.

The red line in the graph shows what would happen if the relevance score of each document was randomly assigned.  A predictive coding system that was perfectly ineffective would cast a line from 0.0 false positives , 0.0 true positives to 100% false positives, 100% true positives.

An ROC curve showing the tradeoff between False Positive 
and True Positives for a perfect categorizer (green), the 
Dynamo Holdings Categorizer (blue) and a random categorizer
(red).
 A perfect predictive coding system, in contrast, would look like the green line and yield 100% true positives at 0% false positives.  The line would pass from 0.0 false positives , 0.0 true positives with a cutoff score of 1.0 to 100% false positives, 100% true positives with a cutoff score of 0.0, but instead of a straight line, it would initially rise vertically to 100% true positives (at 0% false positives) and then move horizontally across the top of the graph, connecting the three points (0,0; 0,100%; 100%, 100%).

By comparison, the blue line shows the estimated Dynamo predictive coding results according to the table presented by the petitioners.  This predictive coding exercise yielded a middling level of estimated accuracy.  It is neither very near to a perfect system or a random system.  I will return to some possible explanations for this low accuracy later.  Depending on where the threshold is placed (different points along the line), the mix of false positive and true positive results changes in a regular way.  Each point corresponds to the false positive and true positive rates shown in the table.

The Commissioner chose to accept a large number of false positives in order to get the highest number of true positives.  In fact, the Commissioner’s original request was to receive all of the documents, which would have guaranteed the receipt of all of the true positives (cutoff score of 0.0).  That’s an important point.  Even a random process can achieve 95% Recall if you are willing to accept a large number of non-responsive documents.  In fact, you can achieve 100% Recall if you are willing to accept 0% Precision.  If you receive all of the documents, you are guaranteed to receive all of the responsive ones among them. In this case, the petitioners turned over about 43% of the collection to achieve what they thought was 95% Recall.  As it turns out, though, they were seriously mistaken about the level of Recall they did actually achieve.

Predictive coding effectiveness for several systems including
human review (triangles), predictive coding (circles and squares).
The Dynamo Holdings system is shown in the upper left corner
as a blue diamond.
 The predictive coding results in this case are, in my experience, exceptionally poor.  F1, which is a type of average of Recall and Precision, is only about 13%.  F1 is often used in recognition of the tradeoff between Recall and Precision, and in recognition of the fact that one can always achieve arbitrarily high levels of Recall by producing more documents.  An effective process, however, will return as close to all of the responsive documents as possible with as few of the non-responsive ones as possible (which would result in a high F1 score).  The petitioners achieved high levels of Recall, but only by producing a very substantial number of non-responsive documents as well.

Other cases, such as Global Aerospace, report Recall and Precision in 80% to 95% range (with F1 in about that same range), so it is quite unusual to have Precision in the single digits.  Even the petitioners were clear that the accuracy of this system in this case was poor. The second chart shows the Precision and Recall for several predictive coding systems (circles and squares), human review (triangles) and a negotiated keyword search.  The current results from the Dynamo petitioners’ production is shown as a diamond in the upper left-hand corner (95% Recall and 7% Precision).  Dynamo Recall level is comparable to other systems—it was forced by the Commissioner, but this level of Recall was achieved only by producing  a substantial number of non-responsive documents.  Remember that arbitrary high levels of Recall can be achieved as long as one is willing to accept high levels of false positives along with the responsive documents.

Even this level of accuracy may, however, overstate the success of this predictive coding task.  The petitioners predicted that their production of 174,091 documents would contain 12,705 responsive ones.  The Commissioner, instead, reported that only 5,796 of them were responsive.  This difference is beyond what one would expect based on a reasonable sampling error.

It’s not clear why there should be this large discrepancy.  The training set was judged by the receiving party, though we do not know if the same person did the training and the final assessment of the responsiveness of the produced documents.  Reviewers often differ in the documents they call responsive.  If different people trained and tested, the system, then there could be a substantial difference in their responsiveness judgments.

It’s possible that different standards were applied when training predictive coding relative to assessing the final product.  During training, the reviewer might use a looser criterion for what would be responsive then during the final review.  During training, the reviewer might seek to include documents that are marginally responsive in order to get a more complete production, but then once that production has been delivered, use a narrower criterion for what is actually useful in the case.  We have no information about the validity of this speculation in this case.

Given that the Commissioner found only 5,796 documents of the produced set to be responsive, the actual Recall rate is likely to be substantially lower than the nominal 95%, but we have no way to estimate that from the available information.  As far as I am aware, this discrepancy was not mentioned by the Commissioner when petitioning for additional documents to be produced.

Even if we accept the nominal Precision and Recall measures proffered by the petitioners, the predictive coding performance in this case is quite poor.  The second graph shows this case in comparison to some others, measured as Precision and Recall.

There are at least four possible explanations for this low level of predictive coding performance. 

  1. Predictive coding does not work
  2. The specific predictive coding application used in this case does not work
  3. Insufficient training examples were provided
  4. Inaccurate training examples were provided


Given the wide use of predictive coding in eDiscovery, and its success, as shown by the graph above, and given the use of similar machine learning technologies in other areas, such as spam filtering, I think that we can categorically reject the first possibility. It would be a serious mistake, I believe, to throw out a class of technologies because of this case.  Predictive coding has frequently been found to work quite well and this case is an outlier.

The second explanation is slightly more plausible.  It appears that this project used Ringtail Visual Predictive Coding.  The petitioners’ expert was James Scarazzo of FTI, so it would make sense to use a predictive coding system used by his company. The table showing the Precision at each Recall level is similar to one on the Ringtail website.

FTI is a very reputable company, and its Ringtail software is widely used.  Nonetheless, an analysis of the results presented on their website, promoting the use of their software, shows similar levels of performance (low Precision at high levels of Recall).  So it is possible that their predictive coding software is somehow limited.  It is, of course, also possible the software is good, but that the problem is in their marketing; perhaps they chose an unflattering example for their website.

The third explanation is, I think, far more likely.  The total population of documents was about 406,000.   Of these, the Commissioner found only 5,796 to be truly responsive, that would be about 1.4% of the total collection was found to be responsive.  The petitioners claimed that 12,705 documents were responsive, but even that is only 3.1% of the whole document set.  In the original randomly-selected training set of 1,000 documents, therefore, there were between 14 and 31 responsive ones (1.4% or 3.1% of 1000).  The second stage of training, where they tried to focus on documents more likely to be responsive may have added more positive examples, but even 50 or 75 responsive documents may not be enough to effectively train a predictive coding system.

The fourth possible explanation is in many ways the most likely.  Effectively training a predictive coding system depends on the validity, consistency, and representativeness of the training set.  These are factors that people control—independent of the technology. 

Random samples provide a reasonable means of assuring a representative sample of documents. 

The validity and consistency of the review, on the other hand, may be problems.  If the review of the training documents was delegated to someone with a low level of expertise (low validity) in these matters or someone who was distracted (low consistency) during the review, then the documents used to train the system may not have been accurately or consistently categorized. 

For example, in one matter that I worked on, several different people did predictive coding on the same set of documents to identify documents that were responsive to very similar issues.  One of those people did the predictive coding training in a couple of days, reviewing a couple thousand documents.  The others tried to do the training over several weeks, doing only a few documents at a time.  The concentrated training resulted in very high predictive coding performance, the piecemeal training resulted in relatively poor performance, apparently because it was difficult to maintain a consistent perspective on responsiveness over a long time with many interruptions.  The same software used with the same data resulted in different levels of success depending on how it was used.  A poor training set would almost certainly lead to a poor outcome.

Conclusion

There may be other potential factors that could contribute to the poor performance on this predictive coding task.  It is important to keep in mind, that even the most powerful predictive coding system is still just a tool used by humans. 

The power of a categorization system, such as predictive coding, is its ability to separate the document classes from one another (e.g., responsive from the non-responsive documents).  For a system with any amount of power, specific levels of Recall can be achieved by adjusting the criterion of what one calls responsive to accept more or fewer true positives and therefore more or fewer false positives.  By itself, achieving high levels of Recall, therefore, does not mean a powerful system because when high levels of Recall are accompanied by high levels of false positives, there is very little separation at all.  A more powerful system is one that increases the proportion of truly responsive documents more quickly than the proportion of false positives as this criterion is lowered.  A more powerful system will achieve high Recall at the same time as it achieves few false positives.  In this light, the system used by Dynamo Holdings was not very powerful.  Rather than separating the responsive from non-responsive, it simply provided both.

It is important to remember that a system, particularly in eDiscovery, consists not just of the software used to implement the machine learning, but also of the training examples and other methods used.  People are a critical part of predictive coding system and by some measures, they are the most error-prone part.

Predictive coding is  not magic.  You don’t get something for nothing.  What you do get is a tool that makes the most out of relatively small amounts of effort.  Unsupervised, the computer has no way to distinguish what is legally important from what is not, it still requires human judgment to guide it.  The computer then amplifies that judgment, but can amplify poor judgment as well as good judgment.

Effective predictive coding requires good technology, good methods for applying that technology, and good judgment to guide the technology.  At least one of those appears to have been missing in this case.


No comments:

Post a Comment