The case of Dynamo
Holdings (Dynamo Holdings Limited
Partnership v. Commissioner of Internal Revenue, 143 T.C. No. 9.), and its use
of predictive coding is a case with high levels of cooperation, unusual
methods, and poor results. Among other things, this case shows how important
the human factor is in predictive coding.
Introduction
In September of 2014, Judge Buch of the US Tax Court ruled
that the petitioners (Dynamo Holdings, the producing party) were free to use
predictive coding to identify responsive documents, noting that “the Court is
not normally in the business of dictating to parties the process that they
should use when responding to discovery.”
The Petitioners chose to use predictive coding and worked out a protocol
with the IRS to accomplish this.
The Commissioner originally requested
that the petitioners produce all of the documents on two specified backup
tapes. The Petitioners responded that it
would take many months and a significant cost to fulfill the Commissioner’s
request, that the documents would have to be reviewed for privilege and
confidential information. They also
argued that the Commissioner’s approach amounted to a “fishing expedition” in
search of new issues that could be raised against the petitioners. Instead, they proposed to use predictive
coding to reduce the document set to something more manageable. Given their argument against providing the
Commissioner with access to many potentially irrelevant documents, the protocol
that they ended up agreeing to is rather curious.
As part of their negotiated protocol, the petitioners
randomly selected two sets of 1,000
documents each, from a backup tape and from an Exchange database. The
first set was to be used as a training set and the second as a validation or
control set. Both sets were coded by
the Commissioner.
Having the Commissioner (the receiving party) review the
documents in the training set indicates a high level of cooperation (as noted
by the Court),
but it also shifts the burden to the receiving party and may not be sensible in
all cases. Even with a clawback
agreement, this process exposes many of the very documents that the petitioners
had sought to restrict in their response to the Commissioner’s original
discovery request.
After training on these 1,000 documents, the petitioners
reported that the predictive coding process was not performing well, so they
had the Commissioner code an additional 1,200 documents that were drawn using
some kind of judgmental sample to make the training set richer in responsive
documents. There is no further
information regarding this supplemental training set.
The petitioners invited the Commissioner to review an
additional set of 1,000 documents that they called a validation sample. They also said that this validation sample
would be unlikely to improve the model and the Commissioner declined to review
them. These are puzzling statements in
that a validation sample would presumably be used to assess the training
process and not train it, so it is unclear why the petitioners said that it
would be unlikely to improve the model.
Second, if performance was poor, then the most obvious solution would be
to provide more training, so again, why would they say that it would not
improve the model and why would a dissatisfied commissioner decline to provide
this small amount of additional training?
It is also puzzling that if the petitioners knew and
represented that the predictive coding model was ineffective, why the parties
did not stop there and try something else, perhaps a different training method,
different documents, or a different predictive coding model. The parties had been cooperating, why did
that cooperation break down at this point?
Recall With a Substantial Dose of False Positives
The court noted
that there is often a tradeoff between Precision and Recall: “A broad search
that misses few relevant documents will usually capture a lot of irrelevant
documents, while a narrower search that minimizes ‘false positives’ will be
more likely to miss some relevant documents.”
Although this tradeoff is widely appreciated, this particular case
provides an opportunity to make it more comprehensible.
The petitioners presented a table
of estimated results at different levels of Recall. It’s not entirely clear how this table was
derived (but see Ringtail
Visual Predictive Coding for a similar table). I infer that the true positives, and
therefore, the Recall and Precision estimates, are based on the validation set
(the second 1,000 documents) reviewed by the Commissioner. This table, then, consists of estimates of the
number of documents to be expected at each Recall Target.
Here is the Dynamo table:
Recall target | 65% | 70% | 75% | 80% | 85% | 90% | 95% |
Projected True Positives | 8,712 | 9,075 | 9,801 | 10,527 | 11,253 | 11,979 | 12,705 |
Projected True Positives Plus False Positives | 52,336 | 54,880 | 69,781 | 122,116 | 139,563 | 157,736 | 174,091 |
Precision | 16% | 16% | 14% | 8% | 8% | 7% | 7% |
Each document was assigned a relevance score by the
predictive coding system. This score can be used to order the documents
from lowest to highest relevance. We can
then use a certain score as a cutoff or threshold. Documents with scores above this threshold
would be designated as positives (putatively responsive) and documents with
scores below this cutoff would be designated as negatives (putatively
non-responsive). Presumably, they used the
Commissioner’s judgments as a random sample of responsive and non-responsive
documents and estimated the expected Recall (Recall Target) from this sample at
each of seven thresholds.
As the cutoff score is lowered, more documents are included
in the putatively positive set. These
putatively positive documents will include some that are truly responsive (true
positives) and some that are not truly responsive (false positives). Lower cutoff scores yield more true
positives, but also more false positives.
For example, setting the cutoff score at a relatively high level would, according
to the table yield 52,336 positive documents of which, presumably, 8,712 would
be truly responsive (65% Recall target).
Setting a low criterion would yield 174,091 positive documents, of which
12,705 are expected to be truly responsive (the 95% Recall target). Increases in the so-called Recall Target in
the table corresponds to a decrease in the threshold, resulting in the
selection of more putatively responsive documents, including both those that
are truly responsive (true positives) and those that are truly non-responsive
(false positives).
"One system, many levels of Recall"
Keep in mind that the same predictive coding system produced
all of these Recall Targets. The only
thing that differs between Recall Targets (table columns) is the cutoff score,
not the model. The same model is used at
all Recall Target levels.
Assessing the Tradeoff
Another way to look at the tradeoff is with a graph called
an ROC
curve. This graph is shown
below. Assuming that the relevance scores
range from 0.0 to 1.0, if the cutoff score is set to 0.0, then all of the
documents will be selected and we will have a point in the upper right hand
corner of the chart. The true positive
rate will be 100%, but also the false-positive rate will be 100%. Conversely, if we set the cutoff score to be
1.0, then none of the documents will be selected and the true positive rate and
the false positive rates will both be 0%.
We can always achieve a Recall level of 1.0 by setting the threshold to
0.0 and producing all of the documents.
All ROC curves include the points at 0 false positives, 0 true
positives, 100% false positives, 100% true positives because even a random
system can produce these results.
The red line in the graph shows what would happen if the
relevance score of each document was randomly assigned. A predictive coding system that was perfectly
ineffective would cast a line from 0.0 false positives , 0.0 true positives to
100% false positives, 100% true positives.
An ROC curve showing the tradeoff between False Positive
and True Positives for a perfect categorizer (green), the
Dynamo Holdings Categorizer (blue) and a random categorizer
(red).
|
By comparison, the blue line shows the estimated Dynamo
predictive coding results according to the table
presented by the petitioners. This
predictive coding exercise yielded a middling level of estimated accuracy. It is neither very near to a perfect system or
a random system. I will return to some
possible explanations for this low accuracy later. Depending on where the threshold is placed
(different points along the line), the mix of false positive and true positive
results changes in a regular way. Each
point corresponds to the false positive and true positive rates shown in the
table.
The Commissioner chose to accept a large number of false
positives in order to get the highest number of true positives. In fact, the Commissioner’s original request
was to receive all of the documents, which would have guaranteed the receipt of
all of the true positives (cutoff score of 0.0). That’s an important point. Even a random process can achieve 95% Recall
if you are willing to accept a large number of non-responsive documents. In fact, you can achieve 100% Recall if you are
willing to accept 0% Precision. If you
receive all of the documents, you are guaranteed to receive all of the
responsive ones among them. In this case, the petitioners turned over about 43%
of the collection to achieve what they thought was 95% Recall. As it turns out, though, they were seriously
mistaken about the level of Recall they did actually achieve.
Other cases, such as Global Aerospace, report Recall and
Precision in 80% to 95% range (with F1 in about that same range), so it is
quite unusual to have Precision in the single digits. Even the petitioners were clear that the
accuracy of this system in this case was poor. The second chart shows the
Precision and Recall for several predictive coding systems (circles and
squares), human review (triangles) and a negotiated keyword search. The current results from the Dynamo
petitioners’ production is shown as a diamond in the upper left-hand corner
(95% Recall and 7% Precision). Dynamo Recall
level is comparable to other systems—it was forced by the Commissioner, but
this level of Recall was achieved only by producing a substantial number of non-responsive
documents. Remember that arbitrary high
levels of Recall can be achieved as long as one is willing to accept high
levels of false positives along with the responsive documents.
Even this level of accuracy may, however, overstate the
success of this predictive coding task. The
petitioners predicted that their production of 174,091 documents would contain
12,705 responsive ones. The
Commissioner, instead, reported that only 5,796 of them were responsive. This difference is beyond what one would
expect based on a reasonable sampling error.
It’s not clear why there should be this large discrepancy. The training set was judged by the receiving
party, though we do not know if the same person did the training and the final
assessment of the responsiveness of the produced documents. Reviewers often differ in the documents they
call responsive. If different people
trained and tested, the system, then there could be a substantial difference in
their responsiveness judgments.
It’s possible that different standards were applied when
training predictive coding relative to assessing the final product. During training, the reviewer might use a
looser criterion for what would be responsive then during the final
review. During training, the reviewer
might seek to include documents that are marginally responsive in order to get
a more complete production, but then once that production has been delivered,
use a narrower criterion for what is actually useful in the case. We have no information about the validity of
this speculation in this case.
Given that the Commissioner found only 5,796 documents of
the produced set to be responsive, the actual Recall rate is likely to be
substantially lower than the nominal 95%, but we have no way to estimate that from
the available information. As far as I
am aware, this discrepancy was not mentioned by the Commissioner when petitioning
for additional documents to be produced.
Even if we accept the nominal Precision and Recall measures proffered
by the petitioners, the predictive coding performance in this case is quite
poor. The second graph shows this case
in comparison to some others, measured as Precision and Recall.
There are at least four possible explanations for this low
level of predictive coding performance.
- Predictive coding does not work
- The specific predictive coding application used in this case does not work
- Insufficient training examples were provided
- Inaccurate training examples were provided
Given the wide use of predictive coding in eDiscovery, and
its success, as shown by the graph above, and given the use of similar machine
learning technologies in other areas, such as spam filtering, I think that we
can categorically reject the first possibility. It would be a serious mistake,
I believe, to throw out a class of technologies because of this case. Predictive coding has frequently been found
to work quite well and this case is an outlier.
The second explanation is slightly more plausible. It appears that this project used Ringtail
Visual Predictive Coding. The
petitioners’ expert was James
Scarazzo of FTI, so it would make sense to use a predictive coding system
used by his company. The table showing the Precision at each Recall level is
similar to one on the Ringtail website.
FTI is a very reputable company, and its Ringtail software
is widely used. Nonetheless, an analysis
of the results presented on their website, promoting the use of their software,
shows similar levels of performance (low Precision at high levels of Recall). So it is possible that their predictive
coding software is somehow limited. It
is, of course, also possible the software is good, but that the problem is in
their marketing; perhaps they chose an unflattering example for their website.
The third explanation is, I think, far more likely. The total population of documents was about 406,000. Of these, the Commissioner found only 5,796
to be truly responsive, that would be about 1.4% of the total collection was
found to be responsive. The petitioners
claimed that 12,705 documents were responsive, but even that is only 3.1% of
the whole document set. In the original
randomly-selected training set of 1,000 documents, therefore, there were
between 14 and 31 responsive ones (1.4% or 3.1% of 1000). The second stage of training, where they
tried to focus on documents more likely to be responsive may have added more
positive examples, but even 50 or 75 responsive documents may not be enough to
effectively train a predictive coding system.
The fourth possible explanation is in many ways the most
likely. Effectively training a
predictive coding system depends on the validity, consistency, and
representativeness of the training set.
These are factors that people control—independent of the
technology.
Random samples provide a reasonable means of assuring a
representative sample of documents.
The validity and consistency of the review, on the other
hand, may be problems. If the review of
the training documents was delegated to someone with a low level of expertise
(low validity) in these matters or someone who was distracted (low consistency)
during the review, then the documents used to train the system may not have
been accurately or consistently categorized.
For example, in one matter that I worked on, several
different people did predictive coding on the same set of documents to identify
documents that were responsive to very similar issues. One of those people did the predictive coding
training in a couple of days, reviewing a couple thousand documents. The others tried to do the training over
several weeks, doing only a few documents at a time. The concentrated training resulted in very
high predictive coding performance, the piecemeal training resulted in
relatively poor performance, apparently because it was difficult to maintain a
consistent perspective on responsiveness over a long time with many
interruptions. The same software used
with the same data resulted in different levels of success depending on how it
was used. A poor training set would
almost certainly lead to a poor outcome.
Conclusion
There may be other potential factors that could contribute
to the poor performance on this predictive coding task. It is important to keep in mind, that even
the most powerful predictive coding system is still just a tool used by
humans.
The power of a categorization system, such as predictive
coding, is its ability to separate the document classes from one another (e.g.,
responsive from the non-responsive documents).
For a system with any amount of power, specific levels of Recall can be
achieved by adjusting the criterion of what one calls responsive to accept more
or fewer true positives and therefore more or fewer false positives. By itself, achieving high levels of Recall,
therefore, does not mean a powerful system because when high levels of Recall
are accompanied by high levels of false positives, there is very little
separation at all. A more powerful
system is one that increases the proportion of truly responsive documents more
quickly than the proportion of false positives as this criterion is
lowered. A more powerful system will
achieve high Recall at the same time as it achieves few false positives. In this light, the system used by Dynamo
Holdings was not very powerful. Rather
than separating the responsive from non-responsive, it simply provided both.
It is important to remember that a system, particularly in
eDiscovery, consists not just of the software used to implement the machine
learning, but also of the training examples and other methods used. People are a critical part of predictive
coding system and by some measures, they are the most error-prone part.
Predictive coding is
not magic. You don’t get
something for nothing. What you do get
is a tool that makes the most out of relatively small amounts of effort. Unsupervised, the computer has no way to
distinguish what is legally important from what is not, it still requires human
judgment to guide it. The computer then
amplifies that judgment, but can amplify poor judgment as well as good
judgment.
Effective predictive coding requires good technology, good
methods for applying that technology, and good judgment to guide the
technology. At least one of those appears
to have been missing in this case.
No comments:
Post a Comment