Tuesday, September 13, 2016

Recall, Magical Thinking, and the Assessment of eDiscovery

The current version of the Federal Rules of Civil Procedure highlight the importance of reasonableness and proportionality.  As is widely understood, the cost of dealing with the volume of documents that could potentially play a role in a legal dispute can easily overwhelm the value of the case.  Some kind of technology use is essential if we are to maintain a justice system that depends on evidence.

The problem is generally not the number of documents that will ultimately be introduced as evidence, rather it is the winnowing process that goes from the domain of potentially relevant documents down to the ones that must be produced.  Ultimately, only a handful of those may end up being critical to a case.  If we knew without effort which those documents were, we would not have to go through the complex discovery process. 

Discovery involves more than winnowing, of course.  The legal team not only has to decide which documents are pertinent to a case, but also understand the content of those documents and how they fit into and guide the theory of the case. Data analysis and understanding has not, historically, had the benefit of a well-structured process, but the winnowing task has.  In this context, I am focusing on the problem of identifying the documents to be produced from large collections.

Assessing the reasonableness of any process can be facilitated by measurement.  There is a saying that you cannot improve what you do not measure.  Although one can use intuition or other forms of judgment to assess reasonableness, intuitive feelings of reasonableness alone may not be sufficient.  In these cases, we would like to know how reasonable a process was.  For this, we need measurement.

Overwhelmingly, the primary measurement of the efficacy of the winnowing process in eDiscovery is Recall.  Of the documents that are relevant in a collection, how many (what proportion) of them have been identified?  The idea is that the more complete the identification process, the better it has been.  All other things being equal, a better process is a more reasonable process. 

Still, from time to time, question arise whether Recall is a good measure for assessing the winnowing process.

As I read it, there are four related arguments about why Recall might be inappropriate as a measure of the eDiscovery winnowing process:
  1. Recall measures completeness, but completeness is not enough
  2. Recall is overly sensitive to the easy to find documents
  3. Recall is insufficiently sensitive to rare, but critical sources of information (smoking guns)
  4. Recall measures the number of documents that are identified, but not their importance


Before discussing these criticisms, I want to spend some time thinking about measures.  A good measure should have validity and reliability.  Validity means that it actually measures the property that you are interested in.  Reliability means that measuring the property repeatedly gives consistent results. A good measure should also be easy to obtain and yield a quantity that has a minimum and maximum value (say 0.0 and 1.0 or 0.0 and 100.0). Finally, it should be transparently related to the goals of the task, so that it is easy to interpret.  Although computing it can take some effort, Recall meets these criteria for a good measure.

Completeness may not be not complete

Recall is a statistic for measuring completeness.  It corresponds directly to the requirement in The Federal Rules of Civil Procedure, Rule 26(g) that the producing party certify that a production is complete and correct, following a reasonable inquiry.  So, by these standards, completeness would seem to be a central criterion against which to judge a production.

The usefulness of any statistic depends critically on the question you are trying to ask.  If we want to know how complete an eDiscovery process has been we can simply ask how close we have come to identifying all of the relevant documents. It is difficult to think of a more transparent or valid measure than Recall to answer this question.  If you know the number of responsive documents in a collection and you know the number that have been identified, then you know how complete your process is.

To be sure, there are challenges when measuring Recall.  The primary one is that we do not actually know directly how many relevant documents are in a collection.  We need to estimate that number, and for this we use various statistical sampling and other methods.  I have discussed some of these methods elsewhere, but all of them are essentially different ways of estimating Recall.  If you want to know about the completeness of a discovery process, Recall, however estimated, is your answer (I count Elusion as being one of the methods of estimating Recall).

Critics of Recall sometimes claim that there must be more to completeness than the number of documents available and the number identified.  We turn to a couple of those suggestions next.

Sensitivity to the easy to find documents

According to the second argument, completeness in terms of documents is not completeness in terms of information.  We should really be using a measure of the completeness of information.  Some documents contain unique information and some are simply repeats of already known information.  The responsive ones with unique information tend to be more valuable than the redundant ones.

After finding one responsive document, other similar documents are automatically found, but finding many duplicates of an easy to find document do not add value to the discovery. For example, if 80% of the responsive documents are nearly identical to one another and we find one of them, we can achieve 80% Recall without finding another document.  We could appear to be successful just by finding the easy to find documents and still miss a lot of information.

But just how do we measure this missing information?  Counting documents is relatively easy, but measuring the information content of each one is practically impossible.  Experimental psychology had a flirtation with measuring information in text in ways that could be automated, but that approach generally did not work out. 

I don’t want to claim that there could never be a way of effectively measuring the amount of information in a document or a collection of a documents, but at present, I don’t know of any practical way.  The best we could do, I think, is to determine that a document is dissimilar to any that have been found so far to be responsive.  Even that would be a challenge to convert into any meaningful measure of the completeness of a production, however, let alone a practical measure. 

Recall does not measure the effectiveness of finding smoking guns or rare documents


It is common in eDiscovery to say that smoking guns tend to have friends.  That is, they are generally not unique.  A representative sample of documents has a good chance of catching smoking gun documents, if they exist in a collection.  But truly rare documents can occur, and a sampling process is unlikely to find them.  That is the definition of rare.

The challenge of finding rare documents might be a criticism of sampling, but it is not a criticism of Recall.  No matter what process we employ, even exhaustively reading all of the documents, truly rare documents necessarily present a challenge to discovery.  Many documents in a collection are rare, but their rarity does not guarantee their relevance.  Rarity is not a value by itself. Individual junk emails could also be rare and of no value at all to the litigation.

If a document type is truly rare, then it is unlikely to be encountered during the review process, or if it is encountered, it is unlikely to be recognized.  Since World War II, it has been known that humans have difficulty sustaining their attention in the face of rare signals, an effect called "vigilance decrement."  Studies of human reviewers in eDiscovery confirm that people are relatively poor at independently identifying responsive documents.  We found, for example, that only 28% of the documents identified by either of two professional reviewers were identified by both reviewers. When two reviewers disagree on whether a document is responsive, one of them must be wrong. 

Documents do not have to be rare for human reviewers to miss them. It is a common occurrence in eDiscovery that a category of documents is not recognized until after many thousands of documents have been reviewed.  Human reviews rarely go back and fix such mistakes because it is simply too expensive.

Furthermore, truly rare documents are unlikely to appear in our estimate of the truly responsive documents in a collection against which we compute Recall.  If they are not encountered or if they are not recognized when they are encountered, they cannot count either for or against Recall.  We would have no knowledge that they exist.  Documents that we do not know cannot affect any measurement.  Moreover, it would be extremely difficult to practically identify such unique documents in a large collection.  Again, this is not a problem with Recall, but with the search process in general.  These documents might magically exist, but none of the processes we have available are likely to find them.  Again, that is the definition of rare.  If they were easy to find, they would not be a problem.

Recall is an “average” kind of measure.  It is a characteristic of how a process performs over the population of all documents in a collection.  Each document may be unique in what makes it relevant and in how important it is, but Recall captures the overall quality of the process.  Rare kinds of documents contribute less than common kinds of documents.  According to decision theory, it is more difficult to accurately judge rare events relative to more frequent events, whether that judgment is done by a computer or by a human reader.

Recall does not measure importance

Recall treats each responsive document as making an equal contribution to completeness.  It treats each responsive document found as a count toward either prevalence or completeness.  But documents are not equal in their probative value.  Could there be a measure that takes account of the probative value of a document?  This would, of course be a different measure than Recall, addressing a different question. 

Probativeness concerns an individual document’s contribution to the case.  It is not a measure of the completeness of a process at finding responsive documents.  A document has probative value if it raises some new piece of evidence, but not if it is the tenth or hundredth document providing that same information.  It is difficult to see how probativeness could be used as a measure of the success of a predictive coding project rather than as a measure of an individual document in that collection.  We could not, for example simply sum up the probativeness of each document in the collection.  The probative value of a document is contingent on the document and on the already discovered documents in the collection. 

Recall can be used to some extent in the context of probativeness.  Some predictive coding projects, for example, compute separate Recall measures for “hot” documents, the most important ones to the case, and merely responsive documents, the rest of the responsive ones. This does not indicate a failure of Recall, but its application to a special subset of responsive documents.

Like responsiveness, we cannot know the probativeness of a document before the discovery process.  If we did, we would not need to conduct the eDiscovery process.  Some analysis needs to be conducted to assess probativeness and it may take the development of new approaches to machine learning to automate the estimation of a document’s probative value.  The probativeness of a document, though, is not contained solely within the document, but in the relationship between a document’s content and other sources of information.  Any process directed at automating the assessment of probativeness will have to include much more information than that contained within a document or even a document collection.  As mentioned earlier, measuring the information content of a document is itself difficult, measuring the document’s relation to the facts and needs of the case is, at least for the present, impossible.

If we knew the probativeness of each document, then we could use that information to weight our Recall.  Unfortunately, at this point, wishing for a measure of probativeness is just magical thinking.  Someday, we may be able to automate its assessment, but until we have an automated measure, basing an assessment on probativeness seems unlikely to be anything more than wishful thinking.

Furthermore, I don’t think that that is what the winnowing process is all about.  Would it be reasonable for a producing party to say, “we are only producing a small percentage of these documents, but these are the most probative ones?”  Would such a production be compatible with FRCP Rule 26(g) (requirement for complete and correct productions after reasonable enquiry)?  Could the producing party even judge which documents would be most probative to the requesting party?  Is not the probative value of documents part of the essential legal reasoning in a case?

The status of Recall

We can make up imaginary situations where Recall fails to assess the reasonableness of our selection process, but these situations are contrived and simply not realistic.  For example, one commonly suggested scenario is that one process will find more total responsive documents and thus have higher Recall than another while the second process finds fewer documents (lower Recall), but better ones. 

This scenario, it seems to me, is unlikely to actually occur.  In order for one system to have lower Recall than another, but still find a substantial number of better documents (a) there have to be a substantial number of better documents to find, (b) the lower-Recall system would have to miss a substantial number of documents found by the other process, and (c) we would have to find evidence of these other documents.  Generally speaking, an eDiscovery activity uses only one kind of eDiscovery process, though sometimes keywords are used on the same set as predictive coding.  Parties have speculated that there might be substantial numbers of documents detected by the keywords that were not identified by predictive coding, but these have been mostly speculation (e.g., Dynamo Holdings).

If such a scenario could happen, there might be some abstract sense in which we would prefer the lower-Recall process over the higher-Recall process. The production from the lower-Recall system in this scenario, though, is less complete than the higher-Recall system.  It misses a large number of responsive documents according to this scenario that are found by the higher scoring process.   

Finally, how could we know? We do not have access to some catalog of ultimate truth about the responsiveness of documents.  How could we tell that the system produced better quality documents without running the comparison (i.e., doing predictive coding twice) and without having found the more valuable documents? We can imagine a situation where we have a god's-eye view of the true nature of documents, but in reality, we can only know what we observe.

Often the objections to the use of Recall seem to be thinly veiled arguments that human review is somehow superior to computer-assisted review.  Some people still cling to the view that human review is the gold standard, that it is better to have a team of reviewers spend many hours over many months reviewing documents because somehow we will get results that we cannot get using any other approach.  There is no empirical support to such a claim. 

Many studies find that reviewers are inconsistent when making independent judgments about the responsiveness of documents.  I know of no studies, or even cases, that have found that people are better at finding rare documents or smoking guns than computer assisted review is.  Some lawyers may think that they are somehow better at identifying responsive documents than the statistics of human review would imply, but these lawyers are probably over-estimating their ability (the overconfidence effect) and they are unlikely to be the ones who actually do review the documents during the winnowing process.  Some lawyers are surely above average at recognizing responsive documents, but not all of them can be.  And the average seems actually to be rather low.

It seems clear that if complete and correct productions are the goal, then we need measures of completeness and correctness.  Completeness is clearly indicated by Recall but correctness depends on the validity of the decisions made during the review process.  Correctness is much more affected by the people using the technology than by the technology itself.

Obviously, if we produce all of the responsive documents, then we must be producing the correct ones as well. The closer the production is to complete, the closer it must be to correct. 


Rule 26(g) also refers to reasonable enquiry.  Any process we demand, must be practical to execute.  No eDiscovery process is likely to be perfect.  Hypothetical processes that demand information that is not practically obtainable may be useful for making abstract arguments, but they are unlikely to find any useful role in litigation.  As long as we are interested in completeness, then I think that our focus will remain on the measure of that completeness—Recall and its analogs.

Wednesday, September 7, 2016

Understanding Dynamo Holdings Predictive Coding


The case of Dynamo Holdings (Dynamo Holdings Limited Partnership v. Commissioner of Internal Revenue, 143 T.C. No. 9.), and its use of predictive coding is a case with high levels of cooperation, unusual methods, and poor results. Among other things, this case shows how important the human factor is in predictive coding. 

 Introduction

In September of 2014, Judge Buch of the US Tax Court ruled that the petitioners (Dynamo Holdings, the producing party) were free to use predictive coding to identify responsive documents, noting that “the Court is not normally in the business of dictating to parties the process that they should use when responding to discovery.”  The Petitioners chose to use predictive coding and worked out a protocol with the IRS to accomplish this. 

The Commissioner originally requested that the petitioners produce all of the documents on two specified backup tapes.  The Petitioners responded that it would take many months and a significant cost to fulfill the Commissioner’s request, that the documents would have to be reviewed for privilege and confidential information.  They also argued that the Commissioner’s approach amounted to a “fishing expedition” in search of new issues that could be raised against the petitioners.  Instead, they proposed to use predictive coding to reduce the document set to something more manageable.  Given their argument against providing the Commissioner with access to many potentially irrelevant documents, the protocol that they ended up agreeing to is rather curious.

As part of their negotiated protocol, the petitioners randomly selected two sets of 1,000  documents each, from a backup tape and from an Exchange database. The first set was to be used as a training set and the second as a validation or control set.   Both sets were coded by the Commissioner. 

Having the Commissioner (the receiving party) review the documents in the training set indicates a high level of cooperation (as noted by the Court), but it also shifts the burden to the receiving party and may not be sensible in all cases.  Even with a clawback agreement, this process exposes many of the very documents that the petitioners had sought to restrict in their response to the Commissioner’s original discovery request.

After training on these 1,000 documents, the petitioners reported that the predictive coding process was not performing well, so they had the Commissioner code an additional 1,200 documents that were drawn using some kind of judgmental sample to make the training set richer in responsive documents.  There is no further information regarding this supplemental training set.

The petitioners invited the Commissioner to review an additional set of 1,000 documents that they called a validation sample.  They also said that this validation sample would be unlikely to improve the model and the Commissioner declined to review them.  These are puzzling statements in that a validation sample would presumably be used to assess the training process and not train it, so it is unclear why the petitioners said that it would be unlikely to improve the model.  Second, if performance was poor, then the most obvious solution would be to provide more training, so again, why would they say that it would not improve the model and why would a dissatisfied commissioner decline to provide this small amount of additional training?

It is also puzzling that if the petitioners knew and represented that the predictive coding model was ineffective, why the parties did not stop there and try something else, perhaps a different training method, different documents, or a different predictive coding model.  The parties had been cooperating, why did that cooperation break down at this point? 

Recall With a Substantial Dose of False Positives

The court noted that there is often a tradeoff between Precision and Recall: “A broad search that misses few relevant documents will usually capture a lot of irrelevant documents, while a narrower search that minimizes ‘false positives’ will be more likely to miss some relevant documents.”  Although this tradeoff is widely appreciated, this particular case provides an opportunity to make it more comprehensible.

The petitioners presented a table of estimated results at different levels of Recall.  It’s not entirely clear how this table was derived (but see Ringtail Visual Predictive Coding for a similar table).  I infer that the true positives, and therefore, the Recall and Precision estimates, are based on the validation set (the second 1,000 documents) reviewed by the Commissioner.  This table, then, consists of estimates of the number of documents to be expected at each Recall Target.

Here is the Dynamo table:
Recall target 65% 70% 75% 80% 85% 90% 95%
Projected True Positives 8,712 9,075 9,801 10,527 11,253 11,979 12,705
Projected True Positives Plus False Positives 52,336 54,880 69,781 122,116 139,563 157,736 174,091
Precision 16% 16% 14% 8% 8% 7% 7%


Each document was assigned a relevance score by the predictive coding system.  This score can be used to order the documents from lowest to highest relevance.  We can then use a certain score as a cutoff or threshold.  Documents with scores above this threshold would be designated as positives (putatively responsive) and documents with scores below this cutoff would be designated as negatives (putatively non-responsive).  Presumably, they used the Commissioner’s judgments as a random sample of responsive and non-responsive documents and estimated the expected Recall (Recall Target) from this sample at each of seven thresholds. 

As the cutoff score is lowered, more documents are included in the putatively positive set.  These putatively positive documents will include some that are truly responsive (true positives) and some that are not truly responsive (false positives).  Lower cutoff scores yield more true positives, but also more false positives.  For example, setting the cutoff score at a relatively high level would, according to the table yield 52,336 positive documents of which, presumably, 8,712 would be truly responsive (65% Recall target).  Setting a low criterion would yield 174,091 positive documents, of which 12,705 are expected to be truly responsive (the 95% Recall target).  Increases in the so-called Recall Target in the table corresponds to a decrease in the threshold, resulting in the selection of more putatively responsive documents, including both those that are truly responsive (true positives) and those that are truly non-responsive (false positives).

"One system, many levels of Recall"
Keep in mind that the same predictive coding system produced all of these Recall Targets.  The only thing that differs between Recall Targets (table columns) is the cutoff score, not the model.  The same model is used at all Recall Target levels.

Assessing the Tradeoff

Another way to look at the tradeoff is with a graph called an ROC curve.  This graph is shown below.  Assuming that the relevance scores range from 0.0 to 1.0, if the cutoff score is set to 0.0, then all of the documents will be selected and we will have a point in the upper right hand corner of the chart.  The true positive rate will be 100%, but also the false-positive rate will be 100%.  Conversely, if we set the cutoff score to be 1.0, then none of the documents will be selected and the true positive rate and the false positive rates will both be 0%.  We can always achieve a Recall level of 1.0 by setting the threshold to 0.0 and producing all of the documents.  All ROC curves include the points at 0 false positives, 0 true positives, 100% false positives, 100% true positives because even a random system can produce these results.

The red line in the graph shows what would happen if the relevance score of each document was randomly assigned.  A predictive coding system that was perfectly ineffective would cast a line from 0.0 false positives , 0.0 true positives to 100% false positives, 100% true positives.

An ROC curve showing the tradeoff between False Positive 
and True Positives for a perfect categorizer (green), the 
Dynamo Holdings Categorizer (blue) and a random categorizer
(red).
 A perfect predictive coding system, in contrast, would look like the green line and yield 100% true positives at 0% false positives.  The line would pass from 0.0 false positives , 0.0 true positives with a cutoff score of 1.0 to 100% false positives, 100% true positives with a cutoff score of 0.0, but instead of a straight line, it would initially rise vertically to 100% true positives (at 0% false positives) and then move horizontally across the top of the graph, connecting the three points (0,0; 0,100%; 100%, 100%).

By comparison, the blue line shows the estimated Dynamo predictive coding results according to the table presented by the petitioners.  This predictive coding exercise yielded a middling level of estimated accuracy.  It is neither very near to a perfect system or a random system.  I will return to some possible explanations for this low accuracy later.  Depending on where the threshold is placed (different points along the line), the mix of false positive and true positive results changes in a regular way.  Each point corresponds to the false positive and true positive rates shown in the table.

The Commissioner chose to accept a large number of false positives in order to get the highest number of true positives.  In fact, the Commissioner’s original request was to receive all of the documents, which would have guaranteed the receipt of all of the true positives (cutoff score of 0.0).  That’s an important point.  Even a random process can achieve 95% Recall if you are willing to accept a large number of non-responsive documents.  In fact, you can achieve 100% Recall if you are willing to accept 0% Precision.  If you receive all of the documents, you are guaranteed to receive all of the responsive ones among them. In this case, the petitioners turned over about 43% of the collection to achieve what they thought was 95% Recall.  As it turns out, though, they were seriously mistaken about the level of Recall they did actually achieve.

Predictive coding effectiveness for several systems including
human review (triangles), predictive coding (circles and squares).
The Dynamo Holdings system is shown in the upper left corner
as a blue diamond.
 The predictive coding results in this case are, in my experience, exceptionally poor.  F1, which is a type of average of Recall and Precision, is only about 13%.  F1 is often used in recognition of the tradeoff between Recall and Precision, and in recognition of the fact that one can always achieve arbitrarily high levels of Recall by producing more documents.  An effective process, however, will return as close to all of the responsive documents as possible with as few of the non-responsive ones as possible (which would result in a high F1 score).  The petitioners achieved high levels of Recall, but only by producing a very substantial number of non-responsive documents as well.

Other cases, such as Global Aerospace, report Recall and Precision in 80% to 95% range (with F1 in about that same range), so it is quite unusual to have Precision in the single digits.  Even the petitioners were clear that the accuracy of this system in this case was poor. The second chart shows the Precision and Recall for several predictive coding systems (circles and squares), human review (triangles) and a negotiated keyword search.  The current results from the Dynamo petitioners’ production is shown as a diamond in the upper left-hand corner (95% Recall and 7% Precision).  Dynamo Recall level is comparable to other systems—it was forced by the Commissioner, but this level of Recall was achieved only by producing  a substantial number of non-responsive documents.  Remember that arbitrary high levels of Recall can be achieved as long as one is willing to accept high levels of false positives along with the responsive documents.

Even this level of accuracy may, however, overstate the success of this predictive coding task.  The petitioners predicted that their production of 174,091 documents would contain 12,705 responsive ones.  The Commissioner, instead, reported that only 5,796 of them were responsive.  This difference is beyond what one would expect based on a reasonable sampling error.

It’s not clear why there should be this large discrepancy.  The training set was judged by the receiving party, though we do not know if the same person did the training and the final assessment of the responsiveness of the produced documents.  Reviewers often differ in the documents they call responsive.  If different people trained and tested, the system, then there could be a substantial difference in their responsiveness judgments.

It’s possible that different standards were applied when training predictive coding relative to assessing the final product.  During training, the reviewer might use a looser criterion for what would be responsive then during the final review.  During training, the reviewer might seek to include documents that are marginally responsive in order to get a more complete production, but then once that production has been delivered, use a narrower criterion for what is actually useful in the case.  We have no information about the validity of this speculation in this case.

Given that the Commissioner found only 5,796 documents of the produced set to be responsive, the actual Recall rate is likely to be substantially lower than the nominal 95%, but we have no way to estimate that from the available information.  As far as I am aware, this discrepancy was not mentioned by the Commissioner when petitioning for additional documents to be produced.

Even if we accept the nominal Precision and Recall measures proffered by the petitioners, the predictive coding performance in this case is quite poor.  The second graph shows this case in comparison to some others, measured as Precision and Recall.

There are at least four possible explanations for this low level of predictive coding performance. 

  1. Predictive coding does not work
  2. The specific predictive coding application used in this case does not work
  3. Insufficient training examples were provided
  4. Inaccurate training examples were provided


Given the wide use of predictive coding in eDiscovery, and its success, as shown by the graph above, and given the use of similar machine learning technologies in other areas, such as spam filtering, I think that we can categorically reject the first possibility. It would be a serious mistake, I believe, to throw out a class of technologies because of this case.  Predictive coding has frequently been found to work quite well and this case is an outlier.

The second explanation is slightly more plausible.  It appears that this project used Ringtail Visual Predictive Coding.  The petitioners’ expert was James Scarazzo of FTI, so it would make sense to use a predictive coding system used by his company. The table showing the Precision at each Recall level is similar to one on the Ringtail website.

FTI is a very reputable company, and its Ringtail software is widely used.  Nonetheless, an analysis of the results presented on their website, promoting the use of their software, shows similar levels of performance (low Precision at high levels of Recall).  So it is possible that their predictive coding software is somehow limited.  It is, of course, also possible the software is good, but that the problem is in their marketing; perhaps they chose an unflattering example for their website.

The third explanation is, I think, far more likely.  The total population of documents was about 406,000.   Of these, the Commissioner found only 5,796 to be truly responsive, that would be about 1.4% of the total collection was found to be responsive.  The petitioners claimed that 12,705 documents were responsive, but even that is only 3.1% of the whole document set.  In the original randomly-selected training set of 1,000 documents, therefore, there were between 14 and 31 responsive ones (1.4% or 3.1% of 1000).  The second stage of training, where they tried to focus on documents more likely to be responsive may have added more positive examples, but even 50 or 75 responsive documents may not be enough to effectively train a predictive coding system.

The fourth possible explanation is in many ways the most likely.  Effectively training a predictive coding system depends on the validity, consistency, and representativeness of the training set.  These are factors that people control—independent of the technology. 

Random samples provide a reasonable means of assuring a representative sample of documents. 

The validity and consistency of the review, on the other hand, may be problems.  If the review of the training documents was delegated to someone with a low level of expertise (low validity) in these matters or someone who was distracted (low consistency) during the review, then the documents used to train the system may not have been accurately or consistently categorized. 

For example, in one matter that I worked on, several different people did predictive coding on the same set of documents to identify documents that were responsive to very similar issues.  One of those people did the predictive coding training in a couple of days, reviewing a couple thousand documents.  The others tried to do the training over several weeks, doing only a few documents at a time.  The concentrated training resulted in very high predictive coding performance, the piecemeal training resulted in relatively poor performance, apparently because it was difficult to maintain a consistent perspective on responsiveness over a long time with many interruptions.  The same software used with the same data resulted in different levels of success depending on how it was used.  A poor training set would almost certainly lead to a poor outcome.

Conclusion

There may be other potential factors that could contribute to the poor performance on this predictive coding task.  It is important to keep in mind, that even the most powerful predictive coding system is still just a tool used by humans. 

The power of a categorization system, such as predictive coding, is its ability to separate the document classes from one another (e.g., responsive from the non-responsive documents).  For a system with any amount of power, specific levels of Recall can be achieved by adjusting the criterion of what one calls responsive to accept more or fewer true positives and therefore more or fewer false positives.  By itself, achieving high levels of Recall, therefore, does not mean a powerful system because when high levels of Recall are accompanied by high levels of false positives, there is very little separation at all.  A more powerful system is one that increases the proportion of truly responsive documents more quickly than the proportion of false positives as this criterion is lowered.  A more powerful system will achieve high Recall at the same time as it achieves few false positives.  In this light, the system used by Dynamo Holdings was not very powerful.  Rather than separating the responsive from non-responsive, it simply provided both.

It is important to remember that a system, particularly in eDiscovery, consists not just of the software used to implement the machine learning, but also of the training examples and other methods used.  People are a critical part of predictive coding system and by some measures, they are the most error-prone part.

Predictive coding is  not magic.  You don’t get something for nothing.  What you do get is a tool that makes the most out of relatively small amounts of effort.  Unsupervised, the computer has no way to distinguish what is legally important from what is not, it still requires human judgment to guide it.  The computer then amplifies that judgment, but can amplify poor judgment as well as good judgment.

Effective predictive coding requires good technology, good methods for applying that technology, and good judgment to guide the technology.  At least one of those appears to have been missing in this case.