Monday, July 13, 2020

Getting the inside straight on eDiscovery


One of the beneficial outcomes of the expansion of tools like predictive coding is that lawyers have managed to embrace some level of numerical thinking.  When discussing how to identify relevant documents in a collection, lawyers now readily use terms like Recall, Precision, and Confidence Interval.  I think that this has been very beneficial for the field, in part because it has added substance to our understanding of such concepts as “reasonable” and “proportional.”  In this essay I want to go beyond the typical eDiscovery numbers and dig into some of the internals of machine learning categorizers.  These numbers can also be important to the strategy and tactics of eDiscovery search.

Many eDiscovery practitioners are familiar with concepts like Precision and Recall.  Precision measures the exclusivity of a search process and Recall measures the completeness of that process.  Of the documents that are predicted to be relevant, what proportion actually are?  That’s Precision.  Of the documents that actually are relevant, what proportion of them have been identified? That’s Recall.  A complementary measure that is sometimes used is Elusion (Roitblat, 2007).  Of the documents that have been rejected as non-relevant, what proportion are actually relevant?  These measures are a few of those available to assess the success of any search process.

Recall and Precision assume that documents can be classified reliably into one group or the other.  In machine learning, that is called a binary decision.  In addition, some machine learning systems compute a score for each document.  Those documents that receive a high score are designated as relevant, and those that receive a low score are designated non-relevant.  I want to talk about where those scores come from and how they can affect Recall and Precision.

For this analysis, I used a set of texts collected to support training and testing of a machine learning process intended to identify microaggressions.  Racism and sexism not only infect our institutions, but they have woven their way deep into the fabric of our interpersonal communications.  These latent attitudes often manifest themselves in microaggressions, which are uncivil, sometimes subtle, sometimes veiled, and sometimes unconscious communication patterns that manifest social bias.  Many technology companies, for example, are examining their use of such biased terms as “white lists,” which means things that are known to be good or permitted, and “black lists” which are things known to be bad.  I don’t believe that the association between white and good versus black and bad is intended as an overtly racist position, but these small patterns can add up in important ways to communicate that some members of our community are less worthy than others.
Microaggressive communications are often subtle, relative to overt aggression, and so present a significant challenge to machine learning.  The communications collected for this study were selected to help people identify potentially harmful statements before they are transmitted.  But they also provide an opportunity to explore some of the numerical properties of machine learning similar to that used in eDiscovery, for example, in predictive coding.

A few example microaggressive statements:
  • At least I don't sit on my ass all day collecting welfare, I EARN my money.
  • Do you girls have children yet? How old are you? Oh, well you will have them soon.
  • Don't you think it would be better if gay people adopted the older kids that nobody else wants, leaving the babies for normal people?
  • He's totally white on the inside.

The communications used in this project tend to be shorter than typical emails, but they are otherwise similar to emails in many ways. 

The data are publicly available from a study done by Luke Breitfeller and colleagues who mounted a substantial effort to collect, identify, and label microaggressive statements from social media.  See their paper for details.

There are over 200 machine learning methods that can be used to categorize documents.  According to many studies, most of them are capable of producing about the same levels of Precision and Recall. Not all of them, of course are used in current eDiscovery products.  But understanding the differences helps us to understand what the numbers mean. 

For this analysis, I looked at the eight machine learning categorizers listed in Table 1.  Two thirds of the documents were used for the training set and the remaining documents were used for the test set, a method that is commonly used in studies of machine learning.  The results described in this article all concern the performance of the models on the holdout test set.  All systems were trained with the same training set and tested with exactly the same test set.  Because the number of microaggressive and nonaggressive documents were close and because they were identical for all systems, accuracy provides a reasonable estimate of the success of the system.  It is a fair measure to compare one system against another.  Another measure, F1, is the harmonic mean of Precision and Recall and gives a good overall account of the accuracy of the system calculations.  One of the things that this table shows is that arguments about which method is the best for eDiscovery are probably misplaced because they are all about equally good.  These accuracies, by the way, were derived from the default versions of these categorizers, no effort was made to explore a range of parameters that might make them more accurate.  When trained on the same data in the same way , though, they are all remarkably similar in their outcomes.

Table 1. The machine learning methods used in the present study along with measures of their accuracy.
Categorizer
Precision (%)
Recall (%)
F1(%)
Accuracy
Multinomial Bayes
89
71
79
0.82
SGDClassifier
76
75
76
0.77
Random Forest
89
62
73
0.79
Gaussian Process Classifier
83
75
79
0.81
MLP Neural Network
76
75
76
0.77
Logistic Regression
89
67
76
0.81
SVM SVC
92
66
77
0.82
ReluNeural Network
78
75
76
0.78

I will describe these classifiers in more detail later.  For now, it is sufficient to know that these categorizers represent a variety of the available supervised machine learning methods. 

The goal of each of these machine learning systems is to divide the test set documents into two groups, one group containing the microaggressive communications and one containing the other communications.  All eight systems perform about equally well on this challenging task.  However, the pattern by which they assign documents to the two groups differ substantially.  If all we cared about was the ultimate categorization of communications, the internal differences among systems would not matter at all.  But if we want to examine the process in more detail, then these specific difference do matter.  By default, each of these categorizers use a decision rule that assigns communications to the more likely category (microaggression or nonaggression).  Each one of them computes its prediction of this likelihood by assigning each document a score between 0.0 and 1.0.   We often talk about these scores as probabilities, but, in fact, they are not very accurate estimates of probability, as shown in Figure 1.  This chart is called a calibration curve.  It shows how the predicted probability of relevance compares with the actual probability of relevance.

We can turn to weather forecasts to get a better idea of what it would mean for the scores assigned to document to correspond to actual probabilities.  If a weather forecaster were to accurately predict rain, then we would expect that the forecaster’s estimate of the probability of rain (the score) would match closely the rate at which it did in fact rain.  We would expect to see rain on 20% of the days for which the forecaster predicted a 20% chance of rain.  We would expect to see rain on 50% of the days for which the forecaster predicted a 50% chance of rain. If a model predicted a 20% chance of relevance, then we should expect that 20% of the documents with this prediction would, in fact, be relevant.  The model’s predicted “probability” is shown along the horizontal axis and the observed proportion of documents that were actually relevant is shown along the vertical axis.  Like the expert weather forecaster, we would expect that these two proportions (predicted and observed) should be very similar, lying on or near the dashed line.  Obviously, they are not.

As with weather forecasts, we divide up the predicted probabilities into ranges and thereby predict the number of communications that should fall into each bin.  The scores are approximately ranked in the same order as the probabilities, but they do not correspond directly with probabilities.  There are dips in the curves of Figure 1 meaning that some higher probability bins have fewer documents than lower probability bins.
The scores differ from one categorizer to the next.  Again, if we only looked at whether a document is more or less likely to be in the positive category, we would get approximately the same Recall and Precision from all of these systems (as shown in Table 1), but if we want to examine the scores in more detail, we might be surprised at how much they differ, for example as part of a predictive coding protocol.  These models all used the a cutoff of score that would assign documents to the more likely category, but parties do often disagree about where to put the cutoff between documents that are classified as relevant and those that are not.  We see from Figure 1 that that relationship is very complex.  The scores do not actually mean the probability that a document is relevant (the dashed line), so basing a strategy on the assumption that that they do represent probabilities could be very misleading and inappropriate.

Figure 1. Calibration curves for eight machine learning categorizers.  The dashed line shows corresponds to a perfectly calibrated categorizer.


Some eDiscovery protocols demand that documents with nonzero scores be assessed or at least sampled, even though they are not classified as relevant.  The kind of data represented in Figure 1 make it very difficult to predict the value of such examinations.  For example, if an eDiscovery protocol mandates examining documents with intermediate scores, perhaps sampling from each decade of scores (for example, those with scores between 0.4 and 0.5 and those between 0.5 and 0.6), then the pattern of results we might observe could be very different, and have different potential value, depending on the machine learning method that was used.


Figure 2. The distribution of documents for each category in each score range for eight machine learning categorizers. P indicates relevant communications (microaggressions). N indicates non-relevant documents (non aggressions).


Figure 2 shows a different view of these data.  The orange bars represent the number of non-relevant (non-aggressive) communications in the designated score bin.  The blue line shows the number of relevant (aggressive) communications in the corresponding score bin.  Each classifier has a unique way of distributing communications to score bins.  The first thing to note is that the range of scores for the two categories of documents overlaps significantly.  Four of the machine learning systems score a substantial number of documents near the middle of the score range (Multinomial Bayes, Random Forest, Gaussian Process Classifier, and Logistic Regression) and the remaining systems tend to score documents near the edges of the range.
The difference in scoring patterns across classifiers reinforces the concern raised above for protocols in which parties expect to examine communications with scores that are not near the endpoints of scoring range.  Using a multinomial Bayes model, for example, the decades between 0.4 and 0.6 would have substantial numbers of communications with relevant and non-relevant documents.  But the SVM SVC model produces only a few documents in this range; and using one of the neural network models would yield almost no documents in this range. SVM (support vector machine) use is common in eDiscovery, so understanding how documents are scored in this context may be very important. 

We will leave as an open question for now whether examination of those intermediate-scoring documents is actually useful, but it is clear that whatever information it does supply will be strongly affected by the method used to identify relevant documents.

What the scores mean


It should be clear from the above figures that the scores produced by a categorizer do not reflect directly the probability that a document is relevant.  Nor do they reflect the degree of relevance.  The probability that a document is relevant is different from the document’s relevance.  The probability that it will rain is generally uncorrelated with the amount of rain that will fall.  Documents that score higher may be more likely to be relevant, they may be more like other relevant documents, or they may be more different from non-relevant documents, but higher scores do not mean that they are necessarily more relevant. 

For example, taller heavier people tend to be men versus women, but one would not conclude that taller heavier people are more manly than smaller people.  Being more likely to be a man is unrelated to the degree of “manlyness.”  Maybe it is just my taste, but I do not believe, for example, that John Candy (6’2”, 300 pounds) was more manly than Humphrey Bogart (5’8”, 150 pounds, # 9 on iMDB list of most manly actors).  Similarly, being more likely to be a relevant document is unrelated to the degree of relevance.

Relevance is a complex issue, depending in part on the information needs of the parties.  Classifiers, on the other hand, only have available the content of the communication and its similarity to other communications.  Machine learning is possible because we assume that similar items should be treated similarly.  Machine learning does not just memorize the correct arbitrary class for each communication because then it could never generalize to new communications.  Similarity does not always correspond to relevance either.  For example, an email that contained the phrase “you’re fired, you’re fired, you’re fired,” might score higher than one that said only “you’re fired” but the second one may be more serious and more important to the question you are trying to answer. 

The scores are a mathematical product of the algorithms used by the system.  Scoring functions are often complex and can be quite subtle.  Some words are more informative than others.  Two documents near a boundary may differ in many words; and identifying the contribution of each of these words to the distinction between categories would a formidable exercise, with little expected value. 

The scores produced by machine learning methods depend on the specific algorithms employed.  Each machine learning method employs its own way to compute the similarity among documents.  But, as described, all of the categorizers studied here produce about the same level of classification accuracy, so the ability of these systems to categorize the communications is not in doubt.  The point I want to make is that even though the classifiers agree on the overall classification, how they come to that classification can be very different.

The internal properties of machine learning categorizers are important for understanding just what the scores mean. These properties become critical when the parties want to go deeper beyond the decision-making and want to try to exploit internal details.  Asking the right questions and the value of the answers one gets for them depends critically on just what information a categorizer uses and how it uses it.  The very strategy of eDiscovery processing may depend on a proper understanding of these properties.  Our intuitions about we will find when we look at particular documents may not match up with what is actually there.