One of the beneficial outcomes of the expansion of tools
like predictive coding is that lawyers have managed to embrace some level of
numerical thinking. When discussing how
to identify relevant documents in a collection, lawyers now readily use terms
like Recall, Precision, and Confidence Interval. I think that this has been very beneficial
for the field, in part because it has added substance to our understanding of
such concepts as “reasonable” and “proportional.” In this essay I want to go beyond the typical
eDiscovery numbers and dig into some of the internals of machine learning
categorizers. These numbers can also be
important to the strategy and tactics of eDiscovery search.
Many eDiscovery practitioners are familiar with concepts
like Precision and Recall. Precision
measures the exclusivity of a search process and Recall measures the
completeness of that process. Of the
documents that are predicted to be relevant, what proportion actually are? That’s Precision. Of the documents that actually are relevant,
what proportion of them have been identified? That’s Recall. A complementary measure that is sometimes used
is Elusion (Roitblat, 2007). Of the
documents that have been rejected as non-relevant, what proportion are actually
relevant? These measures are a few of those
available to assess the success of any search process.
Recall and Precision assume that documents can be classified
reliably into one group or the other. In
machine learning, that is called a binary decision. In addition, some machine learning systems
compute a score for each document. Those
documents that receive a high score are designated as relevant, and those that
receive a low score are designated non-relevant. I want to talk about where those scores come
from and how they can affect Recall and Precision.
For this analysis, I used a set of texts collected to
support training and testing of a machine learning process intended to identify
microaggressions. Racism and sexism not
only infect our institutions, but they have woven their way deep into the
fabric of our interpersonal communications.
These latent attitudes often manifest themselves in microaggressions, which
are uncivil, sometimes subtle, sometimes veiled, and sometimes unconscious communication
patterns that manifest social bias. Many
technology companies, for example, are examining their use of such biased terms
as “white lists,” which means things that are known to be good or permitted,
and “black lists” which are things known to be bad. I don’t believe that the association between
white and good versus black and bad is intended as an overtly racist position,
but these small patterns can add up in important ways to communicate that some
members of our community are less worthy than others.
Microaggressive communications are often subtle, relative to
overt aggression, and so present a significant challenge to machine learning. The communications collected for this study were
selected to help people identify potentially harmful statements before they are
transmitted. But they also provide an
opportunity to explore some of the numerical properties of machine learning
similar to that used in eDiscovery, for example, in predictive coding.
A few example microaggressive statements:
- At least I don't sit on my ass all day collecting welfare, I EARN my money.
- Do you girls have children yet? How old are you? Oh, well you will have them soon.
- Don't you think it would be better if gay people adopted the older kids that nobody else wants, leaving the babies for normal people?
- He's totally white on the inside.
The communications used in this project tend to be shorter
than typical emails, but they are otherwise similar to emails in many
ways.
The data are publicly available from a study done by Luke
Breitfeller and colleagues who mounted a substantial effort to collect,
identify, and label microaggressive statements from social media. See their paper for details.
There are over 200 machine learning methods that can be used
to categorize documents. According to
many studies, most of them are capable of producing about the same levels of
Precision and Recall. Not all of them, of course are used in current
eDiscovery products. But understanding
the differences helps us to understand what the numbers mean.
For this analysis, I looked at the eight machine learning
categorizers listed in Table 1. Two thirds of the documents were used for the
training set and the remaining documents were used for the test set, a method
that is commonly used in studies of machine learning. The results described in this article all
concern the performance of the models on the holdout test set. All systems were trained with the same
training set and tested with exactly the same test set. Because the number of microaggressive and
nonaggressive documents were close and because they were identical for all systems,
accuracy provides a reasonable estimate of the success of the system. It is a fair measure to compare one system
against another. Another measure, F1, is
the harmonic mean of Precision and Recall and gives a good overall account of
the accuracy of the system calculations.
One of the things that this table shows is that arguments about which method
is the best for eDiscovery are probably misplaced because they are all about
equally good. These accuracies, by the way,
were derived from the default versions of these categorizers, no effort was
made to explore a range of parameters that might make them more accurate. When trained on the same data in the same way
, though, they are all remarkably similar in their outcomes.
Table 1. The machine learning
methods used in the present study along with measures of their accuracy.
Categorizer
|
Precision (%)
|
Recall
(%)
|
F1(%)
|
Accuracy
|
Multinomial Bayes
|
89
|
71
|
79
|
0.82
|
SGDClassifier
|
76
|
75
|
76
|
0.77
|
Random Forest
|
89
|
62
|
73
|
0.79
|
Gaussian Process Classifier
|
83
|
75
|
79
|
0.81
|
MLP Neural Network
|
76
|
75
|
76
|
0.77
|
Logistic Regression
|
89
|
67
|
76
|
0.81
|
SVM SVC
|
92
|
66
|
77
|
0.82
|
ReluNeural Network
|
78
|
75
|
76
|
0.78
|
I will describe these classifiers in more detail later. For now, it is sufficient to know that these categorizers
represent a variety of the available supervised machine learning methods.
The goal of each of these machine learning systems is to divide
the test set documents into two groups, one group containing the microaggressive
communications and one containing the other communications. All eight systems perform about equally well
on this challenging task. However, the
pattern by which they assign documents to the two groups differ substantially. If all we cared about was the ultimate
categorization of communications, the internal differences among systems would
not matter at all. But if we want to
examine the process in more detail, then these specific difference do
matter. By default, each of these
categorizers use a decision rule that assigns communications to the more likely
category (microaggression or nonaggression).
Each one of them computes its prediction of this likelihood by assigning
each document a score between 0.0 and 1.0.
We often talk about these scores as probabilities, but, in fact, they
are not very accurate estimates of probability, as shown in Figure
1. This chart is called a calibration
curve. It shows how the predicted
probability of relevance compares with the actual probability of relevance.
We can turn to weather forecasts to get a better idea of
what it would mean for the scores assigned to document to correspond to actual
probabilities. If a weather forecaster
were to accurately predict rain, then we would expect that the forecaster’s
estimate of the probability of rain (the score) would match closely the rate at
which it did in fact rain. We would
expect to see rain on 20% of the days for which the forecaster predicted a 20%
chance of rain. We would expect to see
rain on 50% of the days for which the forecaster predicted a 50% chance of
rain. If a model predicted a 20% chance of relevance, then we should expect
that 20% of the documents with this prediction would, in fact, be relevant. The model’s predicted “probability” is shown
along the horizontal axis and the observed proportion of documents that were
actually relevant is shown along the vertical axis. Like the expert weather forecaster, we would
expect that these two proportions (predicted and observed) should be very
similar, lying on or near the dashed line.
Obviously, they are not.
As with weather forecasts, we divide up the predicted
probabilities into ranges and thereby predict the number of communications that
should fall into each bin. The scores
are approximately ranked in the same order as the probabilities, but they do
not correspond directly with probabilities. There are dips in the curves of Figure
1
meaning that some higher probability bins have fewer documents than lower
probability bins.
The scores differ from one categorizer to the next. Again, if we only looked at whether a document
is more or less likely to be in the positive category, we would get approximately
the same Recall and Precision from all of these systems (as shown in Table
1),
but if we want to examine the scores in more detail, we might be surprised at
how much they differ, for example as part of a predictive coding protocol. These models all used the a cutoff of score
that would assign documents to the more likely category, but parties do often
disagree about where to put the cutoff between documents that are classified as
relevant and those that are not. We see from
Figure 1
that that relationship is very complex.
The scores do not actually mean the probability that a document is
relevant (the dashed line), so basing a strategy on the assumption that that
they do represent probabilities could be very misleading and inappropriate.
Figure 1. Calibration curves for eight machine learning categorizers. The dashed line shows corresponds to a perfectly calibrated categorizer. |
Some eDiscovery protocols demand that documents with nonzero
scores be assessed or at least sampled, even though they are not classified as
relevant. The kind of data represented
in Figure 1
make it very difficult to predict the value of such examinations. For example, if an eDiscovery protocol
mandates examining documents with intermediate scores, perhaps sampling from
each decade of scores (for example, those with scores between 0.4 and 0.5 and
those between 0.5 and 0.6), then the pattern of results we might observe could
be very different, and have different potential value, depending on the machine
learning method that was used.
Figure 2. The distribution of documents for each category in each score range for eight machine learning categorizers. P indicates relevant communications (microaggressions). N indicates non-relevant documents (non aggressions).
Figure 2 shows a different view of these data. The orange bars represent the number of non-relevant (non-aggressive) communications in the designated score bin. The blue line shows the number of relevant (aggressive) communications in the corresponding score bin. Each classifier has a unique way of distributing communications to score bins. The first thing to note is that the range of scores for the two categories of documents overlaps significantly. Four of the machine learning systems score a substantial number of documents near the middle of the score range (Multinomial Bayes, Random Forest, Gaussian Process Classifier, and Logistic Regression) and the remaining systems tend to score documents near the edges of the range.
The difference in scoring patterns across classifiers
reinforces the concern raised above for protocols in which parties expect to
examine communications with scores that are not near the endpoints of scoring
range. Using a multinomial Bayes model,
for example, the decades between 0.4 and 0.6 would have substantial numbers of
communications with relevant and non-relevant documents. But the SVM SVC model produces only a few
documents in this range; and using one of the neural network models would yield
almost no documents in this range. SVM (support vector machine) use is common
in eDiscovery, so understanding how documents are scored in this context may be
very important.
We will leave as an open question for now whether
examination of those intermediate-scoring documents is actually useful, but it
is clear that whatever information it does supply will be strongly affected by
the method used to identify relevant documents.
What the scores mean
It should be clear from the above figures that the scores
produced by a categorizer do not reflect directly the probability that a
document is relevant. Nor do they
reflect the degree of relevance. The
probability that a document is relevant is different from the document’s
relevance. The probability that it will
rain is generally uncorrelated with the amount of rain that will fall. Documents that score higher may be more
likely to be relevant, they may be more like other relevant documents, or they
may be more different from non-relevant documents, but higher scores do not mean that they are necessarily more
relevant.
For example, taller heavier people tend to be men versus
women, but one would not conclude that taller heavier people are more manly
than smaller people. Being more likely
to be a man is unrelated to the degree of “manlyness.” Maybe it is just my taste, but I do not
believe, for example, that John Candy (6’2”, 300 pounds) was more manly than
Humphrey Bogart (5’8”, 150 pounds, # 9 on iMDB list of most manly actors). Similarly, being more likely to be a relevant
document is unrelated to the degree of relevance.
Relevance is a complex issue, depending in part on the information
needs of the parties. Classifiers, on
the other hand, only have available the content of the communication and its
similarity to other communications.
Machine learning is possible because we assume that similar items should
be treated similarly. Machine learning
does not just memorize the correct arbitrary class for each communication
because then it could never generalize to new communications. Similarity does not always correspond to
relevance either. For example, an email
that contained the phrase “you’re fired, you’re fired, you’re fired,” might
score higher than one that said only “you’re fired” but the second one may be
more serious and more important to the question you are trying to answer.
The scores are a mathematical product of the algorithms used
by the system. Scoring functions are
often complex and can be quite subtle. Some
words are more informative than others. Two
documents near a boundary may differ in many words; and identifying the
contribution of each of these words to the distinction between categories would
a formidable exercise, with little expected value.
The scores produced by machine learning methods depend on
the specific algorithms employed. Each
machine learning method employs its own way to compute the similarity among
documents. But, as described, all of the
categorizers studied here produce about the same level of classification
accuracy, so the ability of these systems to categorize the communications is
not in doubt. The point I want to make
is that even though the classifiers agree on the overall classification, how
they come to that classification can be very different.
The internal properties of machine learning categorizers are
important for understanding just what the scores mean. These properties become
critical when the parties want to go deeper beyond the decision-making and want
to try to exploit internal details.
Asking the right questions and the value of the answers one gets for
them depends critically on just what information a categorizer uses and how it
uses it. The very strategy of eDiscovery
processing may depend on a proper understanding of these properties. Our intuitions about we will find when we look at particular documents may not match up with what is actually there.