Tuesday, March 9, 2021

When to stop searching: An example from continuous active learning

 

Making reasonable decisions in eDiscovery, as elsewhere, requires that we have reasonable expectations about the consequences of those decisions.  One of these is when to stop searching for relevant documents.  For example, in the commonly used approach to identifying relevant eDiscovery documents called continuous active learning, the search process continues until a reasonable level of completeness has been achieved.  The reasonableness of any given level is often attributed to the cost of finding more relevant documents.  But there seems to be a hidden and erroneous, assumption that continued effort will eventually yield all relevant documents and the question is solely what level of effort is reasonable in the context of a specific case.

The problem of when to stop searching is not limited to continuous active learning or any other specific method of discovery.  Every search process necessarily entails comparable questions about when to stop. My concern in this essay is not with the adequacy or efficiency of continuous active learning, but with what appears to be some fundamental misunderstanding of search effort and its consequences, which impact efforts to make reasonable decisions about that effort and when to stop.  I use continuous active learning as a more or less typical example of eDiscovery search methods.

In continuous active learning, a classifier (often a machine learning process called a support vector machine) is trained to predict which documents in the collection are likely to be relevant.  The most likely documents are shown to reviewers, who then either endorse or reverse the prediction of the classifier.  The reviewers’ decisions are then added to the training set for the classifier and it continues to predict the remaining documents, that is, those that have not yet been seen.  Again, the documents predicted to be most likely to be relevant are presented to the reviewers and the process repeats for some number of cycles.  The stopping rule determines when to terminate this process of classify-judge-predict cycles. 

The implicit assumption in this reasonableness judgment seems to be that if we just continue through enough cycles, eventually we will identify all of the relevant documents in the collection.  But that assumption is wrong and here’s why.

The success of any machine learning exercise depends on three things.  The distinguishability of the data, accuracy of the machine learning algorithm (for example, the support vector machine), and the quality of the human judgments that go into training and assessing the process.  In order to achieve complete Recall, all three of these error sources has to be reduced to zero.  Let’s take them in order.

The two graphs show hypothetical simplified situations for a machine learning process to distinguish the positive (orange) from the negative (blue) instances.  For any categorization system, the goal is to find a separator that puts all of the positive instances (blue dots) on one side and all of the negative instances (orange dots) on the other side of this separator.  That task is relatively simple for the first graph, where all of the positive instances are above and to the right of a diagonal line and only one negative instance would be included.

A relatively easy categorizing problem with little overlap between the positive (orange) set and the negative (blue) set.


In the second graph, the task is much more difficult because there is substantial overlap between the two groups.  No surface exists that will perfectly separate the positive from the negative instances.  A categorizer might reach 100% Recall with the data in the first graph, but cannot with the data in the second one, without also including a lot of negative instances.  The distinction between positive and negative instances may be subtle, difficult, or obscure and the ultimate accuracy of any decision system would be limited by the ability of a fully-informed categorizer to make the right choices.

Second, machine learning algorithms do not differ very much among themselves in their ability to learn categorizations.  Overall, systems can differ in the way that they represent the documents (for example, as individual words versus as phrases or as mathematically represented concepts).  They can differ in the number of training examples they need.  They may include different sources of information. For example, some may include metadata, some just the body of the document.  And, they may differ in how they are deployed.  All of these factors may be consequential in terms of how effective they are at identifying all of the relevant documents, even if they all used the very same underlying algorithms. Conversely, if all of these other factors are held constant, they may all give essentially the same level of accuracy, but that accuracy is seldom perfect.

A relatively difficult categorizing problem with substantial overlap between the positive (orange) set and the negative (blue) set.

The third major source of potential errors comes from the people doing the task.  The request for production may be vague. The lead attorneys may not know perfectly what they are looking for and so may make errors when informing the line reviewers.  The line reviewers may have differing beliefs about what constitutes a relevant article.  They learn more of the subject matter as they go through the review, so their own decision patterns may change over time.  Most studies of human reviewer accuracy find that their recall is relatively poor, only about 50% and sometimes lower. The parties will disagree about which documents are relevant.  Even the reviewers working for one party will disagree among themselves about what constitutes a relevant document.  A reviewer may disagree with his or her own judgment at a later time and make inconsistent judgements over time. 

On top of all of this disagreement, people make mistakes.  Attention wanders.  A machine learning system depends on the human judgments it is given in order to learn what document features are associated with each category.  If these example documents are misclassified, the machine could learn to make incorrect decisions.  With enough training documents and enough reviewers, it may still be possible for a system to learn more or less correct classifications, but these inconsistencies may still lead the system to make errors, which will limit its accuracy.  

Continuous active learning highlights another aspect of the human factor in review.  At each step, the machine learning system ranks the so-far unseen documents for review.  Any documents that have already been seen are generally not included in the set for which relevance is predicted.  So, if a reviewer at one stage of the review incorrectly classifies a document as not-relevant, that document will not be available for ultimate production.  It will never be counted toward the Recall level of the process, no matter how much more effort is expended.  This is another factor that limits the ultimate achievable level of Recall.

Without considering the limits on the ultimate accuracy of any search, we will over-estimate the value of that search.  The limit on subsequent accuracy depends on factors that cannot be defeated simply by continuing the search further.

The active ranking of documents to review has to this point in the review process improved the accuracy of the reviewers, primarily by affecting their consistency with the predictions of the categorizer.  As the process continues, however, the ranking of the remaining documents comes to be dominated by the non-relevant documents and the categorizer will be of diminishing value to the reviewer.  Instead of making reliable predictions about which documents are relevant, the reviewers will have to make independent judgments.  The combination of the sparsity of remaining relevant documents with the inaccuracy of the prediction will cause the reviewers’ accuracy to diminish substantially from the level they had achieved, perhaps even dropping below the expected level of a complete manual review.  Continued search will not only be less valuable, but also less accurate than the preceding effort has produced.  Unless these factors are carefully considered, there will be a very strong tendency to over-estimate the value of continued search and impose excessive burden on the producing party.

Monday, July 13, 2020

Getting the inside straight on eDiscovery


One of the beneficial outcomes of the expansion of tools like predictive coding is that lawyers have managed to embrace some level of numerical thinking.  When discussing how to identify relevant documents in a collection, lawyers now readily use terms like Recall, Precision, and Confidence Interval.  I think that this has been very beneficial for the field, in part because it has added substance to our understanding of such concepts as “reasonable” and “proportional.”  In this essay I want to go beyond the typical eDiscovery numbers and dig into some of the internals of machine learning categorizers.  These numbers can also be important to the strategy and tactics of eDiscovery search.

Many eDiscovery practitioners are familiar with concepts like Precision and Recall.  Precision measures the exclusivity of a search process and Recall measures the completeness of that process.  Of the documents that are predicted to be relevant, what proportion actually are?  That’s Precision.  Of the documents that actually are relevant, what proportion of them have been identified? That’s Recall.  A complementary measure that is sometimes used is Elusion (Roitblat, 2007).  Of the documents that have been rejected as non-relevant, what proportion are actually relevant?  These measures are a few of those available to assess the success of any search process.

Recall and Precision assume that documents can be classified reliably into one group or the other.  In machine learning, that is called a binary decision.  In addition, some machine learning systems compute a score for each document.  Those documents that receive a high score are designated as relevant, and those that receive a low score are designated non-relevant.  I want to talk about where those scores come from and how they can affect Recall and Precision.

For this analysis, I used a set of texts collected to support training and testing of a machine learning process intended to identify microaggressions.  Racism and sexism not only infect our institutions, but they have woven their way deep into the fabric of our interpersonal communications.  These latent attitudes often manifest themselves in microaggressions, which are uncivil, sometimes subtle, sometimes veiled, and sometimes unconscious communication patterns that manifest social bias.  Many technology companies, for example, are examining their use of such biased terms as “white lists,” which means things that are known to be good or permitted, and “black lists” which are things known to be bad.  I don’t believe that the association between white and good versus black and bad is intended as an overtly racist position, but these small patterns can add up in important ways to communicate that some members of our community are less worthy than others.
Microaggressive communications are often subtle, relative to overt aggression, and so present a significant challenge to machine learning.  The communications collected for this study were selected to help people identify potentially harmful statements before they are transmitted.  But they also provide an opportunity to explore some of the numerical properties of machine learning similar to that used in eDiscovery, for example, in predictive coding.

A few example microaggressive statements:
  • At least I don't sit on my ass all day collecting welfare, I EARN my money.
  • Do you girls have children yet? How old are you? Oh, well you will have them soon.
  • Don't you think it would be better if gay people adopted the older kids that nobody else wants, leaving the babies for normal people?
  • He's totally white on the inside.

The communications used in this project tend to be shorter than typical emails, but they are otherwise similar to emails in many ways. 

The data are publicly available from a study done by Luke Breitfeller and colleagues who mounted a substantial effort to collect, identify, and label microaggressive statements from social media.  See their paper for details.

There are over 200 machine learning methods that can be used to categorize documents.  According to many studies, most of them are capable of producing about the same levels of Precision and Recall. Not all of them, of course are used in current eDiscovery products.  But understanding the differences helps us to understand what the numbers mean. 

For this analysis, I looked at the eight machine learning categorizers listed in Table 1.  Two thirds of the documents were used for the training set and the remaining documents were used for the test set, a method that is commonly used in studies of machine learning.  The results described in this article all concern the performance of the models on the holdout test set.  All systems were trained with the same training set and tested with exactly the same test set.  Because the number of microaggressive and nonaggressive documents were close and because they were identical for all systems, accuracy provides a reasonable estimate of the success of the system.  It is a fair measure to compare one system against another.  Another measure, F1, is the harmonic mean of Precision and Recall and gives a good overall account of the accuracy of the system calculations.  One of the things that this table shows is that arguments about which method is the best for eDiscovery are probably misplaced because they are all about equally good.  These accuracies, by the way, were derived from the default versions of these categorizers, no effort was made to explore a range of parameters that might make them more accurate.  When trained on the same data in the same way , though, they are all remarkably similar in their outcomes.

Table 1. The machine learning methods used in the present study along with measures of their accuracy.
Categorizer
Precision (%)
Recall (%)
F1(%)
Accuracy
Multinomial Bayes
89
71
79
0.82
SGDClassifier
76
75
76
0.77
Random Forest
89
62
73
0.79
Gaussian Process Classifier
83
75
79
0.81
MLP Neural Network
76
75
76
0.77
Logistic Regression
89
67
76
0.81
SVM SVC
92
66
77
0.82
ReluNeural Network
78
75
76
0.78

I will describe these classifiers in more detail later.  For now, it is sufficient to know that these categorizers represent a variety of the available supervised machine learning methods. 

The goal of each of these machine learning systems is to divide the test set documents into two groups, one group containing the microaggressive communications and one containing the other communications.  All eight systems perform about equally well on this challenging task.  However, the pattern by which they assign documents to the two groups differ substantially.  If all we cared about was the ultimate categorization of communications, the internal differences among systems would not matter at all.  But if we want to examine the process in more detail, then these specific difference do matter.  By default, each of these categorizers use a decision rule that assigns communications to the more likely category (microaggression or nonaggression).  Each one of them computes its prediction of this likelihood by assigning each document a score between 0.0 and 1.0.   We often talk about these scores as probabilities, but, in fact, they are not very accurate estimates of probability, as shown in Figure 1.  This chart is called a calibration curve.  It shows how the predicted probability of relevance compares with the actual probability of relevance.

We can turn to weather forecasts to get a better idea of what it would mean for the scores assigned to document to correspond to actual probabilities.  If a weather forecaster were to accurately predict rain, then we would expect that the forecaster’s estimate of the probability of rain (the score) would match closely the rate at which it did in fact rain.  We would expect to see rain on 20% of the days for which the forecaster predicted a 20% chance of rain.  We would expect to see rain on 50% of the days for which the forecaster predicted a 50% chance of rain. If a model predicted a 20% chance of relevance, then we should expect that 20% of the documents with this prediction would, in fact, be relevant.  The model’s predicted “probability” is shown along the horizontal axis and the observed proportion of documents that were actually relevant is shown along the vertical axis.  Like the expert weather forecaster, we would expect that these two proportions (predicted and observed) should be very similar, lying on or near the dashed line.  Obviously, they are not.

As with weather forecasts, we divide up the predicted probabilities into ranges and thereby predict the number of communications that should fall into each bin.  The scores are approximately ranked in the same order as the probabilities, but they do not correspond directly with probabilities.  There are dips in the curves of Figure 1 meaning that some higher probability bins have fewer documents than lower probability bins.
The scores differ from one categorizer to the next.  Again, if we only looked at whether a document is more or less likely to be in the positive category, we would get approximately the same Recall and Precision from all of these systems (as shown in Table 1), but if we want to examine the scores in more detail, we might be surprised at how much they differ, for example as part of a predictive coding protocol.  These models all used the a cutoff of score that would assign documents to the more likely category, but parties do often disagree about where to put the cutoff between documents that are classified as relevant and those that are not.  We see from Figure 1 that that relationship is very complex.  The scores do not actually mean the probability that a document is relevant (the dashed line), so basing a strategy on the assumption that that they do represent probabilities could be very misleading and inappropriate.

Figure 1. Calibration curves for eight machine learning categorizers.  The dashed line shows corresponds to a perfectly calibrated categorizer.


Some eDiscovery protocols demand that documents with nonzero scores be assessed or at least sampled, even though they are not classified as relevant.  The kind of data represented in Figure 1 make it very difficult to predict the value of such examinations.  For example, if an eDiscovery protocol mandates examining documents with intermediate scores, perhaps sampling from each decade of scores (for example, those with scores between 0.4 and 0.5 and those between 0.5 and 0.6), then the pattern of results we might observe could be very different, and have different potential value, depending on the machine learning method that was used.


Figure 2. The distribution of documents for each category in each score range for eight machine learning categorizers. P indicates relevant communications (microaggressions). N indicates non-relevant documents (non aggressions).


Figure 2 shows a different view of these data.  The orange bars represent the number of non-relevant (non-aggressive) communications in the designated score bin.  The blue line shows the number of relevant (aggressive) communications in the corresponding score bin.  Each classifier has a unique way of distributing communications to score bins.  The first thing to note is that the range of scores for the two categories of documents overlaps significantly.  Four of the machine learning systems score a substantial number of documents near the middle of the score range (Multinomial Bayes, Random Forest, Gaussian Process Classifier, and Logistic Regression) and the remaining systems tend to score documents near the edges of the range.
The difference in scoring patterns across classifiers reinforces the concern raised above for protocols in which parties expect to examine communications with scores that are not near the endpoints of scoring range.  Using a multinomial Bayes model, for example, the decades between 0.4 and 0.6 would have substantial numbers of communications with relevant and non-relevant documents.  But the SVM SVC model produces only a few documents in this range; and using one of the neural network models would yield almost no documents in this range. SVM (support vector machine) use is common in eDiscovery, so understanding how documents are scored in this context may be very important. 

We will leave as an open question for now whether examination of those intermediate-scoring documents is actually useful, but it is clear that whatever information it does supply will be strongly affected by the method used to identify relevant documents.

What the scores mean


It should be clear from the above figures that the scores produced by a categorizer do not reflect directly the probability that a document is relevant.  Nor do they reflect the degree of relevance.  The probability that a document is relevant is different from the document’s relevance.  The probability that it will rain is generally uncorrelated with the amount of rain that will fall.  Documents that score higher may be more likely to be relevant, they may be more like other relevant documents, or they may be more different from non-relevant documents, but higher scores do not mean that they are necessarily more relevant. 

For example, taller heavier people tend to be men versus women, but one would not conclude that taller heavier people are more manly than smaller people.  Being more likely to be a man is unrelated to the degree of “manlyness.”  Maybe it is just my taste, but I do not believe, for example, that John Candy (6’2”, 300 pounds) was more manly than Humphrey Bogart (5’8”, 150 pounds, # 9 on iMDB list of most manly actors).  Similarly, being more likely to be a relevant document is unrelated to the degree of relevance.

Relevance is a complex issue, depending in part on the information needs of the parties.  Classifiers, on the other hand, only have available the content of the communication and its similarity to other communications.  Machine learning is possible because we assume that similar items should be treated similarly.  Machine learning does not just memorize the correct arbitrary class for each communication because then it could never generalize to new communications.  Similarity does not always correspond to relevance either.  For example, an email that contained the phrase “you’re fired, you’re fired, you’re fired,” might score higher than one that said only “you’re fired” but the second one may be more serious and more important to the question you are trying to answer. 

The scores are a mathematical product of the algorithms used by the system.  Scoring functions are often complex and can be quite subtle.  Some words are more informative than others.  Two documents near a boundary may differ in many words; and identifying the contribution of each of these words to the distinction between categories would a formidable exercise, with little expected value. 

The scores produced by machine learning methods depend on the specific algorithms employed.  Each machine learning method employs its own way to compute the similarity among documents.  But, as described, all of the categorizers studied here produce about the same level of classification accuracy, so the ability of these systems to categorize the communications is not in doubt.  The point I want to make is that even though the classifiers agree on the overall classification, how they come to that classification can be very different.

The internal properties of machine learning categorizers are important for understanding just what the scores mean. These properties become critical when the parties want to go deeper beyond the decision-making and want to try to exploit internal details.  Asking the right questions and the value of the answers one gets for them depends critically on just what information a categorizer uses and how it uses it.  The very strategy of eDiscovery processing may depend on a proper understanding of these properties.  Our intuitions about we will find when we look at particular documents may not match up with what is actually there.



Thursday, February 22, 2018

Machine Learning: It Depends


If you ask an attorney a simple yes/no question, you’re likely to get the answer “It depends.” If we could identify all of the things on which that opinion depends and if we can identify the values of those things that would make the answer come out one way or another, then we would have an artificial intelligence agent that could replace the attorney on that particular question.  That agent could take the form of a checklist, a flowchart, or a computer program.  The agent is defined by the variables (the things the opinion depends on) and how the values for those variables lead to one decision or the other, not by the technology that implements those relations.

If there are only a few variables and simple relations between them and the decision, then the problem is easy and we may be able to simply write down a set of rules (if damages are greater than $1,000 then …).  If the opinion depends on lots of variables and if those variables can combine in complex ways, then we may need machine learning and lots of examples to extract and formulate these relations.  Each variable may contribute ambiguously to the decision, but when you combine many ambiguous predictors, you can often achieve a reliable decision process.

The FICO credit score is an example of this kind of decision agent.  The FICO score is used to make consumer credit risk decisions.  It depends on five broad groups of variables, some of which are proprietary.  Because it considers various combinations of variables with varying levels of importance, it is difficult to describe the specific impact that any one of them has on the ultimate score.  Two people with same values for one of these variables may, as a result, have very different FICO scores.  Still, this scoring system has been found over the years to be quick and reliable.
Systems like the FICO score typically depend on machine learning.  The system designer picks the variables that she thinks are likely to be involved in making the decision and then lets the machine learning system figure out how to use these variables to make the decision.  Training the system consists typically of providing it lots of examples (for example, of coded credit reports) and their outcomes (for example, whether the people associated with those credit reports repaid their loan and how quickly). 

There are hundreds of methods for machine learning, even if we restrict ourselves to just those that are used to support decision making.  Machine learning can be summarized by three things:
  • The representation 
  • The assessment
  • The adaptation (or optimization) method.  

The representation is how the problem is described.  What are the potential variables that will be used? How are the values of these variables described numerically. For example, we might include a variable indicating how much has been borrowed in the past and another indicating how much has been repaid on time.  

If we wanted to build an agent that played chess, we might represent the chess game as a tree of potential moves.  At each point in the game, the tree has one branch for every legal move from that point.  The problem could also be represented as a computational network (often called a neural network or deep learning) where inputs lead to multiple layers of simulated neurons and eventually to a predicted categorization.

In document classification, how we represent the text is a critical part of the representation.  Each variation is a different representation.  One document representation could treat each word as an independent item, for example, how often it appears in the document.  Another might include information about the context in which the word appears.  One representation might include the words just as they appear, another might transform the words to their root form, for example, representing "time," "timer," and "timing" all as the single root "time."

Machine learning can only distinguish items that are different in their representation.  If we normalize words to remove prefixes and suffixes, for example, then distinctions between words with different suffixes would be unavailable. Words like "take," "taken," "taking," would all be represented in the same way and distinguishing between them would be impossible.  On the other hand, the system would not have to separately learn that words that differ in tense should be treated identically.  There can be tradeoffs to choosing between representations.

The assessment part of machine learning includes the goal of the project and how to measure approximations to that goal.  In the FICO example, the goal is to predict the creditworthiness of borrowers.  For each borrower in the training set we have a representation of the relevant credit history and a measure of how credit worthy that person was.  The goal would be to predict the appropriate score for each of the training examples, minimizing the difference between the score predicted by the method and the actual score assigned to that borrower.  The smaller the difference, the closer the system is to achieving its goal.  

In a decision system, the goal might be a decision about the input, for example, the category in which it belongs, and the approximation might be measured by the accuracy of that decision relative to the training examples.

The adaptation or optimization method is how the system works to improve its prediction, to get closer to the specified goal.  There are many optimization methods.  Generally they work by adjusting the importance of each variable or each combination of variables.  The importance may be all-or-nothing, as might be the case in selecting a variable or it might be a continuous value, including negative values.  The importance controls how the input variable affects the system prediction.  The optimization method might increase or decrease the importance of variables and keep just those changes that result in a better prediction. 

The accuracy, speed, and scope of problems that can be solved by machine learning has grown dramatically over the last couple of decades because the creators of those systems have invented better ways of representing problems, and adaptation methods that are better at selecting potential changes.  It also does not hurt that computers have gotten much faster and can contain more memory than ever before.

Despite these improvements, machine learning still stands on the three core features of representation, assessment, and optimization.  As a result, many machine learning systems tend to return approximately the same results.    Given a choice between more clever algorithms and better quality training data, it is often preferable to spend the effort on better data.  

Particularly in the context of categorizing documents, the machine learning algorithm makes little difference to the ultimate outcome of the project.  The same algorithm can lead to good results or to bad results, depending on how it is used.  We had a project, for example, in which several very similar data sets were to be categorized.  Each set was handled by a different attorney who was to provide the training.  One attorney paid close attention to the problem, worked at making consistent decisions, and selected the training examples over a few days.  The system did really well on this set.  The other attorneys did their work to select training documents less systematically selecting only a few documents per day, with several days in between.  The very same system did poorly on these sets.

To summarize: Learning algorithms, particularly those involved in document categorization all work to maximize the probability that a document will be categorized correctly given its content.  The tools that machine learning provides to accomplish this are to adjust the importance of the document elements (for example, the words, or whatever representation is used).  Methods differ in how they represent the documents and this difference can be critically important.  The quality, primarily, the consistency, of the training examples is also a critically important part of machine learning.  It does not much matter how these documents are selected, provided that the examples are representative of those that need to be categorized, that they are consistently coded, and that the coding represents the actual desired code for each one. 

Sunday, November 26, 2017

Comparing 179 Machine Learning Categorizers on 121 Data Sets


It is often argued that the algorithm used for machine learning is less important than the amount of data used to train the algorithm (e.g., Domingos, 2012; “More data beats a cleverer algorithm”).  In a monumental study, Fernández-Delgado and colleagues tested 179 machine learning categorizers on 121 data sets. They found that a large majority of them, were essentially identical in their accuracy. In fact, 121 of them (that’s a coincidence) were within ±5 percentage points of one another averaging all of the data sets.
The following two graphs show the same data organized either by family (color and order) or by accuracy (order) and family (color).



Families
1. Bagging (BAG): 24 classifiers.
2. Bayesian (BY) approaches: 6 classifiers.
3. Boosting (BST): 20 classifiers.
4. Decision trees (DT): 14 classifiers.
5. Discriminant analysis (DA): 20 classifiers.
6. Generalized Linear Models (GLM): 5 classifiers.
7. Logistic and multinomial regression (LMR): 3 classifiers.
8. Multivariate adaptive regression splines (MARS): 2 classifiers.
9. Nearest neighbor methods (NN): 5 classifiers.
10. Neural networks (NNET): 21 classifiers.
11. Other ensembles (OEN): 11 classifiers.
12. Other Methods (OM): 10 classifiers.
13. Partial least squares and principal component regression (PLSR): 6 classifiers.
14. Random Forests (RF): 8 classifiers.
15. Rule-based methods (RL): 12 classifiers.
16. Stacking (STC): 2 classifiers.
17. Support vector machines (SVM): 10 classifiers.

Each family relies on the same core classifiers but may use different parameters or different transformations of the data.  There is no simple way to assess the variety of the specific classifiers in each group. 

A few observations

The observation that so many of the classifiers performed so well over a variety of different data sets is remarkable.  More than 2/3 of the classifiers that were tested performed within plus or minus 5 percentage points of one another over a large number of different data sets.
The observation that the range of accuracies differed almost as much within a family as between families is also remarkable.  Classifiers in the Bagging family BAG), for example, were among the most and among the least accurate classifiers in the experiment.  Bagging is an ensemble approach, where several different classifiers are combined using a kind of averaging method.  Boosting, Stacking, and OEN (their abbreviation for other ensembles) families also involve ensembles of classifiers.  The high levels of variability among members of these families is a little surprising and may be, at least partially, due to the ways in which the parameters for these models were chosen.
Although Fernández-Delgado and associates tried to choose optimal parameters for each method, there is no guarantee that their methods of selection were optimal for each classifier.  Poor classifiers may have performed poorly either because they were ill suited to one or more of the data sets in the collection or because their parameters were chosen poorly.
Three other families showed relatively high accuracy, and also high consistency.  The best performing family was Random Forest (RF), followed by Support Vector Machines (SVM) families. A Random Forest classifier uses sets of decision trees to perform its classification.  Support Vector Machines learn separators between classes of objects.  These are two relatively old machine learning methods.  Classifiers in the Decision Trees family were also relatively consistent, though slightly less accurate.
Classifiers in the Bayesian family (BY) were also quite consistent, but slightly less accurate.  Bayesian models tend to be the simplest models to compute with relatively few parameters and no iterative training (repeatedly adjusting parameters using multiple passes over the same data).

Conclusion

So, what do we make of this result.  Classification is not particularly sensitive to the family of classifier that is employed.  Practically any family of classifier can be used to achieve high quality results.  Based on these results, the choice of a Random Forest or SVM classifier is likely to be the most reliable choice in that they seem to work well under a variety of data and a variety of configurations.  Many classifiers from other families, if effectively tuned, are also likely to be effective.  There is no guarantee that all classifiers are equally affected by a single tuning method, or that all varieties of classifier are equal, but many of them will yield high quality results.  It appears that how a classifier is used is more important than what kind of classifier it is.
I have left out many of the details of exactly how these different classifiers work.  That information can be gained from the Fernandez-Delgado paper or from Wikipedia.


Domingos P (2012) A few useful things to know about machine learning. Commun ACM 55(10) 78-87.
Fernández-Delgado, M., Cernadas, E., Barro, S. and Amorim, D. (2014) Do We Need Hundreds of Classifiers to Solve Real World Classification Problems? Journal of Machine Learning Research, 15, 3133-3181


Thursday, January 12, 2017

Intelligence: Natural and Artificial



It does not take a genius to recognize that artificial intelligence is going to be one of the hot topics for 2017.  AI is suddenly everywhere from phones that answer your questions to self-driving cars. Once a technology achieves prominence in the consumer space, it moves into the mainstream of applied fields, even for fields that are slow to adopt technology.  

Predictions 2017.  Source: http://www.psychics.com/blog/a-brief-history-of-the-crystal-ball/


AI has also caught the imagination of people who like to worry about the future.  Are we going to achieve some kind of singularity where we can upload human consciousness to a computer or is Skynet going to determine that people are simply superfluous and work to destroy them?

The prospects for either of these worrisome events seems extraordinarily remote.  I think that these concerns rest largely on a deep misunderstanding of just what intelligence is.  Intelligence is often used to describe something akin to cognitive processing power.  And processing power of a certain kind that represents the cultural achievements of Western Civilization (e.g., school, business success).

Intelligent people/things generally are those that are better able to think.  The implication is that some people are better than others at thinking—they are generally more intelligent.  This is the kind of idea that underlies the concept of IQ (intelligence quotient).  IQ was originally invented to predict how well children would do in school. 

This notion of general intelligence, one that is intended to measure how well people think overall, has proven to be elusive.  Although there is some correlation between performance on one cognitive test and another, that correlation is not particularly high.  Moreover, the correlation may be more indicative of the similarity between cognitive tests and tasks than of shared cognitive abilities.  The correlation may be a result of the kind of situations where we attribute intelligence (for example, multiple classroom activities or business) and not be general at all.  Even among so-called intellectual activities, the correlation may be absent.  

The same applies to artificial intelligence.  We don’t have any generally intelligent machines.  So far, artificial intelligence machines are rather narrowly specialized.  Building a great chess playing machine is unlikely to be any use to winning at Jeopardy. Intelligence, both human and natural, seems to be largely domain specific.  If the evidence were stronger for general human intelligence, I might be more willing to predict that kind of success in general artificial intelligence, but so far, the evidence seems strongly to contrary.

Further, the problems that seem to rely most on intellectual capacity, such as chess playing or Jeopardy answering, turn out to be the easier problems to solve with computers.  Problems that people find natural, such as recognizing a voice or a face turn out to be more difficult for computers.  It is only recently, that we have made progress on addressing such problems with computers.  

Chess playing and Jeopardy answering by computers uses approaches that are different from those used by humans.  The differences are often revealed in the kind of mistakes people and machines make.  IBM’s Watson beat the human Jeopardy players handily, but it made certain mistakes that humans would not (for example, asserting that Toronto was a US city).  The difference in mistakes (artificial vs. natural stupidity?) is not a sign of AI’s failure, just that the computer is doing things in a different way than a smart person would.  

Similarly, the kinds of mistakes people make tell us something about how they form their intelligence.  For example, people will give different answers to the same question, depending on how precisely it is asked.  In a seminal study by Kahneman and Tversky, participants were asked to choose between two treatments for 600 people infected with a deadly disease.   

If the people were given a positively framed choice, 72% chose Treatment A:
  • Positive Frame: With Treatment A 200 people will be saved.  With Treatment B, there is a 33% chance of saving all 600 and a 66% chance of saving no one.
On the other hand, if the situation was described more negatively, only 22% chose Treatment A:
  • Negative Frame: With Treatment A, 400 people will die.  With Treatment B, there is a 33% chance that no one will die and 66% chance that all 600 people will die.
With both sets of alternatives, 200 people are predicted to live and 400 people will die under treatment A, and under Treatment B there is a 33% chance that everyone will survive and a 66% chance that no one will survive.  Logically, people should give the same answer to both, but instead they are affected by the pattern of how the question was asked.  The first pair match a positive pattern and the second pair match a negative pattern, and thus lead to different choices.

People have a tendency to jump to conclusions based on first impressions or other patterns that they analyze.  In fact, the root factor underlying human cognition seems to be pattern recognition.  People see patterns in everything.  The gambler’s fallacy, for example, relies on the fact that people see patterns in random events.  If heads come up six times in a row, people are much more likely to think that the next flip will result in tails, but in reality heads and tails are equally likely.  

Humans evolved the ability to exploit patterns over millions of years.  Artificial intelligence, on the other hand, has seen dramatic progress over the last few decades because it is only recently that computer software has been designed to take advantage of patterns.

People are naturally impressionistic intuitive reasoners.  Computers are naturally logical and consistent.  Computers recognize patterns, and thereby become more intelligent, to the extent that these patterns can be accommodated in a logical, mathematical framework.  Humans have a difficult time with logic.  They can use logic to the extent that it is consistent with the patterns that are perceived or are “emulated” by external devices.  But logic is difficult to learn and difficult to employ consistently.

Every increase in human intelligence over the last several thousand years, I would argue, has been caused by the introduction of some artifact that helped people to think more effectively.  These artifacts range from language itself, which makes a number of thinking processes more accessible to things like checklists and pro/con lists, which help make decisions more systematic, to mathematics.

In contrast, the kinds of tasks that people find easy (and challenge computers), such as recognizing faces are apparently a property of specific brain structures, which have evolved over millions of years.  Other aspects of what we usually think of as intelligence are much more recent developments, evolutionarily speaking, over a time frame of, at most, a few thousand years.  Our species, Homo sapiens, has only been around for about 150,000 years.  There have been quite a few changes to our intellectual capacity over that time, particularly over the last few thousand years.  The cave paintings at Lascaux in France, among the earliest known artifacts of human intelligence, are only about 20,000 years old.

An example of Face recognition by humans.  Although upside down, this face is easily recognizable, but there is something strange about it.  See below.


Computer face-recognition systems do not yet have the same capacity as human face recognition, but the progress in computerized face recognition has come largely from algorithms that exploit invariant measurements of the faces (such as the ratio of the distance between the eyes relative to the length of the nose).  

The birth of self-driving cars can be attributed to the DARPA Grand Challenge.  DARPA offered a million dollar prize for a self-driving car that could negotiate, unrehearsed, a 142 mile off-road course through the Mojave desert.  The 2014 competition was a complete failure.  None of the vehicles managed more than 5% of course.  In 2015, on the other hand, things were dramatically different.  Stanley, the Stanford team’s car negotiated the full course in just under 7 hours.  A major source of its success, I believe, lay in the sensors that were deployed and in the way information from those sensors was processed and pattern analyzed.  Core to the system were machine learning algorithms that learned to avoid obstacles from example incidents in which human drivers avoided obstacles.

Enhancement in computer intelligence similarly has come from hacks and artifacts.  Face recognition algorithms, navigational algorithms, sensors, parallel distributed computing, and pattern recognition (often called machine learning) have all contributed to the enhancement of machine intelligence.   But, like the elusiveness of human general intelligence, it is dubious that we will see anything resembling general intelligence in computers.   

Winning a Noble prize in physics is no guarantee that one can come to sensible conclusions about race and intelligence, for example.  Being able to answer Jeopardy questions is arguably more challenging than winning chess games, but it is not the same thing as being able to navigate a vehicle from Barstow, California to Primm, Nevada.  Computers are getting better at what they do, but the functions on which each one is successful are still narrowly defined.  Computers may increasingly take over jobs that used to require humans, but they are unlikely, I think, to replace them altogether. Ultimately, computers, even those endowed with artificial intelligence, are tools for raising human capabilities.  They are not a substitute for those capabilities.  


The same face turned right-side up.  In the view above, the mouth and eyes were upright while the rest of the face was upside down.  The effect is readily seen when the face is right-side up, but was glossed over when upside down.  We recognize the pattern inherent in the parts and in the whole.  Source: http://thatchereffect.com/