Sunday, November 26, 2017

Comparing 179 Machine Learning Categorizers on 121 Data Sets


It is often argued that the algorithm used for machine learning is less important than the amount of data used to train the algorithm (e.g., Domingos, 2012; “More data beats a cleverer algorithm”).  In a monumental study, Fernández-Delgado and colleagues tested 179 machine learning categorizers on 121 data sets. They found that a large majority of them, were essentially identical in their accuracy. In fact, 121 of them (that’s a coincidence) were within ±5 percentage points of one another averaging all of the data sets.
The following two graphs show the same data organized either by family (color and order) or by accuracy (order) and family (color).



Families
1. Bagging (BAG): 24 classifiers.
2. Bayesian (BY) approaches: 6 classifiers.
3. Boosting (BST): 20 classifiers.
4. Decision trees (DT): 14 classifiers.
5. Discriminant analysis (DA): 20 classifiers.
6. Generalized Linear Models (GLM): 5 classifiers.
7. Logistic and multinomial regression (LMR): 3 classifiers.
8. Multivariate adaptive regression splines (MARS): 2 classifiers.
9. Nearest neighbor methods (NN): 5 classifiers.
10. Neural networks (NNET): 21 classifiers.
11. Other ensembles (OEN): 11 classifiers.
12. Other Methods (OM): 10 classifiers.
13. Partial least squares and principal component regression (PLSR): 6 classifiers.
14. Random Forests (RF): 8 classifiers.
15. Rule-based methods (RL): 12 classifiers.
16. Stacking (STC): 2 classifiers.
17. Support vector machines (SVM): 10 classifiers.

Each family relies on the same core classifiers but may use different parameters or different transformations of the data.  There is no simple way to assess the variety of the specific classifiers in each group. 

A few observations

The observation that so many of the classifiers performed so well over a variety of different data sets is remarkable.  More than 2/3 of the classifiers that were tested performed within plus or minus 5 percentage points of one another over a large number of different data sets.
The observation that the range of accuracies differed almost as much within a family as between families is also remarkable.  Classifiers in the Bagging family BAG), for example, were among the most and among the least accurate classifiers in the experiment.  Bagging is an ensemble approach, where several different classifiers are combined using a kind of averaging method.  Boosting, Stacking, and OEN (their abbreviation for other ensembles) families also involve ensembles of classifiers.  The high levels of variability among members of these families is a little surprising and may be, at least partially, due to the ways in which the parameters for these models were chosen.
Although Fernández-Delgado and associates tried to choose optimal parameters for each method, there is no guarantee that their methods of selection were optimal for each classifier.  Poor classifiers may have performed poorly either because they were ill suited to one or more of the data sets in the collection or because their parameters were chosen poorly.
Three other families showed relatively high accuracy, and also high consistency.  The best performing family was Random Forest (RF), followed by Support Vector Machines (SVM) families. A Random Forest classifier uses sets of decision trees to perform its classification.  Support Vector Machines learn separators between classes of objects.  These are two relatively old machine learning methods.  Classifiers in the Decision Trees family were also relatively consistent, though slightly less accurate.
Classifiers in the Bayesian family (BY) were also quite consistent, but slightly less accurate.  Bayesian models tend to be the simplest models to compute with relatively few parameters and no iterative training (repeatedly adjusting parameters using multiple passes over the same data).

Conclusion

So, what do we make of this result.  Classification is not particularly sensitive to the family of classifier that is employed.  Practically any family of classifier can be used to achieve high quality results.  Based on these results, the choice of a Random Forest or SVM classifier is likely to be the most reliable choice in that they seem to work well under a variety of data and a variety of configurations.  Many classifiers from other families, if effectively tuned, are also likely to be effective.  There is no guarantee that all classifiers are equally affected by a single tuning method, or that all varieties of classifier are equal, but many of them will yield high quality results.  It appears that how a classifier is used is more important than what kind of classifier it is.
I have left out many of the details of exactly how these different classifiers work.  That information can be gained from the Fernandez-Delgado paper or from Wikipedia.


Domingos P (2012) A few useful things to know about machine learning. Commun ACM 55(10) 78-87.
Fernández-Delgado, M., Cernadas, E., Barro, S. and Amorim, D. (2014) Do We Need Hundreds of Classifiers to Solve Real World Classification Problems? Journal of Machine Learning Research, 15, 3133-3181