If you ask an attorney a simple yes/no question, you’re likely to get the answer “It depends.” If we could identify all of the things on which that opinion depends and if we can identify the values of those things that would make the answer come out one way or another, then we would have an artificial intelligence agent that could replace the attorney on that particular question. That agent could take the form of a checklist, a flowchart, or a computer program. The agent is defined by the variables (the things the opinion depends on) and how the values for those variables lead to one decision or the other, not by the technology that implements those relations.
If there are only a few variables and simple relations between them and the decision, then the problem is easy and we may be able to simply write down a set of rules (if damages are greater than $1,000 then …). If the opinion depends on lots of variables and if those variables can combine in complex ways, then we may need machine learning and lots of examples to extract and formulate these relations. Each variable may contribute ambiguously to the decision, but when you combine many ambiguous predictors, you can often achieve a reliable decision process.
The FICO credit score is an example of this kind of decision agent. The FICO score is used to make consumer credit risk decisions. It depends on five broad groups of variables, some of which are proprietary. Because it considers various combinations of variables with varying levels of importance, it is difficult to describe the specific impact that any one of them has on the ultimate score. Two people with same values for one of these variables may, as a result, have very different FICO scores. Still, this scoring system has been found over the years to be quick and reliable.
Systems like the FICO score typically depend on machine learning. The system designer picks the variables that she thinks are likely to be involved in making the decision and then lets the machine learning system figure out how to use these variables to make the decision. Training the system consists typically of providing it lots of examples (for example, of coded credit reports) and their outcomes (for example, whether the people associated with those credit reports repaid their loan and how quickly).
There are hundreds of methods for machine learning, even if we restrict ourselves to just those that are used to support decision making. Machine learning can be summarized by three things:
- The representation
- The assessment
- The adaptation (or optimization) method.
The representation is how the problem is described. What are the potential variables that will be used? How are the values of these variables described numerically. For example, we might include a variable indicating how much has been borrowed in the past and another indicating how much has been repaid on time.
If we wanted to build an agent that played chess, we might represent the chess game as a tree of potential moves. At each point in the game, the tree has one branch for every legal move from that point. The problem could also be represented as a computational network (often called a neural network or deep learning) where inputs lead to multiple layers of simulated neurons and eventually to a predicted categorization.
In document classification, how we represent the text is a critical part of the representation. Each variation is a different representation. One document representation could treat each word as an independent item, for example, how often it appears in the document. Another might include information about the context in which the word appears. One representation might include the words just as they appear, another might transform the words to their root form, for example, representing "time," "timer," and "timing" all as the single root "time."
Machine learning can only distinguish items that are different in their representation. If we normalize words to remove prefixes and suffixes, for example, then distinctions between words with different suffixes would be unavailable. Words like "take," "taken," "taking," would all be represented in the same way and distinguishing between them would be impossible. On the other hand, the system would not have to separately learn that words that differ in tense should be treated identically. There can be tradeoffs to choosing between representations.
The assessment part of machine learning includes the goal of the project and how to measure approximations to that goal. In the FICO example, the goal is to predict the creditworthiness of borrowers. For each borrower in the training set we have a representation of the relevant credit history and a measure of how credit worthy that person was. The goal would be to predict the appropriate score for each of the training examples, minimizing the difference between the score predicted by the method and the actual score assigned to that borrower. The smaller the difference, the closer the system is to achieving its goal.
In a decision system, the goal might be a decision about the input, for example, the category in which it belongs, and the approximation might be measured by the accuracy of that decision relative to the training examples.
The adaptation or optimization method is how the system works to improve its prediction, to get closer to the specified goal. There are many optimization methods. Generally they work by adjusting the importance of each variable or each combination of variables. The importance may be all-or-nothing, as might be the case in selecting a variable or it might be a continuous value, including negative values. The importance controls how the input variable affects the system prediction. The optimization method might increase or decrease the importance of variables and keep just those changes that result in a better prediction.
The accuracy, speed, and scope of problems that can be solved by machine learning has grown dramatically over the last couple of decades because the creators of those systems have invented better ways of representing problems, and adaptation methods that are better at selecting potential changes. It also does not hurt that computers have gotten much faster and can contain more memory than ever before.
Despite these improvements, machine learning still stands on the three core features of representation, assessment, and optimization. As a result, many machine learning systems tend to return approximately the same results. Given a choice between more clever algorithms and better quality training data, it is often preferable to spend the effort on better data.
Particularly in the context of categorizing documents, the machine learning algorithm makes little difference to the ultimate outcome of the project. The same algorithm can lead to good results or to bad results, depending on how it is used. We had a project, for example, in which several very similar data sets were to be categorized. Each set was handled by a different attorney who was to provide the training. One attorney paid close attention to the problem, worked at making consistent decisions, and selected the training examples over a few days. The system did really well on this set. The other attorneys did their work to select training documents less systematically selecting only a few documents per day, with several days in between. The very same system did poorly on these sets.
To summarize: Learning algorithms, particularly those involved in document categorization all work to maximize the probability that a document will be categorized correctly given its content. The tools that machine learning provides to accomplish this are to adjust the importance of the document elements (for example, the words, or whatever representation is used). Methods differ in how they represent the documents and this difference can be critically important. The quality, primarily, the consistency, of the training examples is also a critically important part of machine learning. It does not much matter how these documents are selected, provided that the examples are representative of those that need to be categorized, that they are consistently coded, and that the coding represents the actual desired code for each one.