To hear some people tell the story, we are on the verge of
economic disaster. According to some,
the problem is that debt continues to “explode.” The data science may not support these
claims. The argument for explosion rests
on dubious data presentations, which, at best, make these claims difficult to
assess, and, at worse, contradict them.
Given the importance of this topic, I want to use this context to make a
few comments on the presentation of quantitative information. Today's post has little to do with eDiscovery, but a lot to do with data and how we analyze and present it.
Data science is principled story telling about data. But inadequate understanding of data can lead
to misleading stories. Moreover, how you
present the data has a big impact on how people understand the story that the
data are telling. This is a story about
misrepresenting data, leading to the wrong conclusion.
In particular, I want to focus on an analysis by Grant
Williams. Williams created a 40 minute video called “Crazy, A
Story of Debt,” in which he claims that there is too much debt, which will
cause future economic collapse when that debt comes due. He argues that “The
relationship between [total debt and GDP] is what everything that we have been
through has been all about.” He argues
specifically that high levels of debt were the cause of the 2008 credit crisis
and that the level of debt has increased even more since then. Williams argues that “absurd levels of
short-term debt,” have grown more absurd since the Great Recession.
Figure 1. The graph to rule them all: debt and GDP from 1951 to 2015. |
In the first chart (Figure 1),
I have redrawn the one that Williams calls the one chart to rule them all. The debt and GDP data are available from
the Federal Reserve of St. Louis.
This chart shows the total outstanding debt for the US (“All
Sectors; Debt Securities and Loans; Liability, Level”) for each quarter from
1951 to 2015. It is clear that the
amount of debt has increased substantially over that time period, and that it
has increased at a higher rate than the GDP (Gross Domestic Product; the market
value of goods and services produced in a country), particularly over the last
35 years.
What is less clear from Figure 1,
is whether the debt increased at the same rate after 2008 relative to the rate before
2008. The time scale on the chart makes
the post-2008 debt look like it is increasing substantially. And, it is difficult to compare visually two
slopes in a line graph.
2,
on the other hand, makes this comparison explicit. From 2001 to 2007 the
debt increased by $9.5 billion per day.
After 2008, it increased at the much lower rate of $3.4 billion per day. Both rates may be economically unhealthy, but
it would be wrong to claim that the debt continues to grow unabated after 2008.
Its growth is substantially lower.
We can focus our chart on the period since 2008, where
Williams focuses his attention. This
chart is shown in Figure 3.
Figure 3
shows just the time period since 2008.
The increase in the debt level over this time period does not look as
dramatic in this graph as is does in the first one, though the numbers are
exactly the same.
Figure 3. Debt and GDP from 2008 to 2015. These are the same data as shown in Figure 1 for the time from 2008 to the end of 2015. |
Larger economies are likely to have larger debt, all other
things being equal. The GDP is a global
measure of the economy’s output. Looking
at the debt as a proportion of GDP presents a very different story than looking
at the absolute dollar level. The
relative amount of debt has actually declined even as the absolute value has
increased.
Both debt and GDP have increased since 2008, Debt has grown
more quickly than GDP (as measured by the slope of the two lines), but debt as
a percentage of GDP has actually declined since 2008.
Figure 4. The ratio of debt to GDP for the period from 2008 - 2015. In this version, it is easier to see that debt has fallen as a percentage of GDP since 2008. |
In contrast to the overall debt, the US Federal Debt has, in
fact, increased since 2008, both in absolute terms, relative to GDP, and
relative to the total debt.
Conclusions
Good data science starts with a question, analyzes data to
address that question, and presents the results of that analysis in a form
accessible to the target audience. In
the present context, the primary question is whether there has been an
explosion of debt relative to GDP since 2008.
The answer to that question seems to be a resounding no. Although the debt has, as expected, increased,
it has been at a lower rate than before 2008 and it has actually declined
relative to GDP.
In conducting this analysis, we have had to make some
decisions about how we interpret the informal language of English into precise
testable hypotheses. We interpreted
“explosion” to mean a higher rate of increase after 2008 than before. We interpreted “relative to” to mean the
ratio of debt to GDP. Effective data
science, or any other kind of science, for that matter, always requires such translations from
informal language into mathematically precise language.
They often find it
difficult to understand what it means to say that some factor increased, but at
a lower rate than before. Although
people can, if they think, about it, understand that the GDP growth rate is
lower at one time than at another, if they don’t actively work at it, they
might expect that a decreasing growth rate would result in a decreasing
GDP. It is not that people are incapable of
understanding such rate claims, but they may find it difficult.
Effective data science presentations, therefore, are the ones that reduce
the cognitive load of making difficult comparisons by designing visualizations
that make these comparisons as explicit as possible. Figure 5
makes the comparison of rate growth before and after 2008 more clear than Figure 1 does and Figure 2
focuses on it even more directly.
A lot of data science focuses on big data and machine
learning, but good data science is necessary when dealing with more moderately scaled data sets, as well. Basic line charts can provide interesting and
useful information, but they can also be misused.
The data scientist's conclusion should be apparent to the reader from the chosen visualization. The audience should not
have to dig through the graph to assess ones claims (See Figure 4, debt relative to GDP).
Figure 7. Federal debt a proportion of GDP. This version makes it clear that the Federal debt has been increasing relative to GDP since 2008. |
The goal of effective data science should be to find
patterns in data, not to selectively present the data to support preformed conclusions. Visualizations can help to
make data more comprehensible, but they can also be used to mislead. Visualizations are not arbitrarily related to
the data, rather, the combination of theory and data constrain the kind of
visualization that is appropriate and useful.