Data analysis

Even if we have access to a lot of data, it becomes increasingly hard to make sense of it. Being able to analyze our current situation allows us to make informed decisions about our future, which can be useful no matter what our profession is. Data analysis is only one aspect of our work, and when isolated, it can't do much. But when combined with interdisciplinary know-how, it can support execution and at the same time depend on it. Every business has at least few key performance indicators that generate data, which needs to be analyzed in order to find if everyone is moving in the right direction. Accounting data is also of existential importance.

Everything around us is data even if we don't usually think so. Very often this fact remains hidden until otherwise. Sometimes data is like the air—it's there, but we can't see it. This is why it can be useful to think about the data behind every visible object—how it was made, what its origins are, who created it and under what kind of circumstances. It's often much easier for us to describe products through their quantifiable characteristics than it is to tell how we felt using them.

Data is also useful for understanding relationships, analyzing trends, even making approximate predictions. It can be used for comparisons of all kind (for example: determining how the average size of a web page changed over the years). But without knowing what it is we look for, we risk collecting the wrong data or one that is incomplete. Collecting data can be problematic as it needs sometimes to be cleaned up, corrected and/or normalized, before we can interpret it and present it in the most appropriate way. (The process also predetermines the strength of our conclusions.) Plotting it improves people's perception on the meaning of numbers and as the size of the dataset grows, this becomes increasingly useful. Complex data shouldn't lead to a crowded graphic and for this to happen, we need to take more time understanding what it really means and what would not be relevant to the presentation.

Eurostat is one of the many publishers of freely available data which everyone could take and analyze. Since inequality within the EU is a common theme these days, I decided to seek for signs whether it grows or shrinks and by how much. So I have taken part of the data in the dataset on GDP per capita, which is said to be the true measure of wealth in a society. I was only interested in the data of individual countries within the EU, so I discarded everything else. Unfortunately, the data was already normalized (numbers were relative to the average) and not raw as I wanted it to be. On the one side, this eliminates unneeded noise in the results, but on the other—it makes subsequent analysis harder. I was interested in the spread of the data, so I thought that the box plot would present it best. Before we continue, I want to make sure that you know what a box plot is and how it can be read.

Anatomy of a box plot

The box plot gives an overview of the distribution of data. As you can see it has two whiskers which denote the maximum and minimum. The difference between them is the range. The lower bound of the box is the first quartile and the upper bound the third quartile. The difference between them is the innerquartile range(IQR), which is a measure of the spread. In the center of box is the median, and not the mean(average). Here they are equal only for the sake of clarity. The mean is not resistant to extreme observations whereas the median is, which can give us less biased results. Even the trimmed mean(cutting the whiskers) is more robust than the mean in case of extremes. This diagram type is also useful to check for unusual observations.

Let's look at the data from Eurostat now. Side-by-side box plots can help us compare annual data and see how it changed over time.

GDP per capita in EU countries (2007-2012)

At the top we see dots, which are outliers—values that are so different compared to the rest that they don't fit the rules. We can see that since year 2008, the median value is constantly decreasing, which should indicate that countries as a whole have decreased their GDP per capita, even if they are measured against the same average of 100%. The maximums have declines, but only up to a point, whereas the minimums are constantly increasing since 2007, which means that the poorest countries seem to increase their GDP per capita. The first quartile seems to have moved up even stronger and is quickly closing the gap with the median. This could be a sign that inequality is slightly shrinking if we can trust the data.

Analyzing datasets like this can help us come to our own conclusions to data that is of our direct interest. There are many other plot types for knowledge compression—bar, scatter, histogram, pie etc.—and knowing when to use which can help with the interpretation of the data. Just because we find a particular type more beautiful than others doesn't mean that we should use it. I encourage you to seek some simple dataset and see whether you can find hidden signals that you would otherwise miss if they were encoded in pure numbers. Even if we aren't data analysts, we can still find interesting details.

What I've learned is that in some cases, choosing a logarithmic scale can help us see trends in the data more clearly. Linear regression allows us to understand the linear relationships between data. The correlation is captured in the line of “best fit”. Positive correlation (up to +1) indicates a strong relation between the data, negative correlation(down to -1) indicates lack of relationship. We can graph the residuals (differences between actual and predicted values) to analyze the data, but the result must be trend-free if we want to be correct in our conclusions.

It's best if we can collect data through experiments and only when it's not possible to prefer observation (we risk influencing the people we observe, which is also valid with web design). According to the central limit theorem, if we repeat an experiment over and over, the probabilities for the average result will converge to the normal distribution (whose data is consistent to the empirical rule), which decreases the amount of uncertainty in the results. Randomization is another technique that can be used in experiments in order to reduce bias or the inability to make conclusions. It is important for the selection of representative samples.