Plotting 10 million data points

Every day we encounter large amounts of data that in one way or another describe some aspect of our environment. This data can be generated by individuals, sensors, systems, machines or business operations. Our task is often to try to understand it and see whether we could use it to support or improve our decisions.

We can often hear that we are drowning in data, but starving for insight. This is not surprising: high-speed streaming data arrives now faster than we can analyze it through our hardware and software. The magnitude of the data has changed. What was considered a big dataset before is now considered to be of moderate size.

Our code needs to run that much faster or we won't be able to manage the data flood. If we have only a partial or incomplete picture of this data, such will become the quality of our decisions as well. We may start to think that our decisions must be hasty, since the data that can make them obsolete arrives fast as well and thus increases the uncertainty.

Buying faster hardware every time can partially postpone the problem with the amount of data, but this can become too expensive or even economically infeasible. Moreover, a lot of such improvements were and are financed by debt, which is not sustainable. The value that a potential insight or better understanding of the data can bring must offset the higher hardware cost in some way, which is not always given.

This means that we need to look at our software tools, improve our programming languages and data plotting/analysis machinery. For instance, we may be able to use a very fast programming language, but our plotting library may be too slow. Or our programming language can be slow, but our plotting library fast. In both cases, we would be limited to the lowest common denominator in terms of how much data we could process and analyze when both are used together. And if there isn't tighter integration between the two, this will slow down the process even further.

Imagine that a loop iteration takes time m and that a dataset consists of n items. Now think about another option, where the loop is slightly faster, maybe 20%, but the dataset is now suddenly 10 times bigger. In the first case we would spend m*n time to touch all items, while in the second case we would spend 0.8*m*10*n or 8*m*n. This is only an unreliable theoretical approximation, but it makes a point. It highlights the fact that slow technological advancements cannot possibly cope with a fast linear increase in the problem sizes we encounter. The speed of the loop remains relatively constant, but sudenly this constant has to be multiplied by a much bigger factor.

What we could do is review our tools and seek whether there are inefficiencies in our process or implementation.

It is surprising that a single-core mobile CPU is capable of generating 10 million lines x 2 space-separated random integers in the range [1, 100million] and writing them to a file in just 47seconds. Depending on the amount of free RAM we have, plotting all this data in a scatterplot could take around 3-5 minutes. This is why 100million data points may actually fail, but it is always good to know where the limits are.

Scatterplots are important when we are having lots of potentially unrelated data points or when we want to look for outliers. Below you can see the code used to generate the points.

#include <iostream> #include <ctime> #include <fstream> using namespace std; int main () { int x; int y; clock_t begin = clock(); ofstream file;"file.txt"); for(int i = 0; i < 10000000; ++i) { x = rand() % 100000000 + 1; y = rand() % 100000000 + 1; file << x << " " << y << endl; } file.close(); clock_t end = clock(); double time = double(end - begin) / CLOCKS_PER_SEC; cout << "Elapsed time:" << endl; cout << time << "s" << endl; return 0; }

What you can see is that we don't plot anything here, but we store all the data in a file. Reading that file later in a single pass allows us to be more efficient than if we had to use some plotting function in a tight loop. Additionally, keeping a separate data file gives us the flexibility to select any plotting library of our choice.

After we build and execute, the file is generated (117MB) and we can now use Gnuplot with it. If we chose to plot 100million data points, that would create a 1.2GB file in 9 minutes and 5 seconds, but we would be unable to plot such amount of data without running out of memory.

set style fill transparent solid 0.5 plot 'Desktop/file.txt' with points pt 6 lt 3 ps 0.0001

The first line is to make the points semi-transparent. Then we need to know the current working directory of Gnuplot which is obtained with the pwd command. If it happens that your data file is on your Desktop and it is a subdirectory of what pwd has returned, you could write 'Desktop/file.txt' to plot directly the contents of that file. Here we plot we points, but in some cases you might prefer linepoints instead. pt 6 indicates the type of the marker (here empty circle). Filled circle is pt 7. lt 3 indicates the color index, which stands for blue. lt 4 would be yellow (at least here). ps stands for point size and the default is 1. Since we have extremely large number of points, we need to make this number very small to ensure that they are somewhat distinguishable.

This is the result we get (click to enlarge):

A scatterplot with 10 million data points

It is not interesting how it looks, but that so many other use cases can be reduced to it. Perhaps there is even a better way to visualize lots of data or seek how individual items might be related. Hopefully this will help you to create your own visualizations to look for interesting patterns.