Fitting a spline to the time-series data of the project "Search for papers"

Different spline fits to the data

The horizontal axes show the number of days since this project was started, while the vertical axes indicate the number of paper titles that have been added. The points at which measures have been made are given in black. You can see that in the first period of more than 200 days, no measurements have been made, so there is no data of how the number of papers changed during that period. Spline interpolation attempts to fit a smooth line to our data points in a way that could potentially reveal some useful information about the regions where few or no measurements have been made.

Here you see three different spline kinds (linear, quadratic and cubic) fitting the data.

The linear spline passes through all data points, but does this abruptly and not smoothly. We see that in the first approx. 200 days it paints the picture that each and every day the same number of papers have been added, which is simply not true. We know that on some days no new paper titles were added, while on others a large number were added. On the other side, it looks somewhat realistic, because if we continue the first straight line in our imagination, we can see that its slope is approximately the same as the slope of the line connecting the points from day 233 to day 413. The first line grows slightly faster than the second, which indicates that the initial effort was bigger than the one spent on improvement, which only makes sense. Additionally the number of new paper titles that aren't already included has started to decline; it became relatively harder to include new ones.

The quadratic spline is the most appealing of the three and it fits smoothly to all data points. The problem is that at day 233 it looks as if the number of paper titles will continue to increase exponentially, which is rather unrealistic. Nothing has changed so dramatically in the way resources were assigned to the project to justify such thinking.

Finally, we can see that the cubic spline overfits with this data. It is simply unable to pass through all data points in a meaningful way. It presents things as if we added more than 70000 titles at the beginning and then dropped their number to approx. 50000, which is simply not true. Normally, the higher the order of the spline (cubic=3), the higher the tendency to overfit. It is preferable to use a lower order spline if we can. Cubic splines, mathematically defined by the Bernstein polynomials, are often used in computer graphics software (e.g. the Path tool) or on the web (SVG). The cubic-bezier function in CSS can approximate all easing functions we use in our animations: we could think of the names "ease-in", "ease-in-out" etc. as particular results of the application of this function. But as always, words can't describe every possible result.

Splines are very powerful so it is not suprising that we often return to them.