Reducing the size of file access logs

Logs are very helpful in general and we can often hear that most systems should probably have some form of logging enabled. They help us understand what is happening on a website or on a given system. They help us diagnose problems, find bottlenecks, answer certain queries and find the exact time when particular events occured.

But if we are not careful, our log files can grow very large, taking away precious system resources. If left unchecked, an outsized PHP error log can affect our ability to store more data if we are on a space-contrained shared hosting. Each and every month that our site remains in operation produces another log of data. Depending on the frequency of visits, the combined size of all log files can add up.

If we take a critical look at the data we collect, we may find that we don't need to store some of it. For instance, storing the seconds of each visit may not be that useful if we can live with a granularity of a minute. We might also not be very interested whether a request was GET or POST if that's already reflected in the form of the visited URL. Similarly, that the request used the HTTP/1.1 protocol or a newer one may not be relevant to us. Also, the "compatible" part of the user agent description may still leave us wondering what type of browser has accessed the page, since history knows cases when the browsers misreported themselves.

To better understand what I mean, suppose that we have the following access log fragment:

access_log_data = """ - - [10/Feb/2014:05:34:17 +0200] "GET /wp-admin/ HTTP/1.1" 404 - "-" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" - - [16/Feb/2014:06:06:25 +0200] "GET / HTTP/1.1" 302 2449 "-" "GoogleBot 1.0" - - [16/Feb/2014:06:06:26 +0200] "GET /index.php?page=1 HTTP/1.1" 200 3525 "-" "GoogleBot 1.0" """

You might have already seen the word "code bumps" on my blog and this is exactly what is wrong here. If we have a file with million lines in the same format, finding something fast by reading through could become really hard. But we could transform the data to a form that would reduce our storage costs and at the same time allow us to search through the data with automated tools. For instance, the CSV format is both space-efficient and easy to parse, so we could use it as a target.

Here is the short code that we will use for the transformation:

transformed_data = [] months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'] for line in access_log_data.split('\n')[1:-1]: ip, rest = line.split(' - - ') dt, rest = rest.split("] ") page, browser = rest.split(' "-" ') dt = dt[1:] browser = browser[1:-1] page = page.split(' ', 2)[1] dt_components = dt.split(':', 1) date_month_year = dt_components[0] hours_mins = dt_components[1].split(' ')[0][:-3] date, m, year = date_month_year.split('/') month = months.index(m) + 1 if month < 10: month = '0' + str(month) date_month_year = '.'.join([date, month, year]) transformed_data.append("|".join([hours_mins + ' ' + date_month_year, ip, page, browser])) transformed_data = '\n'.join(transformed_data) print(transformed_data) """ 05:34 10.02.2014||/wp-admin/|Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0) 06:06 16.02.2014||/|GoogleBot 1.0 06:06 16.02.2014||/index.php?page=1|GoogleBot 1.0 """ original_data_len, transformed_data_len = len(access_log_data), len(transformed_data) print('Original data length: %d, Transformed data length: %d' % (original_data_len, transformed_data_len)) print('{0:.2f}% from the original size'.format((transformed_data_len / original_data_len)*100)) """ Original data length: 354, Transformed data length: 217 61.30% from the original size """

We could have used a regular expression too, but the problem with this is that very often regular expressions are hard to read, being by nature full of code bumps. Using a stepwise approach allows us to label every data piece exactly, so that we always know where to find what we need. Using the pipe symbol as a separator will work well only if we know that it cannot occur anywhere else on each line in the access log. Notice that on the first line the "compatible" string has been preserved; without it the size reduction would have been even greater. This is how we arrive at almost 39% gains over the original size. This might not seem much, but if your log file(s) are several gigabytes, this might save you several hundred megabytes if you can live without the originals. But to do such transformation correctly, you need to know what data you should keep in your case and what you can afford to discard.