LZ4 vs. GZIP

Compression becomes more important as we continue to add content on the web every single day. The average size of our websites continues to increase rapidly, so we need to look around for ways to minimize the waiting time for our users. The nature of our content determines the most appropriate algorithm for its compression. For example, if we have images, PNG or JPEG will work, but to choose between them we need to know the number of colors our image has and whether a small loss in detail is tolerable. WebP is another option, but we need to ensure that it is well supported and just as easy to use.

The majority of the online content is textual. If its type is natural language, we can't remove white space, because individual words won't be clearly distinguishable. But if we have code, such minification is possible and can decrease the file size without altering our initial functionality. This is why using semicolons at the end of each line of code is recommended. It allows the parser to recognize individual statements and interpret them correctly. After this minification we can further decrease our file size through compression. But then we need to make sure that the browser will be able to decompress our file in a reasonable time. We can see whether the request header "Accept-Encoding: gzip, deflate" is available.

GZIP is widely used today and for a good reason. It provides very good compression and can be used on anything that contains text. It also provides relatively fast decompression, which is very important on the web. But lately we see another algorithm that is becoming popular and is already used in a variety of products—from databases to OS kernels. It's the LZ4 algorithm, which people claim to have much faster compression and decompression speeds, especially on multi-core systems. My first thought was that speed is important on the web too, where every millisecond counts, so why not use it there too? This necessitated the creation of some simple tests to compare the performance of GZIP and LZ4 and try to understand in which use cases they would excel. I decided to test only compression ratios, since measuring speeds is system-specific. Measuring speeds would be harder and more error-prone.

First, I decided to compare both algorithms with something that is familiar and in wide use on the web today: jQuery. So I took the most recent development and production versions and supplied them as input to both algorithms. I did GZIP compression through the gzip package in Python, which uses a compression level 9 (best compression, slowest speed) by default, so I needed to make sure that LZ4 used the same setting. I used the command-line tool for Windows that Yann Collet—the creator of the LZ4 algorithm—provides.

Compression of the jQuery production version

With the uncompressed, development version, only small differences were noticeable, but GZIP offered a better compression.

Compression of the jQuery development version

With the minified, production version, differences were somewhat more pronounced, with GZIP again being better.

Then I decided to see if there is a difference with files in natural language, so I took Peter Norvig's big.txt (6.18MB!) and used it as an argument.

Compression of a big file in natural language

We see again small differences, so in a case like this one, where the original file is big, the speeds of compression/decompression will be much more important. If the LZ4 decompression is much faster than the GZIP decompression, users will be able to see the content much sooner, which will make them believe that LZ4 performs better.

Where else do we have a lot of content? Databases. So I took a simple SQL file to see whether there is any difference this time.

Compression of an SQL file

It seems that in this case both algorithms perform a bit better, which made the compressed versions almost 25% of the original size. But a bigger original file could have made the test a bit more representative here.

Compression of a large file with permutations of a set of characters

Finally, I decided to give as input my own file that contained all possible permutations of a set of characters, separated through a delimiter. We can see that in this case differences in the output size are no longer small. Something else that I noticed is that on my single core processor, GZIP compressed this file in 21 seconds, whereas LZ4 finished in 66 seconds. On a multi-core system LZ4 might have performed much better. Decompression on the other side was different: GZIP took around 4 seconds and LZ4 finished in less than a second, which is very fast for a file size of 112MB. Applications that have to deal with very large datasets could certainly benefit from this. So the decision of which algorithm to use could really depend on connection speed. If users have fast Internet connections, they are less likely to notice that a large file took a couple of seconds more to download. Then they could also decompress it instantly with LZ4 (if browsers supported this). But if they are on a slower connection, the time required to download a large file could quickly overshadow the gains from the faster decompression. In this case GZIP would still be a more appropriate choice.

This might be a sign why we don't see more excitement about LZ4 for the purpose of compressing content on the web today. But things can change, which is why it's not so simple to say that one algorithm is better than another, when they could be developed for specific and very different scenarios. We need to be aware first how the context changes the algorithm, before we are able to evaluate whether it is appropriate for the task.

bit.ly/1g6K86Y