Data storage options

HTML data attributes. These seem to be suitable mostly for small amounts of data that will be only occasionally accessed. But if we have to add data attributes to every single element, the HTML will quickly become hard to read and change. Not only that, its size will grow disproportionately to the value each additional byte adds. Although many attributes store some kind of data, their keys are rigid and can't be changed. Data attributes practically allow us to define our own keys, which enhances flexibility. There are already proposals to have custom HTML elements too. Whether this will make code more readable for web designers is still debatable. We need to also keep in mind that they are generally slower to work with than retrieving the data from non-persistent arrays and objects (see below).

Base64 encoding. This is a way to describe a small image (preferably under 1-2kB) through data in order to save an additional HTTP request for the image. In some cases such data representation can be useful, because sending files through the network is generally slow. But for larger images, base64 encoding can actually store more data than the image itself will require, which is a drawback. Another reason not to use base64 encoding is when we need the image to be cached by the browser. But this encoding may still be useful in isolated cases.

Cache between the client and server. These caches store data temporarily in order to prevent queries from hitting the server and thus slowing it down. Memcached, Varnish, Squid, APC, Cache_Lite can all be used to speed up requests for data on websites, but we need to be aware of their individual constraints. It may be a good idea to examine the cache hit rate and—if possible—the conditions, which invalidate the previously stored data (cache thrashing). We need to be aware that we are effectively storing the available data twice, so we must evaluate whether this is worth our effort. The cache policy (write-through/write-back) may be another thing we need to consider if we want things to work correctly.

Cookies. They are used to store very small amounts of data (less than 4kB) on the client's machine, usually in the form of variables that have some values. Creating cookies is relatively easy, but deletion can be less intuitive, since we need to set the same cookie again, but with a date in the past. In other words, one method is used for both operations. Cookies may need to be set on a particular domain level and it may not be obvious why. Sometimes cookies are intrusive, especially when websites start to ask us whether we want to store them on our machine or not. But the alternative is simply storing data we don't know about, which is even worse considering how insecure this could be. Cookies can be disabled from the browser, which makes the data they contain only conditionally available. Another disadvantage is that over time we collect too many of them, which requires frequent cleaning.

Web storage. LocalStorage and sessionStorage are part of the Web storage specification. This is a key-value datastore that is made accessible through the client-side programming language. Data is stored on the client, and in the case of sessionStorage it is also persistent. These are implemented as objects in the browser that provide convenient get/set methods. If we have modified element attributes through the getAttribute/setAttribute methods in JavaScript, we are already aware how they work, which makes them intuitive and easy to use. Storing data in this way means that we don't have to worry whether cookies are available or whether we can fit within their size limit.

Databases on the client. IndexedDB is a database that is part of the HTML5 specification and is still under active development. Browser vendors have different implementations that differ in how big the allowed data size can be. But the data is stored persistently, which is good when we need to store more than we can fit with web storage mechanisms (on average around 5MB). The question is what we do with the data we store and how we access it. Reading medium sized data (20-30MB) from IndexedDB at once can quickly make our website unacceptably slow, no matter how beautiful the end result is.

XML. This format is used to store data in a semantically meaningful and portable/interoperable way. The format is very similar to HTML (both are based on SGML), with the difference that we can define our own elements and attributes. But traversing XML may not always be straightforward. The format is used in RSS feeds, EPUB books, SVG graphics, MathML formulae and others. But XML files are bigger in comparison to JSON, which makes them slower to download. In the same amount of data, they simply provide lower value. XML can be populated on the fly with data from a database, which can be done on the server. Parts of the XML tree can be efficiently queried with XPath, if it is available (more often on the server than on the client). XML data can also be styled with XSLT, but the last is fairly difficult to use and thus of less practical value.

Non-persistent arrays or objects. These are available within the programming languages and allow us to store data that will be manipulated only once during a single web page load. As soon as we reload our page, this data is wiped out and variables are populated again. The lack of persistence makes them suitable only for one-time tasks. The need to separate the data from the business logic means that we can't simply mix large amounts of data with the rest of our code. The least we can do is to encapsulate the data and related operations in a separate class.

Final words. Storing no data is the most effective way to speed up a website, but it's also the least practical. Every time we add more data, we decrease the speed with which it will be retrieved. So we need to make sure that each data entry has as much weight as possible in order to save the user's bandwidth. There is no best way to store data and it all depends on what we are trying to do. Minimizing the working set is something we can do to achieve better performance. Databases like MySQL use the LIMIT clause for this. Another consideration is that fetching data can be slower than computing it with numeric calculations. If we are absolutely sure that the last are equivalent and not very expensive, it's probably better to make them. We may be able to further save on computations if we could apply existing knowledge from asymptotic combinatorics or just shift individual bits to arrive at the same result. Data often requires making tradeoffs, and we can choose between speed and size, but not have both. The data that we choose to store needs to be accurate and relevant to the business goals. When we start to model it, it pays to think upfront how it will be queried and what questions users will ask. This can lead to choosing one database over another, one model over another or one approach over another.

Data on its own is rarely useful unless we act on it. When we use slow algorithms to analyze it, it doesn't matter that it can be fetched relatively fast. If the data is huge, we may be able to extract a representative set, which will be hopefully small enough to fit in memory to accelerate analysis. Then we could operate on this small subset and make conclusions about the data at large. But we have to be very careful as this requires deep statistical knowledge and not only good knowledge of the data itself. As always, there will be an error/bias in our conclusions. It is generally accepted that the higher the variance of the data, the lower the bias and the lower the variance, the higher the bias. But its probably good to be skeptical about this rule. Last, but not least, collecting data from users raises ethical concerns, especially when we consider that data is more often stored than deleted. Even if we store the right data, the probability that hackers will gain access to it is probably quite high in the long term. The purposes for which the data is collected need to be explicitly communicated.

bit.ly/1gV53dE19.10.2013 22:57