Data storage options

Information retrieval is an entire branch of computer science that is dealing with data. The goal of storing and analyzing it is to arrive at new knowledge that can be used in further decisions. We can store the data in many different ways or make it accessible through our websites. But it's not always obvious which option will be the most suitable for our task. Let's take a look at some of them.

HTML. When the data needs to be made widely accessible, it has to be part of the HTML document. We can describe it semantically to make it more meaningful to robots and screen readers. But HTML isn't the best option if we need to have too many elements, because as the DOM tree grows, our website becomes progressively slower. Such data is best used in paragraphs. We can use tables for small tabular data if we don't like the additional cost of connecting and reading from a database. Within the HTML, we can access data from many different sources, which helps us to separate our concerns. The HTML can be seen as a base that shows an aggregated collection of all the data that needs to be made available to the user. But this data doesn't have to be hard-coded, when it can be generated on the server.

HTML data attributes. These seem to be suitable mostly for small amounts of data that will be only occasionally accessed. But if we have to add data attributes to every single element, the HTML will quickly become hard to read and change. Not only that, its size will grow disproportionately to the value each additional byte adds. Although many attributes store some kind of data, their keys are rigid and can't be changed. Data attributes practically allow us to define our own keys, which enhances flexibility. There are already proposals to have custom HTML elements too. Whether this will make code more readable for web designers is still debatable. We need to also keep in mind that they are generally slower to work with than retrieving the data from non-persistent arrays and objects (see below).

Base64 encoding. This is a way to describe a small image (preferably under 1-2kB) through data in order to save an additional HTTP request for the image. In some cases such data representation can be useful, because sending files through the network is generally slow. But for larger images, base64 encoding can actually store more data than the image itself will require, which is a drawback. Another reason not to use base64 encoding is when we need the image to be cached by the browser. But this encoding may still be useful in isolated cases.

Cache between the client and server. These caches store data temporarily in order to prevent queries from hitting the server and thus slowing it down. Memcached, Varnish, Squid, APC, Cache_Lite can all be used to speed up requests for data on websites, but we need to be aware of their individual constraints. It may be a good idea to examine the cache hit rate and—if possible—the conditions, which invalidate the previously stored data (cache thrashing). We need to be aware that we are effectively storing the available data twice, so we must evaluate whether this is worth our effort. The cache policy (write-through/write-back) may be another thing we need to consider if we want things to work correctly.

Cookies. They are used to store very small amounts of data (less than 4kB) on the client's machine, usually in the form of variables that have some values. Creating cookies is relatively easy, but deletion can be less intuitive, since we need to set the same cookie again, but with a date in the past. In other words, one method is used for both operations. Cookies may need to be set on a particular domain level and it may not be obvious why. Sometimes cookies are intrusive, especially when websites start to ask us whether we want to store them on our machine or not. But the alternative is simply storing data we don't know about, which is even worse considering how insecure this could be. Cookies can be disabled from the browser, which makes the data they contain only conditionally available. Another disadvantage is that over time we collect too many of them, which requires frequent cleaning.

Session arrays. Sessions are another persistent mechanism through which data is preserved throughout multiple web pages. In PHP, before sessions can be used, each individual page needs to start with a method called session_start() or session variables won't be read/written. Often it is more practical to store multiple variables in the $_SESSION array, since we rarely want to store a single piece of data and if so, in this case the use of sessions may not be justified. The data has to be non-critical, because attackers may be able to tamper it. Once we store it in this array, it gets written in a file on the server, that has a particular session ID that is used to uniquely identify a returning visitor. In Python we can use the pickle or cpickle modules if we want to use the equivalent of sessions for a similar effect.

Web storage. LocalStorage and sessionStorage are part of the Web storage specification. This is a key-value datastore that is made accessible through the client-side programming language. Data is stored on the client, and in the case of sessionStorage it is also persistent. These are implemented as objects in the browser that provide convenient get/set methods. If we have modified element attributes through the getAttribute/setAttribute methods in JavaScript, we are already aware how they work, which makes them intuitive and easy to use. Storing data in this way means that we don't have to worry whether cookies are available or whether we can fit within their size limit.

Databases on the client. IndexedDB is a database that is part of the HTML5 specification and is still under active development. Browser vendors have different implementations that differ in how big the allowed data size can be. But the data is stored persistently, which is good when we need to store more than we can fit with web storage mechanisms (on average around 5MB). The question is what we do with the data we store and how we access it. Reading medium sized data (20-30MB) from IndexedDB at once can quickly make our website unacceptably slow, no matter how beautiful the end result is.

Databases on the server. There are many different SQL and NoSQL databases and the decision of which to use depends on the type of data we need to store. Relational databases like MySQL and PostgreSQL are suitable for data, whose format can be closely represented by a table. They are generally efficient to query and store data that isn't too long as its length can adversely impact performance. The data is organized in rows and columns and is read sequentially, starting from a certain index. When a primary key in one table has the same name as a foreign key in another table, we can join both tables in a relation that allows us to query the combined data with the help of SQL. (As the size of the data grows, JOINs between tables become progressively slower.) Relational databases are often implemented as B(+) trees, so to understand how indexing and search work, we need to learn more about this data structures. Hierarchical databases also exist, but they are said to be less practical due to hard to establish connections between the different nodes and leafs on the tree (even when inserts and deletes are fast). Object-oriented databases (like Wakanda) store data as properties of objects, which is a very close and natural way to how programmers tend to work. In this case the data itself starts to feel more like code, which in many cases can be more intuitive. Key-value databases like Redis access data through the key that is hashed with a specific function that makes it unique upfront. Data accesses are fast and scalability can be easier to achieve compared with relational databases. The requests per seconds that can be satisfied can really make a relational database look old. Another advantage is that fluff is eliminated, which allows to get insights about the data more easily. However, key-value databases may require some time to learn and many hosting providers still don't have them installed. Document-oriented databases like MongoDB and CouchDB store data as documents, in BSON or JSON data formats, which makes them quite flexible. If we have many such documents to store, using such a database may be beneficial. Graph databases like Neo4j are efficient for storing lots of data that has many relationships. They are very practical and multi-purpose with use cases spanning from recommendation systems and social networks to retrospective analysis, predictive analytics, clickthrough analysis, longitudinal studies, the web of things, system profiling, scientific computing, geographic data, epidemiology, even neuron wiring in the brain. It is said that 90-95% of the available data can be modeled with graphs, which makes them very versatile, especially when we consider how interconnected our world is. The web is also an example of a graph, in which search engines examine the connections between the nodes (websites) to decide what weight to assign to them (PageRank). Characteristic for graph databases is that query latency is proportional to how deep we want to traverse the graph and not to the amount of data stored. This makes them very fast to work with extremely large structures of connected data. As soon as we decide to traverse more than three connections, the advantages of a graph database start to outweigh the ones of a relational database that would need multiple JOINs between tables. Neo4j has its own query language (Cypher) with specific syntax that can take time to get used to. Even though performance of the graph database is fast, it is often the programmer's performance of writing code that is interesting and in the case of Neo4j, having to specify individual connection within a graph can be slow with a large number of nodes. Another disadvantage is that graph objects can have less efficient disk representation than relational databases. If we assume that each node in the network is intelligent, we must proceed from the standpoint that each one will decide based on the combined behavior of all others for the maximization of its own payoff through an individual strategy. This means that node readjustments will cause ripple effects through the network, which makes it more difficult, but also more interesting to examine. Graph database structure can be represented with variants of adjacency lists and matrices. Internally, Neo4j uses doubly linked lists for the relationships between nodes. Distributed databases like HBase/Hadoop, Riak and RethinkDB are useful for storing data that will require lots of computing power to make sense of. This is why it is good to make the data accessible on many parallel computers, to decrease the time needed to solve complex problems. These databases use a technique called MapReduce, which splits the data in many pieces, inspects each piece on a different machine (in parallel) and finally gathers individual results and aggregates them into a single one. In-memory databases like memSQL, Redis and others can read data directly from the RAM, and if it is sufficiently large, they can be incredibly fast, since the access to the slower hard drives is eliminated. Special-purpose databases like the queueing database RabbitMQ are efficient at one specific thing—in this case accepting and forwarding messages. Queues are a FIFO data structure that is frequently used for buffering or even content updates, when we add a new article that makes the oldest one disappear, when the number of visible articles remains constant. We also have queues in stores, in front of the cash register. As soon as the purchase of an individual has been processed, the person leaves the queue and a new person can come at the end. Optimizing the queue performance has been an effective way for stores to increase their revenue. High-performance databases like Aerospike, Clustrix and InfoBright are designed for extreme workloads and scalability, but they may require expensive support. Nevertheless, they may be a good choice for real-time, mission-critical applications that deal with large amounts of data.

JSON. This is a key-value data format that can be easily read on the client and read/written on the server. JSON files have a small size, which makes them fast to download. They are also easy to parse, so the data becomes immediately available. JSON has a very close notation to JavaScript and it looks like we are having an array or object that contains many nested properties/objects/arrays. On the client we have methods like JSON.parse() and JSON.stringify() (JavaScript) and on the server json_encode() and json_decode() (PHP) that allow us to work with JSON data. This format is often used in configuration files of various applications, because it's easy to work with.

XML. This format is used to store data in a semantically meaningful and portable/interoperable way. The format is very similar to HTML (both are based on SGML), with the difference that we can define our own elements and attributes. But traversing XML may not always be straightforward. The format is used in RSS feeds, EPUB books, SVG graphics, MathML formulae and others. But XML files are bigger in comparison to JSON, which makes them slower to download. In the same amount of data, they simply provide lower value. XML can be populated on the fly with data from a database, which can be done on the server. Parts of the XML tree can be efficiently queried with XPath, if it is available (more often on the server than on the client). XML data can also be styled with XSLT, but the last is fairly difficult to use and thus of less practical value.

CSV. Delimiter-separated data can be stored efficiently. Then we can easily split it on the separator and become an array, whose items can be conveniently accessed through a numeric index. The delimiter doesn't have to be a comma, but can be any arbitrary chosen character that is unique for the purpose of splitting. We can use multiple different separators and first split on one character to divide the data in “lines” and then on another character to further divide the “lines” in “columns”. This way we arrive at a two-dimensional array, but data access follows the same principle. CSV files are used often for data migration, since they are small and provide high value per kilobyte. We can use them for a variety of purposes. But if we have lots of very small items (maybe all single characters from the alphabet), the costs of each separator can become disproportional to the amount of data stored. Although storing characters from the alphabet isn't very useful, I use it to illustrate the separator cost here.

Flat file. Flat files (often in .txt format) are useful when we have long texts that could fit in a database, but at the expense of making table queries much slower. In this case, it is often better to not store these texts directly in the database, but in separate .txt files that can be dynamically loaded as needed. This way the database can preserve its fast performance with queries that ask for short data. With lots of content, flat files can be faster to read from, since the filesystem is more immediately accessible to the programming language than the database. By saving on expensive database queries, we can make our websites a bit more responsive. If we need to search for data within a big, textual file, it's probably best to do this within the programming language and not by going to the database, using expensive functions and returning the result to the programming language anyway. Database functions have to be used sparingly, not only because they can be slow, but because in certain queries they may cause an index to not be considered. The File API allows us to read flat files directly on the client, which can also be useful. But writing files on the client can be insecure. When we have a large file that could take too long to download, we could also establish a persistent connection to it through Websockets, read data in pieces and then close the connection when we're done. This saves us from having to make an additional HTTP request for each additional piece. How files are stored depends on the underlying file system and its cluster size. Some filesystems are internally represented as red-black trees. The disadvantage of flat files is that they are less flexible and hard to relate to other data when we don't know it upfront. Moving a file handle through a numeric value within a file doesn't make us immediately aware whether we have reached the point at which we want to read or write. But other than that, opening, reading/writing and closing the file is fast.

Non-persistent arrays or objects. These are available within the programming languages and allow us to store data that will be manipulated only once during a single web page load. As soon as we reload our page, this data is wiped out and variables are populated again. The lack of persistence makes them suitable only for one-time tasks. The need to separate the data from the business logic means that we can't simply mix large amounts of data with the rest of our code. The least we can do is to encapsulate the data and related operations in a separate class.

Final words. Storing no data is the most effective way to speed up a website, but it's also the least practical. Every time we add more data, we decrease the speed with which it will be retrieved. So we need to make sure that each data entry has as much weight as possible in order to save the user's bandwidth. There is no best way to store data and it all depends on what we are trying to do. Minimizing the working set is something we can do to achieve better performance. Databases like MySQL use the LIMIT clause for this. Another consideration is that fetching data can be slower than computing it with numeric calculations. If we are absolutely sure that the last are equivalent and not very expensive, it's probably better to make them. We may be able to further save on computations if we could apply existing knowledge from asymptotic combinatorics or just shift individual bits to arrive at the same result. Data often requires making tradeoffs, and we can choose between speed and size, but not have both. The data that we choose to store needs to be accurate and relevant to the business goals. When we start to model it, it pays to think upfront how it will be queried and what questions users will ask. This can lead to choosing one database over another, one model over another or one approach over another.

Data on its own is rarely useful unless we act on it. When we use slow algorithms to analyze it, it doesn't matter that it can be fetched relatively fast. If the data is huge, we may be able to extract a representative set, which will be hopefully small enough to fit in memory to accelerate analysis. Then we could operate on this small subset and make conclusions about the data at large. But we have to be very careful as this requires deep statistical knowledge and not only good knowledge of the data itself. As always, there will be an error/bias in our conclusions. It is generally accepted that the higher the variance of the data, the lower the bias and the lower the variance, the higher the bias. But its probably good to be skeptical about this rule. Last, but not least, collecting data from users raises ethical concerns, especially when we consider that data is more often stored than deleted. Even if we store the right data, the probability that hackers will gain access to it is probably quite high in the long term. The purposes for which the data is collected need to be explicitly communicated.

bit.ly/1gV53dE