Linked data

linked data

With the appearance of Google Knowledge Graph and Facebook Open Graph, we saw again how important it has become to connect data in new and meaningful ways. Search results are no longer a list of simple individual links that are ranked according to multiple criteria, but the criteria are determined by how well links relate together. Instead of links to click, we increasingly see worlds to discover—something that is much more natural and close to how we behave in real life. Each node of the graph isn't simply a source of information, but the distilled (factual) knowledge of it. People can query the graph and arrive to this knowledge in different ways, depending on the chosen path and the distance between source and destination nodes.

Microformats and the semantic web movement have influenced very strong how we think about data. We were trying to describe relations through rel-attributes, we created specific and standardized HTML components with the hope that they will eventually spread widely enough to matter. But this was a bit inflexible and refused to accept that everyone has different needs that can't be fulfilled with the same markup. We knew that metadata could also be described with the XML-based RDF and later RDFa, but they were a bit cumbersome albeit offering good structure. Tools like Protégé could help us generate such documents through the help of the Web Ontology Language(OWL). But the ease of use for the single person couldn't really compensate for the need to embed complex and partially hard-to-read XML documents that will be downloaded by everyone else. The scale still remained lop-sided. This is how schema.org was born as a joint effort of many large corporations to provide the means for linked data. I found this initiative highly impressive—for the first time I saw so many big corporations willing to collaborate on a common cause. The solution they came up with was to use attributes like itemscope, itemtype, itemprop in the HTML. They called it microdata. This was and still is a great solution, because it allows the content to be described at the right place at the right time during content creation. Item scopes map quite well to existing HTML components, which is convenient and minimizes the risk of type conflicts. New and existing metadata parsers could go through the markup and easily pick up these properties, but they need the active help of the content creator to know how to interpret this data. Some companies advise webmasters to prefer microdata over microformats if possible. (Here is one nice tutorial if you look for a way to get started.)

With so many tools and technologies at our disposal it was confusing for a long time whether to link our data or not, since future changes could quickly convert our asset to a liability. It's still hard to imagine that we might miss something. When an old technology gets replaced, it is good to think what was suboptimal in it, even if the new is sometimes the old with a new label. We can learn a lot this way. I was a bit surprised to read about JSON‐LD on the schema.org blog as a new and recommended approach to linked data. (Then I noticed that Google is already using it in Gmail.) So what was wrong with microdata then? My only clue is that it was sprinkled through the entire HTML markup, which could be hard to maintain with large web sites. So the in-place advantage actually becomes a maintanence disadvantage. Probably this is what JSON-LD is trying to solve. It uses the same data types as microdata, but combined with the elegance and cohesiveness of the JSON data format. This allows to separate the content from the metadata and to improve maintanence, especially if we keep a high level mental image of what we want to describe without constantly feeling the need to inspect concrete HTML elements to derive that meaning. In a sense, the clear separation of concerns comes at the price of loss in clarity. Here we see why these approaches are neither interchangeable nor supplementary.

We could write a lot of metadata, but when it reaches a certain size, it will become increasingly hard to understand which connections we actually affect without the use of some visualization tool. Search engine providers could create one to show more accurately where content creators need to work on a graph. This in itself will contribute to a more intensive use of linked data on websites, when people could directly see the effect of their work and not just wait for an index to be updated. Linked data could help us understand things better by extracting the unique knowledge we have and discarding the irrelevant, the majority of which is plaguing us now. The sooner we find a way to decrease our information overload, the better we'll be able to apply our knowledge.

bit.ly/172V7qd