Metadata and Markup

I’ve been thinking a lot about ontologies and triplestore related things lately. Without going into too much detail about that, in this post I want to compare their utility with that of markup like HTML & type annotations in a programming language.

I use the word metadata here to mean data about other data; which can be a bit of an arbitrary distinction depending on who you ask. And then metadata is also data itself. For example, the triple (Alice hair-color blonde) is typically viewed as metadata about Alice. I might make a probability judgment on that data like ((Alice hair-color blonde) likelihood 90%) in which case what was once metadata is not so meta after all. Point being is that it is a bit of a silly word, but the important thing is that there is a reference to some other data. That reference is the key theme here.

To me, markup first meant what English teachers would do to my essays: use ~~red~~ purple (because it was less harsh or something) pens to cross out my words, add punctuation, write little notes in the margins about who knows what, etc. Marking up a physical paper is a very clean model to represent the layered information between my words and the teacher’s thoughts because the two are clearly separated and I can’t change my words out from underneath the teacher. We hold a mutex, if you will, on the essay. This means the teacher’s underlining of a run-on sentence will be the same sentence it is when she underlined it and the same sentence when I read the comments and just kidding I’m not sure I actually read their comments and that is why this thing people call blogs is full of ramblings instead of coherent thoughts like you might expect.

Today, markup is typically associated with HTML. HTML has obviously been bastardized over the years to get the job done (whatever job it may be), but at its heart is a way to add more information to text on a page. HTML stores this metadata inline with the content (probably because that is convenient in text editors). This means that things like hyperlinks have a very good chance of still surrounding the right text when the rest of the document changes: inserting characters before a <a> and after the closing </a> will not change the insides of that a.

Programming languages with type annotations that appear inline with a variable like C have a similar property in that global changes to a program’s text are unlikely to alter the reference of type to variable. Things do get a little blurry with things like Haskell where you can write a full type signature above a function: removing a function parameter would still parse as a valid program, but the type checker would reject it; whereas in C if you only removed the parameter name and not the type, it wouldn’t even parse. By keeping the metadata stored with close locality, these languages keep a close correspondence of the data (variable name) with metadata (type). This lessens the possibility of making a change that then incorrectly associates the metadata with a new piece of data (kinda like a false positive) or removing the metadata altogether (kinda like the false negative). The latter is an interesting case because statically typed languages like this do not really view types as metadata as an optional thing (okay yes type inference but eventually the typing information gets there) and so should we still treat it like metadata?

Something like Haskell is interesting because with things like liquid Haskell and the slow but seemingly unstoppable push to full dependent types, it occurs to me that a single Haskell function will admit more than one typing judgment. I might say square :: Int -> Int -> Int but Simon might say square :: Int -> Int -> {Int >= 0} (you get the gist, refinement types). In the world of types, we can have a mathematical notion of when two metadata statements about something are consistent (like in the case above one is strictly tighter). [Side note Whereas, other domains are a lot fuzzier on what is consistent: (Alice thinks (color sky blue)) and (Bob thinks (color sky aqua)).] The reason I find this interesting is that eventually I’d like to reduce the amount of code in the world and so it is important to be able for multiple agents (people and/or other programs) to make statements, claims, and observations about code (and any data in general) and for the underlying system to be able to reconcile when these are statements about the same thing. And then when the underlying thing changes, can we know whether we are permitted to infer that these claims hold or do not hold.

Another interesting thing to note is that in comparison to triplestores, a type annotation is metadata on a highly structured piece of data (ignoring the textual representation). In order to write down a simple subject-predicate-value triple for a type annotation, we would have to either a) come up with a standard way to reference nodes in an AST (and any structure in the general case) (something like /a/file/system/path comes to mind) or b) encode the entire AST as a series of triples in the store (this seems unnatural at first thought but is appealing for its consistency). You can already see that I’ve cheated in my few triples in this post by nesting triples in order to show that one references the other. We can always extract out a nested triple and replace it by (you guessed it) a reference to that triple (can be made unambiguous by a hash if necessary) so you can see how this might be applied to an AST. Even something like the first one (Alice hair-color blonde) should be replaced by (Alice (hair attribute color) blonde) so that we can see the similarity with (Alice (eye attribute color) green) (otherwise we would have to come up with a predicate for every colored thing we could relate to Alice).

My closing thought on this is that while I would love to just put everything in a triplestore and call it a day, I also want something that works for image data, genomics data, weather data, etc. While we could come up with an encoding that breaks apart these large things into a giant set of triples, it would be slow and wasteful. At some point something is “just” data and is an atomic unit even if it can be broken down into pixels, base pairs, etc. We would still like to also be able to reference into these blobs of data and annotate a recognized object in a picture or an interesting region of DNA by a researcher. And we want to be able to share these statements with other people and for multiple statements about the same thing (or other statements) to be related in an automated fashion. We need less meaningless information.