Musings on data, formats and types

Lately I’ve been attending a study group on learning the Clojure programming language, and with me being a firm believer in statically typed languages there have of course been some discussions of the virtues of statically versus dynamically typed languages. Thinking of it on my way home tonight, I came to think that perhaps it’s not really that much about static versus dynamic types, but rather about documenting the format of data. I would like to expand on this topic in this post.

Let me first define what I mean by data for the purposes of this discussion.

Data is something that can be used to inform an action or decision.

By this definition, here are some examples that are not data:

26
https://emlun.se
{ "name": "Emil", "age": 26, "website": "https://emlun.se" }

On the other hand, here are some examples that are data:

Emil is 26 years old.
Emil’s web site is located at the URL https://emlun.se.
Emil is described by the JSON object {"name": "Emil", "age": 26, "website": "https://emlun.se" }, where the keys name, age and website respectively contain his name, age in years, and web site URL.

Spot the difference?

The examples in the first group are just blobs of… something, that are meaningless without context. The third example in the first group is debatable, but the fact remains that interpreting it means you are guessing what it means. The examples in the second group however come with something describing what they are. They have a format imposed upon them by a context of meaning.

The point I want to make is that data always has a format. If you don’t know what something means, then you cannot use it to inform an action or decision. Therefore, “unformatted data”, “unstructured data” and “schema-less format” are oxymorons; data without a format is not data. That would be noise, text or some other kind of opaque payload. The term data mining actually illustrates this idea surprisingly well: you start out with an opaque blob of something, but refine and process some of it to extract useful data.

So the bottom line of what I like about statically typed languages is that they force you to express your data formats - including function types - explicitly, as opposed to dynamically typed languages which leave them implicit by default. But in the end, no matter how you choose to express, access and process your data, your data always has a format - and by making explicit the rules and assumptions accompanying that format, statically typed languages do their best to make it immediately and painfully obvious when you violate them. They help debug your code by preventing one class of bugs - inconsistent assumptions about data formats - from ever happening in the first place.