awarm.spacenewsletter | fast | slow

Some facts about facts

My project for the next couple of weeks is implementing a personal database. I want it to store information about the things I read, the thoughts I'm thinking, the websites I visit, the people that matter to me, and anything else I can cram into it.

One of the biggest decisions in implementing this database is figuring out the data model: How do we organize the data and it's relationships?

I decided pretty early to go for a facts based data model. In it, all the information in a system is represented by a set of "facts".

Each fact has three parts:

Using this simple building block, you represent a huge variety of useful contructs.

Let's look at an example.

Here's a list of facts:

What this means is that the entity 42 has two attributes, name and place of birth, which have the values "Jared", and "Muscat, Oman" respectively.

The values in facts can also be references to other entities. So we could represent the same information like this:

Instead of storing the value of "place of birth" as a piece of data, we store it as a reference to another entity, 58. That entity then has the attributes city/name and city/country which define it's name and country.

By using these references you could build up a rich graph of data.

Why facts?

Every data model has different trade-offs and is suited for different domains. I'm making the case that facts are particularly well suited for modelling the domain of "personal data"; roughly, all the information a single human being might want to keep and interconnect.

The most important constraint of this domain is that it's dynamic and inconsistent. Human lives are messy, and human thoughts even more so. The information we're dealing with does not fit into fixed categories, and is constantly shifting.

You have much richer information about your closest friends than passing accquiantances. Same with a website you frequent daily and one you clicked once. Yet they're all people, and all websites. Facts let us represent individual pieces of information instead of fitting things into these large categories.

That doesn't mean that it's a data model completely without constraints. Instead of constraining what attributes things can have (i.e saying "all people have a name"), you constrain what values an attribute can have (i.e "all names are text"). Constraining on the attribute level lets you coordinate with yourself more easily, without making adding new information too difficult.

Ultimately, I think facts strike the right balance between simplicity and expressivity, and flexibility and constraints.

What does this actual look like?

What deciding on a data model doesn't give you is any clue on how to actually store and represent that data. Like I mentioned last time, one of my constraints for this project is to lean on the file system, so that means the question is how am I going to represent facts in files?

The main goal is to represent them in such a way that it makes reading and writing data easy, given the patterns of reading and writing that are going to occur often.

Grouping facts in files by entity is the simplest answer here, as entities already correspond to the most "meaningful" relationships between facts. i.e two facts about the same person are more closely related than, for example, two names of random people.

I experimented briefly with using YAML as a format to write this data, but it's honestly more complicated than I need it to be, and that complexity created some problems. So I'm defining my own little syntax. It looks something like this:

name: Jared Pereira
notes: [[[
---
This is a note about me
---
---
This is another note about me!
So much information!
---
]]]

I still have to work out the kinks, but the idea is to represent each fact as a property on the entity, with some properties having single values, some having multiple, and being able to represent both single line and multi-line values.

I've started writing a very simple parser for this format and it's a lot of fun! Evocative of the point I was at in Spring last year, on I suppose, this very same project.

How does this turn from files to a database?

What's the difference between a file system and a database? Constraints. You can dump any kind of information into a file system, but the purpose of a database is to obstruct access to data 1, so that all the data you put in is structured.

The most fundamental constraint is the data model. A file can represent anything, but the database forces everything to be facts. A higher level constraint is the ones on the values of attributes I talked about earlier.

Really, these constraints can get arbitrarily complex. You could say every book has an author, or you have to rate every book you finish, or the average rating of every book you've rated should be 5/10.

Big databases would call this "business logic" and they're very good at representing it. But people aren't businesses and we need a different kind of logic, one that's a lot more flexible and emergent. I'm not yet quite sure how to implement this.

One guess I have is that instead of actually enforcing constraints, it's enough to make the user aware of them, by creating an interface to them. This leave the choice up to the user, and maitains the property that they always take action that changes the database, while still pushing the database into the direction a past version of the same user wanted it to go. This is one of the most exciting areas of exploration for me.

One more difference, is that a filesystem has a very simple query mechanism, "give me a file", or "give me a folder". But a database can leverage the structure they enforce to give you the ability to ask richer questions, like "give me the name of all books I rated more than 10". The way you ask those questions is the query language, which we'll get into next time!


Last week I forgot to link you to the notes I wrote on this topic for my final essay. This week I'm preparing a little outline.

subscribe for updates

  1. Taken from "The Image of Postgres", a talk by r0ml