Why I care so much about markup languages

Plaintext is the most important format for human communication. It's incredibly resilient, can be written and consumed by almost any computer or computer program, from devices from decades ago to a smart-toaster.

Markup languages are structured ways of writing plaintext. Crucially, they're meant to be read and written by both humans and computers. They can be manipulated by computer programs, but even without any processing people can understand and edit them. This makes them perfect for communication between heterogenous people and systems. I can send someone a file written in a markup language regardless of what kind of system they're running or what programs they use.

At their best markup languages augment the human intellect (a la Douglas Englebart). Their structure makes it easier for people to think and communicate their thoughts.

Features

Uniform syntax

The syntax should:

  1. be easy to parse
  2. have few concepts for a new user to learn
  3. allow for extensions in a natural way

Currently I think the best format for this is S-Expressions, the format of Lisps. They're nested lists of parenthesis and are very simple to parse.

(This is a list ( containing this nested list) ( and this one))

Manipulable Structure

The structure of the program should be easy to edit, whether that means changing the order of items, or "promoting" or "demoting" sections.

This gets slightly complicated if our language is not indentation aware. For example with markdown and org-mode nested sections are described with an increasing prefix length. So a header starting with "**" will be a child of the first header of "*" above it.

However if we're doing a system based on parenthesis this does not hold true.

"Package"/Dependency/Import Manager

We want to be able to easily communicate texts written in this langugae, easily sending them all over. We also want to be able to reference other files.

Interdocument references

It should be possible to reference (and maybe even transclude) specific sections of other documents.

Embededd code

This is where things get interesting. A powerful feature would be to write code that generates text to be used in text files. This could be to produce things like reports, or different views on files. This makes each file written in this language analogous to a program that produces a file.

In lisps there is the "quote" operator which prevents a function from executing and returns it as a list. For example:

> (+ 4 7 )
> (11)
> '( + 4 7)
> (+ 4 7)

In our case though our language is operating with text as the default but code (i.e function calling) as the special case. We can just switch the meaning of the quote operator to mean, execute this block.

Gotchas

  • Parsing unformatted text is difficult. When in the process do we need to implement it? After tokenization probably?
  • This needs to support unicode as people will use it to write!

Parsing

We need to parse files into an AST while still keeping track of their location in the original file. This is so that we can modify that original file with functions.

Another option which could achieve similar functionality would be having a standard way that an AST is serialized to text, like code-formatters like prettier or black.

However, this is harder to do for what is ostensibly prose as opposed to code, as people may use whitespace and odd formatting intentionally.

What seems to be the standard for parsers is to have a token based system, with an interface for looking at the current token and future ones.

built with nextjs, mdx, and typescript view source