Why I care so much about markup languages

Plaintext is the most important format for human communication. It's incredibly resilient, can be written and consumed by almost any computer or computer program, from devices from decades ago to a smart-toaster.

Markup languages are structured ways of writing plaintext. Crucially, they're meant to be read and written by both humans and computers. They can be manipulated by computer programs, but even without any processing people can understand and edit them. This makes them perfect for communication between heterogenous people and systems. I can send someone a file written in a markup language regardless of what kind of system they're running or what programs they use.

At their best markup languages augment the human intellect (a la Douglas Englebart). Their structure makes it easier for people to think and communicate their thoughts.

Features

Uniform syntax

The syntax should:

  1. be easy to parse
  2. have few concepts for a new user to learn
  3. allow for extensions in a natural way

Currently I think the best format for this is S-Expressions, the format of Lisps. They're nested lists of parenthesis and are very simple to parse.

(This is a list ( containing this nested list) ( and this one))

Manipulable Structure

The structure of the program should be easy to edit, whether that means changing the order of items, or "promoting" or "demoting" sections.

This gets slightly complicated if our language is not indentation aware. For example with markdown and org-mode nested sections are described with an increasing prefix length. So a header starting with "**" will be a child of the first header of "*" above it.

However if we're doing a system based on parenthesis this does not hold true.

"Package"/Dependency/Import Manager

We want to be able to easily communicate texts written in this langugae, easily sending them all over. We also want to be able to reference other files.

Interdocument references

It should be possible to reference (and maybe even transclude) specific sections of other documents.

Embededd code

This is where things get interesting. A powerful feature would be to write code that generates text to be used in text files. This could be to produce things like reports, or different views on files. This makes each file written in this language analogous to a program that produces a file.

In lisps there is the "quote" operator which prevents a function from executing and returns it as a list. For example:

> (+ 4 7 )
> (11)
> '( + 4 7)
> (+ 4 7)

In our case though our language is operating with text as the default but code (i.e function calling) as the special case. We can just switch the meaning of the quote operator to mean, execute this block.

Gotchas

  • Parsing unformatted text is difficult. When in the process do we need to implement it? After tokenization probably?
  • This needs to support unicode as people will use it to write!

Parsing

We need to parse files into an AST while still keeping track of their location in the original file. This is so that we can modify that original file with functions.

Another option which could achieve similar functionality would be having a standard way that an AST is serialized to text, like code-formatters like prettier or black.

However, this is harder to do for what is ostensibly prose as opposed to code, as people may use whitespace and odd formatting intentionally.

What seems to be the standard for parsers is to have a token based system, with an interface for looking at the current token and future ones.

Examples and experiments

Markdown replacement

(# This is a header)

From here I can just write out this block. This would get parsed as a list of
atoms. I could make things (* bold) or (/ italic).

If I wanted to include a source block it would be.

(src
function test()  {
  return "This is a src block"
}
)

((# This is now a new section)
  The first atom in a section is it's name.

  ((## A sub section)
    This is the content of the sub section

    - Can we parse this as just a list?
    - Maybe each line by default becomes a list of it's own?

    )
)

Figuring out Lists

(# I want to try to make a list here)

(## An unordered List)

(- This is an item)
(- also an item (* this is bold)
   and just more text as well!))

When parsing if we encounter a list item we start with a list. If the next atom
is also a list item then it get's added to the list. This could complicate
parsing a little bit, but it does make it relatively clean.

(## An ordered list)

(1. This is list item)
(2. This is another list item)
(3. This is a list item that's also a paragraph
    It has another line here. (* something bold))
(# An alternative format for lists)

(ul
  (This is one item in the list)
  (This is another item in the list))

(ol
  ( this is the first item in the list)
  ( this is the second item in the list))

(- [_] )
(- [x] )

This avoid syntax that can be inferred by the structure, but that also makes it
a bit harder to quickly scan or pick up for someone new to this whole thing.

Prior Art

Markdown

Markdown is probably the most widely adopted markup language for writing prose. It's available in editors all over the internet and has parsers written in pretty much every language. This is largely due to it's simplicity. There's very little syntax in markdown, and most of it is highly intuitive for both reading and writing.

A shortcoming of markdown is the fragementation that has come along with it's proliferation. There are different "flavors" of markdown that have different quirks and features. This can range from small things, like how to specify the language in a code block, to almost whole new formats like mdx (which I used to set up the new fathom site).

Org-mode

It's hard to directly talk about org mode as it's value really comes from the combination of a markup format with an incredibly full featured editting mode in emacs.

It's core is a hierarchy of headings, each of which can be folded and moved around easily. This makes it extremely useful for outlining, and progressive fleshing out of ideas. You can fold headings you don't carea bout at a particular moment or even narrow down to a specific heading so that everything else is completely hidden.

Org-mode's biggest downside is it's complexity. While the core structure is simple and easy to understand and manipulate, it has a ton of other features that are implemented in different ways. For example you can set properties for a header with a drawer, designated with :PROPERTIERS: and :END:, but properties for code-blocks are implemented as value pairs on the same line.

I talk more about how I use org-mode here

built with nextjs, mdx, and typescript view source