A Minimal Markup Language

2022-03-21

For some of my projects, I've been using a custom markup language to annotate text. I'd like to capture some notes about it here.

As a motivating example, suppose we're writing a book that makes abundant reference to different Spanish phrases. So, we want to be able to annotate whether some text is Spanish or not.

In HTML, we might try <span lang="es">hola</span>, which is workable if Spanish text will appear just once. But if we expect to refer to hundreds or thousands of Spanish words, this is tedious.

For a second iteration, we might instead consider a custom XML element, such as <es>hola</es>. This is much better — the <es> element is lightweight, expressive, and focuses on semantics instead of presentational details.

But even so, I have two problems with this approach. The first is that <es>hola</es> isn't very readable. I think that markup should be kept to a minimum so that I can focus on content. But this is minor compared to the second problem: this markup horrendously painful if the tags are just a little bit longer. Suppose I want to annotate that a phrase is a Spanish idiom. Do I then use <idiom lang="es>hola</idiom>?

I've considered a few different options, and ultimately I find something like this:

1
2
{es hola}
{idiom[lang="es"] hola}

to be the most readable. This is basically just an s-expression. But the problem I have here is that if the text needs to start with whitespace, this approach is no longer so elegant. And as a secondary issue, it becomes difficult to type { as an ordinary character.

So I'm also drawn to a more TeX-like approach, which is also used by Pollen (which my friend S. told me about today). In such an approach, we mark the element name with some meta character. Pollen, for example, uses the lozenge symbol ():

1
2
◊es{hola}
◊idiom[#:lang "es"]{hola}

But while I have no quarrel with Unicode control characters, I do think that a markup language should stick to characters that are easy to type on a standard keyboard, which is why I lean more toward a traditional backslash as used in TeX.

So, I think my final preference is something like this:

1
2
\es{hola}
\idiom[lang="es"]{hola}

A brief spec

Not especially formal, but:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Document:
    ""
    Nodes
Nodes:
    Node
    Nodes Node
Node:
    Text
    Element
Text:
    ""
    Character
    Text Character
Character:
    [^\\]
    '\' [bfnrt"]
    '\\'
Element:
    '\' SymbolName '{' Nodes '}'
    '\' SymbolName ElementAttributes '{' Nodes '}'
SymbolName:
    [^{}\[\]\\]+
ElementAttributes:
    '[' KeyValues ']'
KeyValues:
    ws KeyValue ws
    ws KeyValues \s+ KeyValue ws
KeyValue:
    SymbolName ws '=' ws '"' Characters '"'
ws:
    \s*