A Minimal Markup Language
2022-03-21
For some of my projects, I've been using a custom markup language to annotate text. I'd like to capture some notes about it here.
As a motivating example, suppose we're writing a book that makes abundant reference to different Spanish phrases. So, we want to be able to annotate whether some text is Spanish or not.
In HTML, we might try <span lang="es">hola</span>
, which is workable if
Spanish text will appear just once. But if we expect to refer to hundreds or
thousands of Spanish words, this is tedious.
For a second iteration, we might instead consider a custom XML element, such as
<es>hola</es>
. This is much better — the <es>
element is lightweight,
expressive, and focuses on semantics instead of presentational details.
But even so, I have two problems with this approach. The first is that
<es>hola</es>
isn't very readable. I think that markup should be kept to a
minimum so that I can focus on content. But this is minor compared to the
second problem: this markup horrendously painful if the tags are just a little
bit longer. Suppose I want to annotate that a phrase is a Spanish idiom. Do I
then use <idiom lang="es>hola</idiom>
?
I've considered a few different options, and ultimately I find something like this:
1 2 |
|
to be the most readable. This is basically just an s-expression. But the
problem I have here is that if the text needs to start with whitespace, this
approach is no longer so elegant. And as a secondary issue, it becomes
difficult to type {
as an ordinary character.
So I'm also drawn to a more TeX-like approach, which is also used by
Pollen (which my friend S. told me about today). In such an approach,
we mark the element name with some meta character. Pollen, for example, uses
the lozenge symbol (◊
):
1 2 |
|
But while I have no quarrel with Unicode control characters, I do think that a markup language should stick to characters that are easy to type on a standard keyboard, which is why I lean more toward a traditional backslash as used in TeX.
So, I think my final preference is something like this:
1 2 |
|
A brief spec
Not especially formal, but:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
|