alexkarle.com

Source for alexkarle.com
git clone git://git.alexkarle.com/alexkarle.com.git
Log | Files | Refs | README | LICENSE

mdoc-to-nihdoc.txt (6939B) [raw]


      1 # Writing a Custom Markup Parser for this Site
      2 
      3 _Published: December 28, 2021_
      4 
      5 ## How it All Happened
      6 
      7 Almost exactly a year ago, I moved this site from templated
      8 markdown to a [static site built with `mdoc(7)`](/blog/my-old-man.html).
      9 I ported 5 blog posts in the process and ended up writing 7 more
     10 over the course of the year.
     11 
     12 While it was a success in teaching me
     13 [`mdoc(7)`](https://man.openbsd.org/mdoc.7), I found that it
     14 slowed me down a bit in authoring blog posts.
     15 
     16 Around the same time as I was feeling this slowness, I began
     17 actively phlogging on my [gopherhole](gopher://alexkarle.com) and
     18 [even started posting gopher-only content](/blog/burrowing.html).
     19 I really enjoyed writing plaintext posts because they are quick
     20 to write and, more importantly, highly durable to changes in
     21 technology. I can't imagine a world where one cannot open a plain
     22 `.txt` file and read/edit it.
     23 
     24 I got a real sense for the importance of optimizing for archival
     25 while browsing gopher--it's incredible to be reading textfiles
     26 older than me. It made me realize that I want to be sure that my
     27 content can survive for decades with minimal effort.
     28 
     29 So, I started thinking about how I could move my site's source
     30 (pre-HTML) to a more readable plaintext format. Markdown was the
     31 obvious choice, but I wanted to stay true to my
     32 [creative limitation](/blog/creative-coding.html) of keeping this
     33 site buildable by base OpenBSD, so I had to find another option.
     34 
     35 It was about 10 days into solving Advent of Code puzzles that I
     36 realized I could redirect some of the puzzling effort at the
     37 problem and write my own markup parser. The result, a few weeks
     38 later, is [`nihdoc(1)`](https://git.sr.ht/~akarle/nihdoc).
     39 `nihdoc(1)` (a play on the fact that markdown is *N*ot *I*nvented
     40 *H*ere) provides support for all the basic syntax I'd want in a
     41 blog post--nested lists, _inline_ *styles* and `code`, code
     42 blocks and block quotes, and headers. It was a blast to write,
     43 and I learned a lot in the process!
     44 
     45 I suspect the CSS for the blog will still change (maybe a dark
     46 mode? or something a little less plain), but I tried to keep the
     47 resulting HTML pretty bare in support of accessibility and
     48 portability--it should read well in screenreaders, embedded in
     49 RSS feeds, and more.
     50 
     51 If you want to see the source for any post, just replace the
     52 `.html` extension in the URL with `.txt`! For example, here's
     53 [this post's source](/blog/mdoc-to-nihdoc.txt).
     54 
     55 ## Implementation Highlights
     56 
     57 If you read this far, I figure you might be interested in some of
     58 the implementation details and design decisions.
     59 
     60 ### Stream Based Parsing
     61 
     62 Probably the most interesting detail of the parser is that it is
     63 stream-based with constant memory usage. In other words, it will
     64 start spitting out the input and the HTML markup as soon as it
     65 can decisively figure out what state it's in (i.e. has the
     66 paragraph ended, etc). Keeping track of this state is done with a
     67 handful of booleans/integers and doesn't involve storing lines in
     68 memory. In fact, the current implementation reads the input one
     69 character at a time!
     70 
     71 This is an efficiency win for large documents (not that my posts
     72 are that long), but was also just a fun constraint to try to code
     73 within. In practice, I found I was able to get support for almost
     74 everything I wanted (nested lists, etc) with maybe the exception of
     75 "bottom of the document" links that markdown allows. More on that
     76 later.
     77 
     78 ### Balancing Ease of Implementation with Syntax
     79 
     80 One of the most interesting challenges in designing a markup
     81 language is settling on a syntax that's both easy-ish to
     82 implement (I'm a big believer in simpler = less bugs) but also
     83 syntactically appealing in plaintext format (after all, one of
     84 the main motivations was to make the source archive-ready).
     85 
     86 The best example of this was deciding how to write links.
     87 
     88 I started off with the easiest implementation, which is also the
     89 least appealing (IMHO). A link looked like this:
     90 
     91 	[https://alexkarle.com/blog my blog]
     92 
     93 This is super easy to parse one character at a time. In
     94 psuedo-code:
     95 
     96 1. If current character is *`[`*, print *`<a href="`*
     97 2. Print all characters (the href) until you see a space/newline
     98 3. Once we see the space/newline, print *`">`*
     99 4. Print all characters (the description) up until the `]`
    100 5. Once we see the *`]`*, print the closing *`</a>`*
    101 
    102 This fits really nicely into our "parse one character at a time",
    103 since each special character in the link corresponds to a direct
    104 piece of HTML to output. However, it's ugly to print links that
    105 have no description, such as:
    106 
    107 	[https://alexkarle.com https://alexkarle.com]
    108 
    109 To address this, the next evolution added a (stack-allocated)
    110 "link buffer" that would store the href as it was printed so that
    111 if the ']' was hit before a space/newline, it was assumed that
    112 the description was the href and it would print the link buffer
    113 in the place of the description, enabling "bare links" like so:
    114 
    115 	[https://alexkarle.com]
    116 
    117 I was about to go live on my blog with that iteration because I
    118 liked it _enough_, but the one thing that really bothered me was
    119 that it's hard to read the description after the link. To the
    120 plaintext reader, the description is way more important than the
    121 link! Especially for long links, it's distracting to have to scan
    122 ahead to continue a sentence.
    123 
    124 I really wanted markdown-style links like so:
    125 
    126 	[my blog](https://alexkarle.com/blog)
    127 
    128 The immediate problem was that the parser can no longer print the
    129 characters as it sees them, since the URL happens after the
    130 description in the input but needs to come before the description
    131 in the output. I realized however that this is a similar problem
    132 to the way I used the linkbuf for bare links--all I had to do was
    133 store the description in the buffer, and play it back after
    134 printing the href. It's the same amount of memory, but a tad more
    135 complex, since the description is allowed to have inline styles,
    136 so before pushing onto the link buf, we need to check for styles
    137 and push those too (effectively a smaller version of the main
    138 loop).
    139 
    140 The final form of markdown links that I'd like to support but
    141 can't is a "postfix link", link so:
    142 
    143 	This is a [link] in
    144 	a paragraph
    145 	...
    146 	[link]: https://alexkarle.com
    147 
    148 Since the actual link could be anywhere in the document, this
    149 kind of parsing requires buffering potentially the whole
    150 document, which violates the streaming condition (which I'd like
    151 to keep!), so I stopped short of it.
    152 
    153 ## Conclusion
    154 
    155 I hope you found this discussion of syntax, tradeoffs, and
    156 parsers interesting! I'm sure there's a lot more I can learn and
    157 improve on, but it's been a fun evolution from the `mdoc(7)` I
    158 started with! Check out the
    159 [source](https://git.sr.ht/~akarle/nihdoc) if you're curious. I
    160 expect it'll change rather frequently in the next few weeks, so I
    161 wouldn't advise depending on it yourself (but I wanted to open
    162 source it to share with others as a teaching tool regardless!).
    163 
    164 [Back to blog](/blog)