mdoc-to-nihdoc.txt (6939B) [raw]
1 # Writing a Custom Markup Parser for this Site 2 3 _Published: December 28, 2021_ 4 5 ## How it All Happened 6 7 Almost exactly a year ago, I moved this site from templated 8 markdown to a [static site built with `mdoc(7)`](/blog/my-old-man.html). 9 I ported 5 blog posts in the process and ended up writing 7 more 10 over the course of the year. 11 12 While it was a success in teaching me 13 [`mdoc(7)`](https://man.openbsd.org/mdoc.7), I found that it 14 slowed me down a bit in authoring blog posts. 15 16 Around the same time as I was feeling this slowness, I began 17 actively phlogging on my [gopherhole](gopher://alexkarle.com) and 18 [even started posting gopher-only content](/blog/burrowing.html). 19 I really enjoyed writing plaintext posts because they are quick 20 to write and, more importantly, highly durable to changes in 21 technology. I can't imagine a world where one cannot open a plain 22 `.txt` file and read/edit it. 23 24 I got a real sense for the importance of optimizing for archival 25 while browsing gopher--it's incredible to be reading textfiles 26 older than me. It made me realize that I want to be sure that my 27 content can survive for decades with minimal effort. 28 29 So, I started thinking about how I could move my site's source 30 (pre-HTML) to a more readable plaintext format. Markdown was the 31 obvious choice, but I wanted to stay true to my 32 [creative limitation](/blog/creative-coding.html) of keeping this 33 site buildable by base OpenBSD, so I had to find another option. 34 35 It was about 10 days into solving Advent of Code puzzles that I 36 realized I could redirect some of the puzzling effort at the 37 problem and write my own markup parser. The result, a few weeks 38 later, is [`nihdoc(1)`](https://git.sr.ht/~akarle/nihdoc). 39 `nihdoc(1)` (a play on the fact that markdown is *N*ot *I*nvented 40 *H*ere) provides support for all the basic syntax I'd want in a 41 blog post--nested lists, _inline_ *styles* and `code`, code 42 blocks and block quotes, and headers. It was a blast to write, 43 and I learned a lot in the process! 44 45 I suspect the CSS for the blog will still change (maybe a dark 46 mode? or something a little less plain), but I tried to keep the 47 resulting HTML pretty bare in support of accessibility and 48 portability--it should read well in screenreaders, embedded in 49 RSS feeds, and more. 50 51 If you want to see the source for any post, just replace the 52 `.html` extension in the URL with `.txt`! For example, here's 53 [this post's source](/blog/mdoc-to-nihdoc.txt). 54 55 ## Implementation Highlights 56 57 If you read this far, I figure you might be interested in some of 58 the implementation details and design decisions. 59 60 ### Stream Based Parsing 61 62 Probably the most interesting detail of the parser is that it is 63 stream-based with constant memory usage. In other words, it will 64 start spitting out the input and the HTML markup as soon as it 65 can decisively figure out what state it's in (i.e. has the 66 paragraph ended, etc). Keeping track of this state is done with a 67 handful of booleans/integers and doesn't involve storing lines in 68 memory. In fact, the current implementation reads the input one 69 character at a time! 70 71 This is an efficiency win for large documents (not that my posts 72 are that long), but was also just a fun constraint to try to code 73 within. In practice, I found I was able to get support for almost 74 everything I wanted (nested lists, etc) with maybe the exception of 75 "bottom of the document" links that markdown allows. More on that 76 later. 77 78 ### Balancing Ease of Implementation with Syntax 79 80 One of the most interesting challenges in designing a markup 81 language is settling on a syntax that's both easy-ish to 82 implement (I'm a big believer in simpler = less bugs) but also 83 syntactically appealing in plaintext format (after all, one of 84 the main motivations was to make the source archive-ready). 85 86 The best example of this was deciding how to write links. 87 88 I started off with the easiest implementation, which is also the 89 least appealing (IMHO). A link looked like this: 90 91 [https://alexkarle.com/blog my blog] 92 93 This is super easy to parse one character at a time. In 94 psuedo-code: 95 96 1. If current character is *`[`*, print *`<a href="`* 97 2. Print all characters (the href) until you see a space/newline 98 3. Once we see the space/newline, print *`">`* 99 4. Print all characters (the description) up until the `]` 100 5. Once we see the *`]`*, print the closing *`</a>`* 101 102 This fits really nicely into our "parse one character at a time", 103 since each special character in the link corresponds to a direct 104 piece of HTML to output. However, it's ugly to print links that 105 have no description, such as: 106 107 [https://alexkarle.com https://alexkarle.com] 108 109 To address this, the next evolution added a (stack-allocated) 110 "link buffer" that would store the href as it was printed so that 111 if the ']' was hit before a space/newline, it was assumed that 112 the description was the href and it would print the link buffer 113 in the place of the description, enabling "bare links" like so: 114 115 [https://alexkarle.com] 116 117 I was about to go live on my blog with that iteration because I 118 liked it _enough_, but the one thing that really bothered me was 119 that it's hard to read the description after the link. To the 120 plaintext reader, the description is way more important than the 121 link! Especially for long links, it's distracting to have to scan 122 ahead to continue a sentence. 123 124 I really wanted markdown-style links like so: 125 126 [my blog](https://alexkarle.com/blog) 127 128 The immediate problem was that the parser can no longer print the 129 characters as it sees them, since the URL happens after the 130 description in the input but needs to come before the description 131 in the output. I realized however that this is a similar problem 132 to the way I used the linkbuf for bare links--all I had to do was 133 store the description in the buffer, and play it back after 134 printing the href. It's the same amount of memory, but a tad more 135 complex, since the description is allowed to have inline styles, 136 so before pushing onto the link buf, we need to check for styles 137 and push those too (effectively a smaller version of the main 138 loop). 139 140 The final form of markdown links that I'd like to support but 141 can't is a "postfix link", link so: 142 143 This is a [link] in 144 a paragraph 145 ... 146 [link]: https://alexkarle.com 147 148 Since the actual link could be anywhere in the document, this 149 kind of parsing requires buffering potentially the whole 150 document, which violates the streaming condition (which I'd like 151 to keep!), so I stopped short of it. 152 153 ## Conclusion 154 155 I hope you found this discussion of syntax, tradeoffs, and 156 parsers interesting! I'm sure there's a lot more I can learn and 157 improve on, but it's been a fun evolution from the `mdoc(7)` I 158 started with! Check out the 159 [source](https://git.sr.ht/~akarle/nihdoc) if you're curious. I 160 expect it'll change rather frequently in the next few weeks, so I 161 wouldn't advise depending on it yourself (but I wanted to open 162 source it to share with others as a teaching tool regardless!). 163 164 [Back to blog](/blog)