‹ All posts

The path to structured content with Markdown

This post was inspired by the Pros and Cons of Using Markdown for Technical Documentation Panel Discussion with Ed Marsh, Eric Holscher, and Fabrizio Ferri-Benedetti. Around the ~40 minute mark the discussion moves onto how to maintain a consistent style with Markdown documentation, especially in larger teams.

The panel agrees that currently there are no clear ways to enforce a specific document structure when using Markdown.

Fabrizzio then on goes on to say (54:15):

Some day, someone is going to figure out a way to seamlessly grow Markdown into something that can mature into structured content

I’ve been thinking about this a lot, and my claim is that Markdown and XML (which is traditionally used for structured authoring) are more similar than you would initially expect, and that the dream of adding validations and structure to Markdown is perhaps not too far away.

The structure behind Markdown

There’s a commonly held belief that Markdown is not structured. There is some truth to this, but I’d argue it’s more the case that the structure is hidden. While with most XML-based authoring tools you are manipulating the structure directly (or through a WYSIWYG editor), with Markdown we are one level removed from the actual underlying structure.

To demonstrate, let’s take a look at a quick example at how Markdown is actually almost equivalent to XML, if you squint a little.

Let’s take this Markdown content.

1
2
3
4
5
# Hello, world

This is a paragraph **with some bold text**

![and this is an image](cat.jpg)

When this Markdown is processed into HTML, it goes through a number of transformations. The first thing that happens is that the Markdown is converted into an Abstract Syntax Tree. This means taking the raw text of the Markdown, and converting into something more structured that a program can manipulate.

Each Markdown parser does this a little bit differently, but we can see this in action with, for example, the AST Explorer tool. Let’s take Markdown above, and paste it into this tool to inspect the AST, and select the JSON output format.

We get something like this (I’ve removed some unimportant fields for brevity).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
{
  "type": "root",
  "children": [
    {
      "type": "element",
      "tagName": "h1",
      "children": [
        {
          "type": "text",
          "value": "Hello, world"
        }
      ]
    },
    {
      "type": "element",
      "tagName": "p",
      "children": [
        {
          "type": "text",
          "value": "This is a paragraph "
        },
        {
          "type": "element",
          "tagName": "strong",
          "children": [
            {
              "type": "text",
              "value": "with some bold text"
            }
          ]
        }
      ]
    },
    {
      "type": "element",
      "tagName": "p",
      "children": [
        {
          "type": "element",
          "tagName": "img",
          "properties": {
            "src": "cat.jpg",
            "alt": "and this is an image"
          }
        }
      ]
    }
  ]
}

The Markdown has been transformed into a tree structure. The top level object is of type root, which has a list of children. This object has three children, the first being the header, and the other two are paragraphs. The first contains some text, a part of which is bolded, and the latter an image.

Now, if we compare this to an equivalent DITA topic:

1
2
3
4
5
6
7
8
9
10
11
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd">
<topic id="hello-world">
    <title>Hello, world</title>
    <body>
        <p>This is a paragraph <b>with some bold text</b></p>
        <p>
            <image href="../images/cat.jpg" alt="and this is an image"/>
        </p>
    </body>
</topic>

Because it’s JSON, it’s a bit more verbose, but fundamentally the same structure and information is there. In fact, you could convert the JSON into XML to get to something very close to DITA.

This becomes even more clear when this AST finally gets converted to HTML as part of the Markdown to HTML pipeline:

1
2
3
<h1>Hello, world</h1>
<p>This is a paragraph <strong>with some bold text</strong></p>
<p><img src="cat.jpg" alt="and this is an image"></p>

Markdown is just a level “above” XML, but it is representing very similar information.

The horror of Markdown templating

With bigger projects you eventually want to enable some kind of content reuse.

The issue is that all tools incorrectly tend to assume Markdown can be processed as just text. This causes surprising breakage and prevents robust templating and content reuse.

Let me demonstrate.

Let’s say you have a warning snippet that you want to reuse across your pages:

1
2
3
**Warning**

Do not attempt without expert supervision

Next, let’s say I’m creating a list of steps in Markdown, and I want to include this snippet (I’m assuming Liquid syntax here).

1
2
3
4
5
6
7
1. Press the big red button

   {% include "snippets/warning.md" %}

2. Perform questionable science

3. Profit??

This will, slightly surprisingly, not work, and your list will be broken:

  1. Press the big red button

    Warning

Do not attempt without expert supervision

  1. Perform questionable science

  2. Profit??

Note how the list is not ordered correctly, and the second line of the snippet is not rendered inside the list.

The reason for this is that unlike XML, Markdown is whitespace-sensitive. When Liquid does the string template replacement, what you actually are doing is this:

1
2
3
4
5
6
7
8
9
10
1. Press the big red button

   <!-- snippet is injected here -->
   **Warning**

Do not attempt without expert supervision   <!-- Incorrectly indented -->

2. Perform questionable science

3. Profit??

The include snippet was just replaced verbatim with the content of the snippet file.

But because the indentation did not exist in the snippet, the second line is not indented correctly in the list, breaking the flow of the ordered list.

Any robust template system used for Markdown needs to be aware of the Markdown syntax itself.

This is something we learnt the hard way at Doctave, and fixed in our 2.0 release. Our template and component system is fully Markdown-aware, which removes pitfalls like the one above. Stripe also solved this with Markdoc. Both options can handle the case described above - they can see “ah, I’m inside a list, so I know that the paragraph must be made a child of the list item” instead of doing naive string replacements.

This should be considered table stakes for any authoring system using Markdown.

Enforcing structure

Now we get to the part that does not exist today: how can you ensure and validate a consistent structure in your Markdown-based documentation project? There’s a great section in the panel discussion above at around 45 minutes on this topic.

DITA and other structured authoring tools were designed for this: using schemas and other structures to ensure that your content conforms to a specific standard. Be it for maintaining consistency and/or ensuring legal or regulatory compliance, this is what lets you scale large authoring teams and projects.

Markdown, on the other hand, doesn’t support this. While we can use tools like Vale to help us enforce a style guide, constraining the actual structure of the Markdown isn’t something that we can do.

Think about how in Markdown you would ensure that “all our how-to guides must have an h1 title, followed by one or more paragraphs, followed by one or more steps to achieve the guide’s goal”, and then consider how easy it is in DITA.

This is just not really possible today, at least in any popular Markdown-based framework.

There is a path

Bridging the gap between Markdown and structured authoring will require building new tooling and standards. It’s unlikely that CommonMark or any other popular Markdown flavor would consider going in this direction, so we’ll have to create tools around Markdown.

We at Doctave have some ideas about how to achieve this, and have a roadmap on how to get there. At a high level there are a few things we would need:

  • ✅ A parser and template system that is Markdown-aware
  • ❌ A language for describing constraints and rules for your Markdown content
  • ❌ An engine that enforces those rules on your content

We’re already part of the way there!

Let’s imagine a world for a moment where you could add a little annotation to your Markdown file and have a format enforced:

1
2
3
4
5
6
7
---
schema: ../schemas/task.schema
--

# Uploading files

...

Or perhaps MDX-inspired format where Markdown is embedded inside HTML-like tags:

1
2
3
4
5
6
7
8
9
<Task>
  <Task.Title>
    Uploading Files
  </Task.Title>

  <Task.Description>
    To upload files using our [portal](https://www.example.com)...
  </Task.Description>
</Task>

And your format would be validated automatically:

1
2
3
4
5
Document validation error

  <Task>
  ▲▲▲▲▲▲
    └─ Missing `Title` field

This would let you enjoy the low barrier to entry of Markdown, but scale up to a more robust authoring system, layering on rules and schemas as required.

Fundamentally, this is possible. We just need to go out and build it.

I find this really exciting, and something we are exploring at Doctave. If you have ideas or opinions about this area, or just want to nerd out about Markdown toolchains, reach out! You can email me at nik@doctave.com.

Articles about documentation, technical writing, and Doctave into your inbox every month.