ata.soy

Simple Compiler Construction - md2jsx

8 min readAta Atasoy
Markdown to JSX Compiler Architecture

TL;DR: After two months of vibe coding, I decided to dive into computer science fundamentals by building a Markdown to JSX compiler. This article walks through the process of creating a simple compiler, understanding lexers, parsers, and code generation, all while using only NVIM and no AI assistance.

Introduction

I had enough of vibe coding for around two months, so I wanted to take control back and learn more about computer science. So I built a simple Markdown to JSX compiler. I'm currently studying programming language fundamentals where we also go through this stuff! So it was the perfect time to get my hands dirty with it.

So the challenge is:

What is a compiler?

Software that translates files from one language to another (Markdown to JSX in our case). A typical example would be the Go compiler, where Go source code is compiled into executable machine code.

It has several phases and components. Here are the main ones:

Lexer (scanner):

Parser (analyzer):

Code Generation:

md2jsx Outline

So here is the outline of the project:

I have 3 main files:

Lexer

First of all, we should define our types:

type token =
  | { type: "header"; level: number; content: string }
  | { type: "paragraph"; content: inline_token[] };

// inline tokens
type inline_token =
  | { type: "bold"; content: string }
  | { type: "highlight"; content: string }
  | { type: "normal"; content: string };

// modifier dict
const modifiers: { [key: string]: string } = {
  "*": "bold",
  "=": "highlight",
};

For this small compiler, I will add support for headers and some inline modifiers like * and =.

Diving into lexer function

As the name scanner suggests, first we should scan the file and start iterating through it. I decided to use the Deno runtime for this project!

const content = await Deno.readTextFile("example3.md");

I need some way to determine whether we're at a header or a paragraph. My idea was to check if the character is # if so, it means we're at a header. But this implementation is prone to bugs and edge cases. For example, if # appears in a paragraph, our compiler could break :((

// get header level
function get_header_level(index: number, content: string): number {
  if (peek(index, content) != "#") {
    return 1;
  }
  return get_header_level(index + 1, content) + 1;
}

Here is my cool recursive function to get the header level (# → 1, ## → 2, ### → 3). After getting the level of the header, we should eat (skip) those indexes so we don't repeat ourselves. For example:

const header_level = get_header_level(i, content); // calculate header level eg. # ## ###
i += header_level; // jump so we dont check # again

If we're not at a header, it means we're scanning a paragraph. Here, we should also handle modifiers like bold and highlight.

if (modifiers[char] != undefined) { 
// checking if char is modifier
        // check if normal content exisits
        if (paragraph_content != "") {
          let sub_inline_token: inline_token = {
            type: "normal",
            content: paragraph_content,
          };
          inline_tokens.push(sub_inline_token);
          paragraph_content = "";
        }
        let sub_content_data = get_sub_content(i, content, char);
        let sub_inline_token: inline_token = {
          type: modifiers[char],
          content: sub_content_data[0],
        };
        inline_tokens.push(sub_inline_token);
        i += sub_content_data[1];

        continue;
      }

      // normal text
      paragraph_content += char;
    }

Generator

AST Output from Lexer

This is our AST output from the lexer. Now we need to generate JSX from it. The generator is pretty straightforward. I chose to use map methods to handle this, but there are other different and possibly easier ways as well.

export function generator(ast: Token[]): string {
  const jsx = ast.map((node) => {
    switch (node.type) {
      case "header":
        return `<h${node.level}>${node.content}</h${node.level}>`;

      case "paragraph":
        const inner = node.content
          .map((sub_node) => {
            switch (sub_node.type) {
              case "normal":
                return sub_node.content;
              case "bold":
                return `<strong>${sub_node.content}</strong>`;
              case "highlight":
                return `<mark>${sub_node.content}</mark>`;
            }
          })
          .join("");
        return `<p>${inner}</p>`;
    }
  });

  return `export default function Md2jsx(){
      return (${jsx.join("\n")} )}`;

Summary

So, to build a compiler, you need to code a lexer, parser, and generator. If you are building a fairly simple compiler, you might not need a separate parser for syntax analysis and AST generation.

In my project, I'm scanning (lexing) a .md file and generating .jsx from the AST it produces. It's kind of extensible with token types, and it can even support custom tokens and styles.

My main motivation for this is that I write my blogs in Markdown, but I want to host them as static files in my Next.js app. So this is a great starting point for that!

You can check out the whole project on my GitHub: https://github.com/atasoya/md2jsx-compiler

Related Topics

  • Compiler Design: The study of how to build compilers and interpreters, including lexical analysis, parsing, and code generation.
  • Markdown Parsing: Understanding how to parse and process Markdown syntax into other formats.
  • AST (Abstract Syntax Tree): A tree representation of the abstract syntactic structure of source code.

Keep Learning

If you enjoyed this article and want to learn more about compiler construction, check out these resources:

Have questions or feedback? Feel free to email me or connect with me on LinkedIn.