I'm writing a compiler?!

Or: An Incremental Approach to Compiler Construction

I appear to be writing a compiler. It’s not entirely by accident, but it’s not entirely intentional either. I’ve been interested in compilers for a long time, but I haven’t learned assembly, so most of my experiments have been compiling to different languages (like C), and interpreters.

But now I’ve found a paper that builds a compiler in 24 incremental steps:

An Incremental Approach to Compiler Construction, by Abdulaziz Ghuloum

It’s about writing a simple compiler for a sizable subset of Scheme (up to an interpreter), to raw x86 (32 bit) assembly.

I think it’s awesome!

My compiler targets x86_64, because that’s what my laptop is running. For now that seemed to amount to using the wide registers whenever pointers/memory locations are in play, which means at several steps I got loads of segfaults.

I’m currently at step seven, which introduces heap allocation, and with that several types that aren’t representable with just a stack. (Or maybe they are, but not without many difficulties, at least in this compiler.)

It’s fun, it’s challenging, but it is also doable, which means that I (mostyl) understand what it is doing, without having much experience with assembly. I did know Scheme and compilers, though, but in theory this should also be possible if it were a compiler for Python. (In fact, I might port the first few sections to Python, so people will look at that, and then hopefully continue with the rest of the paper.)

A taste of the compiler

So, how do things work?

In the beginning, there was nothing. Except, the paper starts with just numbers, just a function returning a numeric constant:

// integer.c

int scheme_entry() {
    return 42;
}

What kind of assembly does that generate?

$ gcc integer.s -S
$ cat integer.s
    .file   "integers.c"
    .text
    .p2align 4,,15
    .globl  scheme_entry
    .type   scheme_entry, @function
scheme_entry:
.LFB0:
    .cfi_startproc
    movl    $42, %eax
    ret
    .cfi_endproc
.LFE0:
    .size   scheme_entry, .-scheme_entry
    .ident  "GCC: (GNU) 7.2.0"
    .section    .note.GNU-stack,"",@progbits

While that may seem like a whole lot of gibberish, what’s important are three lines:

// scheme.s

scheme_entry:          ; define a label called "scheme_entry"
    movl $42, %eax     ; move the number "42" into the "eax" register
    ret                ; return

And now, we can call it from C:

// driver.c

#include <stdio.H>

extern int scheme_entry();

int main(int argc, char **argv) {
	int val = scheme_entry();
  	printf("%d\n", val);
  	return 0;
}

And sure enough, it prints 42:

$ gcc scheme.s driver.c -o scheme
$ ./scheme
42

To me, that was amazing! I didn’t know a thing about assembly, and here I was, writing a compiler, which about a week later supported ifs and let. (That was yesterday, today it’s learning about heap allocation.)

It’s not all sunshine

Here come the caveats.

You will have a much easier time if you can work in a 32 bit x86 environment, because that’s what the compiler in the paper targets. With a little more effort and debugging segfaults it is possible to port it to 64 bits, but if you can get a compiler for a 32 bit environment, I’d recommend that.

The paper is rather sparse in places. For example, it explains how to encode numbers and shows code for that, but then it leaves you on your own to write the code for booleans, characters and the other types. It’s doable, but it seemed a bit daunting to me at first. However, I think that was very much a didactic decision to not include all the code, because that will require you to think about what you’re doing.

I also think I may have found a few mistakes, but I’m not entirely sure about those, partially because I work on a differrent architecture. But it works.

Resources

Want to write your own compiler as well? Great, here are some pointers to helpful things:

First and foremost, the paper itself.