Skip to content


Creating a Programming Language – Part 1: Overview of the first project

After deciding that I needed to do a few one-off projects to learn my way around the language techniques I’m not quite sure how to implement, it struck me it might be fun to do a series to go along with them.

Since I tend to overdue it, I’m going to stick to a “keep it simple” philosophy. Some ground rules:

  1. The language is not allowed to be general purpose, or particularly useful.
  2. We’ll intentionally pick the easiest option when we have a choice to make. This is to get us going as fast as possible.
  3. But we’ll still hand-code it, since, well, I enjoy the visceral feel of code when I’m learning something.

Sifula – The Simple Functional Language

To kick it off, let’s start with a simple functional language. When you first start putting a language together, you’ll notice that you spend quite a lot of time coming up with syntax that fits your requirements, which you then have to tweak as you learn where things might be ambiguous or confusing – unless, that is, you borrow from another project. Since Rule #2 is to pick the easiest option, here we’ll borrow from Haskell:

add x y = x+y
add 3 4

From the example we start making our list of rules:

  • The ‘=’ creates a definition of the left hand side for the right hand side.
  • The function is the left most item in a list separated by spaces, eg “add 3 4″ and “add x y”.
  • Items can be separated by operators to form math expressions, eg “x+y”.
  • Variables are created during a definition.
  • We’ll only allow integers as values.

We’ll also add one more: “an expression by itself is printed to the screen”, so that we can see the output.

Kicking it off: the lexer

The first step, now that we have a feel for the language, is to build an interpreter for it. To get started with that, we first need the ability to take the source and break it into pieces called ‘tokens’. This is generally called the lexer or tokenizer. Taking the example string above:

add x y = x+y

What comes out of the lexer will be a linked list that looks like:

[add] - [x] - [y] - [=] - [x] - [+] - [y]

You’ll notice that though we’re using the whitespace as a delimiter, we don’t keep it around. This is because we generally keep around information only as long as it’s useful. Once it starts getting in the way, we discard it. If your code senses are tingling now, that’s good… we’ve already done something we probably shouldn’t have: we’ve thrown away too many spaces. If we keep important spaces around, we can use them as “operators” just like plus or equals. The trick is knowing which whitespaces to keep and which to throw out.

To do this, we’ll do the lexer in two stages (again, to keep things simple). The first stage will be to lex w/ spaces intact, so our revised example will look like this:

[add] - [ ] - [x] - [ ] - [y] - [ ] - [=] - [ ] - [x] - [+] - [y]

We care about only the first two spaces, since they give us information about the function and its arguments. We don’t generally care about the spaces around the ‘=’. In our second pass, we’ll keep only whitespace that is both preceded and followed by identifiers (eg ‘add’, ‘x’, and ‘y’ in our example). This gives us:

[add] - [ ] - [x] - [ ] - [y] - [=] - [x] - [+] - [y]

Parsing: making trees

For our linked list to become useful, we need to get a better feel for the importance of the various tokens. We can see in the above list that between every identifier is an operator of some sort. To uncover the precedence, we’ll borrow the idea of order of operations from math, and extend it to work with the other operators in our language. The result gives us, from most precedence to least:

  1. ‘ ‘ (the space)
  2. ‘+’
  3. ‘=’

Which lets us build the tree:
picture-1The lower the precedence, the higher (or more to the left) the operator will be in our tree. This tells us that its children are of higher precedence and must be completed first.

Analysis: what’s it mean?

We’ve turned our line of text into a tree, so the next thing we want to do is to teach our interpreter to understand it. At this point, where analysis begins and ends and where our interpreter just starts “doing” things can get a little fuzzy. Following Rule #2, we’ll keep it simple and say for this project that analysis is execution. Our evaluator will look at each node, evaluate its children, and then evaluate it.

Every time we evaluate a node we’ll want to return two things: the answer to the evaluation and its type. “Why it’s type?” you might ask. As it turns out, there’s actually more than one type already. The first, ‘int’, we already know about. That’s the type of the numbers we’ll be using. But there’s a second type hidden in there as well: the return type of ‘=’. This one is more like a status type, letting us know whether or not a definition was successful[1].

At the top-most level, once we’ve solved the node that begins the expression we’re looking at, we’re free to print its value to the screen if it has one. We’ll know whether it’s an expression or a definition by looking at its result type.

Then we’re done

We just got started, and we’re already done. Once we put those three pieces in place, we’ll have a working interpreter for our toy language that can create and solve functions. Nothing particularly tricky so far, so the trick instead will be to make a solid foundation we can grow with new features without making too much of a mess. Before we get sophisticated, though, we first need to see some results. With the next article we’ll start coding the three parts for our interpreter. Once those are done, we should have a working demo.

The Sifula Project

  1. Overview of the first project
  2. Lexing Sifula
  3. Parsing Sifula
  4. Evaluating Sifula
  5. Sifula in Practice
  1. As we’ll find out, this status type is implied in most of the expressions we’ll use. []

Posted in Research.

Tagged with .


0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.



Some HTML is OK

or, reply to this post via trackback.