I did it! I'm done! I finished CS50. Now is the time for a feeling of accomplishment, mixed with sadness. GOODBYE, CS50! Will I ever MOOC as well as I MOOCed with thee? Who knows.

Anyway. The final project! Yes! I had a GREAT IDEA for my final project:

To write a programming language in not-English, and thus examine whether not-English languages can convey new programming concepts.

Ambitious! I had originally wanted to do it all in C as well: I envisioned fields of malloc() and free(). But, well, I gave up on that and ended up writing and compiling it to Python. EASY MODE. My middle names.


tldr: Here it is!

triestin


Writing a programming language: In theory

Let's say you open up https://repl.it/, choose a language, write some stuff, and then run it. What, exactly, happens when you run run?

Well: Under the hood, a programming language is doing three things. This is apparently what all languages do. Let's go step by step.

Inputting the string

The zero-th step for any language is accepting a text (string) input. You type in print("hello world") in Python. That string - 'print("hello world")' - gets accepted by the language. Okay!

Lexing/Tokenizing

The next step is tokenizing. This is where some heavy regex comes in. You (the language) need to look at the string and figure out how to split it into its constituent parts. The split is determined by the idiosyncracies of your syntax. That is, how are strings of letters interpreted - is print the name of a variable? A number? Is it a pigeon? Or is it a keyword (that is, a special/reserved word in the syntax)? Spoiler, but print is a keyword in Python - it calls the print() function.

Actually, this is easier with an example. Let's print "hello world" in a few different languages (including triestin!) and write out the resulting tokenization:

# Python
print("hello world")
# Tokens: "print", "(", ""hello world"", ")"

# C
printf("hello world");
# Tokens: "printf", "(", ""hello world"", ")", ";"

# Clojure
(println "hello world")
# Tokens:  "(", "println", ""hello world"", ")"

# Triestin
dimmi hello world fin
# Tokens:  "dimmi", "hello world", "fin"

(Note to self: Add syntax highlighting for triestin...)

In triestin, I saved my tokens as namedtuples with a type (tipo) and a value (val). For example, from the above, if triestin found a dimmi token, it would save it as Token(tipo="print", val="dimmi").

One important thing about tokenizing is that it's ordered - that is, we need to look through our input string and make sure we don't, e.g., mistake dimmi for a variable name. There's precedence, and keywords come first!

Parsing

Okay, so now we have a bunch of tokens. Or rather: we have an ordered string of tokens, like so:

# input string: dimmi hello world fin
Token(tipo="print", val="dimmi"), Token(tipo="identifier", val="hello world"), Token(tipo="fin", val="fin")

Now what? Now we need to parse them. Parsing means taking those tokens and composing them into an abstract syntax tree (AST). That is, we need to translate the input string -> tokens -> something semantic and meaningful which conveys what the user wanted to do. For dimmi hello world fin, we know that fin is the delimiter and anything after dimmi is what we'll want to print. Parsing those tokens is easy then. In pseudocode:

  1. From the stack of tokens, pop off the first one.
  2. Inspect it. Is it dimmi?
  3. If so, keep popping off more tokens until you get to fin.
  4. Put all those tokens into a string.
  5. Save all that into a PrintNode in your abstract syntax tree. The val (value) of the PrintNode will just be the string to be printed.

Easy!

The harder stuff is in parsing function definitions and function calls. Our language needs to be able to support an arbitrary number of nested arguments and functions. This can be accomplished by:

  1. From the stack of tokens, pop off the first one.
  2. Inspect - is it a function definition (ecco)?
  3. If so, parse a function definition. This includes parsing its arguments, its body, and - if the body has a function call in it - looping back, parsing that call's arguments and expressions, and so on. You can keep recursively parsing through a function of functions, etc, to an arbitrary depth of levels.

You can see all this in parser.py.

In the end, though, you'll have an AST with, one hopes, the input code correctly represented as a tree of calls, variables, functions, and so on.

Compiling

And now! What do we do with our AST? This is somewhat tricky, but mostly easy (parsing is the hardest part, I reckon): we crawl through our tree and, node by node, convert each node to the target language - i.e. the language we want to compile to. I chose Python because it's easy! And triestin made the same assumptions as Python (i.e. it was dynamically typed, it didn't handle memory explicitly, etc.).

So how do we compile? Well, we just untangle the node into its constituent parts and then create an output string in Python! With a PrintNode (which parsed dimmi, remember), it's super simple: "print('{}')".format(node.val)! Just pipe that output string to Python and Python will do what it's told! Voila.

One note: like with parsing, compiling is recursive - that is, you have no idea how many nodes you have, and how nested they are. So you gotta keep generating and generating, node by node, until you run out of nodes and have gone down every "branch" of the tree.

Here's compiler.py.


Writing a programming language: In practice

Yeah, right. That all sounds fine. But how was the process?

Well - my process was OK. I took it fairly easy and it took me about six weeks of dabble-here, dabble-there.

commitz

I started with Build Your Own Lisp, which is a (free) book to learn how to write a Lisp-like language in C. I got through 11 (!) of its 16 chapters, but eventually called it quits, since it had just gotten away from me. I had no idea what I was doing, I was confused, I was copy-pecking way too much, and bah. So that was a big dead end.

Thankfully, I found Destroy All Software's great videos - specifically, A compiler from scratch. In that video, DAS writes a language in Ruby that compiles to JavaScript. I know JavaScript and Ruby is similar enough to Python that I could translate it pretty easily. triestin is heavily based on this video.

All the other speed bumps were small potatoes - figuring out how to do math seemed inordinately hard, until I sat down and employed the Lazy Angela Way - i.e. I just directly fed anything that seemed mathy straight to Python and let Python handle it.

I actually ended up learning a lot of handy Python while building this: breaking everything out into modules, learning about when to use and when not to use a Class, working with namedtuples. It was great!


Linguistic side-eye: A detour into anzi and è

anzi

The big point of triestin, the big hypothesis, is that because most (all?) programming languages are written in English, we may be limiting the programmatic concepts that get out there. That is, it's kind of like the Sapir-Whorf hypothesis - tldr, that language affects our thoughts - only for software design. This is beyond the obvious implication that English-speaking programmers have a sociocultural, and not technical, advantage over everyone else just because every damn language is written in their own damn language!

I was super into Sapir-Whorf back in grad school, but I'm less enamored with it now (though I still highly recommend this book). I decided to try to think through some words which are difficult to translate into English. German is famous for having words like this, but, interestingly, I think most languages have words without exact counterparts (languages are not one completely overlapping Venn diagram!). For example:

  • doch in German, which I find impossible to understand totally, but kinda means "the opposite of what you just said".
  • The -va- infix in Hindi, e.g. karna (करना, "to do") vs. karvana (करवाना, "to have done"), which converts verbs from active to passive.
  • Swahili's precisely logical "compounding words": nimekupenda ("I have loved you"), which breaks out into ni- ("I"), -me- ("have"), -ku- ("you") and -penda ("love"). Man, Swahili logic is so cool. (Also, infixes.)

All of these could be turned into interesting programmatic concepts! I think programming languages are more than pure logic (though they are mostly that) - there are language design choices that, in turn, determine the design of products built in that language. And our software design world is, for now, monolingual!

So I thought I'd try to implement a not-English concept. Namely, I thought I'd try to implement a programmatic concept for anzi, an Italian word that kinda means "the opposite of what was just said" (so, I think like doch in German?) but can also mean "rather" or "instead of" or "actually". I feel like, meta-semantically, anzi means "HARK, CONTEXT!" But how to use a contexty word like anzi in a programming language? It would have to know context, i.e. remember what had just been said.

In pseudocode, I decided anzi would be a "reassignment operator". That is, after assigning a variable, x, to some value, y, if the user said anzi z fin, then x would be reassigned to z!

But this meant, ugggghhh, that I would have to figure out how to have state in my language. Ugghghhhh. STTAAAAATTTEEE. Noooooo. Ohhhhhh nooooo.

OK, I complained a lot. It seemed hard. I ended up hacking an awful solution. Here are the steps:

First, in parser.py, let's make an AnziNode and parse it:

def parse_anzi(self):
    """
    `anzi` is kind of like 'actually' or 'rather'
    it's kind of like "doch" (German) - it can signify "scratch that"
    programmatic `anzi` will revise whatever assignment you just made (for now)
    e.g.
    x è 1 fin
    anzi 2 fin
    x fin
    >> 2
    """
    self.consume('anzi')
    value = self.consume_next().val
    self.consume('fin')
    return AnziNode(val=value)

That is, when someone uses anzi in triestin, we'll save the value that they gave us. This is the z value - the replacement value. Now it needs to be reassigned!

In compiler.py, I just fed the AnziNode through, telling the REPL that it was coming and was special and had a value.

Then, in the REPL - here - I did this awful hack:

if __name__=="__main__":

    print("Triestin v0.0.1")
    print("ctrl+C per uscire")

    local_state = State()

    while True:
        try:
            # Python 2 on CLI
            user_input = raw_input('triestin> ')
            output = triestin(source_code=user_input)
            local_state.add_history(output)
            if output['state'] == 'exec':
                exec(output['code'])
            if output['state'] == 'eval':
                if type(eval(output['code'])) == tuple:
                    print(eval(output['code'])[0])
                else:
                    print(eval(output['code']))
            if output['state'] == 'anzi':
                if local_state.history[-2]['node'] == 'AssignNode':
                    prev_node = local_state.history[-2]
                    prev_var = prev_node['code'].split('=')[0].strip()
                    new_code = '{} = {}'.format(prev_var, output['code'])
                    exec(new_code)
        except Exception as e:
            print(e)

Ho jeez. This looks so hideous to me. A local history of whatever has happened in the REPL gets saved and anzi just looks to the previous command. If it was an assignment node (e.g. x è 1) then it reassigns whatever 1 was to the AnziNode value. MADNESS.

(Note to self: let anzi reassign to previous assignment, without that assignment having happened just before...)

Okay, anyway, good enough.

è

My next stab for polylingual justice was replacing something considered very banal and very basic, the assignment operator (normally =), to something that should be equally banal and basic: the Italian word for "is" - è!

That way, someone could write x è 1 fin, which can be read as "x è uno" - that is, "x is one". Easy!

No, not easy. È has an accent over it, which causes English keyboards to fuss and fritz and generally complain. It also caused Python to fritz and complain: WHAT ENCODING IS THIS?! Because, of course, the computer hegemony is a 26-letter English alphabet one, and "è" isn't easy to type on my English keyboard, and rant rant rave. It's one character! Just as efficient as =! But, because of hardware and software design choices, it's a giant pain in the ass to use.

Anyway, you get my point. Now every time someone has to Google "e accent grave" to get è so they can copy it over into the triestin REPL, THEY SHALL KNOW MY RIGHTEOUS INDIGNATION.


Resources

Computer languages

Human languages


This post is part 5 of the CS50 series:

  1. Starting CS50
  2. CS50 - Algorithms, pointers and malloc, oh my
  3. CS50 - Data structures achievement unlocked
  4. CS50 - No more lectures 😭
  5. Introducing triestin