Clojure + cryptography = maximum fun

In March 2017, I attended the Boston ClojureBridge. Clojure is a functional programming language; it's a dialect of Lisp, written on top of the Java Virtual Machine. In October 2017, I attended Clojure/conj in Baltimore. I don't really use Clojure in my day-to-day programming (yet!), but I've definitely been flirting with it for a while now.

The tldr on why Clojure has been interesting and useful to me is:

A different programming paradigm. I learned languages in the following order: Stata -> JavaScript -> Python. I've been resting comfortably in PythonLand since 2014. It's a wonderful place to be. But it's also, ya know, only one way of doing things! ClojureBridge exposed me to a totally different way of conceiving of code. The ideas of an immutable state, of everything as a function, of functions as arguments, and even just the damn parentheses, were mind-bending and amazing. It's helped me write better Python.
Exposure to more machine. Since I don't have a degree in computer science (yet!), there's a whole big wide world out there in The Machine. I didn't really know anything about compilers until watching a couple lectures of CS50 and writing a bit of C. I still don't know much! Like, I don't really know what the JVM is. So Clojure is a fun way for me to explore!
Ground floor data science. Several Clojure/conj speakers talked a big game about how Clojure (or functional programming, anyway) and deep learning are a, ahem, match made in heaven. This was super interesting! Most of the Clojure deep learning, stats and machine learning libraries are still a bit alpha, but it was great to see the enthusiasm. And writing stuff from scratch is the best way to really learn and understand something (see below...).

Anyway, I wanted to write up my ClojureBridge notes because I ended up FALLING IN LOVE with the project from that day: some simple cryptography tasks. CRYPTOGRAPHY IS AMAZING, PEOPLE!

Whirlwind Clojure syntax

Everything is a function. Some functions are built-in (like +). Everything is parentheses. PARENS!

; This is a comment
;; This too

; General syntax
(function arg1 arg2 ..)

; Example
(+ 1 2)
>> 3

; Defining a variable (evaluated only once)
(def answer 42)

; Defining a function (evaluated every time code is run)
(defn hello-world [name] (println (str "Hello, " name  "!")))

; Printing a docstring
(doc func-name)

Project: Cryptography

The day's project (github/track2-ciphers) was written by Elena Machkasova - thank you, Elena! The project takes you through some basic cryptography tools and puzzles: specifically, you write functions to encrypt/decrypt texts using both the Caesar cipher and the Vigenère cipher. It also touches on cryptanalysis using frequency analysis. In the end, Elena presents an encrypted text, and you try to Clojure your way to a solution.

It took me about ~15 hours to crack the final puzzle; 4 of them were at the actual ClojureBridge workshop in Boston, where I bumbled through Clojure and then got pretty tired by the afternoon, and the remaining 9 hours were me at home, puttering away for a few more evenings. Decrypting was hard, but SO FUN too. And the whole journey was FANTASTIC. I loved learning about Clojure, and I loved learning about cryptography - something I apparently love!? It was like the Platonic ideal of a day-long workshop. I felt super inspired by the end, and like I had enough tools and a motivating project to continue.

I'll go through my method, with some notes on both Clojure the language and cryptography the mission.

Caesar cipher: Encryption and decryption

The Caesar cipher replaces each letter with the letter n places to the left. You select your n and that's your encryption key. To decrypt, just shift everything n steps to the right. For example, if we choose a key of 2, we shift everything to the left by 2 steps and get an alphabet like so:

PLAIN: abcdefghijklmnopqrstuvwxyz
CIPHER: cdefghijklmnopqrstuvwxyzab

PLAIN: hello, world!
CIPHER: jgnnq, yqtnf!

If you choose a key greater than 26, you can divide by 26 and use the remainder. That is, modulo 26.

Here's a Caesar cipher in Clojure:

;; Return the ASCII position of the first letter of the alphabet, A
(def ascii-a (int \a))

(defn abs-value
  [x y]
  (if (> x y) (- x y) (- y x)))

(defn to-int
  "takes a lowercase letter character and returns its position in the alphabet: a = 0, b = 1, etc."
  [c]
  (- (int c) ascii-a))

(defn to-char
  "take a number and return the lowercase letter character that it refers to"
  [num]
  (def abs_num (abs-value num 0))
  (if
   (<= abs_num 25)
   (char (+ abs_num ascii-a))
   (println "Warning: Number must be between 0 and 25 (inclusive)")))

(defn shift
  "shifting a letter to the left by the key"
  [char key]
  (to-char (mod (+ (to-int char) key) 26)))

(defn caesar-encrypt
  "encrypt a word with a key using Caesar cipher"
  [word key]
  (apply str (mapv #(shift % key) word)))

(defn caesar-decrypt
  [word key]
  (caesar-encrypt word (* key -1)))

Cracking the Caesar cipher

The Caesar cipher is pretty straightforward to crack. One way would be to just try all 25 possible keys (1 - 25). Even if the encryption key was originally >26, the decryption key will be modulo 26 of it.

Another way, if you don't want to brute force it, is to conduct frequency analysis. Letters don't appear randomly in a language; they have distinct frequencies. In English, e, t and a occur way more often than letters like x, q and z.

-- Letter frequency in the English language

If your ciphertext is long enough, you could conduct frequency analysis and compare the resulting frequencies to the plaintext English letter frequency above. Like, you could run a Chi-square test for comparing whether two discrete distributions are the same.

Side note but this was SUPER GLORIOUS to me! What an amazing use of statistics! I felt like John Nash in that touching scene in A Beautiful Mind where he exclaims, astonished, "I would never have thought of that!" when the Nobel Prize guy tells him of all the amazing uses of Nash equilibria. (And, yes, they are pretty wild - behavioral ecology?! cancer?!)

Also, side side note but Nash - of course - did a lot of crypto. GLORIOUS CRYPTO HISTORY!

Vigenère cipher

The Vigenère cipher is an extension of the Caesar cipher. Well, sorta. Instead of choosing a key, n, which we shift alphabet letters by, we instead choose a key phrase (a password, if you will...) and use that to jumble up our plaintext. Specifically, the password letters are added to the plaintext letters; that sum is the ciphertext.

PLAIN:  abcdefghijklmnopqrstuvwxyz
KEY:    passwordpasswordpasswordpa
CIPHER: pbuvatxkxjcdibfsfrklqjnanz

And in Clojure:

(defn encrypt-letter
  "Add two letter's alpha positions together, convert that sum to new letter"
  [c1 c2]
  (def summed (reduce + (mapv #(to-int %) [c1 c2])))
  (to-char (mod summed 26)))

(defn decrypt-letter
  "Subtract two letter's alpha positions together, convert that difference to new letter"
  [c1 c2]
  (def subbed (reduce - (mapv #(to-int %) [c1 c2])))
  (to-char (mod subbed 26)))

(defn vigenere-encrypt
  "Cycling through a keyword, encrypt by summing the plaintext letter + keyword letter"
  [text keyword]
  (def clean_text (get-letters text))
  (apply str (mapv #(encrypt-letter %1 %2) (cycle keyword) clean_text)))

(defn vigenere-decrypt
  "Cycling through a keyword, decrypt by subtracting the plaintext letter + keyword letter"
  [text keyword]
  (apply str (mapv #(decrypt-letter %1 %2) text (cycle keyword))))

Cracking the Vigenère cipher

Cracking the Vigenère cipher is similar to cracking the Caesar cipher: frequency analysis. (There's other methods, but I haven't worked through them yet.) This only works if you have:

Some idea of how long the key phrase is. (Otherwise, you could be conducting freq analysis for a very long time indeed.)
A ciphertext that is pretty long. (Otherwise, your sample size will be too small and your freq analysis too noisy.)

In Elena's example, she gave us two hints: the key phrase was between 2 and 8, and the ciphertext was related to ClojureBridge. She also gave a great guide on why subsequence frequency analysis works - this blew my mind.

Trying to explain it back to myself, I broke down my cryptanalysis into these steps:

You guess a key of length n (e.g. [k, e, y] would have length, n = 3), and break up your ciphertext into n subsequences.
Each subsequence is all the letters that that key character (e.g. k) is encrypting.
Use the Chi-square test to check whether the distribution of letters in your subsequence is similar to the English language. If you see a near uniform distribution of letter frequencies, it’s probably the wrong n - that is, if you just got random mush back, you didn't pick the right key length. Start again.
If you do see patterns in your subsequences' histograms, then you can check the Caesar encryption distance between the most frequent letter in your subsequence and the most frequent letters in English (e, t, a). You then take the subsequence of the nth+1-th letters, and predict which letter the next character in key ([k,e,y] -> e) is.
You put together all the most-frequent-letters by subspace into your n-length key (which should, hopefully, sound like a word).
Then you apply your key to decrypt the entire text.

(This speaks to the value of long passwords and the inherent insecurity of having lots of data about ourselves out there. The longer your key phrase, the smaller the subsequences, the noisier the frequency analysis. The more encrypted data you put out there about yourself, the more data is available for cryptanalysis. I know the Vigenère cipher isn't how modern encryption works, but the general principle did give me pause.)

And in Clojure:

(defn count-letters
  "Count the number of times a letter, char, occurs in a string, s"
  [char s]
  (count (filterv #(= char %) s)))

(defn freq
  "Return frequencies of each letter in a string, s"
  [s]
  (zipmap alphabet (map #(count-letters % s) alphabet)))

(defn freqs
  "Sort letter frequencies (desc) in a string, s"
  [s]
  (sort-by second > (freq s)))

(defn take-nth
  "Split text, s, into subsequences of every n letter"
  [n s]
  (take n (freqs s)))

(defn chi-square
  "Taking the observed freq hash, obs, and compute Pearson coeff (lower is better) against English lang freqs"
  [obs]
  (def exp (hash-map \e 12.702, \t 9.056, \a 8.167, \o 7.507, \i 6.966, \n 6.749, \s 6.327, \h 6.094, \r 5.987, \d 4.253, \l 4.025, \c 2.782, \u 2.758, \m 2.406, \w 2.360, \f 2.228, \g 2.015, \y 1.974, \p 1.929, \b 1.492, \v 0.978, \k 0.772, \j 0.153, \x 0.150, \q 0.095, \z 0.074))
  (def e (vals (sort-by first exp)))
  (def o (vals (sort-by first obs)))
  (def subtracts (map #(- %1 %2) o e))
  (def squares (map #(* % %) subtracts))
  (def normed (map #(/ %1 %2) squares e))
  (reduce + normed))

(defn try-key
  "Given an encrypted text, text, try a key of lenth n and return Chi-square test"
  [text n]
  (def every-nth-letter (take-nth n text))
  (def freq-every-nth (freqs every-nth-letter))
  (def freq-hash (freq_hash freq-every-nth))
  (chi-square freq-hash))

Honestly, I haven't looked at this code in a couple months, and it's a mess. One thing I need to learn is organizing a Clojure project. For now, though, I just manually ran through try-key with the key lengths between 2 and 8, and noted down, first, whether it looked like non-mush came back and, if so, what the likely key character was.

I did this until I had a key length and key that looked reasonable, and then I checked it against the encrypted exercise text. When it worked, it was AMAZING!

This manual method was not great, though, since my human brain is a pattern-recognition machine, and so I saw a lot of patterns that weren't there. Automating it more would have probably helped in avoiding false-positive traps that my brain kept laying, as well as removing some of the general manual pain.