I've been learning about hierarchical models in Bayesian statistics lately. This has meant digging into some YouTube micro-lectures, some Gelman, some PyMC3 documentation. And! Revisiting some of my old problem sets from CS109B.

Ye olde pset: Text prediction with naive Bayes

The old problem set had two main problems. The first was a text analysis/prediction problem. Specifically, given 100 articles and two authors, use word frequency and Bayesian methods to predict which author wrote which article. I can see how this could be extended to, for example, automated categorization of articles or maybe even spam filtering. The methods used were, first, a naive Bayes approach - which assumes that word frequencies are independent of each other (probably not true) and just tries to measure the likelihood of observing some words, \(w\), conditional on that author (or category, or whatever), \(C_{k}\).

Here it is for multinomial naive Bayes (looking at word count frequency):

$$p(\mathbf{x} | C_{k}) = \frac{(\sum_{i} x_{i})!}{\prod_{i} x_{i}!}\prod_{i} p_{ki}^{x_i} $$

Which I stole from Wikipedia and is saying:

  • The likelihood of these words, \(\mathbf{x}\), conditional on it being this person, \(C_{k}\), (where we have \(k\) people/categories/whatever), is
  • The sum of word frequencies, factorial
  • Divided by the product of word frequencies, factorial
  • Multiplied by the product of probabilities that that author used that word (\(p_{ki}^{x_i}\)).

Phew! But basically, we're doing our usual Bayesian thing: we observe some data (\(x_i\), word frequencies), and multiply that by our prior (\(p_{ki}^{x_i}\), the probability that author \(k\) uses those words). The assumption is that authors use words at different probabilities. The other big assumption is that we can just do simple multiplication, since these word likelihoods are independent. That's probably not the case (e.g. words come in phrases, pairs, other stuff).

Dirichlet refresh

But hey! What if we just know that that assumption of independence can't hold. Language comes in chunks, after all! Not. Independent. Words. In that case, the pset had us work through a Dirichlet model.

I really like Dirichlet models! They're the n-dimensional extension of my favorite model, the beta-binomial model. They're also (holy music) a model with conjugate priors. That is, something that has an analytical solution. You don't need to do any Monte Carlo Markov Chain sampling (unless you want to, which whatever, you could want to). But the Dirichlet-multinomial, like the beta-binomial, is Dirichlet going in and Dirichlet coming out.

The Dirichlet is parametrized by an n-dimensional vector of \(\alpha\) - this is similar to the beta being parametrized by two parameters (\(\alpha, \beta\)). Same deal.

Here's a helpful comparison visualization, courtesy Wikipedia:

From a 2-dimensional beta... beta

...to a 3-dimensional Dirichlet! Ta da! dirichlet

So, just like you can set an uninformed prior by using a beta distribution of \(Be(1,1)\), you can do the same for the Dirichlet: \(B(1,1,1,1,...)\). It'll be a flat line or plane!

The benefit, of course, of using the Dirichlet is that you can now model all those words jointly: you can say, given this big long list of words that author \(k\) wrote, what's the joint probability of those words?

Porting from R to Python

Alright. So, I do think that building stuff from scratch is a valuable way to learn. And translating my old R pset into Python is a valuable way to dig in. But, oooof. This took forever! It forced me to think through each step, and think about stuff like: (a) what libraries are available in Python with which I can give up? And (b) what is happening at stage N of this interminable math journey? Things like that.

Okay, naive Bayes was easy. Here it is in Python, courtesy sklearn (thanks, sklearn):

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix

# A little convenience function to make my
# confusion matrix labelled and in % terms (rather than absolute)
def confmat(actual, predictions, columns, index):
    cm = pd.DataFrame(confusion_matrix(actual, predictions), columns=columns, index=index)
    cm['sum'] = cm.sum(axis=1)
    return cm.apply(lambda x: x/x['sum'], axis=1)

# Fitting the model, predicting, running through the confusion matrix
mnb = MultinomialNB()
mnb.fit(X_train, y_train)
y_preds_mnb = mnb.predict(X_test)

confmat(y_test, y_preds_mnb, y_test.unique(), y_test.unique())

Err, I should have probably mentioned that my X_{train|test} are pandas dataframes of word count frequencies for a bunch of words. That is: each row is an article, each column is a word, and each cell is the number of times that word appears in that article. And my y_{train|test} are pandas series of the category (in this case, the author's name): each row is the author's name.

Fine. That's all fine. I recommend you try this out. Maybe using the Enron email archive - see if you can predict email senders! Woohoo. Or analyze Trump's tweets and see which are written by him vs. others. ("Forensic linguistics", ooh la la.)

Anyway. What became horrible was when I decided to port a mammoth R function (which was given to us in the pset by the TAs - i.e. we weren't meant to code this from scratch) into Python.

Oh noez overflow!

The Harvard TAs had written this gargantuan function to manually calculate the Dirichlet-multinomial posterior for each article. That is, the function would return a floating point likelihood of whether that article was author \(A\) or author \(B\). I ported it into Python, but won't copy it here (probably violates Harvard copyright?).

But I did learn some stuff, which relates back to my CS50 learnings: MEMORY ALLOCATION.

So Python is dynamically typed, i.e. duck typed, i.e. if it looks and quacks like an int, it's an int! Which is fine and well for 90% of what you do. But sometimes you want to be explicit. Sometimes you also just plain overflow.

Overflow is when you run out of memory. Say you want to hold a really big number in memory. Like, \(e^{710}\). Well, you can't. There aren't enough bits in your memory for that to happen. Sort of. Python actually, for integers, will just take up ALL the memory available and, dammit, give you your number. But this breaks down for floating points. Because, then you need all the numbers to the right and the left of the decimal point - and that's a tradeoff! You tradeoff size and precision. This is also why some floating points get "rounded" by Python under the hood - for example:

# Ten decimal places: all looks well
print(f"{1/3:.10f}")
# 0.3333333333

# Twenty, and we see the lies we've been told
print(f"{1/3:.20f}")
# 0.33333333333333331483

ANYWAY. All this to say, I was trying to make my Dirichlet posterior function, and it kept crapping out at, yep, \(e^{710}\). And the way it crapped out varied by basic Python vs. numpy vs. my eventual savior, the decimal module (docs).

import math
import numpy as np
import decimal

math.exp(709)
# 8.218407461554972e+307

math.exp(710)
# ---------------------------------------------------------------------------
# OverflowError                             Traceback (most recent call last)
# <ipython-input-11-0ad1e5224725> in <module>()
# ----> 1 math.exp(710)
# OverflowError: math range error

np.exp(709)
# 8.2184074615549724e+307

np.exp(710)
# RuntimeWarning: overflow encountered in exp
# inf

decimal.Decimal(709).exp()
# Decimal('8.218407461554972189241372387E+307')

decimal.Decimal(710).exp()
# Decimal('2.233994766161711031253644458E+308')

Ah ha! Bottom line, use decimal and you may math with ease. Or use numpy for their amusing "so big, basically infinite!"