Learning how to dive into a big, wide codebase is - I'm realizing - an important meta-skill. Kinda like learning how to read an academic article.

I've struggled with it, though, since I didn't realize it was a meta-skill and something to improve. I guess I thought other people just magically absorbed big codebases?! That, coupled with the shrieking sirens of impostor syndrome, made me freeze up whenever someone asked me, "Hey, could you make [X small change] to [Y giant new codebase]?"

So I wanted to note down what I do and what's been helpful.


Imagine, if you will, this scenario:

SCENE 1. A bustling 15th century RENAISSANCE BOTTEGA. Large lavendar plants spill from big ceramic jars. The work of the great Renaissance masters - Da Vinci, Brunelleschi, Michelangelo - adorn the walls, along with a cyberpunk poster or two. The smell of paper-thin pizza is in the air. Your COLLEAGUE is laboring over some giant convolutional neural net. YOU arrive, stage left, pursued by a bear.

COLLEAGUE: Oh! Angela! Welcome to the team. Glad you could join us.

YOU: (nervously) Glad to be here!

COLLEAGUE: We thought a nice first project for you would be extending some of the features for our service's chatbot.

YOU: ...

COLLEAGUE: Are your familiar with chatbots?

YOU: Um, well, I guess I did build one in my old job, but that was just --

COLLEAGUE: GREAT! Ours is powered by some natural language processing, regular expressions, and a few APIs. Pretty straightforward, really, no big deal. We'd love it if it could start responding to user questions about our DATA SCIENCE SERVICE, so you'll need to plug it into that as well. We're just trying to give a window into that infamous black box, eh?

The DATA SCIENCE SERVICE is wheeled out. It is a large black box, thrumming softly and crackling with occasional bursts of electric sparks. There is a small post-it attached that reads: README.md. YOU peer at the post-it, peer at the box, and the blood drains from your face. It reminds me you of that one freaky Dune scene. You think you hear the hiss of snakes inside.

YOU: (to self) Oh God.


YOU: I said, sounds great!

So. What to do?

This has happened a few times, where I'll be introduced to a codebase that has thousands of lines of code spread out over several modules. It can be really intimidating! So this is how I break it down.

Step 1: Get the general context

The biggest mistake I make - and I make it again and again - is that I get into the weeds too fast. Whenever I'm introduced to a black box, I find the first weed and start studying it very, very closely. This is no good. This is not just putting the cart before the horse, this is putting the screws before the wheels before the cart before the horse.

So, one thing I've been training myself to do is a top down approach. Mainly, this means asking myself, before anything: What is the point of this codebase? If I can describe it in one line (e.g. "It's a micro-service that provides a probable classification in response to an HTTP request"), that's a start.

Beyond getting the one-line tldr of the overall, I then try to build out the context more. Why does this codebase exist? What happened before it existed? Who uses it? If there's documentation, I try to read it - even if it's outdated. I try to ask stupid questions, and keep asking them. I try to draw out a diagram of its place in the world (e.g. connected to databases or a website or what).

project in the world I love my LCD writer - perfect extension for your brain!

Step 2: Get the software context

If you don't have lots of software development experience, directory structures look pretty arbitrary. When I realized that there were norms around how code projects were organized, this sped things up a lot. Suddenly, I knew where to look; I knew which folders to focus on. The Hitchhiker's Guide to Python has a good overview of a Python project template, and I've also seen and used DrivenData's cookiecutter data science template.

Step 3: The project outline (drawn)

I find it really helpful to draw a diagram linking all the modules together. Sometimes this includes drilling down into each module, mapping the functions too and whether they're calling external stuff (like a database) or calling other pieces. The main thing I want to do is get a sense of what's in there.

I also am a big fan of color-coded everything, since - thanks to CS171 - I now know that COLOR REIGNS SUPREME, and the human brain processes color blaaazingly fast (yeah, V1 neurons!).

project in the world Blue classes hold red functions

Step 4: Reading and writing docstrings

Once the above steps are done, I feel like I am NOW ALLOWED INTO THE WEEDS. At this point, I'll start digging into specific functions by order of subjective "importance". I really like PyCharm's goto definition thing - where I can cmd+some_function_name() and it'll jump me to the source. I'll try to follow this goto-stream up and down the layers of abstraction, reading through docstrings.

Unless there aren't any docstrings! Which sometimes happen! In this case, I might start drafting some docstrings for the functions - just (1) what I think the func does, (2) arguments and their data types, (3) returned values and their types. I'll try to understand the funcs well enough to write docstrings that could be merged into the develop branch.

Also step 4: Running and writing tests

I recently learned about the wonders of testing, and now I use tests as my sanity check for everything. When I'm getting to know a new codebase, I might run tests as the first thing I do - this often pops out whether my installation or virtual environment is missing something. And that helps understanding too! I also try to read through the tests and see what's being tested; this can be the best place to see the overarching behavior of all the code.

I've lately also found that writing tests - especially super small unit tests - and checking out test coverage, are reeeeally helpful ways to get to know things better. Robin Andeer's three part series --

  1. How I test my code: motivation and strategy (part 1)
  2. How I test my code: pytest and fixtures (part 2)
  3. How I test my code: coverage and automation (part 3)

-- was really eye-opening.

The last step 4: Jupyter notebooks

And finally, Jupyter notebooks! This has also been a helpful way to get to know code. By this point, I should have an overall sense of what the codebase is supposed to be doing. I might want to dig into an important or main function. A good way to do that is, after creating a virtual environment and pip installing the project's dependencies, creating a (.gitignored!) folder called notebooks/ and starting to explore the tests, the code, everything. One thing I might do is pull in a function and break it down into each of its parts - that way, I can see what goes where, and what needs what.

A slightly more onerous way, and something I sometimes do with testing as well, is just splattering print() statements everywhere, especially of parameters and returned values. This helps me understand the state of things at each point in the code.

I'm still not great at this

That's been the journey so far. And that's my current best method. If I were to boil it down, it would be: (1) top-down first, (2) draw, (3) test (safely).