Date Tags python (7 min read)

I started incorporating a few new tools into my programming lately, and it's been rad. I wanted to share how and why I think these tools are helpful. They may be standard to software engineers, but - as a data scientist, and someone without a formal CS background (ahem, for now!) - these sorts of programming paradigms aren't always as prioritized. But it's good to change that!

I've been filling in my CS foundations with CS50 lately, and - beyond the big stuff like data structures and algorithms - I feel like C is giving me good "everyday" habits of being explicit about my data types and more serious about adhering to programming styles. Remember: code is read X times more than it's written, so standardizing your style improves readability/makes you better!

I've started incorporating some better programming practices into my day-to-day, and I really like the results. Namely:


My style of coding - which I assume is pretty common - is something like:

  1. (Semi-optional) Pseudo code/sketch something out, if it's complicated-seeming
  2. Bash something out in Python/Clojure/JavaScript/C
  3. Run it and see what happens
  4. Try to understand inevitable error message

In the beginning, step (4) was mostly impossible, and so I sprinkled print() statements liberally throughout my code. Then I'd sort of anxiously watch the terminal to see what stdout would say as my program ran.

As I got better at reading error messages and generally more confident, I used print() less. But it's still something that I'll use when something breaks in a way I'm not expecting.

And that's fine. But, at the same time, I saw how some of my senior colleagues used logging - and, generally, talked a lot about logs. Searching logs. Logs getting big. Finding something in the logs. Like... what are these logs you people keep talking about? Time to learn!

Python's standard library includes the logging module. The general idea is that, instead of using print() statements - which write to stdout (i.e. your shell's output) - and then disappear into the mists of time, logs are saved to a .log file for as long as you like, so you can peruse as much as you like. (Or, better yet, do cool data analysis on them.)

Logging in Python is really straightforward. The basic steps are:

  1. Configure a log file: its name and file path, its formatting, and what it'll pick up (there are six levels of logging in Python).
  2. Sprinkle log messages throughout!
  3. Profit!
import logging
                   format="%(name)s - %(asctime)s - %(levelname)s - %(message)s",

That is, configure a .log file named (dynamically) after the __name__ (i.e. the name of the Python module currently being executed), catch all priority levels (from DEBUG and up), format the output with the name of the logger, the local time when that log statement was written (asctime), the level of priority (DEBUG, CRITICAL, etc.), and the message (which you'll define later). Do all this and write (w) to the file, rewriting if it already exists (you can change this to append, a, if you prefer).

Later in the code, I initiated my logger object and wrote my first message (an INFO message):

logger = logging.getLogger()'Starting log...')

A .log file is born!


Yet later, while I was running a gnarly loop that tried to do some stats on a bunch of things, I logged every model like so:

def foo(thing_id):'ID {thing_id}')

        b0, b1 = some_fitting_function(data(thing_id))'Parameters: {b0, b1}')
        logging.warning('Failed to fit')

That way, I kept track of everything I was able to fit a model to, and everything that failed. By using a try/except block, my loop kept running. It was super handy, since I needed this to run for a while, for thousands of things, and I was actually trying to get a sense of how often it would fail to fit (and why).

Type hints

Python is dynamically typed - meaning you don't have to specify your data types (string, integer, float), and they're inferred when compiling. C, on the other hand, is much more explicit. Consider the same code in Python vs. C:

x = "hello"
y = 2
z = 0.14
char x[] = "hello";
int y = 2;
float z = 0.14;

Like, C doesn't even have a string datatype! It only has an array of characters! So you have to literally tell it, hey, I'm going to make an array, x[], of characters, char, and those characters are h, e, l, l, o. Mamma mia.

Actually, I kid. I love that C does this. I love that an array name is the pointer to the 0th element. And I actually also love the types! It makes the code more readable.

So now I discovered the best of both worlds with gradual typing.


Google auto-complete drama

When I was recently trying to understand a big, wide codebase, I was starting to get annoyed that I'd have to constantly go up and down the function stack to try to understand what the hell that one argument was. It's hard to tell what function foo(arg) is expecting for arg if it's something like:

def foo(arg):
  # do some stuff
  return deep_foo(arg)

def deep_foo(arg):
  # do different stuff
  return deeper_foo(arg)

def deeper_foo(arg):
    # for the love of god
    # what is going on here
    return deepest_foo(arg)

def deepest_foo(arg):
  # arg is an int!! AN INT!!!!

I'd be spelunking up and down the function stack, trying to write down what arg was and keep it in mind. But for a lot of parameters and a lot of functions, spread across multiple modules.

This was when I decided that I would spelunk ONCE AND FOR ALL and just type hint everything up and down. Type hinting is great. You basically state, in your foo(), what type arg is, what type foo() returns. It works for keyword arguments too. It's very flexible. You can type hint a couple parameters, skip others, and so on. You can use the usual data types (str, int, float, bool), but you can also type hint user-created types and classes.

def foo(arg: int) -> pd.DataFrame:
  # do some stuff
  return deep_foo(arg)

def deep_foo(arg: int) -> pd.DataFrame:
  # do different stuff
  return deeper_foo(arg)

def deeper_foo(arg: int) -> pd.DataFrame:
    # for the love of god
    # what is going on here
    return deepest_foo(arg)

def deepest_foo(arg: int) -> pd.DataFrame:
  # arg is an int!! AN INT!!!!

You can also put your type hints in a stub file. The benefits of using type hints, as I see them, are:

  • Readability. No more spelunking. Better understanding of what each function is doing.
  • IDEs. I learned a lot about type hints by watching Joel Grus live-code a neural net in 1 hr. His IDE (Visual Studio Code) and many others are smart about type hints and will raise a flag if you're not being internally consistent.
  • Linting. Linters also do logic checks on internal consistency with types.

Which brings me to...



Did you know pylint gives your code a SCORE? YES. A SCORE. No longer will you kind of vaguely feel like your code is fine, but could be improved. NO. Now you know whether your code is a 4.32/10 or 6.16/10 or 8.23/10!

So pylint was another thing I long glossed over, but - when I discovered that it QUANTIFIES YOUR CODE QUALITY - well, you can bet I'm on that train. So now I pylint everything I can. My emails, my blog posts. All PEP8 everything, dammit!

What pylint does is automatically check your .py files for adhering to style standards (PEP8 style guide), but also handy stuff like software design rules of thumb. For example, it'll encourage you to refactor a function into more functions (or a class!) if that function has more than five arguments. I guess this is what I hear people at work call a "code smell". When I googled some of pylint's output messages, I found this wonderful resource: PyLint Messages - and what they're trying to tell you.

Again, you can configure your IDE to adhere to PEP8 guidelines too - making it even easier.



Type hints