Apr 30, 2018 - 8 Minute Read

My Take on Documenting Code and Readability

Recently we had a workshop at my workplace about how to document code. Unfortunately I wasn’t able to attend it, but the topic was in the air in the following days and I chatted about it a few times with my colleagues. I thought this was a solved problem because a lot of ink has already been spilled on it, but after a lot of discussions it became clear to me that it is not the case.

I’m going to use the following snippet(that I slightly adapted from real production code) to show the guidelines I tend to use in most cases.

def should_analyze(obj):
    # ...business logic
    return True

def get_category(obj):
    # ...business logic
    return "cat1"

def get_last_event_timestamp(obj):
    # ...business logic
    return datetime.datetime(2010, 1, 1)

def analyze_all(objects):
    res = {}

    for obj in objects:
        category, analysis = analyze(obj)

        if category not in res:
            res[category] = []

        res[category].append(analysis)

    return res

def analyze(obj):
    # Categorize the object
    category = get_category(obj)
    # Find object's last event timestamp
    last_event_timestamp = get_last_event_timestamp(obj)
    # Find out if the object should be analyzed
    analyze = should_analyze(obj)
    if not analyze or not last_event_timestamp:
        return None, None
    # Calculate the delta and add the result to the category
    timestamp = datetime.datetime.now()
    analysis = {
        "timestamp": timestamp,
        "delta": timestamp - last_event_timestamp,
        "last_event_timestamp": last_event_timestamp,
    }

    return category, analysis

First things first

The snippet from above has multiple functions in it and we’ll assume it’s part of a single file. Let’s also assume that you didn’t write this code and that you’re trying to understand what it does. You’ll have to read the code of a lot of functions: get_category, get_last_event_timestamp, should_analyze and analyze_all before reaching the meat of the module that is the analyze function. In this simple example it’s not a big deal since the amount of code is relatively small, but imagine how much code you should read(or skip) before reaching the main functions in a much bigger file. By simply putting the entry points or main functions at the very top of the file we can make the purpose of the file more clear automatically documenting it.

This reasoning has further implications though. Note how the analyze function calls get_category, get_last_event_timestamp and should_analyze in its body in this very specific order. On the other hand, the function definitions are not in the same order. This means that while reading the body of analyze you’ll have to jump at random locations in the file to find the other functions. This gets out of hand pretty quickly. Instead, if the functions follow the same order they’re called then finding the function will be so much simpler. This reasoning can also be applied to the analyze_all function that relies on the analyze function.

Are we done with reordering things? Nope. Note that the category variable inside analyze is used only in the very last return statement. Besides being inefficient, declaring a variable at a location far from where it’s first used is annoying because it’s information that you have to store in your brain for a long time before it can be useful. I don’t know about you, but if I don’t use notions my brain will likely forget them. Therefore, I think it’s best to define variables near to the location of first use.

Let’s reorganize the above snippet following this idea.

def analyze_all(objects):
    res = {}

    for obj in objects:
        category, analysis = analyze(obj)

        if category not in res:
            res[category] = []

        res[category].append(analysis)

    return res

def analyze(obj):
    # Find object's last event timestamp
    last_event_timestamp = get_last_event_timestamp(obj)
    # Find out if the object should be analyzed
    analyze = should_analyze(obj)
    if not analyze or not last_event_timestamp:
        return None, None
    # Calculate the delta and add the result to the category
    timestamp = datetime.datetime.now()
    analysis = {
        "timestamp": timestamp,
        "delta": timestamp - last_event_timestamp,
        "last_event_timestamp": last_event_timestamp,
    }
    # Categorize the object
    category = get_category(obj)
    return category, analysis

def get_category(obj):
    # ...business logic
    return "cat1"

def get_last_event_timestamp(obj):
    # ...business logic
    return datetime.datetime(2010, 1, 1)

def should_analyze(obj):
    # ...business logic
    return True

I think that at this point, we:

  • made clear where’s the meat of the file that is in the analyze function because it’s at very top;
  • optimized the amount of scrolling you have to do to read the entire file.

These might seem a small things, but once you’ll get accustomed to it you’ll find yourself reading code much more quickly and easily.

Now that the overall snippet structure satisfies me, it’s time to concentrate on the analyze function.

Code Comments

I’m not a big fan of comments mainly because of two reasons:

  • they get outdated pretty quickly and if they’re not updated diligently they’ll do exactly the opposite of what they’re meant to;
  • to me they are often abused to try to improve the code clarity without actually having to change any code. This happens especially when the developer isn’t confident enough with the changes he’d be making. There are plenty of reasons why it’d be the case, but the lack of a test suite should be a clear example.

Before I put any comments in the code I take a step back, I read the code from scratch again and ask myself what I could do to improve the code first. Most of the time this involves renaming things and moving stuff around as described above. If the code still looks too complex then eventually I’ll write some comments.

Comments should explain why not how.

In any case I tend to not write comments to explain what the code is doing, but why is doing so. Comments are also often used to give an overview of an algorithm or a complex idea but it should still be more focused on the why and not the how because the latter is already expressed in the code.

Let’s take a look at the snippet now. You could say that the snippet is actually well commented and the comments are there for a purpose. Well, imho the comments in the analyze function are quite useless. They do not provide any additional information I don’t already know just by looking at the code. Let’s consider the very first comment for example:

# Find object's last event timestamp
last_event_timestamp = get_last_event_timestamp(obj)

What is the comment telling us that the code is not already expressing? The only thing I could think of is that maybe get_last_event_timestamp is not a simple getter function, but it needs to perform some heavier computation to find the timestamp of the last event. If that’s the case then I’d rather change the function name to find_last_event_timestamp to make it explicit. In any case, I think the comment is just superfluous and can be removed. A similar reasoning can be applied to the second and fourth comments.

The third comment is worse imho and I’ll go as far as to say that it’s actually misleading and is harming code clarity.

# Calculate the delta and add the result to the category
timestamp = datetime.datetime.now()
analysis = {
    "timestamp": timestamp,
    "delta": timestamp - last_event_timestamp,
    "last_event_timestamp": last_event_timestamp,
}

We can break the comment into two propositions: Calculate the delta and add the result to the category. I won’t spend too much time on the former because it’s of the same “type” of the comments we have already considered. I’d like to concentrate on the latter. The proposition doesn’t make any sense, because the code is not appending anything to anything. I think the comment got outdated and I think I also see how. At the origin there was no analyze_all function and everything was done in analyze and as such there was some code like the following:

# Calculate the delta and add the result to the category
timestamp = datetime.datetime.now()
analysis = {
    "timestamp": timestamp,
    "delta": timestamp - last_event_timestamp,
    "last_event_timestamp": last_event_timestamp,
}
res[category].append(analysis)

which aligns to what the comment is saying. The problem is that eventually the code was rightly refactored into an analyze_all and analyze functions but the comments were not updated. This is a major problem because it leads to more confusion than other. Remember, the source of truth is always the code. I think that at this point we all agree to nuke that misleading comment.

The resulting code of the analyze function looks like the following now.

def analyze(obj):
    last_event_timestamp = get_last_event_timestamp(obj)
    analyze = should_analyze(obj)
    if not analyze or not last_event_timestamp:
        return None, None
    timestamp = datetime.datetime.now()
    analysis = {
        "timestamp": timestamp,
        "delta": timestamp - last_event_timestamp,
        "last_event_timestamp": last_event_timestamp,
    }
    category = get_category(obj)
    return category, analysis

I can say we definitely didn’t lost any clarity with the benefit of having less text to read.

Whitespace matters

As of now the analyze function looks like a complex algorithm because everything is so dense. Well, it turns out that’s not the case. The problem is that there is not even a single empty line in that function. Empty lines are powerful because they allow us to group text into meaningful sections(aka paragraphs). I usually group lines in the following sections:

  • variables initialization;
  • computations with related side effects;
  • any for, while, if or whatever constructs your language has to offer.

Let’s modify the analyze function to follow this principle.

def analyze(obj):
    last_event_timestamp = get_last_event_timestamp(obj)
    analyze = should_analyze(obj)

    if not analyze or not last_event_timestamp:
        return None, None

    category = get_category(obj)

    timestamp = datetime.datetime.now()
    analysis = {
        "timestamp": timestamp,
        "delta": timestamp - last_event_timestamp,
        "last_event_timestamp": last_event_timestamp,
    }

    return category, analysis

Even though the diff is small, I think the benefits are quite big because the logical sections in the functions are crystal clear now. Also, did I mention that refactoring the different sections into separate functions is as easy as cut and paste?

Conclusions

At this point I’m quite satisfied with the results and I don’t think there are other improvements I’d make. I’d argue that by following simple steps we were able to greatly improve the quality of the code.

I’d like to leave you with a consideration. The snippet that I used through out this post is trivial and most of real world code is not. This doesn’t mean that the code should look complex though.

Code should look simple even when it’s doing complex stuff.