jml's notebook

Thoughts from a code yellow

Published: 2021-08-06

A long time ago I was working somewhere with a data pipeline with severe problems. Borrowing a term from Google, I declared a Code Yellow and wrote a doc with a plan for getting out of it.

Being relatively new to the org at the time, I also wrote down some core principles for getting out of the Code Yellow. Reading over them recently reminded me that they are actually relevant almost all of the time.

Quantify the problem

Pretty much everyone agreed there was a problem and that it was pretty bad, but we had no consistent quantification of the problem.

Without numbers, we couldn’t know if we were making progress. Quantifying the problem led to better decisions and more motivation.

Fix the leak, then fill the bucket

Whenever we encountered a problem in our production systems, in our code, or in the data, our first reaction had to be “how can we change the system so it is impossible for this to happen again”? Only then should we address the problem at hand.

(Note: don’t do this if you are paged for a system serving live user traffic.)

This was difficult, because:

There is always significant pressure to just fix the problem
Identifying deeper solutions only added to our already-long todo list

Nevertheless, we can’t just solve problems, we need to eliminate problem generators.

Close the loop

Whenever a person or a machine let us know about a problem with our systems, we should fix the problem, and then tell them we have fixed the problem, ideally in the same forum they raised it.

For example, if we got an alert, we should:

Reply to it saying we’re working on it (ideally silencing the alert for the duration)
When it’s resolved, mark it as resolved
If there’s follow-up work to be done, link to that from the alert itself

This was important, because our goal was to restore trust in our system, and being trustworthy ourselves is a key part of that. Pragmatically, it also reduced interruptions, questions, and hassling, because people will know where to look for status updates.

Learn from failure

When things go wrong, we don’t blame anyone, but instead see the failure as an opportunity to learn. Put another way, accidents and mistakes will always happen, and it is our responsibility to build a system that can tolerate them.

Specifically, when something goes wrong in our production systems, we had to write a post-mortem, and then review it within the team.

This is absolutely not about blame, but rather about making sure we are actually “fixing the leak”. Post-mortems will give us valuable insight into which issues are hurting us most, and how we can systematically address them.

We started by being over-enthusiastic in writing official post-mortem documents, and then backed off as we become more familiar with the process and could incorporate that kind of thinking into our day-to-day work.

Build learning in

When we learned a better way of doing something, we changed the system (either our production system or our development processes) so that the new, better way was the default. Adding checks to CI is a great example of this.

We should be extremely suspicious of advice like “be careful not to…”, “make sure you…”, “watch out for…”. Humans are bad at vigilance. Instead, let’s make machines that check things for us.

I don’t think any of these principles are revolutionary or unique to me. I also don’t think that they are fundamental or complete or come anywhere near describing a system of thought. Instead, these were gaps between how I like to operate and where that particular team was at that particular time.

That said, I do find myself referring to them a lot. They sit in that big grey space between broad principles like “reliability” and more concrete processes.