jml's notebook

Excellence in infrastructure teams

I started re-reading Turn the Ship Around! the other night, this time thinking very specifically about how I might apply the ideas in my own situation.

One of the mechanisms for clarity is "Achieve excellence, don't just avoid errors". If someone in charge of a floating nuclear reactor armed with multiple nuclear warheads can have this attitude, it's at least worth trying on the idea to see how it fits.

What does excellence in an infrastructure team look like?

First, I guess I should step back and describe my team a bit.

I'm in charge of the infrastructure team at Memrise. We're responsible for our AWS infrastructure, Kubernetes deployment, monitoring, alerting, continuous integration, continuous deployment, data pipelines, and probably a bunch of other things too. Everyone in the team is on the on-call rota.

Literally the first thing I look at when I want to measure the health of the team are the rate of errors in production (perfectly fine), and the number of critical pages per day (well below 1). However, both of these are about avoiding errors.

Obviously an ops team needs to avoid errors. But as the saying goes, if the highest duty of a captain were the safety of her ship, she'd never leave the harbour.

Here are some ideas, liberally pinched from the SRE book, and conversations with my colleague JF-P.

Fundamentally, our role is about ensuring a great user experience for our end users and customers. This means a service that is error-free, fast, and reliable. While "error-free" and "reliable" include the avoidance of errors, "fast" is something you can always get better at.

A lot of what we do is oriented toward our internal users: the software developers and data analysts at Memrise. In many ways, the scope of their achievements will be the mark of our own excellence.

This means we'll be building a system together with them that they can understand, because it's simple and because we've provided great tools for visibility. It means they'll be equipped to debug their own problems because they've got the tools to do so, and whatever knowledge their lacking is discoverable.

It also means that they'll have capabilities they didn't have before, and they are using these to provide features we couldn't imagine before, faster than ever.

And if we're being truly excellent, the nature of our own challenges will change. The problems that we deal with today will be gone and we'll be facing newer, tougher problems. What does this mean for hiring junior team members? As Miles Vorkosigan is wont to say, "the reward for a job well done is usually a harder job".

That's all I've got for now. To summarise, excellence is:

  1. Reliable
  2. Fast
  3. Equipping developers to go higher, further, faster
  4. Solving old problems, tackling new ones

This means nothing unless a team comes up with it for themselves, but maybe this is a useful starting point.

What have I missed?