A lot of my job is about safety. Safety prevents errors from happening but, more importantly, when people feel safe, things become safer.
It’s a strange phenomenon that I’ve seen time and time again where if you lay out processes and tools that make things like software deployments safer, the effects continue to compound long after the change has happened.
read more
Something often overlooked during an incident is how we communicate with our customers and reassure them of the situation. How you convey an incident to the people paying for your service can make all the difference when it comes to contract renewal period. At a past company, our sales team regularly reported back from client meetings that they consistently mentioned how helpful and reassuring status page updates from the SRE team were, even when our service was fully down.
read more
Often, when someone works on changes that span multiple services, they think of it as a separate Pull Request for every project. Then, when it comes to deploy day, there’s a concern: We want to make a change to X but Y also needs that change to work - how do we deploy these at the same time?
read more
As Operations Engineers, we often overlook the user experience of tooling in favour of functionality. CLIs end up with vast sprawling seas of flags and nested commands requiring a minotaur to traverse. Now, I believe that ChatOps is the best path to giving independence and control to everyone in the company. This post is going to talk about why we rebuilt our ChatOps from scratch and what I learned along the way about UX, people and myself.
read more
In part one of this series, I talked about my early weeks as an SRE at Hosted Graphite. After jumping into on-call, getting to grips with our Architecture and getting acquainted with 5 years worth of tasks, I was almost ready to call myself a fully fledged member of SRE. Little did I know, my onboarding wasn’t quite finished yet…
read more
Onboarding a new hire is a tricky process and can be very difficult to get right. I’ve worked at/with companies that have had zero onboarding or way too much. In the past, it was either: being pushed out of the plane without a parachute; or the parachute was already deployed and I didn’t make it to production for three months.
read more
This blog post is based on a talk I gave at SREcon18 EMEA. If you’d like to see the slides or watch the talk, click here to view the usenix website.
Let’s start with a radical concept: you’ve already got 50% of the work done to start capacity planning. A lot of what SRE teams do already feeds directly into understanding and forecasting capacity.
read more
Recently there was an issue with particular dedicated hosts having network issues due to high traffic triggering a known bug in particular RealTek NICs.
Unfortunately puppet doesn’t expose facts about the networking equipment of a server so I wrote the below to expose the NIC drivers in use, their firmware version and which interfaces use them.
read more
A lot of puppet configurations recommend using puppet’s tidy directive to manage puppet reports. The problem with this though is that in order to delete the file, puppet will create a file directive in state.yaml. The state file grows pretty quickly then because of this and I’ve experienced it slowing down puppetruns after a certain point.
read more