Posts related to SRE

Psychological Safety and the Only Pyramid Scheme That Works

A lot of my job is about safety. Safety prevents errors from happening but, more importantly, when people feel safe, things become safer. It’s a strange phenomenon that I’ve seen time and time again where if you lay out processes and tools that make things like software deployments safer, the effects continue to compound long after the change has happened.

read more

Customer Communication During Incidents The How to of Status Page Updates

Something often overlooked during an incident is how we communicate with our customers and reassure them of the situation. How you convey an incident to the people paying for your service can make all the difference when it comes to contract renewal period. At a past company, our sales team regularly reported back from client meetings that they consistently mentioned how helpful and reassuring status page updates from the SRE team were, even when our service was fully down.

read more

Building Someone You’d Want To Have A Beer With

As Operations Engineers, we often overlook the user experience of tooling in favour of functionality. CLIs end up with vast sprawling seas of flags and nested commands requiring a minotaur to traverse. Now, I believe that ChatOps is the best path to giving independence and control to everyone in the company. This post is going to talk about why we rebuilt our ChatOps from scratch and what I learned along the way about UX, people and myself.

read more

Capacity Planning in Four Parts: Telling the Future without a Crystal Ball

This blog post is based on a talk I gave at SREcon18 EMEA. If you’d like to see the slides or watch the talk, click here to view the usenix website. Let’s start with a radical concept: you’ve already got 50% of the work done to start capacity planning. A lot of what SRE teams do already feeds directly into understanding and forecasting capacity.

read more

NIC Driver Facts for puppet

Recently there was an issue with particular dedicated hosts having network issues due to high traffic triggering a known bug in particular RealTek NICs. Unfortunately puppet doesn’t expose facts about the networking equipment of a server so I wrote the below to expose the NIC drivers in use, their firmware version and which interfaces use them.

read more

Tidying up puppet reports

A lot of puppet configurations recommend using puppet’s tidy directive to manage puppet reports. The problem with this though is that in order to delete the file, puppet will create a file directive in state.yaml. The state file grows pretty quickly then because of this and I’ve experienced it slowing down puppetruns after a certain point.

read more