Something often overlooked during an incident is how we communicate with our customers and reassure them of the situation. How you convey an incident to the people paying for your service can make all the difference when it comes to contract renewal period. At a past company, our sales team regularly reported back from client meetings that they consistently mentioned how helpful and reassuring status page updates from the SRE team were, even when our service was fully down. We need to stop treating our customers like they’re non-technical children and give them information that will help them make their usage of us more reliable in the long-term.
The most likely and easiest point of contact for a technical team to customers is the status page. It’s the first port of call when you suspect a third-party is causing issues or not behaving itself. It doesn’t really matter what yours looks like as long as it provides the functionality for people to view timestamped updates posted by your oncall team during an incident.
A good status page helps build trust with our users, and show a glimpse into how we behave during incidents and how committed we are not only to reduce the number and impact of incidents, but to be transparent about them (sometimes sharing your mistakes is a good way of shaming yourself into fixing them!).
This will be an ongoing record of of past and present incidents so our customers can have a reasonable and current idea of: 1) how well our services are performing; and 2) if an issue they are experiencing is related to a wider problem or just isolated to them.
Assuming you follow something like the Incident Command System, this task can either be done by the Incident Commander or delegated to someone else. It’s upto the person leading the incident to decide who updates and how frequently they update.
Simple answer: as soon as possible.
Until we’ve posted something, our users won’t know that we know something is wrong and are working to mitigate it. Generally, if you work for a sizeable company, you’ll have experienced the floods of emails to support with “my shit’s broken, yo! WHERE’S YOUR STATUS PAGE?” and while it’s true you can never eliminate all of those, putting up something as soon as you are alerted to an incident is definitely going to help minimise them a lot (and your customers will appreciate you more for it).
With that said though, there is a trade-off to how quick an update goes out. You need to find a balance between providing accurate and relevant information (what the exact impact is and the actual start time of an incident) versus having a fast response. There’s no definitive answer here, it’s mostly something you’ll work out with experience but remember: You can always update the status page to be more specific later, it’s good to err on the side of quickness than waiting too long.
The team should sit down and decide on an appropriate interval to use for incidents going forward. In the past, I’ve found 20 minute intervals works well enough as a starting point.
In that case, you should aim to have an update out every 20 minutes with relevant information - do not copy and paste the same message. If we are at a point where there is no new information to be added, it might be time to consider increasing the update interval.
Increasing the interval should be explicitly stated in an update. If we’re moving from 20 minutes to 1 hour, we should state the current impact and say that our customers should expect an update in an hour or less if more information becomes relevant.
Increasing the interval can also be a good way to free up an extra person from the communication role if there aren’t enough hands on deck.
Giving people a reasonable timeframe for more information eases stress and allows them to plan their own mitigation strategy if they rely on our service. We also don’t want someone furiously refreshing our page for hours waiting for an update that isn’t coming.
I would wager you aren’t lucky to have a staff of technical writers on hand to pen your updates, so these messages fall to regular ol' engineers. That also means we don’t expect you to write the next literary masterpiece. What we do want to accomplish is:
Take some time to consider the right tone of your communication, maybe write a doc outlining some common examples for the team.
The title is a very important part of an update. Sometimes it’s the only thing the user sees before making the decision of clicking to see the rest or not, so we need to help the user answer the following question: “Do I need to care about this?”. How do we do this? By stating how it affects them as clearly as possible.
Is quite vague and very poor at indicating what’s actual affected.
“Increased API Response Times”
Is much better and clearly identifies the service affected, what symptoms to look for and how that may affect the customer.
How you write this is going to come down to a mixture of experience and empathy. You’re going to need to put yourself in your customer’s shoes and think “what would I care about if I saw this incident?”
You’re going to need a couple elements for certain though:
For more info:
These are all examples of postmortems but you’ll notice that a lot of the language is the same and some include all the status updates as well. They’re useful examples as both status updates and postmortems and I would highly recommend reading all the different styles.
A couple times a year, I publish a newsletter with a post or two I really enjoyed writing as well as a Scott Hanselman-style list of things that made me smile! If that interests you, subscribe below :)Subscribe Here