Table of contents
Something often overlooked during an incident is how we communicate with our customers and reassure them of the situation. How you convey an incident to the people paying for your service can make all the difference when it comes to contract renewal period. At a past company, our sales team regularly reported back from client meetings that they consistently mentioned how helpful and reassuring status page updates from the SRE team were, even when our service was fully down. We need to stop treating our customers like they’re non-technical children and give them information that will help them make their usage of us more reliable in the long-term.
Main Point Of Contact
The most likely and easiest point of contact for a technical team to customers is the status page. It’s the first port of call when you suspect a third-party is causing issues or not behaving itself. It doesn’t really matter what yours looks like as long as it provides the functionality for people to view timestamped updates posted by your oncall team during an incident.
A good status page helps build trust with our users, and show a glimpse into how we behave during incidents and how committed we are not only to reduce the number and impact of incidents, but to be transparent about them (sometimes sharing your mistakes is a good way of shaming yourself into fixing them!).
This will be an ongoing record of of past and present incidents so our customers can have a reasonable and current idea of: 1) how well our services are performing; and 2) if an issue they are experiencing is related to a wider problem or just isolated to them.
Who Writes The Message
Assuming you follow something like the Incident Command System, this task can either be done by the Incident Commander or delegated to someone else. It’s upto the person leading the incident to decide who updates and how frequently they update.
How Quickly Should An Update Go Out?
Simple answer: as soon as possible.
Until we’ve posted something, our users won’t know that we know something is wrong and are working to mitigate it. Generally, if you work for a sizeable company, you’ll have experienced the floods of emails to support with “my shit’s broken, yo! WHERE’S YOUR STATUS PAGE?” and while it’s true you can never eliminate all of those, putting up something as soon as you are alerted to an incident is definitely going to help minimise them a lot (and your customers will appreciate you more for it).
With that said though, there is a trade-off to how quick an update goes out. You need to find a balance between providing accurate and relevant information (what the exact impact is and the actual start time of an incident) versus having a fast response. There’s no definitive answer here, it’s mostly something you’ll work out with experience but remember: You can always update the status page to be more specific later, it’s good to err on the side of quickness than waiting too long.
How Often Should We Update?
The team should sit down and decide on an appropriate interval to use for incidents going forward. In the past, I’ve found 20 minute intervals works well enough as a starting point.
In that case, you should aim to have an update out every 20 minutes with relevant information - do not copy and paste the same message. If we are at a point where there is no new information to be added, it might be time to consider increasing the update interval.
Increasing the interval should be explicitly stated in an update. If we’re moving from 20 minutes to 1 hour, we should state the current impact and say that our customers should expect an update in an hour or less if more information becomes relevant.
Increasing the interval can also be a good way to free up an extra person from the communication role if there aren’t enough hands on deck.
Giving people a reasonable timeframe for more information eases stress and allows them to plan their own mitigation strategy if they rely on our service. We also don’t want someone furiously refreshing our page for hours waiting for an update that isn’t coming.
The Language Of Good Status Updates
I would wager you aren’t lucky to have a staff of technical writers on hand to pen your updates, so these messages fall to regular ol’ engineers. That also means we don’t expect you to write the next literary masterpiece. What we do want to accomplish is:
- Clear communication
- The right amount of details
- A good representation of the company in the public eye
Take some time to consider the right tone of your communication, maybe write a doc outlining some common examples for the team.
The Title
The title is a very important part of an update. Sometimes it’s the only thing the user sees before making the decision of clicking to see the rest or not, so we need to help the user answer the following question: “Do I need to care about this?”. How do we do this? By stating how it affects them as clearly as possible.
“Website issues”
Is quite vague and very poor at indicating what’s actual affected.
“Increased API Response Times”
Is much better and clearly identifies the service affected, what symptoms to look for and how that may affect the customer.
The Message
How you write this is going to come down to a mixture of experience and empathy. You’re going to need to put yourself in your customer’s shoes and think “what would _I_ care about if I saw this incident?”
You’re going to need a couple elements for certain though:
- The exact time the problem started. Be sure to include a timezone to avoid an ambiguity. Also important to note that this is not necessarily the same time you opened the status page or were alerted to the problem.
- A clear description of what is and isn’t impacted. This should include information a customer can use to diagnose if their particular issue is related to the incident. It’s also a great time to state what isn’t affected by the incident and relieve some stress. For example: if you ingest user data but are just having delays processing it, it can be good to state that ingestion is unaffected and no user data has been lost.
- The exact time of the update and when the issue has been resolved. Again, with timezones, these will help customers build a timeline to compare to their own logs.
- A technical description of the problem. Your customers are smarter than you give them credit for and deserve to be treated like adults. Be open in your communication of what the problem is because it will also build respect for the next element.
- What are we doing to resolve the problem? Don’t just say “we are dealing with the situation” or “we’re applying a fix” because that isn’t fair to your customers. This is also going to be very useful to you when you create a postmortem for the incident.
Some General Tips
- Don’t apologise or use “sorry for the inconvenience” bullshit language
- Be clear, be concise and state the impact up front
- Be as specific as possible, but only when it helps
- 24hr UTC is the crowned king of timestamps
- Generic maintenance pages frustrate/scare people
- Please, for the love of god do not copy and paste the same variation of the same message over and over and over again
For more info:
- “How To Write A Status Page Update” by Fran Garcia
- “The Bullshit Of Outage Language” by David H. Hanson
- “How To Write A Good Status Update” by Blake Thorne
Examples
These are all examples of postmortems but you’ll notice that a lot of the language is the same and some include all the status updates as well. They’re useful examples as both status updates and postmortems and I would highly recommend reading all the different styles.