This blog post is based on a talk I gave at SREcon18 EMEA. If you’d like to see the slides or watch the talk, click here to view the usenix website.
Let’s start with a radical concept: you’ve already got 50% of the work done to start capacity planning. A lot of what SRE teams do already feeds directly into understanding and forecasting capacity. SLOs, monitoring, alerting – it’s all relevant to creating and enforcing a capacity plan.
I know, I know. You’re not sure if you should even bother going through all the hassle of creating a capacity plan, monitoring it, tweaking it. “We’ve been fine without one for years” I hear people plead. I’d like to encourage you to take on a couple or all of these steps and see if, suddenly, your cashflow and your oncall shifts stabilise a bit more afterwards.
For business, we need to forecast and analyse capacity to make sure we’re only spending money on what we actually need. It’s also there to ensure we can anticipate and deal with sharp growth in a more timely manner. Dealing with a sudden burst in traffic can be difficult but if you don’t plan your capacity, you might end up having to grit your teeth and lose business until you can procure more servers to beef it up.
For SREs, the only thing you need to care about is that you’re enforcing capacity plans because they help avoid 3AM pages. If you’ve ever been woken in the middle of the night just to see that an alert could be solved by adding another server or two to the loadbalancer, you need capacity plans and you need them yesterday.
The goal of capacity planning should be to drive the system to the appropriate level of risk for the lowest possible cost. If you accept more risk, your plans will cost less to execute. If you have less risk, it costs a lot more to beef up your service to keep that risk low.
Intents and Service Level Objectives (SLOs) are definitely something you already have for your systems or, at least, you should have – but we know that’s not always the case. It wouldn’t be an SRE blog post without mentioning SLOs and error budgets.
The start of any new service begins as a conversation between Service Owners and SREs. What does the business want? What does the service require? Understanding intent begins with an SLO. Often, this is a back-and-forth between SRE and the other parties to help establish realistic, appropriate and oncall-friendly SLOs.
A conversation about SLOs with any service owner will be a long one but it’s worth it to come out the other end with a clear understanding of what they expect, what we can provide, and what’s definitely unacceptable. The SLO shapes the risk we’re willing to tolerate in the system.
Unless you’re in a Google-like company, if someone drops a project on your desk and says “the SLO is five 9s of uptime”, your first thought is definitely going to be “this person doesn’t know what that means”. It’s more than likely that the service owner doesn’t understand that 5 minutes (and 15 seconds) of downtime a year is going to cost a fortune to maintain and leaves very little time for your error budget and their new features.
Determining a good SLO is a human task that has some technical requirements, not the other way around. You will be able to determine the best SLOs by understanding the person/team who built it and what the business expects.
Also be wary of any request that pushes a lot of knowledge of the resource consumption and load tolerance back on the SRE. Often, it’s easier to give the service owner appropriate guidance on figuring out these figures – teach a man to fish style. It’ll make your life easier in the long-run.
Part Two: Service Trigger Boogaloo. Service Triggers are the factors that affect your Service Level Indicator (SLI) – they’re the thing that drives it one way or another.
In capacity planning, these are called Driver Metrics. Finding these metrics can be tricky but thinking logically about the bounds of your system – CPU? RAM? Throughput? Queue Size? And viewing your SLI over the last couple months can give you a good feel for what factors into changes in your SLI.
When plotting your SLI, it’s good to search for inflection points in the graph and lining those up with other metric changes. Inflection points are points at which the trend of the graph changes significantly.
Inflection points aren’t always very obvious but it’s worth paying close attention to what you plot because finding them can lead to more intuitive links/differences between metrics.
Above, circled in red, are examples of inflection points. They denote a significant difference in the trend of the users and, compared to our server capacity for this particular service, line up with capacity changes. These correlations are a good indicator that the number of users affects this particular service.
A quick note on a common pitfall: often, it’s easy to think your system scales with your users but, unless you’re an authentication service, it’s very uncommon for the number of users to have a direct impact on capacity. Generally, it’s something the users do that’s the actual culprit.
For example: At Hosted Graphite, it’s not the number of users that’s important. It’s the number of datapoints each user sends or the datapoints per protocol that informs our decision to scale our ingestion services.
You’ve done the research, you’ve sat down and hammered out a reasonable SLO and now, it’s time to discuss when should you increase or decrease capacity (that one’s very important) and what kind of buffer do we need within those actions.
Document everything! People can learn from what you’ve done to better refine your plan. If you don’t leave any form of trail about how your investigation was conducted, what insights you gained or what methods you learned about, you’re going to have people re-treading the same water every time a new capacity plan gets made.
Be the change you want to see in the world and start documenting everything you do.
When you come back to re-evaluate this plan in a couple months, it will make so much more sense if you know why you chose this metric over that metric or made this assumption based on factor X.
Any insight you gain from the metrics, should be actionable. That means that you make graphs on dashboards that explain in simple terms whether the capacity needs to be increased or decreased right now this instant. There’s no benefit to throwing a bunch of figures at an engineer or business person and expecting them to have read the 5 page capacity plan doc from three months ago.
Most importantly though, don’t bury the more technical details. At some point, you should set up a non-paging alert for changing your capacity. When that alert goes off, the oncall person(s) shouldn’t have to go hunting through several dashboards to assess whether or not the recommended action in the alert is the right call.
Put the context in a collapsible set of dashboard rows below the actions. Provide:
Something I haven’t mentioned about this section until now is that adding in a buffer to your service trigger will give you the opportunity to only ever have to scale for capacity during business hours. You should be able to ignore these alerts outside of your normal work schedule.
Lastly, we’ve gotta take everything we’ve learned and get all our eggs together to project those estimates into the future. This section is the most important for coming back to in the next couple months to assess whether the plan is still working or not.
To begin with, if you’re not tracking your capacity right now, start immediately. It’s so much harder to do any of this if you haven’t been monitoring the capacity of your services already.
This is probably the easiest indicator of whether your plan is working or not – does what we predicted line up with what actually happened? How much did it differ? Maybe we can just change the amount we increase by?
Another advantage to recording capacity is that it’s so much easier to predict the future if you can use all your data from the past.
Those inflection points and capacity increases we talked about earlier are exactly what informed our service triggers so we should also use them to predict the kind of timeframe we expect for increases. Otherwise, your alerts and dashes are still just reactionary actions – you need to know where you’re going.
One of the most important points I’d like to hammer home is that it’s not just you looking at all these graphs and forecasts. Put that shit in a table and show off your hard work.
Other engineers and the business folk need to know what the infrastructure should look like in 5, 10, 20 months.
For your table, I recommend leaving a couple columns so you can record the actual state and dates of increases to see how well they match up against your plan. Use these to re-evaluate your capacity plan and determine when it’s not working anymore.