As Operations Engineers, we often overlook the user experience of tooling in favour of functionality. CLIs end up with vast sprawling seas of flags and nested commands requiring a minotaur to traverse.
UX is an important part of tooling. As a user, well-thought out interfaces reinforce confidence in the actions you take and the output they produce. No tool is perfect but building a consistent and uniform experience allows any engineer provide a safe, repeatable path for everyone else.
Trust me, it’s terrifying to hand over the keys of the castle to anyone outside of our meme-filled ops cliques. I used to fear the kind of destruction they’d wreak.
Now, I believe that ChatOps is the best path to giving independence and control to everyone in the company. This post is going to talk about why we rebuilt our ChatOps from scratch and what I learned along the way about UX, people and myself.
Early last year, Slack announced that they would be turning down their IRC gateway for Slack and, thus, our then-ChatBot’s only communication method.
At this point, our ChatBot was an integral tool to the SRE and development teams. Over a number of years, we’d built up an arsenal of tooling, and to only have a month to either port or rebuild all of that was an impossible-sounding task.
In the end, we opted to rebuild it for a couple reasons:
So there we were- Well, there I was. Committed to a tight deadline, for a vital piece of infrastructure, that had to be built again from the ground up. No pressure Evan. After all, you’ve been working here, what? 3 months?
After quite a bit of research, and a month so tightly packed I didn’t have even time to feel like an imposter, we had (nearly) everything we had before but it was built specifically to be consistent, predictable and, my always-favourite, pretty.
It’s nearly a year since I first embarked on this project and I’m still one of the primary contributors to the project.
What follows are the principles on which I designed the bot, combined with all the learndings from then and since.
The old adage of computer science is that naming will be the hardest thing you ever do. There’s so much to a good name. From early on, all command names were given a convention of the format
<noun> [<subnoun>] <action> [<options>] [<paramaters>]
Nouns and subnouns refer to the resource being manipulated and the action is the verb being applied to them. For example, adding a puppet class to a group of servers in our External Node Classifier (ENC) would use:
enc class add <group_name> <class>
The main benefit of a good naming convention is predictability. Predictability makes it easy to guess the command for what you want to do and also leads to discoverability. If you can predict what a command should be then you don’t need to look at the docs to find it. This had a hidden benefit: being a feature suggestion when someone uses a command that doesn’t exist yet (but fits the pattern). Someone would look at the commands we have, make the logical leap to something that doesn’t yet exist and then, a week later, it would be a thing.
Looking back now, what I would do differently is put less of a focus on positional arguments. I made the mistake of creating a substructure like the noun, subnoun, etc for the ENC commands. While it made logical sense to follow a hierarchy like
<cluster> <server_group> <class> <parameter> <value>, that’s a lot of cognitive load to put on someone. The biggest issue with this wasn’t even remembering the order (most got it was biggest to smallest magnitude), it was that if you forgot a piece of the chain, the command couldn’t tell which piece was missing. If I had the chance to change it, I would standardise a set of flags across all the commands to replace that hierarchy.
At first, it was a happy accident of modularity that led to all of our colours being consistent but it didn’t take long to realise that colour was going to be a big part of using the our newly-discovered slack powers.
We started with the three traffic lights and haven’t felt the need to add any other colours since then. Colour was a big part of the ChatBot future but it’s the minimalism of colour used in the bot that makes understanding the colours easy.
There’s something really simple you can do when defining the colours in your code: use a named dictionary.
'FAILED': '#CD4B41', 'RUNNING': '#EBD746', 'SUCCEEDED': '#99c140', }
Giving each of the colours a name means that every time they appear in your code, it’s clear why you’re using that colour and makes it a hundred times more readable. I’ve a background in web design and I hated nothing more than people hard-coding colour hexcodes across projects. It’s horrible to read, it makes it hard to change and, inevitably, someone makes a small change to a colour that makes it near-impossible to mass-edit.
The above names worked well for us but I’ve seen teams use the bootstrap colour names as well which have always seemed pretty clear to me. Although primary, secondary and muted probably don’t make quite as much sense here.
Y’know, this seems like something that should have been majorly obvious. Every time someone interacts with the bot, reply with a simple “I got you fam” message.
A subtle but very annoying problem I had with the old bot is that there wasn’t always any signal that they got your message. Combine that with the fact that incorrect or mispelled commands also wouldn’t work, it was often very unclear whether the bot was working or you’d ham-fisted the keyboard again by accident.
What’s worked amazingly for us, is having two python decorators: one to immediately respond and one to delete the response when the command returns. Whenever someone wants to add to the codebase, you can simply slap these two decorators on any new command and it’ll take care of responsiveness for you.
Responding on reception of a command is good first step but if the command’s going to be running for a while, you need to signal that to the user. When running long-running jobs, add a spinning custom emoji to messages – akin to the loading spinners we’re used to. When the job finishes, remove the message reaction and return the commands output. This will go a long way to curbing a user’s expectation of when they’ll get an answer.
The final thing, which I’ve already briefly mentioned, is please, for the love of everything, tell the user when they’ve got a command wrong. A really nice feature I like about using Errbot is that, not only will it respond when a command is wrong, it’ll also suggest what it thinks the command name should have been.
Okay, so something I hate about most of the chatbot libraries I’ve encountered (including Errbot), is that they default the overall help text to just a ma-HU-ssive plaintext list of command names with a one-line description. After the initial proof of concept, the one thing I immediately did was override the help function.
Our help function is divided into categories (i.e. “plugin”s in Errbot) and each has a unique colour that lets you differentiate the individual commands when doing a search. Executing the bare help command will return a deck of slack cards (which I’ve just realised are actually called slack attachments) with each card having the colour on the left, the category title and a short description of the category.
Providing a category name to the help command, will return a similar deck of cards but the names and descriptions will be the commands from that category. Having this really basic categorisation makes it easy to explore commands around a topic. For example, we have a DNS category that has commands for managing DNS and interacting with AWS/terraform.
An important part of any help command is allowing you to search through the commands. Providing a search string to the command will return similar output to the category’s commands but obviously it’ll be commands that matched your query.
Finally, providing the full command name to the help command will return argparse/CLI-like help output about the help text, the arguments, the options and and explanation of what everything does.
Our old bot had a similarish help command but it only supported the search string and full-name methods of discovery. Although I think the new categories and the improved friendliness were major improvements, I must say my favourite improvement has been that when you request help, it threads the response back. This means that we no longer have chat channels spammed with help output. Large swathes of command information is now tucked away nicely to the side, away from conversations.
Our ChatBot is, to us, a critical tool to our everyday lives and its reliability is paramount. To combat the fact that some of the commands can be long-running operations like server provisioning or job tracking, we need a way to know if the bot is active and in use when we decide to update and restart it.
Before this project, I hadn’t really thought much about read-write locks but it turns out: they are magical and I’m sorry for all the insults while I learned you.
If you’re unfamiliar, RW-locks, allow an indefinite number of read locks simultaneously but there can only be one write lock at any one time. Both read and write locks are mutually exclusive. Extending the slack backend, the core command delegation has been wrapped with a decorator that acquires a read lock that is released when the command returns.
Separately, there’s a poller that periodically tries to acquire a write lock. If it’s successful, we then check for the existence of a sentinel file that signifies a restart. Then, if the file is present, we shutdown the bot and restart the entire process – all through Errbot’s API.
It’s nice, it’s clean and it’s a flow we had in the old bot that was built upon for the new one.
Previously, I’ve made ChatBots in Go and PHP but my new favourite is Python3+. At every necessary point, it was incredibly easy to extend and override any part of the functionality. Its ease-of-change made it very easy to only change what we needed within complex workflows, without having to duplicate or disrupt them.
Our code extends the backend to add stuff we needed that wasn’t provided by the library already. As we reached the final release, we extended the backend to provide an API to deterministically load plugins which allowed us to circumvent Errbot’s circular dependency detection which never really worked for us and lead to a lot of headaches.
Something else we added were slack-specific API extensions for:
The SRE team I joined back in Dec ‘17 are all great and they’ve taught me so much about being an SRE and I’ve loved building tooling that puts a lot of those concepts into practise.
When it comes down to it, my job is to try my damnedest to automate myself out of a job. When building tooling like ChatOps, it can be tempting to devote yourself to fully automating everything possible but, in reality, you don’t have the time to hand-carve an Artificial Intelligence from silicon to understand the intricacies of every intervention – often, it is wisest to automate a complex process but leave the decision to action on it, to a human.
We have battle-worn and proven health checks for the machines associated with the static IPs on the edge of our infrastructure. Without a healthy server behind these static IPs, we can’t receive data from any of our customers so it’s pretty important we ensure those directors are healthy and reachable. As part of our ChatBot, we have an automated check that verifies every server that has one or more static IPs is healthy and, if not, it will automatically failover those IPs to servers that are healthy. This is a form of full automation. We can safely rely on the health checks of our servers to determine when to action a failover and we trust it to do so.
On the other hand, we use MySQL databases within our infrastructure and it’s a bit of an involved process to promote a replica to a master and do all the swapping. For full DB failovers, we’ve automated the complexity of the failover task but the ChatBot has no authority to determine when we do that failover – it’s a human determined task. The reason we elected for partial automation here is that the reasoning behind performing a DB failover can often be complex and the effort required to codify that reasoning would be beyond the benefit of that automation. For now at least.
I’ve learned (and am still learning) about what should be fully automated or what can be done with partial automation and a good runbook.
Within the company, our ChatBot has a regularish name (his name is Dougal and we love him) and has some character/personality to him. To the point where it was kind of weird to not be referring to him by name in this piece, it felt so dehumanizing. As such, people are very comfortable giving out to it when it messes up. In my experience, the humanity and knowledge that it’s not human combine to make people much more likely to complain at it directly.
It’s still something I have trouble with today because it feels like I am the bot. I gave birth to this being and I feel kind of responsible for what it does and how it behaves – even if I didn’t write any of the code in the command that’s currently exploding.
Everyone who makes anything has to go through that period of taking criticism too personally and I’m still pretty bad at separating myself from my work. However, I have found a couple things help: