When to kill the canary
For a while now I have been thinking about introducing canary releases to the deployment pipeline we have set up at my current project. Our move to Kubernetes as a infrastructure framework was paired with a restructure of our internal architecture (moving away from a bloated monolith). This meant that we suddenly made a lot of new, greenfield systems where we could push our code directly to production, with little risk of unknown consequences due to our thorough understanding of the systems (and good test coverage).
However, the pipeline is still lacking somewhat. We have set up a pretty good monitoring system that gives us easy overview of exceptions, heavy resource usage and other problems that might occur in the system, but it is all a bit... just there. I read all over the Internet that this monitoring needs to be combined with alerts, but other than loosely mentioning resource usage and exception rates, the guides stops pretty abruptly.
For day to day operations, what is considered interesting enough to warrant an alert? Every single 500 error? An increase in 500 errors from a particular service? How big should that increase be? Should every single web application be aware of exactly how much CPU and memory it is using, and how much it should be using?
I guess this is what is called Service Level Indicators (SLIs) in the Google SRE philosophy. They claim:
If you’re building a system from scratch, make sure that SLIs and SLOs are part of your system requirements. If you already have a production system but don’t have them clearly defined, then that’s your highest priority work.
(for context, Service Level Objective (SLO) means where on the SLI range you want your service to go, e.g. 99,99% for the SLI "uptime").
This has started to move away from the blog title somewhat, but SLIs are particularly interesting in the case of canary releasing. Because the hardest part of doing this is deciding when to kill the canary. How do you know that the new deployment is worse than the previous?
It seems to me that the criteria for rejecting a deployment through canary releases should be way stricter than the triggers for sending alerts in the day-to-day operations. And some indicators makes little sense, because they are binary and unlikely to occur (e.g. uptime or critical disk usage). Or I guess, if the whole application goes down, that is a pretty good indicator that you should roll back.
I am also not convinced that simply relying on 500 rates and resource usage is sufficient, because unless your system is quite heavily used, this rate can vary quite a bit on its own. Sometimes users might not trigger the already existing broken functionality until you deploy your new release, and then this is falsely attributed to the new release. For an application that is lightly used, you could need days to get a reliable picture of your metrics, and a single canary release would take ages.
Going back to the quote from Google, it does seem smart to approach this from a business and systems angle. What are the scenarios of a poor deployment, and what would be the consequence of these? For lightly used systems, the consequence of poor deployments could still be severe. I see that this is probably where I need to focus.
I am aware that I am exposing quite a few holes in my understanding and knowledge with this post. And that is sort of the intention; by exploring my thoughts on this matter I can distill the knowledge I do actually have, and hopefully start a discussion or two on how to proceed. Please contact me or leave a comment if you have any thoughts!