The Joys and Challenges of Simple Tasks
I will be the first to admit that there is a certain joy that can only really be felt by replacing a repetitive manual task with a short snippet of script that runs it for you: no longer do you have to remember the exact steps you need to follow to renew the TLS certificates on the mail server, you have a short shell script for that; instead of manually exporting all the diagrams from UML to SVG and committing them to git every time you update the design documents, you have a combination of a Makefile target and a CI action that does it for you without fail, and the list goes on and on. These little snippets of automation go a long way to reducing the drudge-work of your day, freeing you to think, organise and deliver faster and more reliably, day after day.
But there is a dark side to all this automation that isn't discussed enough. Like an ornamental garden, automation needs continual tending to remain beautiful. The assumptions made when the scripts were written may change over time: perhaps that file that is always in that one place in the filesystem needs to be somewhere else to work with the updated version of some upstream tool? Well, that is just a simple task to fix with a few extra lines of code, so not a major issue, right?
Of course, simple scripts are quick to update, but as the amount of little snippets of automation increase, the amount of quick tasks to maintain them increases, until a tipping point is reached where the time recovered by automating the manual tasks is now consumed by maintaining the automation itself. With all this automation, you are now achieving much more per day than you were when you were doing it manually, but you have again run out of bandwidth to think, organise and deliver.
Script Saturation in DevOps
Let us take a classic DevOps use-case: Over time, an average engineering team will accumulate a large number of automation scripts to provision cloud infrastructure, deploy services onto that infrastructure, configure the various logging and analytics systems, and forward monitoring notifications to the relevant support personnel. There are lots of great tools out there to perform all these tasks with a sprinkling of customisation, but over time, as the infrastructure landscape changes, the application complexity increases.
As the business requirements change, the amount of time spent by the DevOps team tending to the automation scripts stays fairly constant at best. If the business is driven by growth, customer-demand or regulation to consider deploying the applications across multiple cloud providers, then the development and maintenance load on the team grows fast. This increasing workload means the engineering team now has less time day-to-day to evolve the products - the things which actually generate the revenue for the business. The complexity of the automation that allowed the team to deliver the product in the first place is now slowly poisoning the teams ability to deliver improvements.
Going Back One Step
As in most of engineering, there comes a point in making something work when one has to stop just trying to fix all the issues, and take a mental step back and ask yourself "What am I actually trying to achieve here?". A lot of the time one finds out with a bit of perspective that the cause of all the problems you face is that by trying to solve the problem the wrong way, you are building up an ever-increasing set of corner-cases where various parts of your solution just don't work. Hammering harder at square pegs hoping they will fit into round holes doesn't fix the underlying problem that the pegs are the wrong shape, and until you stop hammering and look at the holes, nothing gets better.
Returning to our DevOps example, the problem we see is that the engineering team is starting to become swamped with the ever-increasing workload of maintaining the growing number of automation scripts, and there are some obvious potential solutions:
-
More People! If the engineering team is over-worked, add more people. That might well work in the short-term, but the problem isn't the current workload, the problem is that it is ever-increasing.
-
Better Automation! If the automation is consuming too much valuable engineering time, then it must be either written badly, or be built using bad tools. Transitioning from one technology to another always has a cost, and rewriting significant parts of the automation codebase may yield some improvement to the ongoing workload, however the outcome will remain the ever-increasing workload of maintaining a growing number of automation scripts.
When we take a mental step back, we can recognise that the problem isn't the tools the team uses to write the automation scripts, or the capacity of the team to maintain them, the problem is actually the nature of the automation itself. The current approach to the problem of automatically provisioning infrastructure and deploying applications just isn't scaling in line with the demands of the business. As the number of differences between infrastructure and options for the product increase, the number of conditional sequences in the automation increases as a multiple. The scripts now have to handle all these conditions, but fundamentally, and critically, the desired outcome is to "create a cluster capable of running our product". The simplicity of the objective gives us an alternate lens to look at the problem.
Outcomes, Not Sequences
If you're reading this, then I'm pretty sure I don't have to convince you that computers can perform some amazing tasks. You want to travel from London to Paris on Tuesday? There are plenty of great applications out there that will plan multiple routes, using different means of transportation, in the blink of an eye. You want to find the winning chess moves from a given position? There's software for that as well. Applications that find answers to those kinds of problems are called "constraint solvers", and although they are a well studied part of Computer Science, they are often seen as far too esoteric to be considered when writing a few quick automation scripts.
However, on revisiting the earlier question of "What am I actually trying to do here?", the answer should be: "I'm trying to get to my desired outcome, given a start state and a set of constraints". With this change in mindset, we can see our automation requirements are satisfied by a constraint solver that takes a defined end-state as input, plus a current start state, and outputs or performs the relevant steps required to get there, or reports why it couldn't find a solution.
Going back to our DevOps example, what we actually need here is a smarter constraint solver, tuned for cloud infrastructure. We want to give it (as input) the resource requirements and placement rules for our application's compute, networking and storage needs. The constraint solver considers the current infrastructure on hand, and then produces a series of commands optimised to our available infrastructure and application configuration. The outcome is a functional, painlessly deployed product. This paradigm shift from maintaining the scripts that automate the steps expected to achieve the outcome, to updating the requirements of the product and feeding those requirements to a smarter constraint solver massively reduces the ever-increasing complexity workload on the engineering team, restoring the time to think, organise and deliver that has been lost.
Closing the Loop
An astute reader might at this point suggest that a constraint solver as described above may well reduce the amount of maintenance of automation scripts by aligning the updating of the product with the updating of the deployed description of the product, but the infrastructure on which the product must run will change according to the schedules of the cloud infrastructure providers and customer demand. The planned set of steps generated to deploy the application at the time the release was base-lined will undoubtedly not work if something changes in the target infrastructure after the plan was made.
In short, we have now automated so much of our deployment process that we have highlighted a manual step: every time the start state changes, the constraint solver has to be run again, and that means collecting the new infrastructure status then producing a new plan based on it. Luckily, this is just the kind of thing computers are good at: running repetitive tasks in a loop. To complete our shift away from imperative scripted steps to something that takes a definition of correct and makes it so, we just need to add a monitoring service to our constraint solver, and combine them in a loop. The monitoring service checks the state of the deployment and infrastructure. If it detects that running state has deviated from the planned state, then the constraint solver will generate a new plan to move from the current state back to the desired state, and then the loop begins again.
This encapsulation of declaration of desired outcome, monitoring of status, and constraint solver to reconverge, all wrapped in an endless loop is what defines Orchestration. Although I have already acknowledged that there is a cost in transitioning from what a team currently uses to a new technology or paradigm, there will come a time in most successful engineering companies where the quick automation scripts that worked great at the beginning are now destroying the productivity of the DevOps team.
Achieve Orchestration Simplicity
The good news is that Orchestration tools do not need to be built in-house from scratch. The paradigm shift from automation to orchestration is gathering momentum in commercial and open-source software. Runtimes such as Kubernetes use an orchestration control-loop as described above to provide a resilient containerisation used by millions world-wide, and other companies such as Ori Industries uses the same approach with its Ori Global Cloud, to provide declarative multi-cloud application deployment resiliency.
So the next time you find yourself staring at the script, asking yourself "Why am I changing this again!", take a step back and ask yourself "What am I actually trying to do here?" and if the answer is "Achieve an outcome given a set of constraints on a shifting target", then do yourself a favour and stop automating, start orchestrating.