Thursday, December 31, 2020

A problem with manual release approval gates we didn't realise we had

Arghhh! There are problems during deployment - what's gone wrong? I thought we had it sorted.

This article describes a source of problems during deployments which I witnessed at a previous client.
I hope this story helps you avoid this problem with your CI/CD pipeline setup too.

Background

We had a "CI/CD" pipeline involving:
  • build/run tests/generate docs etc
  • zero downtime deployment to a manual testing environment
  • once manual testing showed system was OK, zero downtime deployment to production
Sounds quite standard and maybe quite safe?

Zero downtime deployments

If you are running a system/service which has to keep running without downtime, you'll need zero downtime deployments which cope with the switch-over from an old version of a service to a new version of one. This will involve making backward compatible changes to database schemas or APIs so that the system isn't broken while deployments are in progress and the system can continue to be in use during the deployment.

See this excellent workshop for more about this subject, and there are plenty of articles about it.
The main technique is often called "expand-contract" because there's an "expand" while you keep something backward compatible and a "contract" where you delete the left-overs once the relevant change has been made.

Let's consider an example - I've chosen to do an example involving a database and a service, but the equivalent issues can apply for collaborating microservices too. In this example, let's say you want to rename a column in a table from ALICE to BRIAN.

You could do something like (each line represents a deployment):
  1. add the new column BRIAN
  2. make the service save to both columns ALICE to BRIAN
  3. copy all values from ALICE to BRIAN
  4. make the service use only the column BRIAN
  5. delete the column ALICE
we were using flyway to make the schema changes and run scripts against the database.

The developers were experienced enough to know how to do this (with the occasional mistake, but that wasn't the interesting problem), but we still had problems during deployments even when the developers got everything right.

Manual release approval gates

The problem was that due to manual release approval gates, deployments were being batched up, so there could easily be one deployment to production which represented several deployments to the manual testing environment. Although the individual deployments to the manual testing environment worked just fine, when batched up there were sometimes problems during deployments.

Consider the earlier example. During deployment of a service, the schema changes are applied and once that's finished the newly deployed version of the app starts to receive traffic.

If those individually safe deployments end up being batched together, the database schema changes to add the new column BRIAN and delete the column ALICE will happen while the old version of the
service is still running. If that old (currently running) version of the service tries to write to column ALICE before the new version of the service starts handling traffic, then it'll fail because the old column has been deleted.

Possible solutions

  • Do continuous deployments - i.e. don't have manual release approval gates
    • there are other benefits that just avoiding this problem; beyond the scope of this article
    • not an option if you can't trust your automated testing enough (which itself is a massive problem)
  • Do not make the next change in the expand-contract sequence until the previous one is in production
    • this is what I did, but was frustratingly slow because often there would only be one production release per day
  • Run the same separate (intermediate) deployments one at a time sequentially to production which went into the manual testing environment
    • not easy to set up, and not an option in our case where problems were found in the manual testing environment which meant that some intermediate deployments would not have been safe to deploy to production even if immediately followed by a fix in the next deployment
  • Reset the manual testing environment to having the same versions of everything (including database schemas) as production before doing a deployment so you can test the deployment that will actually happen
    • not easy because setting the manual testing database back to the schema version in production while preserving the data related to the manual testing was impractical
I'm sure other solutions are possible - please add your ideas in the comments.

Copyright © 2020 Ivan Moore