Thursday, December 31, 2020

A problem with manual release approval gates we didn't realise we had

Arghhh! There are problems during deployment - what's gone wrong? I thought we had it sorted.

This article describes a source of problems during deployments which I witnessed at a previous client.
I hope this story helps you avoid this problem with your CI/CD pipeline setup too.

Background

We had a "CI/CD" pipeline involving:
  • build/run tests/generate docs etc
  • zero downtime deployment to a manual testing environment
  • once manual testing showed system was OK, zero downtime deployment to production
Sounds quite standard and maybe quite safe?

Zero downtime deployments

If you are running a system/service which has to keep running without downtime, you'll need zero downtime deployments which cope with the switch-over from an old version of a service to a new version of one. This will involve making backward compatible changes to database schemas or APIs so that the system isn't broken while deployments are in progress and the system can continue to be in use during the deployment.

See this excellent workshop for more about this subject, and there are plenty of articles about it.
The main technique is often called "expand-contract" because there's an "expand" while you keep something backward compatible and a "contract" where you delete the left-overs once the relevant change has been made.

Let's consider an example - I've chosen to do an example involving a database and a service, but the equivalent issues can apply for collaborating microservices too. In this example, let's say you want to rename a column in a table from ALICE to BRIAN.

You could do something like (each line represents a deployment):
  1. add the new column BRIAN
  2. make the service save to both columns ALICE to BRIAN
  3. copy all values from ALICE to BRIAN
  4. make the service use only the column BRIAN
  5. delete the column ALICE
we were using flyway to make the schema changes and run scripts against the database.

The developers were experienced enough to know how to do this (with the occasional mistake, but that wasn't the interesting problem), but we still had problems during deployments even when the developers got everything right.

Manual release approval gates

The problem was that due to manual release approval gates, deployments were being batched up, so there could easily be one deployment to production which represented several deployments to the manual testing environment. Although the individual deployments to the manual testing environment worked just fine, when batched up there were sometimes problems during deployments.

Consider the earlier example. During deployment of a service, the schema changes are applied and once that's finished the newly deployed version of the app starts to receive traffic.

If those individually safe deployments end up being batched together, the database schema changes to add the new column BRIAN and delete the column ALICE will happen while the old version of the
service is still running. If that old (currently running) version of the service tries to write to column ALICE before the new version of the service starts handling traffic, then it'll fail because the old column has been deleted.

Possible solutions

  • Do continuous deployments - i.e. don't have manual release approval gates
    • there are other benefits that just avoiding this problem; beyond the scope of this article
    • not an option if you can't trust your automated testing enough (which itself is a massive problem)
  • Do not make the next change in the expand-contract sequence until the previous one is in production
    • this is what I did, but was frustratingly slow because often there would only be one production release per day
  • Run the same separate (intermediate) deployments one at a time sequentially to production which went into the manual testing environment
    • not easy to set up, and not an option in our case where problems were found in the manual testing environment which meant that some intermediate deployments would not have been safe to deploy to production even if immediately followed by a fix in the next deployment
  • Reset the manual testing environment to having the same versions of everything (including database schemas) as production before doing a deployment so you can test the deployment that will actually happen
    • not easy because setting the manual testing database back to the schema version in production while preserving the data related to the manual testing was impractical
I'm sure other solutions are possible - please add your ideas in the comments.

Copyright © 2020 Ivan Moore

2 comments:

Hilverd Reker said...

Hi Ivan! We currently have a similar CI/CD setup but our database migrations run in their own pipeline (and there is no manual testing stage). That was done for unrelated technical reasons but it seems like this may be accidentally preventing the sort of issue you describe -- which I had not realised before. Referring to your example, our separate database migration pipeline would have practically forced the developers (who are aware that there is no guarantee which pipeline gets to deploy to an environment first) to push step 1 as a separate commit and wait for it to go live before pushing any further changes.

Maybe another solution is to configure the CI/CD platform not to batch any commits together (I think Concourse lets you do that, not sure about other platforms) although that would make things very slow. Or ideally there should be a way to annotate a commit to tell the CI/CD platform not to batch it together with any subsequent commits.

By the way, the links within your blog post appear to have gone missing (they all point to https://www.blogger.com/#).

Ivan Moore said...

Hi Hilverd, many thanks for pointing out the problem with the links - now fixed.

Running the migrations in their own pipeline seems to me to be a poor solution to the problem. It is a variant of "Do not make the next change in the expand-contract sequence until the previous one is in production" as in the article. It is not nearly as good as just doing continuous deployments (which requires that those migrations to be done in a deterministic order with the code). i.e. My favoured solution (in general), is not possible to do if you have unrelated db migration pipelines (i.e. even if the pipelines push to prod automatically, developers cannot push their changes continuously so for me it isn't continuous deployment).

Making the CI/CD platform not to batch any commits together was not possible in the situation I was writing about, because some of the intermediate commits (between one manual release gate and another) would have caused problems (even if only for a short time) in production.