Putting the tea into team: A problem with manual release approval gates we didn't realise we had

Arghhh! There are problems during deployment - what's gone wrong? I thought we had it sorted.

This article describes a source of problems during deployments which I witnessed at a previous client.
I hope this story helps you avoid this problem with your CI/CD pipeline setup too.

Background

We had a "CI/CD" pipeline involving:

build/run tests/generate docs etc
zero downtime deployment to a manual testing environment
once manual testing showed system was OK, zero downtime deployment to production

Sounds quite standard and maybe quite safe?

Zero downtime deployments

If you are running a system/service which has to keep running without downtime, you'll need zero downtime deployments which cope with the switch-over from an old version of a service to a new version of one. This will involve making backward compatible changes to database schemas or APIs so that the system isn't broken while deployments are in progress and the system can continue to be in use during the deployment.

See this excellent workshop for more about this subject, and there are plenty of articles about it.
The main technique is often called "expand-contract" because there's an "expand" while you keep something backward compatible and a "contract" where you delete the left-overs once the relevant change has been made.

Let's consider an example - I've chosen to do an example involving a database and a service, but the equivalent issues can apply for collaborating microservices too. In this example, let's say you want to rename a column in a table from ALICE to BRIAN.

You could do something like (each line represents a deployment):

add the new column BRIAN
make the service save to both columns ALICE to BRIAN
copy all values from ALICE to BRIAN
make the service use only the column BRIAN
delete the column ALICE

we were using flyway to make the schema changes and run scripts against the database.

The developers were experienced enough to know how to do this (with the occasional mistake, but that wasn't the interesting problem), but we still had problems during deployments even when the developers got everything right.

Manual release approval gates

The problem was that due to manual release approval gates, deployments were being batched up, so there could easily be one deployment to production which represented several deployments to the manual testing environment. Although the individual deployments to the manual testing environment worked just fine, when batched up there were sometimes problems during deployments.

Consider the earlier example. During deployment of a service, the schema changes are applied and once that's finished the newly deployed version of the app starts to receive traffic.

If those individually safe deployments end up being batched together, the database schema changes to add the new column BRIAN and delete the column ALICE will happen while the old version of the
service is still running. If that old (currently running) version of the service tries to write to column ALICE before the new version of the service starts handling traffic, then it'll fail because the old column has been deleted.

Possible solutions

Do continuous deployments - i.e. don't have manual release approval gates

there are other benefits that just avoiding this problem; beyond the scope of this article
not an option if you can't trust your automated testing enough (which itself is a massive problem)

Do not make the next change in the expand-contract sequence until the previous one is in production

this is what I did, but was frustratingly slow because often there would only be one production release per day

Run the same separate (intermediate) deployments one at a time sequentially to production which went into the manual testing environment

not easy to set up, and not an option in our case where problems were found in the manual testing environment which meant that some intermediate deployments would not have been safe to deploy to production even if immediately followed by a fix in the next deployment

Reset the manual testing environment to having the same versions of everything (including database schemas) as production before doing a deployment so you can test the deployment that will actually happen

not easy because setting the manual testing database back to the schema version in production while preserving the data related to the manual testing was impractical

2 comments:

Hilverd Reker said...: Hi Ivan! We currently have a similar CI/CD setup but our database migrations run in their own pipeline (and there is no manual testing stage). That was done for unrelated technical reasons but it seems like this may be accidentally preventing the sort of issue you describe -- which I had not realised before. Referring to your example, our separate database migration pipeline would have practically forced the developers (who are aware that there is no guarantee which pipeline gets to deploy to an environment first) to push step 1 as a separate commit and wait for it to go live before pushing any further changes.

Maybe another solution is to configure the CI/CD platform not to batch any commits together (I think Concourse lets you do that, not sure about other platforms) although that would make things very slow. Or ideally there should be a way to annotate a commit to tell the CI/CD platform not to batch it together with any subsequent commits.

By the way, the links within your blog post appear to have gone missing (they all point to https://www.blogger.com/#).; January 5, 2021 at 12:40 PM
Ivan Moore said...: Hi Hilverd, many thanks for pointing out the problem with the links - now fixed.

Running the migrations in their own pipeline seems to me to be a poor solution to the problem. It is a variant of "Do not make the next change in the expand-contract sequence until the previous one is in production" as in the article. It is not nearly as good as just doing continuous deployments (which requires that those migrations to be done in a deterministic order with the code). i.e. My favoured solution (in general), is not possible to do if you have unrelated db migration pipelines (i.e. even if the pipelines push to prod automatically, developers cannot push their changes continuously so for me it isn't continuous deployment).

Making the CI/CD platform not to batch any commits together was not possible in the situation I was writing about, because some of the intermediate commits (between one manual release gate and another) would have caused problems (even if only for a short time) in production.; January 16, 2021 at 1:48 AM

Putting the tea into team

Thursday, December 31, 2020

A problem with manual release approval gates we didn't realise we had

Background

Zero downtime deployments

Manual release approval gates

Possible solutions

2 comments:

Blog Archive

Followers

Contributors