Putting the tea into team: December 2020

Arghhh! There are problems during deployment - what's gone wrong? I thought we had it sorted.

This article describes a source of problems during deployments which I witnessed at a previous client.
I hope this story helps you avoid this problem with your CI/CD pipeline setup too.

Background

We had a "CI/CD" pipeline involving:

build/run tests/generate docs etc
zero downtime deployment to a manual testing environment
once manual testing showed system was OK, zero downtime deployment to production

Sounds quite standard and maybe quite safe?

Zero downtime deployments

If you are running a system/service which has to keep running without downtime, you'll need zero downtime deployments which cope with the switch-over from an old version of a service to a new version of one. This will involve making backward compatible changes to database schemas or APIs so that the system isn't broken while deployments are in progress and the system can continue to be in use during the deployment.

See this excellent workshop for more about this subject, and there are plenty of articles about it.
The main technique is often called "expand-contract" because there's an "expand" while you keep something backward compatible and a "contract" where you delete the left-overs once the relevant change has been made.

Let's consider an example - I've chosen to do an example involving a database and a service, but the equivalent issues can apply for collaborating microservices too. In this example, let's say you want to rename a column in a table from ALICE to BRIAN.

You could do something like (each line represents a deployment):

add the new column BRIAN
make the service save to both columns ALICE to BRIAN
copy all values from ALICE to BRIAN
make the service use only the column BRIAN
delete the column ALICE

we were using flyway to make the schema changes and run scripts against the database.

The developers were experienced enough to know how to do this (with the occasional mistake, but that wasn't the interesting problem), but we still had problems during deployments even when the developers got everything right.

Manual release approval gates

The problem was that due to manual release approval gates, deployments were being batched up, so there could easily be one deployment to production which represented several deployments to the manual testing environment. Although the individual deployments to the manual testing environment worked just fine, when batched up there were sometimes problems during deployments.

Consider the earlier example. During deployment of a service, the schema changes are applied and once that's finished the newly deployed version of the app starts to receive traffic.

If those individually safe deployments end up being batched together, the database schema changes to add the new column BRIAN and delete the column ALICE will happen while the old version of the
service is still running. If that old (currently running) version of the service tries to write to column ALICE before the new version of the service starts handling traffic, then it'll fail because the old column has been deleted.

Possible solutions

Do continuous deployments - i.e. don't have manual release approval gates

there are other benefits that just avoiding this problem; beyond the scope of this article
not an option if you can't trust your automated testing enough (which itself is a massive problem)

Do not make the next change in the expand-contract sequence until the previous one is in production

this is what I did, but was frustratingly slow because often there would only be one production release per day

Run the same separate (intermediate) deployments one at a time sequentially to production which went into the manual testing environment

not easy to set up, and not an option in our case where problems were found in the manual testing environment which meant that some intermediate deployments would not have been safe to deploy to production even if immediately followed by a fix in the next deployment

Reset the manual testing environment to having the same versions of everything (including database schemas) as production before doing a deployment so you can test the deployment that will actually happen

not easy because setting the manual testing database back to the schema version in production while preserving the data related to the manual testing was impractical

Putting the tea into team

Thursday, December 31, 2020

A problem with manual release approval gates we didn't realise we had

Background

Zero downtime deployments

Manual release approval gates

Possible solutions

Blog Archive

Followers

Contributors