Thursday, December 31, 2020

A problem with manual release approval gates we didn't realise we had

Arghhh! There are problems during deployment - what's gone wrong? I thought we had it sorted.

This article describes a source of problems during deployments which I witnessed at a previous client.
I hope this story helps you avoid this problem with your CI/CD pipeline setup too.


We had a "CI/CD" pipeline involving:
  • build/run tests/generate docs etc
  • zero downtime deployment to a manual testing environment
  • once manual testing showed system was OK, zero downtime deployment to production
Sounds quite standard and maybe quite safe?

Zero downtime deployments

If you are running a system/service which has to keep running without downtime, you'll need zero downtime deployments which cope with the switch-over from an old version of a service to a new version of one. This will involve making backward compatible changes to database schemas or APIs so that the system isn't broken while deployments are in progress and the system can continue to be in use during the deployment.

See this excellent workshop for more about this subject, and there are plenty of articles about it.
The main technique is often called "expand-contract" because there's an "expand" while you keep something backward compatible and a "contract" where you delete the left-overs once the relevant change has been made.

Let's consider an example - I've chosen to do an example involving a database and a service, but the equivalent issues can apply for collaborating microservices too. In this example, let's say you want to rename a column in a table from ALICE to BRIAN.

You could do something like (each line represents a deployment):
  1. add the new column BRIAN
  2. make the service save to both columns ALICE to BRIAN
  3. copy all values from ALICE to BRIAN
  4. make the service use only the column BRIAN
  5. delete the column ALICE
we were using flyway to make the schema changes and run scripts against the database.

The developers were experienced enough to know how to do this (with the occasional mistake, but that wasn't the interesting problem), but we still had problems during deployments even when the developers got everything right.

Manual release approval gates

The problem was that due to manual release approval gates, deployments were being batched up, so there could easily be one deployment to production which represented several deployments to the manual testing environment. Although the individual deployments to the manual testing environment worked just fine, when batched up there were sometimes problems during deployments.

Consider the earlier example. During deployment of a service, the schema changes are applied and once that's finished the newly deployed version of the app starts to receive traffic.

If those individually safe deployments end up being batched together, the database schema changes to add the new column BRIAN and delete the column ALICE will happen while the old version of the
service is still running. If that old (currently running) version of the service tries to write to column ALICE before the new version of the service starts handling traffic, then it'll fail because the old column has been deleted.

Possible solutions

  • Do continuous deployments - i.e. don't have manual release approval gates
    • there are other benefits that just avoiding this problem; beyond the scope of this article
    • not an option if you can't trust your automated testing enough (which itself is a massive problem)
  • Do not make the next change in the expand-contract sequence until the previous one is in production
    • this is what I did, but was frustratingly slow because often there would only be one production release per day
  • Run the same separate (intermediate) deployments one at a time sequentially to production which went into the manual testing environment
    • not easy to set up, and not an option in our case where problems were found in the manual testing environment which meant that some intermediate deployments would not have been safe to deploy to production even if immediately followed by a fix in the next deployment
  • Reset the manual testing environment to having the same versions of everything (including database schemas) as production before doing a deployment so you can test the deployment that will actually happen
    • not easy because setting the manual testing database back to the schema version in production while preserving the data related to the manual testing was impractical
I'm sure other solutions are possible - please add your ideas in the comments.

Copyright © 2020 Ivan Moore

Monday, May 7, 2018

"As soon as you can" integration

These days, everyone says they are doing "continuous integration" - but are you really?

Do you build your pull requests and feature branches on a CI server? Do you have individually built company-internal libraries each in their own repository (or built as separately versioned artifacts)? If so, you probably aren't doing "as soon as you can" integration (i.e. what "continuous integration" originally meant).

I ran a session at CITCON with Thierry de Pauw about "as soon as you can" integration. It's a name for a style of "trunk based development", but that term can also be taken to have a less extreme meaning than what I'm describing here.

The following practices all go together to support "as soon as you can" integration:
  • "trunk based development" (an extreme version - just one branch; not even short lived branches)
  • "monorepos" or "multi-app repos" (building against all your source, i.e. no separately versioned, company specific, shared libraries/code)
  • pair programming or mob programming (and/or after-the-fact code reviews if you must, but not code reviews that delay integration)
  • separation of deployment and release of software, i.e. your software should always be deployable even if it isn't all releasable, e.g. using feature toggles
It is the combination of these which enables "as soon as you can" integration.
The most effective teams I've worked on have used this combination of practices.


What I'm suggesting here isn't suitable for every team, but has been suitable for almost every team I've worked on. Maybe your situation is special (in particular, the problems identified here are smaller for smaller teams), but please consider whether you could integrate sooner and what benefits that would give you.


Here are some problems that I've seen in teams thinking they are doing continuous integration - but they aren't really:
  • merge conflicts
    • which is a problem because resolving merge conflicts requires human intervention which leads to mistakes
  • people holding off doing a refactoring in order to avoid causing anyone else merge conflicts
  • large refactorings causing other people merge conflicts
  • waste from working on code that has been refactored by someone else but you don't know it yet
  • waste making the same improvement as someone else in parallel
  • not making small improvements to the code unrelated to what you are "working on" due to overheads
  • not suggesting small improvements to pull request because it would delay merging of an otherwise good change
  • suggesting small improvements to a pull request which the author considers annoying nit-picking
  • having to chase someone to review, or merge, a pull request, delaying its integration
  • having to update version numbers (in several places) when modifying a company-internal library
  • lack of refactoring tool support when modifying a company-internal library (for the code that depends on it)
  • the "version update dance" for company-internal library code:
    • change library code
    • wait until it has built to get a new version number
    • update code that uses library to refer to new version number
    • manually update code to match changes made to library code
  • multiple versions of company-internal library code being used by different projects
  • diamond dependencies and difficulty getting a good set of versions of company-internal libraries
  • having to have a special "release" build - so being unable to deploy code for every commit


  • feature branches
  • pull requests
  • code reviews that block integration
  • multiple repositories for different parts of the same application, e.g. different repository for company-internal library compared to code using it
    • using fully qualified versions of company-internal libraries in order to have a deterministic build:
      • the "version update dance" described earlier
      • delayed integration of library code
      • multiple versions of library code
      • diamond dependencies
    • using "snapshot" dependencies or equivalent ("semantic versioning" is a better equivalent but is still not fully deterministic)
      • transient errors due to indeterministic builds
      • need for a special "release build" meaning integration is only really tested then, hence not continuous integration


  • "trunk based development", or at least this version of it:
    • integrate any changes the rest of the team has made into the source on your machine, e.g. git pull -r
    • do some work (e.g. ./ && git commit -am"done some work")
    • integrate any changes the rest of the team has made in the meantime, e.g. git pull -r)
    • share your work with the rest of the team if it all works, e.g. ./ && git push 
    • i.e. everyone work on master all the time. No branches, not even short lived. You already have the equivalent of a branch on your machine - any code that isn't already pushed.
  • "monorepos", or at least this version of it:
    • have all code that has build-time dependencies on each other into the same repository and build from source rather than against versioned artifacts
      • some people use the term monorepo to mean a single repo for an entire company - I don't know of an accepted term for what I mean. I tend to call them "multi-app repos". Suggestions in comments please.
      • other benefits beyond the scope of this article
  • pair programming, or mob programming (there are other reasons why pair programming is good; but I've limited this to what is relevant to "ASAP integration"):
    • constant code review
    • no need to create, review and merge a pull request, so reduced overhead
    • sooner integration, meaning less chance of people working on code that has already changed by someone else
    • code review is in the context of the whole codebase rather than just the diff, meaning tool support for seeing why the code is like it is
    • even the smallest improvement can be made with reduced chance of annoying someone and with no overhead
    • other benefits beyond the scope of this article
  • Separation of deployment and release
    • Push your code to master every time you've done any work that does not break your codebase. It doesn't have to be "finished", it just has to be deployable without breaking anything that has been released.

Side effects

Using this approach, it is possible to break the master branch (note that it doesn't mean you will push anything broken into production, just means the master branch can contain some "bad" commits); for some people this is unacceptable. It is an "optimistic" approach. In almost all teams I've worked in the benefits of this approach outweigh this potential problem, but does require discipline and an approach to working that not everyone is used to:
  • you have to work incrementally
  • you have to have good enough test coverage for it to be a sensible way to work
  • any code you have pushed must be deployable (but not necessarily released) without breaking anything currently released
  • you may have to have a mechanism to release changes independently of the code being deployed, e.g. feature flags
  • you have to try not to break the build
  • if you do push a commit which breaks the build, you have to fix it (or revert it) immediately
  • you should run your build locally before pushing, so your build (including tests) needs to be fast


  • Much reduced chance of merge conflicts
    • merge conflicts are a problem because a human resolving a merge conflict is much more likely to make a mistake, e.g. accidentally undo someone else's change, than an automated merge
  • Better support for refactoring
  • Efficient way of working; less to do:
    • no creating a branch
    • no creating a pull request
    • no chasing someone to review your pull request
    • no doing a code review of some code you don't have the context of
    • no merging the pull request
    • no version update dance
It doesn't suit everyone or every team, but if your situation is suitable for you to try it, give it a go and let me know how it goes.

Copyright © 2018 Ivan Moore

Friday, July 22, 2016

Automated pipeline support for Consumer Driven Contracts

When developing a system composed of services (maybe microservices) some services will depend on other services in order to work. In this article I use the terminology "consumer"[1] to mean a service which depends on another service, and "provider" to mean a service being depended upon. This article only addresses consumers and providers developed within the same company. I'm not considering external consumers here.

This article is about what we did at Springer Nature to make it easy to run CDCs - there is more written about CDCs elsewhere.

CDCs - the basics

CDCs (Consumer Driven Contracts) can show the developers of a provider service that they haven't broken any of their consumer services.

Consider a provider service called Users which has two consumers, called Accounting and Tea. Accounting sends bills to users, and Tea delivers cups of tea to users.

The Users service provides various endpoints, and over time the requirements of its consumers change. Each consumer team writes tests (called CDCs) which check whether the provider service (Users in this case) understands the message sent to it and responds with a message the consumer understands (e.g. using JSON over HTTP).

How we used to run CDCs

We used to have the consumer team send the provider team an executable (e.g. an executable jar) containing the consumer's CDCs. It was then up to the provider team to run those CDCs as necessary, e.g. by adding a stage to their CD (Continuous Delivery) pipeline. A problem with this was that it required manual effort required by the provider team to set up such a stage in the CD pipeline, and required effort each time a consumer team wanted to update its CDCs.

How we run them now

Our automated pipeline system allows consumers to define CDCs in their own repository and declare which providers they depend upon in their pipeline metadata file. Using this information, the automated pipeline system adds a stage to the consumer's pipeline to run its "CDCs" against its providers, and also in the provider's pipeline to run its consumers' CDCs against itself. In our simple example earlier this means the pipelines for Users, Accounting, Tea and Marketing will be something like this:




i.e. Users runs Accounting and Tea CDCs against itself (in parallel) after it has been deployed. Accounting and Tea run their CDCs against Users before they deploy.

This means that:
  • when a change is made to a consumer (e.g. Tea), its pipeline checks that its providers are still providing what is needed. This is quite standard and easy to arrange.
  • when a change is made to a provider (e.g. Users), its pipeline checks that it still provides what its consumers require. This is the clever bit that is harder to arrange. This is the point of CDCs.

Benefits of automation

By automating this setup, providers don't need to do anything in order to incorporate their consumers' CDCs into their pipeline. The providers also don't have to do anything in order to get updated versions of their consumers' CDCs.

The effort of setting up CDCs rests with the teams who have the dependency, i.e. the consumers. The consumers need to declare their provider (dependency) in their metadata file and define and maintain their CDCs.


There are a few subtleties involved in this system as it is currently implemented.
  • the consumer runs its CDCs against the provider in the same environment it is about to deploy into. There may be different versions of providers in different environments and this approach checks that the provider works for the version of the consumer that is about to be deployed, and will prevent the deployment if it is incompatible.
  • the provider runs the version of each consumer's CDCs corresponding to the version of the consumer in the same environment that the provider has just deployed into. There may be different versions of consumers in different environments and this approach checks that the provider works for the versions of consumers that are using the provider.
  • the system deploys the provider before running the consumer CDCs because the consumer CDCs need to run against it. It would be better for the system to deploy a new version of the provider without replacing the current version, run its consumers' CDCs and then only switch over to the new version if the CDCs all pass.
  • because the consumer's CDCs need to run against the provider in the appropriate environment, the system sets an environment variable with the host name of the provider in that environment. Because we only have one executable per consumer for all its CDCs, if a consumer has multiple providers, it needs to use those environment variables in order to determine which of its CDCs to execute.

Implementation notes

The implementation of a consumer running its CDCs against its provider is relatively straightforward. The difficulties are when a provider runs its consumers' CDCs against itself.

In order for a provider to run its consumers' CDCs the system clones each consumer's repository at the appropriate commit and then runs the appropriate executable in a Docker container. (The implementation doesn't clone every time, just if the repository hasn't been cloned on that build agent before.) Using Docker for running the CDCs means that consumers can implement their CDCs using whatever technology they want, as long as it runs in Docker.

In our system, all services are required to implement an endpoint which returns the git hash of the commit that they are built from. This is used to work out which version of consumer's CDCs to run in the case when they are run in a provider's pipeline.

Automating the running of CDCs in our automated pipelines required either making providers know who their consumers are, or consumers know who their providers are. If a provider doesn't provide what a consumer requires, it causes more problems for the consumer than for the provider. Therefore we made it the responsibility of the consumer to define that it depends on the provider rather than the other way around.

1 Other terminology in use for consumer is "downstream" and for provider is "upstream". A consumer is a dependant of a provider. A provider is a dependency of a consumer. I sometimes use the word producer instead of provider.

Copyright © 2016 Ivan Moore

Saturday, April 4, 2015

Example of automatically setting up pipelines

To explain automatically setting up pipelines, I've prepared example code (available here - see the license) for GoCD using gomatic and made a video showing the code running - which I will refer to throughout the article. I could possibly have done this all by narrating the video - maybe I will in the future.

Inception creation

The example includes a script (called which creates the pipeline that will create the pipelines (you run this only once):

from gomatic import GoCdConfigurator, HostRestClient, ExecTask, GitMaterial

configurator = GoCdConfigurator(HostRestClient("localhost:8153"))

pipeline = configurator\
    .set_timer("0 0 * * * ?")  # run on the hour, every hour
inception_job = pipeline.ensure_stage("inception").ensure_job("inception")
inception_job.ensure_task(ExecTask(["python", ""]))


This creates a pipeline in GoCD (here running on localhost) which runs on a timer.

The video starts with this script being run, and the "inception" pipeline has been created by time 0:19.


The script creates a pipeline for every repo of a particular github user. For a real system, you might want to do something more sophisticated; this example has been kept deliberately simple.

from gomatic import GoCdConfigurator, HostRestClient, ExecTask
from github import Github, GithubException

github = Github()
me = github.get_user("teamoptimization")
for repo in me.get_repos():
        print "configuring",
        configurator = GoCdConfigurator(HostRestClient("localhost:8153"))

        pipeline = configurator\
        job = pipeline\
        bootstrap_file_url = ""
        job.ensure_task(ExecTask(["bash", "-c", "curl -fSs " + bootstrap_file_url + " | python - " +]))

    except GithubException:
        print 'ignoring',

This script creates a pipeline with a "bootstrap" stage for each repo (unless it already exists). As long as nothing else creates pipelines with the same names, the "bootstrap" stage will end up being the first stage. The bootstrap stage runs the script described later, passing it the name of the repo/pipeline as an argument.

In the video, the "inception" pipeline is triggered manually at time 0:20 (rather than waiting for the timer) and has finished by time 1:03 (and has no affect yet as the relevant user has no repositories).

The bootstrap stage

In this example, the script creates a stage for every line of a file (called commands.txt); this makes it easy to demonstrate one of the key features of the approach - the ability to keep the pipeline in sync with the repo it is for. One of the subtleties is that the bootstrap has to make sure that it doesn't remove itself, but does need to remove all other stages so that if stages are removed from commands.txt then they will be removed from the pipeline. Note that because of how gomatic works, if there is no difference as a result of removing then re-adding the stages, then no POST request will be sent to GoCD, i.e. it would be entirely unaffected.

import sys
from gomatic import GoCdConfigurator, HostRestClient, ExecTask

configurator = GoCdConfigurator(HostRestClient("localhost:8153"))

pipeline_name = sys.argv[1]

pipeline = configurator\

for stage in pipeline.stages()[1:]:

commands = open("commands.txt").readlines()
for command in commands:
    command_name, thing_to_execute = command.strip().split('=')
        .ensure_task(ExecTask(thing_to_execute.split(" ")))


A real bootstrap script might be much more sophisticated, for example, creating a build stage automatically for any repo which contains a certain file (e.g. build.xml or maven.pom) and creating deployment stage(s) automatically. The example script is as short as I could make it for the purposes of demonstrating the approach.

In the video, the user creates a repository (from time 1:04 - 1:39) and then creates a commands.txt file, commits and pushes (up to time 2:18). Rather than waiting for the timer, the "inception" pipeline is manually triggered at time 2:22 and by 2:43 the pipeline is created for "project1". Rather than wait for GoCD to run the new pipeline (which it would after a minute or so) it is manually triggered at time 2:54, and when it runs, it creates the stage defined in commands.txt.

In the video, at time 3:54 the user adds another line to commands.txt and commits and pushes. Rather than wait for GoCD to run the pipeline (which it would after a minute or so) it is manually triggered at time 4:27, and when it runs, it adds the new stage defined in commands.txt.

Copyright © 2015 Ivan Moore

Saturday, February 7, 2015

Automatically setting up pipelines

Scripting the set up your continuous integration (CI) server is better than clicky clicky, but it might be possible to do even better. If you have many pipelines that are very similar then you might be able to fully automate their set up. A bit like having your own, in house version of Travis CI.

This article will use the GoCD terms "pipeline" and "stage" (a pipeline is somewhat like a "job" Jenkins, and a pipeline comprises one or more stages).

This article describes (at a very high level) the system my colleague Hilverd Reker and I have set up to automatically create pipelines. This has built on experience I gained doing something similar with Ran Fan at a previous client, and being the "customer" of an automated CI configuration system at another previous client.


We have a pipeline in our CI server to automatically create the pipelines we want. We have called this "inception", after the film - I think Ran Fan came up with the name.

The inception pipeline looks for new things to build in new repositories, and sub directories within existing repositories, and creates pipelines as appropriate (using gomatic). (The inception process that Ran Fan and I wrote previously, looked for new things to build within "one large repo" (maybe the subject of a future blog article), and new branches of that repository).

The advantage of having this fully automated, compared to having to run a script to get the pipeline set up, is that it ensures that all pipelines get set up: none are forgotten and no effort is required.

Our inception job sets up a pipeline with only one stage, the bootstrap stage, which configures the rest of the pipeline. This keeps the inception job simple.

The bootstrap stage

Some of the configuration of a pipeline depends upon the files in the repository that the pipeline is for. By making the first stage in the pipeline the bootstrap stage, it can configure the pipeline accurately for the files as they exist when the pipeline runs. If a pipeline is configured by the inception job, or by running a script, rather than a bootstrap stage, then its configuration will not reflect the files in the repository when they change, but rather how they were at the time the inception job, or script, ran. This would result in pipelines failing because they are not set up correctly for the files they are trying to use; hence we have the bootstrap as part of the pipeline itself to solve that problem.

Implementation notes

Our bootstrap stage only alters the configuration of the pipeline if it needs to: it runs very quickly if no changes are needed. GoCD handles changes to the configuration of pipeline well. After the bootstrap stage has run, the subsequent stages run in the newly configured, or reconfigured, pipeline as expected. GoCD also handles the history of a pipelined reasonably well (but not always getting it right), even when it's configuration changes over time.


What would help right now would be an example - but that'll take time to prepare; watch this space (patiently) ...

Copyright ©2015 Ivan Moore 

Wednesday, January 14, 2015

Scripting the configuration of your CI server

How do you configure your CI server?

Most people configure their CI server using a web based UI. You can confirm this by searching for "setting up Jenkins job", "setting up TeamCity build configuration", "setup ThoughtWorks Go pipeline" etc. The results will tell you to configure the appropriate CI server through a web based UI, probably with no mention that this is not the only way.

One of my serial ex-colleagues, Nick Pomfret, describes using these web based UIs as "clicky-clicky". In this article I will use the Jenkins term "job" (aka "project") to also mean TeamCity build configuration or GoCD pipeline. In this article, I'm calling GoCD a CI server; get over it.

What is wrong with clicky-clicky? 

Clicky-clicky can be useful for quick experiments, or maybe if you only have one job to set up, but has some serious drawbacks. 

It works - don't change it

Once a job has been set up using clicky-clicky, one problem is that it is difficult to manage changes to it. It can be difficult to see who has changed what, and to restore a job to a previous configuration. Just version controlling the complete CI server configuration file (which some people do) does not do this well, because such files are difficult to diff, particularly when there are changes to other jobs.

Lovingly hand crafted, each one unique

Another problem with clicky-clicky is when you have a lot of jobs that you would like to set up in the same way, clicky-clicky is both time consuming, and inevitably leads to unintended inconsistencies between jobs, which can cause them to behave in slightly different ways, causing confusion and taking longer to diagnose problems.

Can't see the wood for the tabs 

Furthermore, web UIs often don't make it easy to see everything about the configuration of a job a compact format - some CI servers are better than others for that.

The right way - scripting

If you script the setup of jobs, then you can version control the scripts. You can then safely change jobs, knowing that you can recreate them in the current or previous states, and you can see who changed what. If you need to move the CI server to a new machine, you can just rerun the scripts.

In some cases a script for setting up a job can be much more readable than the UI because it is often more compact and everything is together rather than spread over one or more screens.

Fully automated configuration of jobs

It can be very useful to script the setup of jobs so it is totally automatic; i.e. when a new project is created (e.g. a new repo is created, or a new directory containing a particular file, e.g. a build.gradle file, is created), then a job can be created automatically. If you take that approach, then it saves time because nobody needs to manually setup the jobs, it means that every project that needs a job gets one and none are forgotten, and it means that the jobs are consistent so it is easy to know what they do.

There are some subtleties about setting up fully automated jobs which I won't go into here - maybe a future blog article.

Tools for scripting

For GoCD, see gomatic. For other CI servers, please add a comment if you know of anything that is any good!

Copyright ©2015 Ivan Moore

Wednesday, December 24, 2014

Gomatic - scripting of GoCD configuration

Gomatic has been released - it is a Python API for configuring ThoughtWorks GoCD. I worked on it with my colleague Hilverd Reker. There isn't any documentation yet - we'll add some. For the moment, I thought I'd just post a very brief article here to announce it and to show a simple example of using it.


We wrote it for our purposes and find it very useful; however, it has limitations (e.g. only really supports "Custom Command" task type) and allows you to try to configure GoCD incorrectly (which GoCD will refuse to allow). We will continue to work on it and will address its current limitations.

It has only been tested using GoCD version 14.2.0-377 - I think it doesn't yet work with other versions.


We've written it using Python 2 (for the moment - should be simple to port to Python 3 - which we might do in the future). You can install it using "pip":

sudo pip install gomatic

Create a pipeline

If you wanted to configure a pipeline something like that shown in the GoCD documentation then you could run the following script:

#!/usr/bin/env python
from gomatic import *

go_server = GoServerConfigurator(HostRestClient("localhost:8153"))
pipeline = go_server \
    .ensure_pipeline_group("Group") \
    .ensure_replacement_of_pipeline("first_pipeline") \
stage = pipeline.ensure_stage("a_stage")
job = stage.ensure_job("a_job")


Reverse engineer a pipeline

Gomatic can reverse engineer a gomatic script for an existing pipeline.

If you run the following (we will make it easier to run this rather than having to write a script):

#!/usr/bin/env python
from gomatic import *
go_server = GoServerConfigurator(HostRestClient("localhost:8153"))
pipeline = go_server\
print go_server.as_python(pipeline)

this will print out the following text:

#!/usr/bin/env python
from gomatic import *

go_server_configurator = GoServerConfigurator(HostRestClient("localhost:8153"))
pipeline = go_server_configurator\
stage = pipeline.ensure_stage("a_stage")
job = stage.ensure_job("a_job")

go_server_configurator.save_updated_config(save_config_locally=True, dry_run=True)

This produces a script which does a "dry run" so you can run it to see what changes it will make before running it for real.

So what?

I don't have time to write about why this is a Good Idea, or the consequences of being able to script the configuration of your CI server - but will do soon.

[This article slightly updated as a result of a new release of gomatic]

Copyright © 2014 Ivan Moore