Sunday, November 23, 2008

The problem with conventional continuous integration servers

I recently presented a talk on continuous integration at Agile North.

There was one topic in the talk that I think is so important that I decided to write an article about it. Suprisingly, it's a topic that many users of continuous integration (CI) servers (and even the developers of them in some cases!) don't seem to have thought much about.

Continuous integration

This article assumes you already know what continuous integration is.

The problem with conventional continuous integration servers

Imagine there are three developers, Tom, Dick and Harriet, and a continuous integration server. There is some code in a source code repository and these three developers start with the code cleanly checked out.

  1. Tom makes some changes and commits his changes.
  2. The CI server starts running a build.
  3. Harriet makes some changes and commits her changes.
  4. Dick makes some changes and commits his changes.
  5. The CI server reports that the build is OK.
  6. The CI server starts running a build (because there are new changes for it to run the build on).
  7. Tom makes some changes and commits his changes.
  8. The CI server reports that the build is broken.
Here's a diagram representing that:
The question is - who broke the build? (Left as an exercise for the reader). The problem with conventional continuous integration servers is that they can't tell you. A situation like this is inevitable with the vast majority of continuous integration server installations (that actually exist - I know you could install your favourite CI server differently if you had enough money - read on ...).

BTW - Step 7 is a bit of a red herring. It is not necessary for the purpose of the main point of this article, but a common enough situation. Tom thinks he's committed on a green build, but really the build is already broken in terms of the committed code, just not in terms of what the CI server is saying.

Why it is a problem

The problem with not knowing which commit broke the build is that it takes longer to work out who should look at the problem and how they can fix it. If you know which commit broke the build you know who should look at it and they can review their changes to work out why it broke. Furthermore, if you know which commit broke the build (even if you can't work out why), you can revert that change set while the problem is fixed "off line" from the rest of the team.

I am convinced (from years of using CI servers on many teams) that not knowing which commit broke the build is a major contributor to sloppy CI practice - builds staying red for ages, nobody taking responsibility, reduction in commit frequency etc etc. Just knowing which test failed, or "why" the build broke isn't enough. The symptoms don't always tell you the cause. Knowing the cause - i.e. which commit - is what you need to know.

When do you get this problem?

You suffer this problem more as the team gets larger, commits become more frequent, and the length of the build goes up. (Oh, and if developers get sloppier).

You often can't do anything about the size of the team, you'd like to encourage people to commit more frequently and making the build faster can be really hard. (And you might be able to get rid of sloppy developers, but that's quite a different blog post). Ideally the build should be fast, but that is often easier said than done, and even if the build is fast, it doesn't completely eliminate the problem described in this article.

Solutions

Most continuous integration servers don't solve this problem, but some do. These are the solutions that I know about. The CI server can:

a) run the build for the commits (revisions) between the last known good and the first known bad (provided that there is enough capacity in the build farm, e.g. people stop committing while the build is broken).
b) check that the build passes (on a build agent) before committing changes.
c) run multiple (preferably all) commits in parallel on different build agents in a build farm.

build-o-matic does (a) automatically - using a binary search. TeamCity (version 4 EAP) and Pulse (and maybe others) allow running of a previous revision so you can do the equivalent manually (maybe someone will write a plug-in for TeamCity to do what build-o-matic does? There is a precedent).

TeamCity and Pulse do (b) (build-o-matic doesn't - it's a cool feature; I've used it in TeamCity and it works well but you do need a lot of build agents).

build-o-matic does (c). I think TeamCity and possibly some others will too if you have enough build agents. The problem with this is approach is that you really might need a lot of build agents, particularly if the team is large, commit frequently and the build is long.

My preferred solution is to buy enough build agents to do (c) - computers are very cheap. Note however that just because a CI server supports a build farm, even if the build farm is infinitely large it doesn't necessarily mean it'll run the build for all the commits - check your CI server documentation for details. I believe Bamboo has something up it's sleeve on this topic - but I'm not sure if it's public yet (I'll find out and add a comment as appropriate).

There is another solution which is not to use a CI server at all, but instead have a "build token" or an "integration machine" - i.e. serialize all commits, that way you never commit while the build is running. That only works well for small co-located teams with a fast build. But when conditions are suitable, it really works well!

Conclusion

There are lots of CI servers to choose from. I consider working out which commit broke the build as a basic minimum feature of a CI server installation but suprisingly not all CI servers are capable of telling you - and you really need to understand this before you choose a CI server that will leave you with a broken build and not knowing which commit caused it.

Copyright © 2008 Ivan Moore

6 comments:

Nat Pryce said...

The same effect as (b) can be achieved with distributed version control systems. Developers can submit patches from their working branch to the central integration branch. If the patch fails (conflicts, compile errors, test failures), the patch is rejected and they must resolve the problems and try submitting again.

You can do this by configuring post-commit-hooks on the integration branch or with a patch queue server which is like a CI server but accepts patches by email.

Ivan Moore said...

Hi Nat - good point. Something I forgot to mention is that for that sort of scheme (I think the DVCS variety is called "automatic gatekeeper" (at least for bazaar) and the TeamCity way is called "remote run and delayed (pre-tested) commit") then you need lots of build agents (or "automatic gatekeeper" machines) - you don't get 'owt for nowt. It is well worth getting lots of build agents/"automatic gatekeepers" but if you don't and still try to use such a scheme your integrations could get backed up - it could degenerate into delayed, rather than continuous, integration.

Jeffrey Fredrick said...

Great article Ivan. (And thanks for Julian for pointing me here.)

I think you touch on a key enabling point for all three of the strategies: computing power is cheap.

All three strategies are about trading off cheap computing time for expensive people time. It might seem expensive to have N build machines but when you think of the productivity savings it is a bargain.

Otoh I do fear that insufficient computing power is probably a common anti-pattern for all three of of these approaches yielding (as per your comment) delaying integration.

Option B is getting to be a popular feature with most of the commercial tools supporting it, but I suspect lots of teams will actually end up with longer feedback cycles do to a lack of agents.

Unknown said...

Great article Ivan. Over at Urbancode we try to address this problem in AnthillPro with option C (enough build agents to build each commit as it comes in) or potentially B (personal/preflight/precommit builds).

If it's a common problem though, rather than a once in a blue moon kind of thing, I would really focus on your section "When do you get this problem?"

I'd agree that you don't want to discourage frequent commits, but I think that while non-trivial to address, it's worthwhile to attack build speed and pursue some strategies to shrink the number of developers who can break each other's builds - effectively have small development teams working in concert.

Ivan Moore said...

The Bamboo thing I was referring to is: Elastic Bamboo which is dynamically creating build agents using Amazon EC2. (I didn't know whether it was publicly announced - it's on the web so it must be). Sounds like a great way of scaling the number of build agents on demand. I don't know what the current status of this is.

Unknown said...

Why settle for either (B) or (C) when you could have both (B) and (C)? (For the record, this is what my company's product, Cascade, does.) Pre-commit tests to verify that you're not going to break anything, plus post-commit tests to verify that nothing was in fact broken (even if someone bypassed the pre-commit system and committed directly to the underlying repository).