Excellent news: Jon Pipitone has finished his MSc project on the software quality of climate models, and it makes fascinating reading. I quote his abstract here:

A climate model is an executable theory of the climate; the model encapsulates climatological theories in software so that they can be simulated and their implications investigated. Thus, in order to trust a climate model one must trust that the software it is built from is built correctly. Our study explores the nature of software quality in the context of climate modelling. We performed an analysis of the reported and statically discoverable defects in several versions of leading global climate models by collecting defect data from bug tracking systems, version control repository comments, and from static analysis of the source code. We found that the climate models all have very low defect densities compared to well-known, similarly sized open-source projects. As well, we present a classification of static code faults and find that many of them appear to be a result of design decisions to allow for flexible configurations of the model. We discuss the implications of our findings for the assessment of climate model software trustworthiness.

The idea for the project came from an initial back-of-the-envelope calculation we did of the Met Office Hadley’s Centre’s Unified Model, in which we estimated the number of defects per thousand lines of code (a common measure of defect density in software engineering) to be extremely low – of the order of 0.03 defects/KLoC. By comparison, the shuttle flight software, reputedly the most expensive software per line of code ever built, clocked in at 0.1 defects/KLoC; most of the software industry does worse than this.

This initial result was startling, because the climate scientists who build this software don’t follow any of the software processes commonly prescribed in the software literature. Indeed, when you talk to them, many climate modelers are a little apologetic about this, and have a strong sense they ought to be doing more rigorous job with their software engineering. However, as we documented in our paper, climate modeling centres such as the UK Met Office do have a excellent software processes, that they have developed over many years to suit their needs. I’ve come to the conclusion that it has to be very different from mainstream software engineering processes because the context is so very different.

Well, obviously we were skeptical (scientists are always skeptical, especially when results seem to contradict established theory). So Jon set about investigating this more thoroughly for his MSc project. He tackled the question in three ways: (1) measuring defect density by using bug repositories, version history and change logs to quantify bug fixes; (2) assessing the software directly using static analysis tools and (3) interviewing climate modelers to understand how they approach software development and bug fixing in particular.

I think there are two key results of Jon’s work:

  1. The initial results on defect density bear up. Although not quite as startlingly low as my back of the envelope calculation, Jon’s assessment of three major GCMs indicate they all fall in the range commonly regarded as good quality software by industry standards.
  2. There are a whole bunch of reasons why result #1 may well be meaningless, because the metrics for measuring software quality don’t really apply well to large scale scientific simulation models.

You’ll have to read Jon’s thesis to get all the details, but it will be well worth it. The conclusion? More research needed. It opens up plenty of questions for a PhD project….

3 Comments

  1. Steve,

    This is a very interesting result, and an excellent topic for an MSc thesis. Congratulations to Jon.

    It is common for computer scientists to be horrified at the spaghetti-FORTRAN that many scientists still use. There is a temptation to look at the code and say “this can’t possibly be right”. However, the real test is whether the code produces the correct answer (within reasonable error bounds), regardless of how many underflows, overflows, off-by-one errors, malloc faults, etc, there are.

  2. Pingback: Jim Graham : What is “Reproducibility,” Anyway?

  3. I didn’t have a chance to read the thesis yet, but […edited…]

    [Quit the trolling. If you’re genuinely interested, go read the thesis. – Steve]

  4. Pingback: I never said that! | Serendipity

  5. Pingback: Do Climate Models need Independent Verification and Validation? | Serendipity

  6. Different projects in different fields may have very different definitions of ‘defect’. For instance, I have worked on projects (such as CCC) in which an unclear comment is a defect. So comparing defect densities between projects is a minefield.

Join the discussion: