Sometime in the 1990’s, I drafted a frequently asked question list for NASA’s IV&V facility. Here’s what I wrote on the meaning of the terms “validation” and “verification”:

The terms Verification and Validation are commonly used in software engineering to mean two different types of analysis. The usual definitions are:

  • Validation: Are we building the right system?
  • Verification: Are we building the system right?

In other words, validation is concerned with checking that the system will meet the customer’s actual needs, while verification is concerned with whether the system is well-engineered, error-free, and so on. Verification will help to determine whether the software is of high quality, but it will not ensure that the system is useful.

The distinction between the two terms is largely to do with the role of specifications. Validation is the process of checking whether the specification captures the customer’s needs, while verification is the process of checking that the software meets the specification.

Verification includes all the activities associated with the producing high quality software: testing, inspection, design analysis, specification analysis, and so on. It is a relatively objective process, in that if the various products and documents are expressed precisely enough, no subjective judgements should be needed in order to verify software.

In contrast, validation is an extremely subjective process. It involves making subjective assessments of how well the (proposed) system addresses a real-world need. Validation includes activities such as requirements modelling, prototyping and user evaluation.

In a traditional phased software lifecycle, verification is often taken to mean checking that the products of each phase satisfy the requirements of the previous phase. Validation is relegated to just the begining and ending of the project: requirements analysis and acceptance testing. This view is common in many software engineering textbooks, and is misguided. It assumes that the customer’s requirements can be captured completely at the start of a project, and that those requirements will not change while the software is being developed. In practice, the requirements change throughout a project, partly in reaction to the project itself: the development of new software makes new things possible. Therefore both validation and verification are needed throughout the lifecycle.

Finally, V&V is now regarded as a coherent discipline: ”Software V&V is a systems engineering discipline which evaluates the software in a systems context, relative to all system elements of hardware, users, and other software”. (from Software Verification and Validation: Its Role in Computer Assurance and Its Relationship with Software Project Management Standards, by Dolores R. Wallace and Roger U. Fujii, NIST Special Publication 500-165)

Having thus carefully distinguished the two terms, my advice to V&V practitioners was then to forget about the distinction, and think instead about V&V as a toolbox, which provides a wide range of tools for asking different kinds of questions about software. And to master the use of each tool and figure out when and how to use it. Here’s one of my attempts to visualize the space of tools in the toolbox:

A range of V&V techniques. Note that "modeling" and "model checking" refer to building and analyzing abstracted models of software behaviour, a very different kind of beast from scientific models used in the computational sciences

For climate models, the definitions that focus on specifications don’t make much sense, because there are no detailed specifications of climate models (nor can there be – they’re built by iterative refinement like agile software development). But no matter – the toolbox approach still works; it just means some of the tools are applied a little differently. An appropriate toolbox for climate modeling looks a little different from my picture above, because some of these tools are more appropriate for real-time control systems, applications software, etc, and there are some missing from the above picture that are particular for simulation software. I’ll draw a better picture when I’ve finished analyzing the data from my field studies of practices used at climate labs.

Many different V&V tools are already in use at most climate modelling labs, but there is room for adding more tools to the toolbox, and for sharpening the existing tools (what and how are the subjects of my current research). But the question of how best to do this must proceed from a detailed analysis of current practices and how effective they are. There seem to be plenty of people wandering into this space, claiming that the models are insufficiently verified, validated, or both. And such people like to pontificate about what climate modelers ought to do differently. But anyone who pontificates in this way, but is unable to give a detailed account of which V&V techniques climate modellers currently use, is just blowing smoke. If you don’t know what’s in the toolbox already, then you can’t really make constructive comments about what’s missing.

A common cry from climate contrarians is that climate models need better verification and validation (V&V), and in particular, that they need Independent V&V (aka IV&V). George Crews has been arguing this for a while, and now Judith Curry has taken up the cry. Having spent part of the 1990’s as lead scientist at NASA’s IV&V facility, and the last few years studying climate model development processes, I think I can offer some good insights into this question.

The short answer is “no, they don’t”. The slightly longer answer is “if you have more money to spend to enhance the quality of climate models, spending it on IV&V is probably the least effective thing you could do”.

The full answer involves deconstructing the question, to show that it is based on three incorrect assumptions about climate models: (1) that there’s some significant risk to society associated with the use of climate models; (2) that the existing models are inadequately tested / verified / validated / whatevered; and (3) that trust in the models can be improved by using an IV&V process. I will demonstrate what’s wrong with each of these assumptions, but first I need to explain what IV&V is.

Independent Verification and Validation (IV&V) is a methodology developed primarily in the aerospace industry for reducing the risk of software failures, by engaging a separate team (separate from the software development team, that is) to perform various kinds of testing and analysis on the software as it is produced. NASA adopted IV&V for development of the flight software for the space shuttle in the 1970’s. Because IV&V is expensive (it typically adds 10%-20% to the cost of a software development contract), NASA tried to cancel the IV&V on the shuttle in the early 1980’s, once the shuttle was declared operational. Then, of course the Challenger disaster occurred. Although software wasn’t implicated, a consequence of the investigation was the creation of the Leveson committee, to review the software risk. Leveson’s committee concluded that far from cancelling IV&V, NASA needed to adopt the practice across all of its space flight programs. As a result of the Leveson report, the NASA IV&V facility was established in the early 1990’s, as a centre of expertise for all of NASA’s IV&V contracts. In 1995, I was recruited as lead scientist at the facility, and while I was there, our team investigated the operational effectiveness of the IV&V contracts on the Space Shuttle, International Space Station, Earth Observation System, Cassini, as well as a few other smaller programs. (I also reviewed the software failures on NASA’s Mars missions in the 1990’s, and have a talk about the lessons learned)

The key idea for IV&V is that when NASA puts out a contract to develop flight control software, it also creates a separate contract with a different company, to provide an ongoing assessment of software quality and risk as the development proceeds. One difficulty with IV&V contracts in the US aerospace industry is that it’s hard to achieve real independence, because industry consolidation has left very few aerospace companies available to take on such contracts, and they’re not sufficiently independent from one another.

NASA’s approach demands independence along three dimensions:

  • managerial independence (the IV&V contractor is free to determine how to proceed, and where to devote effort, independently of either the software development contractor and the customer)
  • financial independence (the funding for the IV&V contract is separate from the development contract, and cannot be raided if more resources are needed for development); and
  • technical independence (the IV&V contractor is free to develop its own criteria, and apply whatever V&V methods and tools it deems appropriate).

This has led to the development of a number of small companies who specialize only in IV&V (thus avoiding any contractual relationship with other aerospace companies), and who tend to recruit ex-NASA staff to provide them with the necessary domain expertise.

For the aerospace industry, IV&V has been demonstrated to be a cost effective strategy to improve software quality and reduce risk. The problem is that the risks are extreme: software errors in the control software for a spacecraft or an aircraft are highly likely to cause loss of life, loss of the vehicle, and/or loss of the mission. There is a sharp distinction between the development phase and the operation phase for such software: it had better be correct when it’s launched. Which means the risk mitigation has to be done during development, rather than during operation. In other words, iterative/agile approaches don’t work – you can’t launch with a beta version of the software. The goal is to detect and remove software defects before the software is ever used in an operational setting. An extreme example of this was the construction of the space station, where the only full end-to-end construction of the system was done in orbit; it wasn’t possible to put the hardware together on the ground in order to do a full systems test on the software.

IV&V is essential for such projects, because it overcomes natural confirmation bias of software development teams. Even the NASA program managers overseeing the contracts suffer from this too – we discovered one case where IV&V reports on serious risks were being systematically ignored by the NASA program office, because the program managers preferred to believe the project was going well. We fixed this by changing the reporting structure, and routing the IV&V reports directly to the Office of Safety and Mission Assurance at NASA headquarters. The IV&V teams developed their own emergency strategy too – if they encountered a risk that they considered mission-critical, and couldn’t get the attention of the program office to address it, they would go and have a quiet word with the astronauts, who would then ensure the problem got seen to!

But IV&V is very hard to do right, because much of it is a sociological problem rather than a technical problem. The two companies (developer and IV&V contractor) are naturally set up in an adversarial relationship, but if they act as adversaries, they cannot be effective: the developer will have a tendency to hide things, and the IV&V contractor will have a tendency to exaggerate the risks. Hence, we observed that the relationship is most effective where there is a good horizontal communication channel between the technical staff in each company, and that they come to respect one another’s expertise. The IV&V contractor has to be careful not to swamp the communication channels with spurious low-level worries, and the development contractor must be willing to respond positively to criticism. One way this works very well is for the IV&V team to give the developers advance warning of any issues they planned to report up the hierarchy to NASA, so that the development contractor could have a solution in place as even before NASA asked for it. For a more detailed account of these coordination and communication issues, see:

Okay, let’s look at whether IV&V is applicable to climate modeling. Earlier, I identified three assumptions made by people advocating it. Let’s take them one at a time:

1) The assumption there’s some significant risk to society associated with the use of climate models.

A large part of the mistake here is to misconstrue the role of climate models in policymaking. Contrarians tend to start from an assumption that proposed climate change mitigation policies (especially any attempt to regulate emissions) will wreck the economies of the developed nations (or specifically the US economy, if it’s an American contrarian). I prefer to think that a massive investment in carbon-neutral technologies will be a huge boon to the world’s economy, but let’s set aside that debate, and assume for sake of arguments that whatever policy path the world takes, it’s incredibly risky, with a non-neglibable probability of global catastrophe if the policies are either too aggressive or not aggressive enough, i.e. if the scientific assessments are wrong.

The key observation is that software does not play the same role in this system that flight software does for a spacecraft. For a spacecraft, the software represents a single point of failure. An error in the control software can immediately cause a disaster. But climate models are not control systems, and they do not determine climate policy. They don’t even control it indirectly – policy is set by a laborious process of political manoeuvring and international negotiation, in which the impact of any particular climate model is negligible.

Here’s what happens: the IPCC committees propose a whole series of experiments for the climate modelling labs around the world to perform, as part of a Coupled Model Intercomparison Project. Each participating lab chooses those runs they are most able to do, given their resources. When they have completed their runs, they submit the data to a public data repository. Scientists around the world then have about a year to analyze this data, interpret the results, to compare performance of the models, discuss findings at conferences and workshops, and publish papers. This results in thousands of publications from across a number of different scientific disciplines. The publications that make use of model outputs take their place alongside other forms of evidence, including observational studies, studies of paleoclimate data, and so on. The IPCC reports are an assessment of the sum total of the evidence; the model results from many runs of many different models are just one part of that evidence. Jim Hansen rates models as the third most important source of evidence for understanding climate change, after (1) paleoclimate studies and (2) observed global changes.

The consequences of software errors in a model, in the worst case, are likely to extend to no more than a few published papers being retracted. This is a crucial point: climate scientists don’t blindly publish model outputs as truth; they use model outputs to explore assumptions and test theories, and then publish papers describing the balance of evidence. Further papers then come along that add more evidence, or contradict the earlier findings. The assessment reports then weigh up all these sources of evidence.

I’ve been asking around for a couple of years for examples of published papers that were subsequently invalidated by software errors in the models. I’ve found several cases where a version of the model used in the experiments reported in a published paper was later found to contain an important software bug. But in none of those cases did the bug actually invalidate the conclusions of the paper. So even this risk is probably overstated.

The other point to make is that around twenty different labs around the world participate in the Model Intercomparison Projects that provide data for the IPCC assessments. That’s a level of software redundancy that is simply impossible in the aerospace industry. It’s likely that these 20+ models are not quite as independent as they might be (e.g. see Knutti’s analysis of this), but even so, the ability to run many different models on the same set of experiments, and to compare and discuss their differences is really quite remarkable, and the Model Intercomparison Projects have been a major factor in driving the science forward in the last decade or so. It’s effectively a huge benchmarking effort for climate models, with all the benefits normally associated with software benchmarking (and worthy of a separate post – stay tuned).

So in summary, while there are huge risks to society of getting climate policy wrong, those risks are not software risks. A single error in the flight software for a spacecraft could kill the crew. A single error in a climate model can, at most, only affect a handful of the thousands of published papers on which the IPCC assessments are based. The actual results of a particular model run are far less important than the understanding the scientists gain about what the model is doing and why, and the nature of the uncertainties involved. The modellers know that the models are imperfect approximations of very complex physical, chemical and biological processes. Conclusions about key issues such as climate sensitivity are based not on particular model runs, but on many different experiments with many different models over many years, and the extent to which these experiments agree or disagree with other sources of evidence.

2) the assumption that the current models are inadequately tested / verified / validated / whatevered;

This is a common talking point among contrarians. Part of the problem is that while the modeling labs have evolved sophisticated processes for developing and testing their models, they rarely bother to describe these processes to outsiders – nearly all published reports focus on the science done with the models, rather than the modeling process itself. I’ve been working to correct this, with, first, my study of the model development processes at the UK Met Office, and more recently my comparative studies of other labs, and my accounts of the existing V&V processes. Some people have interpreted the latter as a proposal for what should be done, but it is not; it is an account of the practices currently in place across all the of the labs I have studied.

A key point is that for climate models, unlike spacecraft flight controllers, there is no enforced separation between software development and software operation. A climate model is always an evolving, experimental tool, it’s never a finished product – even the prognostic runs done as input to the IPCC process are just experiments, requiring careful interpretation before any conclusions can be drawn. If the model crashes, or gives crazy results, the only damage is wasted time.

This means that an iterative development approach is the norm, which is far superior to the waterfall process used in the aerospace industry. Climate modeling labs have elevated the iterative development process to a new height: each change to the model is treated as a scientific experiment, where the change represents a hypothesis for how to improve the model, and a series of experiments is used to test whether the hypothesis was correct. This means that software development proceeds far more slowly than commercial software practices (at least in terms of lines of code per day), but that the models are continually tested and challenged by the people who know them inside out, and comparison with observational data is a daily activity.

The result is that climate models have very few bugs, compared to commercial software, when measured using industry standard defect density measures. However, although defect density is a standard IV&V metric, it’s probably a poor measure for this type of software – it’s handy for assessing risk of failure in a control system, but a poor way of assessing the validity and utility of a climate model. The real risk is that there may be latent errors in the model that mean it isn’t doing what the modellers designed it to do. The good news is that such errors are extremely rare: nearly all coding defects cause problems that are immediately obvious: the model crashes, or the simulation becomes unstable. Coding defects can only remain hidden if they have an effect that is small enough that it doesn’t cause significant perturbations in any of the diagnostic variables collected during a model run; in this case they are indistinguishable from the acceptable imperfections that arise as a result of using approximate techniques. The testing processes for the climate models (which in most labs include a daily build and automated test across all reference configurations) are sufficient that such problems are nearly always identified relatively early.

This means that there are really only two serious error types that can lead to misleading scientific results: (1) misunderstanding of what the model is actually doing by the scientists who conduct the model experiments, and (2) structural errors, where specific earth system processes are omitted or poorly captured in the model. In flight control software, these would correspond to requirements errors, and would be probed by an IV&V team through specification analysis. Catching these in control software is vital because you only get one chance to get it right. But in climate science, these are science errors, and are handled very well by the scientific process: making such mistakes, learning from them, and correcting them are all crucial parts of doing science. The normal scientific peer review process handles these kinds of errors very well. Model developers publish the details of their numerical algorithms and parameterization schemes, and these are reviewed and discussed in the community. In many cases, different labs will attempt to build their own implementations from these descriptions, and in the process subject them to critical scrutiny. In other words, there is already an independent expert review process for the most critical parts of the models, using the normal scientific route of replicating one another’s techniques. Similarly, experimental results are published, and the data is made available for other scientists to explore.

As a measure of how well this process works for building scientifically valid models, one senior modeller recently pointed out to me that it’s increasingly the case now that when the models diverge from the observations, it’s often the observational data that turns out to be wrong. The observational data is itself error prone, and software models turn out to be an important weapon in identifying and eliminating such errors.

However, there is another risk here that needs to be dealt with. Outside of the labs where the models are developed, there is a tendency for scientists who want to make use of the models to treat them as black box oracles. Proper use of the models depends on a detailed understanding of their strengths and weaknesses, and the ways in which uncertainties are handled. If we have some funding available to improve the quality of climate models, it would be far better spent on improving the user interfaces, and better training of the broader community of model users.

The bottom line is that climate models are subjected to very intensive system testing, and the incremental development process incorporates a sophisticated regression test process that’s superior to most industrial software practices. The biggest threat to validity of climate models is errors in the scientific theories on which they are based, but such errors are best investigated through the scientific process, rather than through an IV&V process. Which brings us to:

(3) the assumption that our ability to trust  in the models can be improved by an IV&V process;

IV&V is essentially a risk management strategy for safety-critical software when which an iterative development strategy is not possible – where the software has to work correctly the first (and every) time it is used in an operational setting. Climate models aren’t like this at all. They aren’t safety critical, they can be used even while they are being developed (and hence are built by iterative refinement); and they solve complex, wicked problems, for which there’s no clear correctness criteria. In fact, as a species of software development process, I’ve come to the conclusion they are dramatically different from any of the commercial software development paradigms that have been described in the literature.

A common mistake in the software engineering community is to think that software processes can be successfully transplanted from one organisation to another. Our comparative studies of different software organizations show that this is simply not true, even for organisations developing similar types of software. There are few, if any, documented cases of a software development organisation successfully adopting a process model developed elsewhere, without very substantial tailoring. What usually happens is that ideas from elsewhere are gradually infused and re-fashioned to work in the local context. And the evidence shows that every software oganisation evolves its own development processes that are highly dependent on local context, and on the constraints they operate under. Far more important than a prescribed process is the development of a shared understanding within the software team. The idea of taking a process model that was developed in the aerospace industry, and transplanting it wholesale into a vastly different kind of software development process (climate modeling) is quite simply ludicrous.

For example, one consequence of applying IV&V is that it reduces flexibility for development team, as they have to set clearer milestones and deliver workpackages on schedule (otherwise IV&V team cannot plan their efforts). Because the development of scientific codes is inherently unpredictable, would be almost impossible to plan and resource an IV&V effort. The flexibility to explore new model improvements opportunistically, and to adjust schedules to match varying scientific rhythms, is crucial to the scientific mission – locking the development into more rigid schedules to permit IV&V would be a disaster.

If you wanted to set up an IV&V process for climate models, it would have to be done by domain experts; domain expertise is the single most important factor in successful use of IV&V in the aerospace industry. This means it would have to be done by other climate scientists. But other climate scientists already do this routinely – it’s built into the Model Intercomparison Projects, as well as the peer review process and through attempts to replicate one another’s results. In fact the Model Intercomparison Projects already achieve far more than an IV&V process would, because they are done in the open and involve a much broader community.

In other words, the available pool of talent for performing IV&V is already busy using a process that’s far more effective than IV&V ever can be: it’s called doing science. Actually, I suspect that those people calling for IV&V of climate models are really trying to say that climate scientists can’t be trusted to check each other’s work, and that some other (unspecified) group ought to do the IV&V for them. However, this argument can only be used by people who don’t understand what IV&V is. IV&V works in the aerospace industry not because of any particular process, but because it brings in the experts – the people with grey hair who understand the flight systems inside out, and understand all the risks.

And remember that IV&V is expensive. NASA’s rule of thumb was an additional 10%-20% of the development cost. This cannot be taken from the development budget – it’s strictly an additional cost. Given my estimate of the development cost of a climate model as somewhere in the ballpark of  $350 million, then we’ll need to find another $35 million for each climate modeling centre to fund their IV&V contract. And if we had such funds to add to their budgets, I would argue that IV&V is one of the least sensible ways of spending this money. Instead, I would:

  • Hire more permanent software support staff to work alongside the scientists;
  • Provide more training courses to give the scientists better software skills;
  • Do more research into modeling frameworks;
  • Experiment with incremental improvements to existing practices, such as greater use of testing tools and frameworks, pair programming and code sprints;
  • More support to grow the user communities (e.g. user workshops and training courses), and more community building and beta testing;
  • Documenting the existing software development and V&V best practices so that different labs can share ideas and experiences, and the process of model building becomes more transparent to outsiders.

To summarize, IV&V would be an expensive mistake for climate modeling. It would divert precious resources (experts) away from existing modeling teams, and reduce their flexibility to respond to the science. IV&V isn’t appropriate because this isn’t missionsafety-critical software, it doesn’t have distinct development and operational phases, and the risks of software error are minor. There’s no single point of failure, because many labs around the world build their own models, and the normal scientific processes of experimentation, peer-review, replication, and model inter-comparison already provide a sophisticated process to examine the scientific validity of the models. Virtually all coding errors are detected in routine testing, and science errors are best handled through the usual scientific process, rather than through an IV&V process. Furthermore, there is only a small pool of experts available to perform IV&V on climate models (namely, other climate modelers) and they are already hard at work improving their own models. Re-deploying them to do IV&V of each other’s models would reduce the overall quality of the science rather than improving it.

(BTW I shouldn’t have had to write this article at all…)

Following my post last week about Fortran coding standards for climate models, Tim reminded me of a much older paper that was very influential in the creation (and sharing) of coding standards across climate modeling centers:

The paper is the result of a series of discussions in the mid-1980s across many different modeling centres (the paper lists 11 labs) about how to facilitate sharing of code modules. To simplify things, the paper assumes what is being shared are parameterization modules that operate in a single column of the model. Of course, this was back in the 1980s, which means the models were primarily atmospheric models, rather than the more comprehensive earth system models of today. The dynamical core of the model handles most of the horizontal processes (e.g. wind), which means that most of the remaining physical processes (the subject of these parameterizations) affect what happens vertically within a single column, e.g. by affecting radiative or convective transfer of heat between the layers. Plugging in new parameterization modules becomes much easier if this assumption holds, because the new module needs to be called once per time step per column, and if it doesn’t interact with other columns, it doesn’t mess up the vectorization. The paper describes a number of coding conventions, effectively providing an interface specification for single-column parameterizations.

An interesting point about this paper is that popularized the term “plug compatibility” amongst the modeling community, along with the (implicit) broader goal of designing all models to be plug-compatible. (although it cites Pielke & Arrit for the origin of the term). Unfortunately, the goal seems to be still very elusive. While most modelers will agree accept that plug-compatibility is desirable, a few people I’ve spoken to are very skeptical that it’s actually possible. Perhaps the strongest statement on this is from:

  • Randall DA. A University Perspective on Global Climate Modeling. Bulletin of the American Meteorological Society. 1996;77(11):2685-2690.
    p2687: “It is sometimes suggested that it is possible to make a plug-compatible global model so that an “outside” scientist can “easily make changes”. With a few exceptions (e.g. radiation codes), however, this is a fantasy, and I am surprised that such claims are not greeted with more skepticism.”

He goes on to describe instances where parameterizations have been transplanted from one model to another, but likens it to a major organ transplant, but more painful. The problem is that the various processes of the earth system interact in complex ways, and these complex interactions have to be handled properly in the code. As Randall puts it: “…the reality is that a global model must have a certain architectural unity or it will fail”. In my interviews with climate modellers, I’ve heard many tales of it taking months, and sometimes years of effort to take a code module contributed by someone outside the main modeling group, and to make it work properly in the model.

So plug compatibility and code sharing sound great in principle. In practice, no amount of interface specification and coding standards can reduce the essential complexity of earth system processes.

Note: most of the above is about plug compatibility of parameterization modules (i.e. code packages that live within the green boxes on the Bretherton diagram). More progress has been made (especially in the last decade) in standardizing the interfaces between major earth system components (i.e. the arrows on the Bretherton diagram). That’s where standardized couplers come in – see my post on the high level architecture of earth system models for an introduction. The IS-ENES workshop on coupling technologies in December will be an interesting overview of the state of the art here, although I won’t be able to attend, as it clashes with the AGU meeting.

I had lunch last week with Gerhard Fischer at the University of Colorado. Gerhard is director of the center for lifelong learning and design, and his work focusses on technologies that help people to learn and design solutions to suit their own needs. We talked a lot about meta-design, especially how you create tools that help domain experts (who are not necessarily software experts) to design their own software solutions.

I was describing some of my observations about why climate scientists prefer to write their own code rather than delegating it to software professionals, when Gerhard put it into words brilliantly. He said “You can’t delegate ill-defined problems to software engineers”. And that’s the nub of it. Much (but not all) of the work of building a global climate model is an ill-defined problem. We don’t know at the outset what should go into the model, which processes are important, how to simulate complex physical, chemical and biological processes and their interactions. We don’t know what’s computationally feasible (until we try it). We don’t know what will be scientifically useful. So we can’t write a specification, nor explain the requirements to someone who doesn’t have a high level of domain expertise. The only way forward is to actively engage in the process of building a little, experimenting with it, reflecting on the lessons learnt, and then modifying and iterating.

So the process of building a climate model is a loop of build-explore-learn-build. If you put people into that loop who don’t have the necessary understanding of the science being done with the models, then you slow things down. And as the climate scientists (mostly) have the necessary  technical skills, it’s quicker and easier to write their own code than to explain to a software engineer what is needed. But there’s a trade-off: the exploratory loop can be traversed quickly, but the resulting code might not be very robust or modifiable. Just as in agile software practices, the aim is to build something that works first, and worry about elegant design later. And that ‘later’ might never come, as the next scientific question is nearly always more alluring than a re-design. Which means the main role for software engineers in the process is to do cleanup operations. Several of the software people I’ve interviewed in the last few months at climate modeling labs described their role as mopping up after the parade (and some of them used more colourful terms than that).

The term meta-design is helpful here, because it specifically addresses the question of how to put better design tools directly into the hands of the climate scientists. Modeling frameworks fit into this space, as do domain specific-languages. But I’m convinced that there’s a lot more scope for tools that raise the level of abstraction, so that modelers can work directly with meaningful building blocks than lines of Fortran. And there’s another problem. Meta-design is hard. Too often it produces tools that just don’t do what the target users want. If we’re really going to put better tools into the hands of climate modelers, then we need a new kind of expertise to build such tools: a community of meta-designers who have both the software expertise and the domain expertise in earth sciences.

Which brings me to another issue that came up in the discussion. Gerhard provided me a picture that helps me explain the issue better (I hope he doesn’t mind me reproducing it here; it comes from his talk “Meta-Design and Social Creativity” given at IEMC 2007):

To create reflective design communities, the software professionals need to acquire some domain expertise, and the domain experts need to acquire some software expertise (diagram by Gerhard Fischer)

Clearly, collaboration between software experts and climate scientists is likely to work much better if each acquires a little of the other’s expertise, if only to enable them to share some vocabulary to talk about the problems. It reduces the distance between them.

At climate modeling labs, I’ve met a number both kinds of people – i.e. climate scientists who have acquired good software knowledge, and software professionals who have acquired good climate science knowledge. But it seems to me that for climate modeling, one of these transitions is much easier than the other. It seems to be easier for climate scientists to acquire good software skills than it is for software professionals (with no prior background in the earth sciences) to acquire good climate science domain knowledge. That’s not to say it’s impossible, as I have met a few people who have followed this path (but they are rare). It seems to require many years of dedicated work. And there appears to be a big disincentive for many software professionals, as it turns them from generalists into specialists. If you dedicate several years to developing the necessary domain expertise in climate modeling, it probably means you’re committing the rest of your career to working in this space. But the pay is lousy, the programming language of choice is uncool, and mostly you’ll be expected to clean up after the parade rather than star in it.

Here are some climate model coding standards that I’ve collected over the last few months:

It’s encouraging that most modelling centres have developed detailed coding standards, but it’s a shame that most of them had to roll their own. The PRISM project is an exception – as many of the modelling labs across Europe were members of the PRISM project, some of these labs now use the PRISM coding rules.

Two followup tasks I hope to get to soon – (1) analyze how much these different standards overlap/differ, and (2) measure how much the model codes adhere to the standards.

16/11/2010 Update: The UK Met Office standard was an old version that was never publically released, so I’ve removed the link, at the request of the UKMO. I’ll post a newer version if I can sort out the permissions. I’ve added MPI-M’s ICON standards to the list.

After an exciting sabbatical year spent visiting a number of climate modeling centres, I’ll be back to teaching in January. I’ll be introducing two brand new courses, both related to climate modeling. I already blogged about my new grad course on “Climate Change Informatics”, which will cover many current research issues to do with software and data in climate science.

But I didn’t yet mention my new undergrad course. I’ll be teaching a 199 course in January, which I’ve never done before. 199 courses are first-year seminar courses, open to all new students across the faculty of arts and science, intended to encourage critical thinking, communication and research skills. They are run as small group seminar courses (enrolment is capped at 24 students). I’ve never taught one of these courses before, so I’ve no idea what to expect – I’m hoping for an interesting mix of students with different backgrounds, so we can spend some time attacking the theme of the course from different perspectives. Here’s my course description:

“Climate Change: Software, Science and Society”

This course will examine the role of computers and software in understanding climate change. We will explore the use of computer models to build simulations of the global climate, including a historical view of the use of computer models to understand weather and climate, and a detailed look at the current state of computer modelling, especially how global climate models are tested, what kinds of experiments are performed with them, how scientists know they can trust the models, and how they deal with uncertainty. The course will also explore the role of computer models in helping to shape society’s responses to climate change, in particular, what they can (and can’t) tell us about how to make effective decisions about government policy, international treaties, community action and the choices we make as individuals. The course will take a cross-disciplinary approach to these questions, looking at the role of computer models in the physical sciences, environmental science, politics, philosophy, sociology and economics of climate change. However, students are not expected to have any specialist knowledge in any of these fields prior to the course.

If all goes well, I plan to include some hands-on experimentation with climate models, perhaps using EdGCM (or even CESM if I can simplify the process of installing it and running it for them). We’ll also look at how climate models are perceived in the media and blogosphere (that will be interesting!) and compare these perceptions to what really goes on in climate modelling labs. Of course, the nice thing about a small seminar course is that I can be flexible about responding to the students’ own interests. I’m really looking forward to this…

Here’s a very nice video explaining the basics of how climate models work, produced by the folks at IPSL in Paris. This version is French with English subtitles – for the francophones out there, you’ll notice the narration is a little more detailed than the subtitles. I particularly like bit where the earth grid is unpeeled and fed into the supercomputers:

[Qt: 480 360]

The original (without the English subtitles) is here:

For many decades, computational speed has been the main limit on the sophistication of climate models. Climate modelers have become one of the most demanding groups of users for high performance computing, and access to faster and faster machines drives much of the progress, permitting higher resolution models and more earth system processes being explicitly resolved in the models. But from my visits to NCAR, MPI-M and IPSL this summer, I’m learning that growth in volumes of data handled is increasingly a dominant factor. The volume of data generated from today’s models has grown so much that supercomputer facilities find it hard to handle.

Currently, the labs are busy with the CMIP5 runs that will form one of the major inputs to the next IPCC assessment report. See here for a list of the data outputs required from the models (and note that the requirements were last changed on Sept 17, 2010 -well after most centers have started their runs; after all  it will take months to complete the runs, and the target date for submitting the data is the end of this year)

Climate modelers have requirements that are somewhat different from most other users of supercomputing facilities anyway:

  • very long runs – e.g. runs that take weeks or even months to complete;
  • frequent stop and restart of runs – e.g. the runs might be configured to stop once per simulated year, at which point they generate a restart file, and then automatically restart, so that intermediate results can be checked and analyzed, and because some experiments make use of multiple model variants, initialized from a restart file produced partway through a baseline run.
  • very high volumes of data generated – e.g. the CMIP5 runs currently underway at IPSL generate 6 terabytes per day, and in postprocessing, this goes up to 30 terabytes per day. Which is a problem, given that the NEC SX-9 being used for these runs has a 4 terabyte work disk and a 35 terabyte scratch disk. It’s getting increasingly hard to move the data to the tape archive fast enough.

Everyone seems to have underestimated the volumes of data generated from these CMIP5 runs. The implication is that data throughput rates are becoming a more important factor than processor speed, which may mean that climate computing centres require a different architecture than most high performance computing centres offer.

Anyway, I was going to write more about the infrastructure needed for this data handling problem, but Bryan Lawrence beat me to it, with his presentation to the NSF cyberinfrastructure “data task force”. He makes excellent points about the (lack of) scaleability of the current infrastructure, and the social and cultural issues with questions of how people get credit for the work they put into this infrastructure, and the issues of data curation and trust. Which means the danger is we will create a WORN (write-once, read-never) archive with all this data…!

This will keep me occupied with good reads for the next few weeks – this month’s issue of the Journal Studies in History and Philosophy of Modern Physics is a special on climate modeling. Here’s the table of contents:

Some very provocative titles there. I’m curious to see how much their observations cohere with my own…

I’ve been meaning to write a summary of the V&V techniques used for Earth System Models (ESMs) for ages, but never quite got round to it. However, I just had to put together a piece for a book chapter, and thought I would post it here to see if folks have anything to add (or argue with)).

Verification and Validation for ESMs is hard because running the models is an expensive proposition (a fully coupled simulation run can take weeks to complete), and because there is rarely a “correct” result – expert judgment is needed to assess the model outputs.

However, it is helpful to distinguish between verification and validation, because the former can often be automated, while the latter cannot. Verification tests are objective tests of correctness. These include basic tests (usually applied after each code change) that the model will compile and run without crashing in each of its standard configurations, that a run can be stopped and restarted from the restart files without affecting the results, and that identical results are obtained when the model is run using different processor layouts. Verification would also include the built-in tests for conservation of mass and energy over the global system on very long simulation runs.

In contrast, validation refers to science tests, where subjective judgment is needed. These include tests that the model simulates a realistic, stable climate, given stable forcings, that it matches the trends seen in observational data when subjected to historically accurate forcings, and that the means and variations (e.g. seasonal cycles) are realistic for the main climate variables (E.g. see Phillips et al, 2004).

While there is an extensive literature on the philosophical status of model validation in computational sciences (see for example, Oreskes et al (1994); Sterman (1994); Randall and Wielicki (1997); Stehr (2001)), much of it bears very little relation to practical techniques for ESM validation, and very little has been written on practical testing techniques for ESMs. In practice, testing strategies rely on a hierarchy of standard tests, starting with the simpler ones, and building up to the most sophisticated.

Pope and Davies (2002) give one such sequence for testing atmosphere models:

  • Simplified tests – e.g. reduce 3D equations of motion to 2D horizontal flow (e.g. a shallow water testbed). This is especially useful if the reduction has an analytical solution, or if a reference solution is available. It also permits assessment of relative accuracy and stability over a wide parameter space, and hence is especially useful when developing new numerical routines.
  • Dynamical core tests – test for numerical convergence of the dynamics with physical parameterizations replaced by a simplified physics model (e.g. no topography, no seasonal or diurnal cycle, simplified radiation).
  • Single-column tests – allows testing of individual physical parameterizations separately from the rest of the model. A single column of data is used, with horizontal forcing prescribed from observations or from idealized profiles. This is useful for understanding a new parameterization, and for comparing interaction between several parameterizations, but doesn’t cover interaction with large-scale dynamics, nor interaction with adjacent grid points. This type of test also depends on availability of observational datasets.
  • Idealized aquaplanet – test the fully coupled atmosphere-ocean model, but with idealized sea-surface temperatures at all grid points. This allows for testing of numerical convergence in the absence of complications of orography and coastal effects.
  • Uncoupled model components tested against realistic climate regimes – test each model component in stand-alone mode, with a prescribed set of forcings. For example, test the atmosphere on its own, with prescribed sea surface temperatures, sea-ice boundary conditions, solar forcings, and ozone distribution. Statistical tests are then applied to check for realistic mean climate and variability.
  • Double-call tests. Run the full coupled model, and test a new scheme by calling both the old and new scheme at each timestep, but with the new scheme’s outputs not fed back in to the model. This allows assessment of the performance of new scheme in comparison with older schemes.
  • Spin-up tests. Run the full ESM for just a few days of simulation (typically between 1 and 5 days of simulation), starting from an observed state. Such tests are cheap enough that they can be run many times, sampling across the initial state uncertainty. Then the average of a large number of such tests can be analyzed (Pope and Davies suggest that 60 is enough for statistical significance). This allows the results from different schemes to be compared, to explore differences in short term tendencies.

Whenever a code change is made to an ESM, in principle, an extensive set of simulation runs are needed to assess whether the change has a noticeable impact on the climatology of the model. This in turn requires a sub jective judgment for whether minor variations constitute acceptable variations, or whether they add up to a significantly different climatology.

Because this testing is so expensive, a standard shortcut is to require exact reproducibility for minor changes, which can then be tested quickly through the use of bit comparison tests . These are automated checks over a short run (e.g. a few days of simulation time) that the outputs or restart files of two different model configurations are identical down to the least significant bits. This is useful for checking that a change didn’t break anything it shouldn’t, but requires that each change be designed so that it can be “turned off” (e.g. via run-time switches) to ensure previous experiments can be reproduced. Bit comparison tests can also check that different configurations give identical results. In effect, bit reproducibility over a short run is a proxy for testing that two different versions of the model will give the same climate over a long run. It’s much faster than testing the full simulations, and it catches most (but not all) errors that would affect the model climatology.

Bit comparison tests do have a number of drawbacks, however, in that they restrict the kinds of change that can be made to the model. Occasionally, bit reproducibility cannot be guaranteed from one version of the model to another, for example when there is a change of compiler, change of hardware, a code refactoring, or almost any kind of code optimization. The decision about whether to insist on bit reproducibility, or whether to allow it to be broken from one version of the model to the next, is a difficult trade-off between flexibility and ease of testing.

A number of simple practices can be used to help improve code sustainability and remove coding errors. These include running the code through multiple compilers, which is effective because different compilers give warnings about different language features, and some allow poor or ambiguous code which others will report. It’s better to identify and remove such problems when they are first inserted, rather than discover later on that it will takes months of work to port the code to a new compiler.

Building conservation tests directly into the code also helps. These would typically be part of the coupler, and can check the global mass balance for carbon, water, salt, atmospheric aerosols, and so on. For example the coupler needs to check that water flowing from rivers enters the ocean; that the total mass of carbon is conserved as it cycles through atmosphere, oceans, ice, vegetation, and so on. Individual component models sometimes neglect such checks, as the balance isn’t necessarily conserved in a single component. However, for long runs of coupled models, such conservation tests are important.

Another useful strategy is to develop a verification toolkit for each model component, and for the entire coupled system. These contain a series of standard tests which users of the model can run themselves, on their own platforms, to confirm that the model behaves in the way it should in the local computation environment. They also provide the users with a basic set of tests for local code modifications made for a specific experiment. This practice can help to overcome the tendency of model users to test only the specific physical process they are interested in, while assuming the rest of the model is okay.

During development of model components, informal comparisons with models developed by other research groups can often lead to insights in how to improve the model, and also as a method for confirming and identifying suspected coding errors. But more importantly, over the last two decades, model intercomparisons have come to play a critical role in improving the quality of ESMs through a series of formally organised Model Intercomparison Projects (MIPs).

In the early days, these projects focussed on comparisons of the individual components of ESMs, for example, the Atmosphere Model Intercomparison Project (AMIP), which began in 1990 (Gates, 1992). But by the time of the IPCC second assessment report, there was a widespread recognition that a more systematic comparison of coupled models was needed, which led to the establishment of the Coupled Model Intercomparison Pro jects (CMIP), which now play a central role in the IPCC assessment process (Meehl et al, 2000).

For example, CMIP3, which was organized for the fourth IPCC assessment, involved a massive effort by 17 modeling groups from 12 countries with 24 models (Meehl et al, 2007). As of September 2010, the list of MIPs maintained by the World Climate Research Program included 44 different model intercomparison projects (Pirani, 2010).

Model Intercomparison Projects bring a number of important benefits to the modeling community. Most obviously, they bring the community together with a common purpose, and hence increase awareness and collaboration between different labs. More importantly, they require the participants to reach a consensus on a standard set of model scenarios, which often entails some deep thinking about what the models ought to be able to do. Likewise, they require the participants to define a set of standard evaluation criteria, which then act as benchmarks for comparing model skill. Finally, they also produce a consistent body of data representing a large ensemble of model runs, which is then available for the broader community to analyze.

The benefits of these MIPs are consistent with reports of software benchmarking efforts in other research areas. For example, Sim et al (2003) report that when a research community that builds software tools come together to create benchmarks, they frequently experience a leap forward in research progress, arising largely from the insights gained from the process of reaching consensus on the scenarios and evaluation criteria to be used in the benchmark. However, the definition of precise evaluation criteria is an important part of the benchmark – without this, the intercomparison pro ject can become unfocussed, with uncertain outcomes and without the huge leap forward in progress (Bueler, 2008).

Another form of model intercomparison is the use of model ensembles (Collins, 2007), which increasingly provide a more robust prediction system than single models runs, but which also play an important role in model validation:

  • Multi-model ensembles – to compare models developed at different labs on a common scenario.
  • Multi-model ensembles using variants of a single model – to compare different schemes for parts of the model, e.g. different radiation schemes.
  • Perturbed physics ensembles – to explore probabilities of different outcomes, in response to systematically varying physical parameters in a single model.
  • Varied initial conditions within a single model – to test the robustness of the model, and to better quantify probabilities for predicted climate change signals.

Here’s a question I’ve been asking a few people lately, ever since I asserted that climate models are big expensive scientific instruments: How expensive are we talking about? Unfortunately, it’s almost impossible to calculate. The effort of creating a climate model is tangled up with the scientific research, such that you can’t even reliably determine how much of a particular scientist’s time is “model development” and how much is “doing science”. The problem is that you can’t build the model without a lot of that “doing science” part, because the model is the result of a lot of thinking, experimentation, theory building, testing hypotheses, analyzing simulation results, and discussions with other scientists. Many pieces of the model are based on the equations or empirical results in published research papers; even if you’re not doing the research yourself, you still have to keep up with the literature, understand the state-of-the-art, and know which bits of research are mature enough to incorporate into the model.

So, my first cut, which will be an over-estimation, is that *all* of the effort at a climate modeling lab is necessary to build the model. Labs vary in size, but a typical climate modeling lab is of the order of 200 people (including scientists, technicians, and admin support). And most of the models I’ve looked at have been under steady development for twenty years or more. So, that gives us starting point of 200*20 = 4,000 person-years. Luckily, most scientists care more about science than salary, so they’re much cheaper than software professionals. Given we’ll have a mix of postdocs and senior scientists, let’s say average salary would be around $150,000 per year including benefits and other overheads. Thats $600 million.

Oh, and that doesn’t including the costs of equipping and operating a tier-2 supercomputing facility, as the climate model runs will easily keep such a facility fully loaded full time (and we’ll need to factor in the cost to replace the supercomputer every few years to take advantage of performance increases). In most cases, the supercomputing facilities are shared with other scientific uses of high performance computing. But there is one centre that’s dedicated to climate modeling, the DKRZ in Hamburg, which has an annual budget of around 30 million euro. Let’s pretend euros are dollars, and call that $30 million per year, which for 20 years gives us another $600 million. The latest supercomputer at DKRZ, Blizzard, cost 35 million euro. Let’s say we replace this every five years, and throw some more money in for many terabytes of data storage, that’ll get us to around $200 million for hardware.

Grand total: $1.4 billion.

Now, I said that’s an over-estimate. Over lunch today I quizzed some of the experts here at IPSL in Paris, and they thought that 1,000 person-years (50 persons per year for 20 years) was a better estimate of the actual model development effort. This seems reasonable – it means that only 1/4 of the research at my 200 person research institute directly contributes to model development, the rest is science that uses the model but isn’t essential for developing it. So, that brings the salary figure down to $150 million. I’ve probably got to do the same conversion for the supercomputing facilities – let’s say about 1/4 of the supercomputing capacity is reserved for model development and testing. That also feels about right: 5-10% of the capacity is reserved for test processes (e.g. the ones that run automatically every day to do the automated build-and-test process), and a further 10%-20% might be used for validation runs on development versions of the model.

That brings the grand total down to $350 million.

Now, it has been done for less than this. For example, the Canadian Climate Centre, CCCma, has a modeling team one tenth this size, although they do share a lot of code with the Canadian Meteorological Service. And their model isn’t as full-featured as some of the other GCMs (it also has a much smaller user base). As with other software projects, the costs don’t scale linearly with functionality: a team of 5 software developers can achieve much more than 1/10th of what a team of 50 can (cf The Mythical Man Month). Oh, and the computing costs won’t come down much at all – the CCCma model is no more efficient than other models. So we’re still likely to be above the $100 million mark.

Now, there are probably other ways of figuring it – so far we’ve only looked at the total cumulative investment in one of today’s world leading climate models. What about replacement costs? If we had to build a new model from scratch, using what we already know (rather than doing all the research over again), how much would that cost? Well, nobody has ever done this, but there are few experiences we could draw on. For example, the Max Planck Institute has been developing a new model from scratch, ICON, which uses a icosahedral grid and hence needs a new approach to the dynamics. The project has been going for 8 years. It started with just a couple of people, and has ramped up to about a dozen. But they’re still a long way from being done, and they’re re-using a lot of the physics code from their old model, ECHAM. On the other hand, its an entirely new approach to the grid structure, so a lot of the early work was pure research.

Where does that leave us? It’s really a complete guess, but I would suggest a team of 10 people (half of them scientists, half scientific programmers) could re-implement the old model from scratch (including all the testing and validation) in around 5 years. Unfortunately, climate science is a fast moving field. What we’d get at the end of 5 years is a model that, scientifically speaking, is 5 years out of date. Unless of course we also paid for a large research effort to bring the latest science into the model while we were constructing it, but then we’re back where we started. I think this means you can’t replace a state-of-the-art climate model for much less than the original development costs.

What’s the conclusion? The bottom line is that the development cost of a climate model is in the hundreds of millions of dollars.

I’m pleased to see that my recent paper, “Climate Change: A Software Grand Challenge” is getting some press attention. However, I’m horrified to see how it’s been distorted in the echo chamber of the media. Danny Bradbury, writing in the Guardian, gives his piece the headline “Climate scientists should not write their own software, says researcher“. Aaaaaaargh! Nooooo! That’s the exact opposite of what I would say!

Our research shows that earth system models, the workhorses of climate science, appear to have very few bugs, and produce remarkably good simulations of past climate. One of the most important success factors is that the code is written by the scientists themselves, as they understand the domain inside out. Now, of course, this leads to other problems, for instance the code is hard to understand, and hard to modify. And the job of integrating the various components of the models is really hard. But there are no obvious solutions to fix this without losing this hands-on relationship between the scientists and the code. Handing the code development over to software professionals is likely to be a disaster.

I’ve posted a comment on Bradbury’s article, but I have very little hope he’ll alter the headline, as it obviously plays into a storyline that’s popular with denialists right now (see update, below).

Some other reports:

Update (2/9/10): Well that’s a delight! I just got off the overnight train to Paris, and discover that Danny has commented here, and wants to put everything right, and has already corrected the headline in the BusinessGreen version. So, apologies to Danny for doubting him, and also, thanks for restoring my faith in journalism. As is clear in some of the comments, it’s easy to see how one might draw the conclusion that climate scientists shouldn’t write their own code from a reading of my paper. It’s a subtle point, so I probably need to write a longer piece on this to explain…

Update #2 (later that same day): And now the Guardian headline has been changed too. Victory for honest journalism!

I’ve pointed out a number of times that the software processes used to build the Earth System Models used in climate science don’t look anything like conventional software engineering practices. One very noticeable difference is the absence of detailed project plans, estimates, development phases, etc. While scientific steering committees do discuss long term strategy and set high level goals for the development of the model, the vast majority of model development work occurs bottom-up, through a series of open-ended, exploratory changes to the code. The scientists who work most closely with the models get together and decide what needs doing, typically on a week-to-week basis. Which is a little like agile planning, but without any of the agile planning techniques. Is this the best approach? Well, if the goal was to deliver working software to some external customer by a certain target date, then probably not. But that’s not the goal at all – the goal is to do good science. Which means that much of the work is exploratory and opportunistic.  It’s difficult to plan model development in any detail, because it’s never clear what will work, nor how long it will take to try out some new idea. Nearly everything that’s worth doing to improve the model hasn’t been done before.

This approach also favours a kind of scientific bricolage. Imagine we have sketched out a conceptual architecture for an earth system model. The conventional software development approach would be to draw up a plan to build each of the components on a given timeline, such that they would all be ready by some target date for integration. And it would fail spectacularly, because it would be impossible to estimate timelines for each component – each part involves significant new research. The best we can do is to get groups of scientists to go off and work on each subsystem, and wait to see what emerges. And to be willing to try incorporating new pieces of code whenever they seem to be mature enough, no matter where they came from.

So we might end up with a coupled earth system model where each of the major components was built at a different lab, each was incorporated into the model at a different stage in its development, and none of this was planned long in advance. And, as a consequence, each component has its own community of developers and users who have goals that often diverge from the goals of the overall earth system model. Typically, each community wants to run its component model in stand-alone model, to pursue scientific questions specific to that subfield. For example, ocean models are built by oceanographers to study oceanography. Plant growth models are built by biologists to study the carbon cycle. And so on.

One problem is that if you take components from each of these communities to incorporate into a coupled model, you don’t want to fork the code. A fork would give you the freedom to modify the component to make it work in the coupled scheme. But, as with forking in open source projects, is nearly always a mistake. It fragments the community, and means the forked copy no longer gets the ongoing improvements to the original software (or more precisely, it quickly becomes too costly to transplant such improvements into the forked code). Access to the relevant community of expertise and their ongoing model improvements are at least as important as any specific snapshot of their code, otherwise the coupled model will fail to keep up with the latest science. Which means a series of compromises must be made – some changes might be necessary to make the component work in a coupled scheme, but these must not detract from the ability of the community to continue working with the component as a stand-alone model.

So, building an earth system model means assembling a set of components that weren’t really designed to work together, and a continual process of negotiation between the requirements for the entire coupled model and the requirements of the individual modeling communities. The alternative, re-building each component from scratch, doesn’t make sense financially or scientifically. It would be expensive and time consuming, and you’d end up with untested software, that scientifically, is several years behind the state-of-the-art. [Actually, this might be true of any software: see this story of the netscape rebuild].

Over the long term, a set of conventions have emerged that help to make it easier to couple together components built by different communities. These include the basic data formating and message passing standards, as well as standard couplers. And more recently, modeling frameworks, metadata standards and data sharing infrastructure. But as with all standardization efforts, it takes a long time (decades?) for these to be accepted across the various modeling communities, and there is always resistance, in part because meeting the standard incurs a cost and usually detracts from the immediate goals of each particular modeling community (with the benefits accruing elsewhere – specifically to those interested in working with coupled models). Remember: these models are expensive scientific instruments. Changes that limit the use of the component as a standalone model, or which tie it to a particular coupling scheme, can diminish its value to the community that built it.

So, we’re stuck with the problem of incorporating a set of independently developed component models, without the ability to impose a set of interface standards on the teams that build the components. The interface definitions have to be continually re-negotiated. Bryan Lawrence has some nice slides on the choices, which he characterizes as the “coupler approach” and the “framework approach” (I shamelessly stole his diagrams…)

The coupler approach leaves the models almost unchanged, with a communication library doing any necessary transformation on the data fields.

The framework approach splits the original code into smaller units, adapting their data structures and calling interfaces, allowing them to be recombined in a more appropriate calling hierarchy

The advantage of the coupler approach is that it requires very little change to the original code, and allows the coupler itself to be treated as just another stand-alone component that can be re-used by other labs. However, it’s inefficient, and seriously limits the opportunities to optimize the run configuration: while the components can run in parallel, the coupler must still wait on each component to do its stuff.

The advantage of the framework approach is that it produces a much more flexible and efficient coupled model, with more opportunities to lay out the subcomponents across a parallel machine architecture, and a greater ability to plug other subcomponents in as desired. The disadvantage is that component models might need substantial re-factoring to work in the framework. The trick here is to get the framework accepted as a standard across a variety of different modeling communities. This is, of course, a bit of a chicken-and-egg problem, because its advantages have to be clearly demonstrated with some success stories before such acceptance can happen.

There is a third approach, adopted by some of the bigger climate modeling labs: build everything (or as much as possible) in house, and build ad hoc interfaces between various components as necessary. However, as earth system models become more complex, and incorporate more and more different physical, chemical and biological processes, the ability to do it all in-house is getting harder and harder. This is not a viable long term strategy.

To get myself familiar with the models at each of the climate centers I’m visiting this summer, I’ve tried to find high level architectural diagrams of the software structure. Unfortunately, there seem to be very few such diagrams around. Climate scientists tend to think of their models in terms of a set of equations, and differentiate between models on the basis of which particular equations each implements. Hence, their documentation doesn’t contain the kinds of views on the software that a software engineer might expect. It presents the equations, often followed with comments about the numerical algorithms that implement them. This also means they don’t find automated documentation tools such as Doxygen very helpful, because they don’t want to describe their models in terms of code structure (the folks at MPI-M here do use Doxygen, but it doesn’t give them the kind of documentation they most want).

But for my benefit, as I’m a visual thinker, and perhaps to better explain to others what is in these huge hunks of code, I need diagrams. There are some schematics like this around (taken from an MPI-M project site):


But it’s not quite what I want. It shows the major components:

  • ECHAM – atmosphere dynamics and physics,
  • HAM – aerosols,
  • MESSy – atmospheric chemistry,
  • MPI-OM – ocean dynamics and physics,
  • HAMOCC – ocean biogeochemistry,
  • JSBACH – land surface processes,
  • HD – hydrology,
  • and the coupler, PRISM,

…but it only shows a few of the connectors, and many of the arrows are unlabeled. I need something that more clearly distinguishes the different kinds of connector, and perhaps shows where various subcomponents fit in (in part because I want to think about why particular compositional choices have been made).

The closest I can find to what I need is the Bretherton diagram, produced back in the mid 1980’s to explain what earth system science is all about:

The Bretherton Diagram of earth system processes (click to see bigger, as this is probably not readable!)

It’s not a diagram of an earth system model per se, but rather of the set of systems that such a model might simulate. There’s a lot of detail here, but it does clearly show the major systems (orange rectangles – these roughly correspond to model components) and subsystems (green rectangles), along with data sources and sinks (the brown ovals) and the connectors (pale blue rectangles, representing the data passed between components).

The diagram allows me to make a number of points. First, we can distinguish between two types of model:

  • a Global Climate Model, also known as a General Circulation Model (GCM), or Atmosphere-Ocean coupled model (AO-GCM), which only simulates the physical and dynamic processes in the atmosphere and ocean. Where a GCM does include parts of the other processes, it it typically only to supply appropriate boundary conditions.
  • an Earth System Model (ESM), which also includes the terrestrial and marine biogeochemical processes, snow and ice dynamics, atmospheric chemistry, aerosols, and so on – i.e. it includes simulations of most of the rest of the diagram.

Over the past decade, AO-GCMs have steadily evolved to become ESMs, although there are many intermediate forms around. In the last IPCC assessment, nearly all the models used for the assessment runs were AO-GCMs. For the next assessment, many of them will be ESMs.

Second, perhaps obviously, the diagram doesn’t show any infrastructure code. Some of this is substantial – for example an atmosphere-ocean coupler is a substantial component in its own right, often performing elaborate data transformations, such as re-gridding, interpolation, and synchronization. But this does reflect the way in which scientists often neglect the infrastructure code, because it is not really relevant to the science.

Third, the diagram treats all the connectors in the same way, because, at some level, they are all just data fields, representing physical quantities (mass, energy) that cross subsystem boundaries. However, there’s a wide range of different ways in which these connectors are implemented – in some cases binding the components tightly together with complex data sharing and control coupling, and in other cases keeping them very loose. The implementation choices are based on a mix of historical accident, expediency, program performance concerns, and the sheer complexity of the physical boundaries between the actual earth subsystems. For example, within an atmosphere model, the dynamical core (which computes the basic thermodynamics of air flow) is distinct from the radiation code (which computes how visible light, along with other parts of the spectrum, are scattered or absorbed by the various layers of air) and the moist processes (i.e. humidity and clouds). But the complexity of the interactions between these processes is sufficiently high that they are tightly bound together – it’s not currently possible to treat any of these parts as swappable components (at least in the current generation of models), although during development, some parts can be run in isolation for unit testing e.g. the dynanamical core is tested in isolation, but then most other subcomponents depend on it.

On the other hand, the interface between atmosphere and ocean is relatively simple — it’s the ocean surface — and as this also represents the interface between two distinct scientific disciplines (atmospheric physics and oceanography), atmosphere models and ocean model are always (?) loosely coupled. It’s common now for the two to operate on different grids (different resolution, or even different shape), and the translation of the various data to be passed between them is handled by a coupler. Some schematic diagrams do show how the coupler is connected:

Atmosphere-Ocean coupling via the OASIS coupler (source: Figure 4.2 in the MPI-Met PRISM Earth System Model Adaptation Guide)

Atmosphere-Ocean coupling via the OASIS coupler (source: Figure 4.2 in the MPI-Met PRISM Earth System Model Adaptation Guide)

Other interfaces are harder to define than the atmosphere-ocean interface. For example, the atmosphere and the terrestrial processes are harder to decouple: Which parts of the water cycle should be handled by the atmosphere model and which should be handled by the land surface model? Which module should handle evaporation from plants and soil? In some models, such as ECHAM, the land surface is embedded within the atmosphere model, and called as a subroutine at each time step. In part this is historical accident – the original atmosphere model had no vegetation processes, but used soil heat and moisture parameterization as a boundary condition. The land surface model, JSBACH, was developed by pulling out as much of this code as possible, and developing it into a separate vegetation model, and this is sometimes run as a standalone model by the land surface community. But it still shares some of the atmosphere infrastructure code for data handling, so its not as loosely coupled as the ocean is. By contrast, in CESM, the land surface model is a distinct component, interacting with the atmosphere model only via the coupler. This facilitates the switching of different land and/or atmosphere components into the coupled scheme, and also allows the land surface model to have a different grid.

The interface between the ocean model and the sea ice model is also tricky, not least because the area covered by the ice varies with the seasonal cycle. So if you use a coupler to keep the two components separate, the coupler needs information about which grid points contain ice and which do not at each timestep, and it has to alter its behaviour accordingly. For this reason, the sea ice is often treated as a subroutine of the ocean model, which then avoids having to expose all this information to the coupler. But again we have the same trade-off. Working through the coupler ensures they are self-contained components and can be swapped for other compatible models as needed; but at the cost of increasing the complexity of the coupler interfaces, reducing information hiding, and making future changes harder.

Similar challenges occur for:

  • the coupling between the atmosphere and the atmospheric chemistry (which handles chemical processes as gases and various types of pollution are mixed up by atmospheric dynamics).
  • the coupling between the ocean and marine biogeochemistry (which handles the way ocean life absorbs and emits various chemicals while floating around on ocean currents).
  • the coupling between the land surface processes and terrestrial hydrology (which includes rivers, lakes, wetlands and so on). Oh, and between both of these and the atmosphere, as water moves around so freely. Oh, and the ocean as well, because we have to account for how outflows from rivers enter the ocean at coastlines all around the world.
  • …and so on, as we account for more and more of the earth’s system into the models.

Overall, it seems that the complexity of the interactions between the various earth system processes is so high that traditional approaches to software modularity don’t work. Information hiding is hard to do, because these processes are so tightly inter-twined. A full object-oriented approach would be a radical departure from how these models are built currently, with the classes built on the data objects (the pale blue boxes in the Bretherton diagram) rather than the processes (the green boxes). But the computational demands of the processes in the green boxes is so high that the only way to make them efficient is to give them full access to the low level data structures. So any attempt to abstract away these processes from the data objects they operate on will lead to a model that is too inefficient to be useful.

Which brings me back to the question of how to draw pictures of the architecture so that I can compare the coupling and modularity of different models. I’m thinking the best approach might be to start with the Bretherton diagram, and then overlay it to show how various subsystems are grouped into components, and which connectors are handled by a separate coupler.

Postscript: While looking for good diagrams, I came across this incredible collection of visualizations of various aspects of sustainability, some of which are brilliant, while others are just kooky.

I had some interesting chats in the last few days with Christian Jakob, who’s visiting Hamburg at the same time as me. He’s just won a big grant to set up a new Australian Climate Research Centre, so we talked a lot about what models they’ll be using at the new centre, and the broader question of how to manage collaborations between academics and government research labs.

Christian has a paper coming out this month in BAMS on how to accelerate progress in climate model development. He points out that much of the progress now depends on the creation of new parameterizations for physical processes, but to do this more effectively requires better collaboration between the groups of people who run the coupled models and assess overall model skill, and the people who analyze observational data to improve our understanding (and simulation) of particular climate processes. The key point he makes in the paper is that process studies are often undertaken because they are interesting and or because data is available, but without much idea on whether improving a particular process will have any impact on overall model skill; conversely model skill is analyzed at modeling centers without much follow-through to identify which processes might be to blame for model weaknesses. Both activities lead to insights, but better coordination between them would help to push model development further and faster. Not that it’s easy of course: coupled models are now sufficiently complex that it’s notoriously hard to pin down the role of specific physical processes in overall model skill.

So we talked a lot about how the collaboration works. One problem seems to stem from the value of the models themselves. Climate models are like very large, very expensive scientific instruments. Only large labs (typically at government agencies) can now afford to develop and maintain fully fledged earth system models. And even then the full cost is never adequately accounted for in the labs’ funding arrangements. Funding agencies understand the costs of building and operating physical instruments, like large telescopes, or particle accelerators, as shared resources across a scientific community. But because software is invisible and abstract, they don’t think of it in the same way – there’s a tendency to think that it’s just part of the IT infrastructure, and can be developed by institutional IT support teams. But of course, the climate models need huge amounts of specialist expertise to develop and operate, and they really do need to be funded like other large scientific instruments.

The complexity of the models and the lack of adequate funding for model development means that the institutions that own the models are increasingly conservative in what they do with them. They work on small incremental changes to the models, and don’t undertake big revolutionary changes – they can’t afford to take the risk. There are some examples of labs taking such risks: for example in the early 1990’s ECMWF re-wrote their model from scratch, driven in part to make it more adaptable to new, highly parallel, hardware architectures. It took several years, and a big team of coders, bringing in the scientific experts as needed. At the end of it, they had a model that was much cleaner, and (presumably) more adaptable. But scientifically, it was no different from the model they had previously. Hence, lots of people felt this was not a good use of their time – they could have made better scientific progress during that time by continuing to evolve the old model. And that was years ago – the likelihood of labs making such radical changes these days is very low.

On the other hand, academics can try the big, revolutionary stuff – if it works, they get lots of good papers about how they’re pushing the frontiers, and if it doesn’t, they can write papers about why some promising new approach didn’t work as expected. But then getting their changes accepted into the models is hard. A key problem here is that there’s no real incentive for them to follow through. Academics are judged on papers, so once the paper is written they are done. But at that point, the contribution to the model is still a long way from being ready to incorporate for others to use. Christian estimates that it takes at least as long again to get a change ready to incorporate into a model as it does to develop it in the first place (and that’s consistent with what I’ve heard other modelers say). The academic has no incentive to continue to work on it to get it ready, and the institutions have no resources to take it and adopt it.

So again we’re back to the question of effective collaboration, beyond what any one lab or university group can do. And the need to start treating the models as expensive instruments, with much higher operation and maintenance costs than anyone has yet acknowledged. In particular, modeling centers need resources for a much bigger staff to support the efforts by the broader community to extend and improve the models.