climate modeling | Serendipity

Verifying Forecasting Systems

02. December 2010 · 1 comment · Categories: climate modeling

My post on validating climate models suggested that the key validation criteria is the extent to which the model captures (some aspect of) the current scientific theory, and is useful in exploring the theory. In effect, I’m saying that climate models are scientific tools, and should be validated as scientific tools. This makes them very different from, say numerical weather prediction (NWP) software, which are used in an operational setting to provide a service (predicting the weather).

What’s confusing is that both communities (climate modeling and weather modeling) use many of the same techniques both for the design of the models, and for comparing the models with observational data.

For NWP, forecast accuracy is the overriding objective, and the community has developed an extensive methodology for doing forecast verification. I pondered for a while whether this use of the term ‘verification’ here is consistent with my definitions, because surely we should be “validating” a forecast rather than “verifying it”. After thinking about it for a while, I concluded that the terminology is consistent, because forecast verification is like checking a program against it’s specification. In this case the specification states precisely what is being predicted, with what accuracy, and what would constitute a successful forecast (Bob Grumbine gives a recent example in verifying accuracy of seasonal sea ice forecasts). The verification procedure checks that the actual forecast was accurate, within the criteria set by this specification. Whether or not the forecast was useful is another question: that’s the validation question (and it’s a subjective question that requires some investigation of why people want forecasts in the first place).

An important point here is that forecast verification is not software verification: it doesn’t verify a particular piece of software. It’s also not simulation verification: it doesn’t verify a given run produced by that software. It’s verification of an entire forecasting system. A forecasting system makes use of computational models (often more than one), as well as a bunch of experts who interpret the model results.It also includes an extensive data collection system that gathers information about the current state of the world to use as input to the model. (And of course, some forecasting systems don’t use computational models at all). So:

If the forecast is inaccurate (according to the forecast criteria), it doesn’t necessarily mean there’s a flaw in the models – it might just as well be a flaw in the interpretation of the model outputs, or in the data collection process that provided it’s inputs. Oh, and of course, the verification might also fail because the specification is wrong, e.g. because there are flaws in the observational system used in the verification procedure too.
If the forecasting system persistently produces accurate forecasts (according to the forecast criteria), that doesn’t necessarily tell us anything about the quality of the software itself, it just means that the entire forecast system worked. It may well be that the model is very poor, but the meteorologists who interpret model outputs are brilliant at overcoming the weaknesses in the model (perhaps in the way they configure the runs, or perhaps in the way they filter model outputs), to produce accurate forecasts for their customers.

However, one effect of using this forecast verification approach day-in-day-out for weather forecasting systems over several decades (with an overall demand from customers for steady improvements in forecast accuracy) is that all parts of the forecasting system have improved dramatically over the last few decades, including the software. And climate modelling has benefited from this, as improvements in the modelling of processes needed for NWP can often also be used to improve the climate models (Senior et al have an excellent chapter on this in a forthcoming book, which I will review nearer to the publication date).

The question is, can we apply a similar forecast verification methodology to the “climate forecasting system”, despite the differences between weather and climate?

Note that the question isn’t about whether we can verify the accuracy of climate models this way, because the methodology doesn’t separate the models from the broader system in which they are used. So, if we take this route at all, we’re attempting to verify the forecast accuracy of the whole system: collection of observational data, creation of theories, use of these theories to develop models, choices for which model and which model configuration to use, choices for how to set up the runs, and interpretation of the results.

Climate models are not designed as forecasting tools, they are designed as tools to explore current theories about the climate system, and to investigate sources of uncertainty in these theories. However, the fact that they can be used to project potential future climate change (under various scenarios) is very handy. Of course, this is not the only way to produce quantified estimates of future climate change – you can do it using paper and pencil. It’s also a little unfortunate, because the IPCC process (or at least the end-users of IPCC reports) tend to over-emphasize the model projections at the expense of the science that went into them, and increasingly the funding for the science is tied to the production of such projections.

But some people (both within the climate modeling community and within the denialist community) would prefer that they not be used to project future climate change at all. (The argument from within the modelling community is that the results get over-interpreted or mis-interpreted by lay audiences; the argument from the denialist community is that models aren’t perfect. I think these two arguments are connected…). However, both arguments ignore reality: society demands of climate science that it provides its best estimates of the rate and size of future climate change, and (to the extent that they embody what we currently know about climate) the models are the best tool for this job. Not using them in the IPCC assessments would be like marching into the jungle with one eye closed.

So, back to the question: can we use NWP forecast verification for climate projections? I think the answer is ‘no’, because of the timescales involved. Projections of climate change really only make sense on the scale of decades to centuries. Waiting for decades to do the verification is pointless – by then the science will have moved on, and it will be way too late for policymaking purposes anyway.

If we can’t verify the forecasts on a timescale that’s actually useful, does this mean the models are invalid? Again the answer is ‘no’, for three reasons. First, we have plenty of other V&V techniques to apply to climate models. Second, the argument that climate models are a valid tool for creating future projections of climate change is based not on our ability to do forecast verification, but on how well the models capture the current state of the science. And third, because forecast verification wouldn’t necessarily say anything about the models themselves anyway, as it assesses the entire forecast system.

It would certainly be really, really useful to be able to verify the “climate forecast” system. But the fact that we can’t does not mean we cannot validate climate models.

Validating Climate Models

30. November 2010 · 16 comments · Categories: climate modeling

In my last two posts, I demolished the idea that climate models need Independent Verification and Validation (IV&V), and I described the idea of a toolbox approach to V&V. Both posts were attacking myths: in the first case, the myth that an independent agent should be engaged to perform IV&V on the models, and in the second, the myth that you can critique the V&V of climate models without knowing anything about how they are currently built and tested.

I now want to expand on the latter point, and explain how the day-to-day practices of climate modellers taken together constitute a robust validation process, and that the only way to improve this validation process is just to do more of it (i.e. give the modeling labs more funds to expand their current activities, rather than to do something very different).

The most common mistake made by people discussing validation of climate models is to assume that a climate model is a thing-in-itself, and that the goal of validation is to demonstrate that some property holds of this thing. And whatever that property is, the assumption is that such measurement of it can be made without reference to its scientific milieu, and in particular without reference to its history and the processes by which it was constructed.

This mistake leads people to talk of validation in terms of how well “the model” matches observations, or how well “the model” matches the processes in some real world system. This approach to validation is, as Oreskes et al pointed out, quite impossible. The models are numerical approximations of complex physical phenomena. You can verify that the underlying equations are coded correctly in a given version of the model, but you can never validate that a given model accurately captures real physical processes, because it never will accurately capture them. Or as George Box summed it up: “All models are wrong…” (we’ll come back to the second half of the quote later).

The problem is that there is no such thing as “the model”. The body of code that constitutes a modern climate model actually represents an enormous number of possible models, each corresponding to a different way of configuring that code for a particular run. Furthermore, this body of code isn’t a static thing. The code is changed on a daily basis, through a continual process of experimentation and model improvement. Often these changes are done in parallel, so that there are multiple version at any given moment, being developed along multiple lines of investigation. Sometimes these lines of evolution are merged, to bring a number of useful enhancements together into a single version. Occasionally, the lines diverge enough to cause a fork: a point at which they are different enough that it just becomes too hard to reconcile them (See for example, this visualization of the evolution of ocean models). A forked model might at some point be given a new name, but the process by which a model gets a new name is rather arbitrary.

Occasionally, a modeling lab will label a particular snapshot of this evolving body of code as an “official release”. An official release has typically been tested much more extensively, in a number of standard configurations for a variety of different platforms. It’s likely to be more reliable, and therefore easier for users to work with. By more reliable here, I mean relatively free from coding defects. In other words, it is better verified than other versions, but not necessarily better validated (I’ll explain why shortly). In many cases, official releases also contain some significant new science (e.g. new parameterizations), and these scientific enhancements will be described in a set of published papers.

However, an official release isn’t a single model either. Again it’s just a body of code that can be configured to run as any of a huge number of different models, and it’s not unchanging either – as with all software, there will be occasional bugfix releases applied to it. Oh, and did I mention that to run a model, you have to make use of a huge number of ancillary datafiles, which define everything from the shape of the coastlines and land surfaces, to the specific carbon emissions scenario to be used. Any change to these effectively gives a different model too.

So, if you’re hoping to validate “the model”, you have to say which one you mean: which configuration of which code version of which line of evolution, and with which ancillary files. I suppose the response from those clamouring for something different in the way of model validation would say “well, the one used for the IPCC projections, of course”. Which is a little tricky, because each lab produces a large number of different runs for the CMIP process that provides input to the IPCC, and each of these is a likely to involve a different model configuration.

But let’s say for sake of argument that we could agree on a specific model configuration that ought to be “validated”. What will we do to validate it? What does validation actually mean? The Oreskes paper I mentioned earlier already demonstrated that comparison with real world observations, while interesting, does not constitute “validation”. The model will never match the observations exactly, so the best we’ll ever get along these lines is an argument that, on balance, given the sum total of the places where there’s a good match and the places where there’s a poor match, that the model does better or worse than some other model. This isn’t validation, and furthermore it isn’t even a sensible way of thinking about validation.

At this point many commentators stop, and argue that if validation of a model isn’t possible, then the models can’t be used to support the science (or more usually, they mean they can’t be used for IPCC projections). But this is a strawman argument, based on a fundamental misconception of what validation is all about. Validation isn’t about checking that a given instance of a model satisfies some given criteria. Validation is about about fitness for purpose, which means it’s not about the model at all, but about the relationship between a model and the purposes to which it is put. Or more precisely, its about the relationship between particular ways of building and configuring models and the ways in which runs produced by those models are used.

Furthermore, the purposes to which models are put and the processes by which they are developed co-evolve. The models evolve continually, and our ideas about what kinds of runs we might use them for evolve continually, which means validation must take this ongoing evolution into account. To summarize, validation isn’t about a property of some particular model instance; its about the whole process of developing and using models, and how this process evolves over time.

Let’s take a step back a moment, and ask what is the purpose of a climate model. The second half of the George Box quote is “…but some models are useful”. Climate models are tools that allow scientists to explore their current understanding of climate processes, to build and test theories, and to explore the consequences of those theories. In other words we’re dealing with three distinct systems:

We're dealing with relationships between three different systems

There does not need to be any clear relationship between the calculational system and the observational system – I didn’t include such a relationship in my diagram. For example, climate models can be run in configurations that don’t match the real world at all: e.g. a waterworld with no landmasses, or a world in which interesting things are varied: the tilt of the pole, the composition of the atmosphere, etc. These models are useful, and the experiments performed with them may be perfectly valid, even though they differ deliberately from the observational system.

What really matters is the relationship between the theoretical system and the observational system: in other words, how well does our current understanding (i.e. our theories) of climate explain the available observations (and of course the inverse: what additional observations might we make to help test our theories). When we ask questions about likely future climate changes, we’re not asking this question of the the calculational system, we’re asking it of the theoretical system; the models are just a convenient way of probing the theory to provide answers.

By the way, when I use the term theory, I mean it in exactly the way it’s used in throughout all sciences: a theory is the best current explanation of a given set of phenomena. The word “theory” doesn’t mean knowledge that is somehow more tentative than other forms of knowledge; a theory is actually the kind of knowledge that has the strongest epistemological basis of any kind of knowledge, because it is supported by the available evidence, and best explains that evidence. A theory might not be capable of providing quantitative predictions (but it’s good when it does), but it must have explanatory power.

In this context, the calculational system is valid as long as it can offer insights that help to understand the relationship between the theoretical system and the observational system. A model is useful as long as it helps to improve our understanding of climate, and to further the development of new (or better) theories. So a model that might have been useful (and hence valid) thirty years ago might not be useful today. If the old approach to modelling no longer matches current theory, then it has lost some or all of its validity. The model’s correspondence (or lack of) to the observations hasn’t changed (*), nor has its predictive power. But its utility as a scientific tool has changed, and hence its validity has changed.

[(*) except that that accuracy of the observations may have changed in the meantime, due to the ongoing process of discovering and resolving anomalies in the historical record.]

The key questions for validation then, are to do with how well the current generation of models (plural) support the discovery of new theoretical knowledge, and whether the ongoing process of improving those models continues to enhance their utility as scientific tools. We could focus this down to specific things we could measure by asking whether each individual change to the model is theoretically justified, and whether each such change makes the model more useful as a scientific tool.

To do this requires a detailed study of day-to-day model development practices, the extent to which these are closely tied with the rest of climate science (e.g. field campaigns, process studies, etc). It also takes in questions such as how modeling centres decide on their priorities (e.g. which new bits of science to get into the models sooner), and how each individual change is evaluated. In this approach, validation proceeds by checking whether the individual steps taken to construct and test changes to the code add up to a sound scientific process, and how good this process is at incorporating the latest theoretical ideas. And we ought to be able to demonstrate a steady improvement in the theoretical basis for the model. An interesting quirk here is that sometimes an improvement to the model from a theoretical point of view reduces its skill at matching observations; this happens particularly when we’re replacing bits of the model that were based on empirical parameters with an implementation that has a stronger theoretical basis, because the empirical parameters were tuned to give a better climate simulation, without necessarily being well understood. In the approach I’m describing, this would be an indicator of an improvement in validity, even while reduces the correspondence with observations. If on the other hand we based our validation on some measure of correspondence with observations, such a step would reduce the validity of the model!

But what does all of this tell us about whether it’s “valid” to use the models to produce projections of climate change into the future? Well, recall that when we ask for projections of future climate change, we’re not asking the question of the calculational system, because all that would result in is a number, or range of numbers, that are impossible to interpret, and therefore meaningless. Instead we’re asking the question of the theoretical system: given the sum total of our current theoretical understanding of climate, what is likely to happen in the future, under various scenarios for expected emissions and/or concentrations of greenhouse gases? If the models capture our current theoretical understanding well, then running the scenario on the model is a valid thing to do. If the models do a poor job of capturing our theoretical understanding, then running the models on these scenarios won’t be very useful.

Note what is happening here: when we ask climate scientists for future projections, we’re asking the question of the scientists, not of their models. The scientists will apply their judgement to select appropriate versions/configurations of the models to use, they will set up the runs, and they will interpret the results in the light of what is known about the models’ strengths and weaknesses and about any gaps between the comptuational models and the current theoretical understanding. And they will add all sorts of caveats to the conclusions they draw from the model runs when they present their results.

And how do we know whether the models capture our current theoretical understanding? By studying the processes by which the models are developed (i.e. continually evolved) be the various modeling centres, and examining how good each centre is at getting the latest science into the models. And by checking that whenever there are gaps between the models and the theory, these are adequately described by the caveats in the papers published about experiments with the models.

Summary: It is a mistake to think that validation is a post-hoc process to be applied to an individual “finished” model to ensure it meets some criteria for fidelity to the real world. In reality, there is no such thing as a finished model, just many different snapshots of a large set of model configurations, steadily evolving as the science progresses. And fidelity of a model to the real world is impossible to establish, because the models are approximations. In reality, climate models are tools to probe our current theories about how climate processes work. Validity is the extent to which climate models match our current theories, and the extent to which the process of improving the models keeps up with theoretical advances.

The difference between Verification and Validation

29. November 2010 · 20 comments · Categories: climate modeling

Sometime in the 1990’s, I drafted a frequently asked question list for NASA’s IV&V facility. Here’s what I wrote on the meaning of the terms “validation” and “verification”:

The terms Verification and Validation are commonly used in software engineering to mean two different types of analysis. The usual definitions are:

Validation: Are we building the right system?

Verification: Are we building the system right?

In other words, validation is concerned with checking that the system will meet the customer’s actual needs, while verification is concerned with whether the system is well-engineered, error-free, and so on. Verification will help to determine whether the software is of high quality, but it will not ensure that the system is useful.

The distinction between the two terms is largely to do with the role of specifications. Validation is the process of checking whether the specification captures the customer’s needs, while verification is the process of checking that the software meets the specification.

Verification includes all the activities associated with the producing high quality software: testing, inspection, design analysis, specification analysis, and so on. It is a relatively objective process, in that if the various products and documents are expressed precisely enough, no subjective judgements should be needed in order to verify software.

In contrast, validation is an extremely subjective process. It involves making subjective assessments of how well the (proposed) system addresses a real-world need. Validation includes activities such as requirements modelling, prototyping and user evaluation.

In a traditional phased software lifecycle, verification is often taken to mean checking that the products of each phase satisfy the requirements of the previous phase. Validation is relegated to just the begining and ending of the project: requirements analysis and acceptance testing. This view is common in many software engineering textbooks, and is misguided. It assumes that the customer’s requirements can be captured completely at the start of a project, and that those requirements will not change while the software is being developed. In practice, the requirements change throughout a project, partly in reaction to the project itself: the development of new software makes new things possible. Therefore both validation and verification are needed throughout the lifecycle.

Finally, V&V is now regarded as a coherent discipline: ”Software V&V is a systems engineering discipline which evaluates the software in a systems context, relative to all system elements of hardware, users, and other software”. (from Software Verification and Validation: Its Role in Computer Assurance and Its Relationship with Software Project Management Standards, by Dolores R. Wallace and Roger U. Fujii, NIST Special Publication 500-165)

Having thus carefully distinguished the two terms, my advice to V&V practitioners was then to forget about the distinction, and think instead about V&V as a toolbox, which provides a wide range of tools for asking different kinds of questions about software. And to master the use of each tool and figure out when and how to use it. Here’s one of my attempts to visualize the space of tools in the toolbox:

A range of V&V techniques. Note that "modeling" and "model checking" refer to building and analyzing abstracted models of software behaviour, a very different kind of beast from scientific models used in the computational sciences

For climate models, the definitions that focus on specifications don’t make much sense, because there are no detailed specifications of climate models (nor can there be – they’re built by iterative refinement like agile software development). But no matter – the toolbox approach still works; it just means some of the tools are applied a little differently. An appropriate toolbox for climate modeling looks a little different from my picture above, because some of these tools are more appropriate for real-time control systems, applications software, etc, and there are some missing from the above picture that are particular for simulation software. I’ll draw a better picture when I’ve finished analyzing the data from my field studies of practices used at climate labs.

Many different V&V tools are already in use at most climate modelling labs, but there is room for adding more tools to the toolbox, and for sharpening the existing tools (what and how are the subjects of my current research). But the question of how best to do this must proceed from a detailed analysis of current practices and how effective they are. There seem to be plenty of people wandering into this space, claiming that the models are insufficiently verified, validated, or both. And such people like to pontificate about what climate modelers ought to do differently. But anyone who pontificates in this way, but is unable to give a detailed account of which V&V techniques climate modellers currently use, is just blowing smoke. If you don’t know what’s in the toolbox already, then you can’t really make constructive comments about what’s missing.

Do Climate Models need Independent Verification and Validation?

27. November 2010 · 29 comments · Categories: climate modeling

A common cry from climate contrarians is that climate models need better verification and validation (V&V), and in particular, that they need Independent V&V (aka IV&V). George Crews has been arguing this for a while, and now Judith Curry has taken up the cry. Having spent part of the 1990’s as lead scientist at NASA’s IV&V facility, and the last few years studying climate model development processes, I think I can offer some good insights into this question.

The short answer is “no, they don’t”. The slightly longer answer is “if you have more money to spend to enhance the quality of climate models, spending it on IV&V is probably the least effective thing you could do”.

The full answer involves deconstructing the question, to show that it is based on three incorrect assumptions about climate models: (1) that there’s some significant risk to society associated with the use of climate models; (2) that the existing models are inadequately tested / verified / validated / whatevered; and (3) that trust in the models can be improved by using an IV&V process. I will demonstrate what’s wrong with each of these assumptions, but first I need to explain what IV&V is.

Independent Verification and Validation (IV&V) is a methodology developed primarily in the aerospace industry for reducing the risk of software failures, by engaging a separate team (separate from the software development team, that is) to perform various kinds of testing and analysis on the software as it is produced. NASA adopted IV&V for development of the flight software for the space shuttle in the 1970’s. Because IV&V is expensive (it typically adds 10%-20% to the cost of a software development contract), NASA tried to cancel the IV&V on the shuttle in the early 1980’s, once the shuttle was declared operational. Then, of course the Challenger disaster occurred. Although software wasn’t implicated, a consequence of the investigation was the creation of the Leveson committee, to review the software risk. Leveson’s committee concluded that far from cancelling IV&V, NASA needed to adopt the practice across all of its space flight programs. As a result of the Leveson report, the NASA IV&V facility was established in the early 1990’s, as a centre of expertise for all of NASA’s IV&V contracts. In 1995, I was recruited as lead scientist at the facility, and while I was there, our team investigated the operational effectiveness of the IV&V contracts on the Space Shuttle, International Space Station, Earth Observation System, Cassini, as well as a few other smaller programs. (I also reviewed the software failures on NASA’s Mars missions in the 1990’s, and have a talk about the lessons learned)

The key idea for IV&V is that when NASA puts out a contract to develop flight control software, it also creates a separate contract with a different company, to provide an ongoing assessment of software quality and risk as the development proceeds. One difficulty with IV&V contracts in the US aerospace industry is that it’s hard to achieve real independence, because industry consolidation has left very few aerospace companies available to take on such contracts, and they’re not sufficiently independent from one another.

NASA’s approach demands independence along three dimensions:

managerial independence (the IV&V contractor is free to determine how to proceed, and where to devote effort, independently of either the software development contractor and the customer)
financial independence (the funding for the IV&V contract is separate from the development contract, and cannot be raided if more resources are needed for development); and
technical independence (the IV&V contractor is free to develop its own criteria, and apply whatever V&V methods and tools it deems appropriate).

This has led to the development of a number of small companies who specialize only in IV&V (thus avoiding any contractual relationship with other aerospace companies), and who tend to recruit ex-NASA staff to provide them with the necessary domain expertise.

For the aerospace industry, IV&V has been demonstrated to be a cost effective strategy to improve software quality and reduce risk. The problem is that the risks are extreme: software errors in the control software for a spacecraft or an aircraft are highly likely to cause loss of life, loss of the vehicle, and/or loss of the mission. There is a sharp distinction between the development phase and the operation phase for such software: it had better be correct when it’s launched. Which means the risk mitigation has to be done during development, rather than during operation. In other words, iterative/agile approaches don’t work – you can’t launch with a beta version of the software. The goal is to detect and remove software defects before the software is ever used in an operational setting. An extreme example of this was the construction of the space station, where the only full end-to-end construction of the system was done in orbit; it wasn’t possible to put the hardware together on the ground in order to do a full systems test on the software.

IV&V is essential for such projects, because it overcomes natural confirmation bias of software development teams. Even the NASA program managers overseeing the contracts suffer from this too – we discovered one case where IV&V reports on serious risks were being systematically ignored by the NASA program office, because the program managers preferred to believe the project was going well. We fixed this by changing the reporting structure, and routing the IV&V reports directly to the Office of Safety and Mission Assurance at NASA headquarters. The IV&V teams developed their own emergency strategy too – if they encountered a risk that they considered mission-critical, and couldn’t get the attention of the program office to address it, they would go and have a quiet word with the astronauts, who would then ensure the problem got seen to!

But IV&V is very hard to do right, because much of it is a sociological problem rather than a technical problem. The two companies (developer and IV&V contractor) are naturally set up in an adversarial relationship, but if they act as adversaries, they cannot be effective: the developer will have a tendency to hide things, and the IV&V contractor will have a tendency to exaggerate the risks. Hence, we observed that the relationship is most effective where there is a good horizontal communication channel between the technical staff in each company, and that they come to respect one another’s expertise. The IV&V contractor has to be careful not to swamp the communication channels with spurious low-level worries, and the development contractor must be willing to respond positively to criticism. One way this works very well is for the IV&V team to give the developers advance warning of any issues they planned to report up the hierarchy to NASA, so that the development contractor could have a solution in place as even before NASA asked for it. For a more detailed account of these coordination and communication issues, see:

Easterbrook S. The Role of Indepedent V&V in Upstream Software Development Processes. In: 2nd World Conference on Integrated Design and Process Technology (IDPT). Austin, Texas; 1996.

Okay, let’s look at whether IV&V is applicable to climate modeling. Earlier, I identified three assumptions made by people advocating it. Let’s take them one at a time:

1) The assumption there’s some significant risk to society associated with the use of climate models.

A large part of the mistake here is to misconstrue the role of climate models in policymaking. Contrarians tend to start from an assumption that proposed climate change mitigation policies (especially any attempt to regulate emissions) will wreck the economies of the developed nations (or specifically the US economy, if it’s an American contrarian). I prefer to think that a massive investment in carbon-neutral technologies will be a huge boon to the world’s economy, but let’s set aside that debate, and assume for sake of arguments that whatever policy path the world takes, it’s incredibly risky, with a non-neglibable probability of global catastrophe if the policies are either too aggressive or not aggressive enough, i.e. if the scientific assessments are wrong.

The key observation is that software does not play the same role in this system that flight software does for a spacecraft. For a spacecraft, the software represents a single point of failure. An error in the control software can immediately cause a disaster. But climate models are not control systems, and they do not determine climate policy. They don’t even control it indirectly – policy is set by a laborious process of political manoeuvring and international negotiation, in which the impact of any particular climate model is negligible.

Here’s what happens: the IPCC committees propose a whole series of experiments for the climate modelling labs around the world to perform, as part of a Coupled Model Intercomparison Project. Each participating lab chooses those runs they are most able to do, given their resources. When they have completed their runs, they submit the data to a public data repository. Scientists around the world then have about a year to analyze this data, interpret the results, to compare performance of the models, discuss findings at conferences and workshops, and publish papers. This results in thousands of publications from across a number of different scientific disciplines. The publications that make use of model outputs take their place alongside other forms of evidence, including observational studies, studies of paleoclimate data, and so on. The IPCC reports are an assessment of the sum total of the evidence; the model results from many runs of many different models are just one part of that evidence. Jim Hansen rates models as the third most important source of evidence for understanding climate change, after (1) paleoclimate studies and (2) observed global changes.

The consequences of software errors in a model, in the worst case, are likely to extend to no more than a few published papers being retracted. This is a crucial point: climate scientists don’t blindly publish model outputs as truth; they use model outputs to explore assumptions and test theories, and then publish papers describing the balance of evidence. Further papers then come along that add more evidence, or contradict the earlier findings. The assessment reports then weigh up all these sources of evidence.

I’ve been asking around for a couple of years for examples of published papers that were subsequently invalidated by software errors in the models. I’ve found several cases where a version of the model used in the experiments reported in a published paper was later found to contain an important software bug. But in none of those cases did the bug actually invalidate the conclusions of the paper. So even this risk is probably overstated.

The other point to make is that around twenty different labs around the world participate in the Model Intercomparison Projects that provide data for the IPCC assessments. That’s a level of software redundancy that is simply impossible in the aerospace industry. It’s likely that these 20+ models are not quite as independent as they might be (e.g. see Knutti’s analysis of this), but even so, the ability to run many different models on the same set of experiments, and to compare and discuss their differences is really quite remarkable, and the Model Intercomparison Projects have been a major factor in driving the science forward in the last decade or so. It’s effectively a huge benchmarking effort for climate models, with all the benefits normally associated with software benchmarking (and worthy of a separate post – stay tuned).

So in summary, while there are huge risks to society of getting climate policy wrong, those risks are not software risks. A single error in the flight software for a spacecraft could kill the crew. A single error in a climate model can, at most, only affect a handful of the thousands of published papers on which the IPCC assessments are based. The actual results of a particular model run are far less important than the understanding the scientists gain about what the model is doing and why, and the nature of the uncertainties involved. The modellers know that the models are imperfect approximations of very complex physical, chemical and biological processes. Conclusions about key issues such as climate sensitivity are based not on particular model runs, but on many different experiments with many different models over many years, and the extent to which these experiments agree or disagree with other sources of evidence.

2) the assumption that the current models are inadequately tested / verified / validated / whatevered;

This is a common talking point among contrarians. Part of the problem is that while the modeling labs have evolved sophisticated processes for developing and testing their models, they rarely bother to describe these processes to outsiders – nearly all published reports focus on the science done with the models, rather than the modeling process itself. I’ve been working to correct this, with, first, my study of the model development processes at the UK Met Office, and more recently my comparative studies of other labs, and my accounts of the existing V&V processes. Some people have interpreted the latter as a proposal for what should be done, but it is not; it is an account of the practices currently in place across all the of the labs I have studied.

A key point is that for climate models, unlike spacecraft flight controllers, there is no enforced separation between software development and software operation. A climate model is always an evolving, experimental tool, it’s never a finished product – even the prognostic runs done as input to the IPCC process are just experiments, requiring careful interpretation before any conclusions can be drawn. If the model crashes, or gives crazy results, the only damage is wasted time.

This means that an iterative development approach is the norm, which is far superior to the waterfall process used in the aerospace industry. Climate modeling labs have elevated the iterative development process to a new height: each change to the model is treated as a scientific experiment, where the change represents a hypothesis for how to improve the model, and a series of experiments is used to test whether the hypothesis was correct. This means that software development proceeds far more slowly than commercial software practices (at least in terms of lines of code per day), but that the models are continually tested and challenged by the people who know them inside out, and comparison with observational data is a daily activity.

The result is that climate models have very few bugs, compared to commercial software, when measured using industry standard defect density measures. However, although defect density is a standard IV&V metric, it’s probably a poor measure for this type of software – it’s handy for assessing risk of failure in a control system, but a poor way of assessing the validity and utility of a climate model. The real risk is that there may be latent errors in the model that mean it isn’t doing what the modellers designed it to do. The good news is that such errors are extremely rare: nearly all coding defects cause problems that are immediately obvious: the model crashes, or the simulation becomes unstable. Coding defects can only remain hidden if they have an effect that is small enough that it doesn’t cause significant perturbations in any of the diagnostic variables collected during a model run; in this case they are indistinguishable from the acceptable imperfections that arise as a result of using approximate techniques. The testing processes for the climate models (which in most labs include a daily build and automated test across all reference configurations) are sufficient that such problems are nearly always identified relatively early.

This means that there are really only two serious error types that can lead to misleading scientific results: (1) misunderstanding of what the model is actually doing by the scientists who conduct the model experiments, and (2) structural errors, where specific earth system processes are omitted or poorly captured in the model. In flight control software, these would correspond to requirements errors, and would be probed by an IV&V team through specification analysis. Catching these in control software is vital because you only get one chance to get it right. But in climate science, these are science errors, and are handled very well by the scientific process: making such mistakes, learning from them, and correcting them are all crucial parts of doing science. The normal scientific peer review process handles these kinds of errors very well. Model developers publish the details of their numerical algorithms and parameterization schemes, and these are reviewed and discussed in the community. In many cases, different labs will attempt to build their own implementations from these descriptions, and in the process subject them to critical scrutiny. In other words, there is already an independent expert review process for the most critical parts of the models, using the normal scientific route of replicating one another’s techniques. Similarly, experimental results are published, and the data is made available for other scientists to explore.

As a measure of how well this process works for building scientifically valid models, one senior modeller recently pointed out to me that it’s increasingly the case now that when the models diverge from the observations, it’s often the observational data that turns out to be wrong. The observational data is itself error prone, and software models turn out to be an important weapon in identifying and eliminating such errors.

However, there is another risk here that needs to be dealt with. Outside of the labs where the models are developed, there is a tendency for scientists who want to make use of the models to treat them as black box oracles. Proper use of the models depends on a detailed understanding of their strengths and weaknesses, and the ways in which uncertainties are handled. If we have some funding available to improve the quality of climate models, it would be far better spent on improving the user interfaces, and better training of the broader community of model users.

The bottom line is that climate models are subjected to very intensive system testing, and the incremental development process incorporates a sophisticated regression test process that’s superior to most industrial software practices. The biggest threat to validity of climate models is errors in the scientific theories on which they are based, but such errors are best investigated through the scientific process, rather than through an IV&V process. Which brings us to:

(3) the assumption that our ability to trust in the models can be improved by an IV&V process;

IV&V is essentially a risk management strategy for safety-critical software when which an iterative development strategy is not possible – where the software has to work correctly the first (and every) time it is used in an operational setting. Climate models aren’t like this at all. They aren’t safety critical, they can be used even while they are being developed (and hence are built by iterative refinement); and they solve complex, wicked problems, for which there’s no clear correctness criteria. In fact, as a species of software development process, I’ve come to the conclusion they are dramatically different from any of the commercial software development paradigms that have been described in the literature.

A common mistake in the software engineering community is to think that software processes can be successfully transplanted from one organisation to another. Our comparative studies of different software organizations show that this is simply not true, even for organisations developing similar types of software. There are few, if any, documented cases of a software development organisation successfully adopting a process model developed elsewhere, without very substantial tailoring. What usually happens is that ideas from elsewhere are gradually infused and re-fashioned to work in the local context. And the evidence shows that every software oganisation evolves its own development processes that are highly dependent on local context, and on the constraints they operate under. Far more important than a prescribed process is the development of a shared understanding within the software team. The idea of taking a process model that was developed in the aerospace industry, and transplanting it wholesale into a vastly different kind of software development process (climate modeling) is quite simply ludicrous.

For example, one consequence of applying IV&V is that it reduces flexibility for development team, as they have to set clearer milestones and deliver workpackages on schedule (otherwise IV&V team cannot plan their efforts). Because the development of scientific codes is inherently unpredictable, would be almost impossible to plan and resource an IV&V effort. The flexibility to explore new model improvements opportunistically, and to adjust schedules to match varying scientific rhythms, is crucial to the scientific mission – locking the development into more rigid schedules to permit IV&V would be a disaster.

If you wanted to set up an IV&V process for climate models, it would have to be done by domain experts; domain expertise is the single most important factor in successful use of IV&V in the aerospace industry. This means it would have to be done by other climate scientists. But other climate scientists already do this routinely – it’s built into the Model Intercomparison Projects, as well as the peer review process and through attempts to replicate one another’s results. In fact the Model Intercomparison Projects already achieve far more than an IV&V process would, because they are done in the open and involve a much broader community.

In other words, the available pool of talent for performing IV&V is already busy using a process that’s far more effective than IV&V ever can be: it’s called doing science. Actually, I suspect that those people calling for IV&V of climate models are really trying to say that climate scientists can’t be trusted to check each other’s work, and that some other (unspecified) group ought to do the IV&V for them. However, this argument can only be used by people who don’t understand what IV&V is. IV&V works in the aerospace industry not because of any particular process, but because it brings in the experts – the people with grey hair who understand the flight systems inside out, and understand all the risks.

And remember that IV&V is expensive. NASA’s rule of thumb was an additional 10%-20% of the development cost. This cannot be taken from the development budget – it’s strictly an additional cost. Given my estimate of the development cost of a climate model as somewhere in the ballpark of $350 million, then we’ll need to find another $35 million for each climate modeling centre to fund their IV&V contract. And if we had such funds to add to their budgets, I would argue that IV&V is one of the least sensible ways of spending this money. Instead, I would:

Hire more permanent software support staff to work alongside the scientists;
Provide more training courses to give the scientists better software skills;
Do more research into modeling frameworks;
Experiment with incremental improvements to existing practices, such as greater use of testing tools and frameworks, pair programming and code sprints;
More support to grow the user communities (e.g. user workshops and training courses), and more community building and beta testing;
Documenting the existing software development and V&V best practices so that different labs can share ideas and experiences, and the process of model building becomes more transparent to outsiders.

To summarize, IV&V would be an expensive mistake for climate modeling. It would divert precious resources (experts) away from existing modeling teams, and reduce their flexibility to respond to the science. IV&V isn’t appropriate because this isn’t ~~mission~~safety-critical software, it doesn’t have distinct development and operational phases, and the risks of software error are minor. There’s no single point of failure, because many labs around the world build their own models, and the normal scientific processes of experimentation, peer-review, replication, and model inter-comparison already provide a sophisticated process to examine the scientific validity of the models. Virtually all coding errors are detected in routine testing, and science errors are best handled through the usual scientific process, rather than through an IV&V process. Furthermore, there is only a small pool of experts available to perform IV&V on climate models (namely, other climate modelers) and they are already hard at work improving their own models. Re-deploying them to do IV&V of each other’s models would reduce the overall quality of the science rather than improving it.

(BTW I shouldn’t have had to write this article at all…)

Plug-compatibility and climate models

18. November 2010 · 7 comments · Categories: climate modeling

Following my post last week about Fortran coding standards for climate models, Tim reminded me of a much older paper that was very influential in the creation (and sharing) of coding standards across climate modeling centers:

Kalnay E, Kanamitsu M, Pfaendtner J. Rules for Interchange of Physical Parameterizations“. Bulletin of the American Meteorological Society. 1989;70(6):620-622.

The paper is the result of a series of discussions in the mid-1980s across many different modeling centres (the paper lists 11 labs) about how to facilitate sharing of code modules. To simplify things, the paper assumes what is being shared are parameterization modules that operate in a single column of the model. Of course, this was back in the 1980s, which means the models were primarily atmospheric models, rather than the more comprehensive earth system models of today. The dynamical core of the model handles most of the horizontal processes (e.g. wind), which means that most of the remaining physical processes (the subject of these parameterizations) affect what happens vertically within a single column, e.g. by affecting radiative or convective transfer of heat between the layers. Plugging in new parameterization modules becomes much easier if this assumption holds, because the new module needs to be called once per time step per column, and if it doesn’t interact with other columns, it doesn’t mess up the vectorization. The paper describes a number of coding conventions, effectively providing an interface specification for single-column parameterizations.

An interesting point about this paper is that popularized the term “plug compatibility” amongst the modeling community, along with the (implicit) broader goal of designing all models to be plug-compatible. (although it cites Pielke & Arrit for the origin of the term). Unfortunately, the goal seems to be still very elusive. While most modelers will agree accept that plug-compatibility is desirable, a few people I’ve spoken to are very skeptical that it’s actually possible. Perhaps the strongest statement on this is from:

Randall DA. A University Perspective on Global Climate Modeling. Bulletin of the American Meteorological Society. 1996;77(11):2685-2690.
p2687: “It is sometimes suggested that it is possible to make a plug-compatible global model so that an “outside” scientist can “easily make changes”. With a few exceptions (e.g. radiation codes), however, this is a fantasy, and I am surprised that such claims are not greeted with more skepticism.”

He goes on to describe instances where parameterizations have been transplanted from one model to another, but likens it to a major organ transplant, but more painful. The problem is that the various processes of the earth system interact in complex ways, and these complex interactions have to be handled properly in the code. As Randall puts it: “…the reality is that a global model must have a certain architectural unity or it will fail”. In my interviews with climate modellers, I’ve heard many tales of it taking months, and sometimes years of effort to take a code module contributed by someone outside the main modeling group, and to make it work properly in the model.

So plug compatibility and code sharing sound great in principle. In practice, no amount of interface specification and coding standards can reduce the essential complexity of earth system processes.

Note: most of the above is about plug compatibility of parameterization modules (i.e. code packages that live within the green boxes on the Bretherton diagram). More progress has been made (especially in the last decade) in standardizing the interfaces between major earth system components (i.e. the arrows on the Bretherton diagram). That’s where standardized couplers come in – see my post on the high level architecture of earth system models for an introduction. The IS-ENES workshop on coupling technologies in December will be an interesting overview of the state of the art here, although I won’t be able to attend, as it clashes with the AGU meeting.

You can’t delegate ill-defined problems to software engineers

16. November 2010 · 13 comments · Categories: climate modeling, collaborative science

I had lunch last week with Gerhard Fischer at the University of Colorado. Gerhard is director of the center for lifelong learning and design, and his work focusses on technologies that help people to learn and design solutions to suit their own needs. We talked a lot about meta-design, especially how you create tools that help domain experts (who are not necessarily software experts) to design their own software solutions.

I was describing some of my observations about why climate scientists prefer to write their own code rather than delegating it to software professionals, when Gerhard put it into words brilliantly. He said “You can’t delegate ill-defined problems to software engineers”. And that’s the nub of it. Much (but not all) of the work of building a global climate model is an ill-defined problem. We don’t know at the outset what should go into the model, which processes are important, how to simulate complex physical, chemical and biological processes and their interactions. We don’t know what’s computationally feasible (until we try it). We don’t know what will be scientifically useful. So we can’t write a specification, nor explain the requirements to someone who doesn’t have a high level of domain expertise. The only way forward is to actively engage in the process of building a little, experimenting with it, reflecting on the lessons learnt, and then modifying and iterating.

So the process of building a climate model is a loop of build-explore-learn-build. If you put people into that loop who don’t have the necessary understanding of the science being done with the models, then you slow things down. And as the climate scientists (mostly) have the necessary technical skills, it’s quicker and easier to write their own code than to explain to a software engineer what is needed. But there’s a trade-off: the exploratory loop can be traversed quickly, but the resulting code might not be very robust or modifiable. Just as in agile software practices, the aim is to build something that works first, and worry about elegant design later. And that ‘later’ might never come, as the next scientific question is nearly always more alluring than a re-design. Which means the main role for software engineers in the process is to do cleanup operations. Several of the software people I’ve interviewed in the last few months at climate modeling labs described their role as mopping up after the parade (and some of them used more colourful terms than that).

The term meta-design is helpful here, because it specifically addresses the question of how to put better design tools directly into the hands of the climate scientists. Modeling frameworks fit into this space, as do domain specific-languages. But I’m convinced that there’s a lot more scope for tools that raise the level of abstraction, so that modelers can work directly with meaningful building blocks than lines of Fortran. And there’s another problem. Meta-design is hard. Too often it produces tools that just don’t do what the target users want. If we’re really going to put better tools into the hands of climate modelers, then we need a new kind of expertise to build such tools: a community of meta-designers who have both the software expertise and the domain expertise in earth sciences.

Which brings me to another issue that came up in the discussion. Gerhard provided me a picture that helps me explain the issue better (I hope he doesn’t mind me reproducing it here; it comes from his talk “Meta-Design and Social Creativity” given at IEMC 2007):

To create reflective design communities, the software professionals need to acquire some domain expertise, and the domain experts need to acquire some software expertise (diagram by Gerhard Fischer)

Clearly, collaboration between software experts and climate scientists is likely to work much better if each acquires a little of the other’s expertise, if only to enable them to share some vocabulary to talk about the problems. It reduces the distance between them.

At climate modeling labs, I’ve met a number both kinds of people – i.e. climate scientists who have acquired good software knowledge, and software professionals who have acquired good climate science knowledge. But it seems to me that for climate modeling, one of these transitions is much easier than the other. It seems to be easier for climate scientists to acquire good software skills than it is for software professionals (with no prior background in the earth sciences) to acquire good climate science domain knowledge. That’s not to say it’s impossible, as I have met a few people who have followed this path (but they are rare). It seems to require many years of dedicated work. And there appears to be a big disincentive for many software professionals, as it turns them from generalists into specialists. If you dedicate several years to developing the necessary domain expertise in climate modeling, it probably means you’re committing the rest of your career to working in this space. But the pay is lousy, the programming language of choice is uncool, and mostly you’ll be expected to clean up after the parade rather than star in it.

Climate Model Coding Standards

07. November 2010 · 7 comments · Categories: climate modeling

Here are some climate model coding standards that I’ve collected over the last few months:

NASA GISS’s ModelE_Coding_Standards (dated June 2010)
NCAR’s CESM (previously known as CCSM) Coding Conventions (dated June 2001)
IPSL Ocean Model NEMO coding conventions (version 2, dated 2010)
The European Program for Integrated Earth System Modelling P rism Coding Rules (dated 2002)
The UK Met Office Unified Model Software Standards (link removed – see update below)
GFDL FMS coding conventions (dated 2002)
The Max-Planck-Institute’s Programming Guide for ICON (dated March 2006)

It’s encouraging that most modelling centres have developed detailed coding standards, but it’s a shame that most of them had to roll their own. The PRISM project is an exception – as many of the modelling labs across Europe were members of the PRISM project, some of these labs now use the PRISM coding rules.

Two followup tasks I hope to get to soon – (1) analyze how much these different standards overlap/differ, and (2) measure how much the model codes adhere to the standards.

16/11/2010 Update: The UK Met Office standard was an old version that was never publically released, so I’ve removed the link, at the request of the UKMO. I’ll post a newer version if I can sort out the permissions. I’ve added MPI-M’s ICON standards to the list.

New undergrad course on climate models

04. November 2010 · 1 comment · Categories: climate modeling, courses

After an exciting sabbatical year spent visiting a number of climate modeling centres, I’ll be back to teaching in January. I’ll be introducing two brand new courses, both related to climate modeling. I already blogged about my new grad course on “Climate Change Informatics”, which will cover many current research issues to do with software and data in climate science.

But I didn’t yet mention my new undergrad course. I’ll be teaching a 199 course in January, which I’ve never done before. 199 courses are first-year seminar courses, open to all new students across the faculty of arts and science, intended to encourage critical thinking, communication and research skills. They are run as small group seminar courses (enrolment is capped at 24 students). I’ve never taught one of these courses before, so I’ve no idea what to expect – I’m hoping for an interesting mix of students with different backgrounds, so we can spend some time attacking the theme of the course from different perspectives. Here’s my course description:

“Climate Change: Software, Science and Society”

This course will examine the role of computers and software in understanding climate change. We will explore the use of computer models to build simulations of the global climate, including a historical view of the use of computer models to understand weather and climate, and a detailed look at the current state of computer modelling, especially how global climate models are tested, what kinds of experiments are performed with them, how scientists know they can trust the models, and how they deal with uncertainty. The course will also explore the role of computer models in helping to shape society’s responses to climate change, in particular, what they can (and can’t) tell us about how to make effective decisions about government policy, international treaties, community action and the choices we make as individuals. The course will take a cross-disciplinary approach to these questions, looking at the role of computer models in the physical sciences, environmental science, politics, philosophy, sociology and economics of climate change. However, students are not expected to have any specialist knowledge in any of these fields prior to the course.

If all goes well, I plan to include some hands-on experimentation with climate models, perhaps using EdGCM (or even CESM if I can simplify the process of installing it and running it for them). We’ll also look at how climate models are perceived in the media and blogosphere (that will be interesting!) and compare these perceptions to what really goes on in climate modelling labs. Of course, the nice thing about a small seminar course is that I can be flexible about responding to the students’ own interests. I’m really looking forward to this…

What is a climate model?

03. November 2010 · 1 comment · Categories: climate modeling

Here’s a very nice video explaining the basics of how climate models work, produced by the folks at IPSL in Paris. This version is French with English subtitles – for the francophones out there, you’ll notice the narration is a little more detailed than the subtitles. I particularly like bit where the earth grid is unpeeled and fed into the supercomputers:

[Qt:http://www.cs.toronto.edu/~sme/movies/ipsl-modeling-med.mov 480 360]

The original (without the English subtitles) is here: http://www.youtube.com/user/CEADSMCOM

Data Limits vs. Computational Limits

30. September 2010 · 4 comments · Categories: climate modeling

For many decades, computational speed has been the main limit on the sophistication of climate models. Climate modelers have become one of the most demanding groups of users for high performance computing, and access to faster and faster machines drives much of the progress, permitting higher resolution models and more earth system processes being explicitly resolved in the models. But from my visits to NCAR, MPI-M and IPSL this summer, I’m learning that growth in volumes of data handled is increasingly a dominant factor. The volume of data generated from today’s models has grown so much that supercomputer facilities find it hard to handle.

Currently, the labs are busy with the CMIP5 runs that will form one of the major inputs to the next IPCC assessment report. See here for a list of the data outputs required from the models (and note that the requirements were last changed on Sept 17, 2010 -well after most centers have started their runs; after all it will take months to complete the runs, and the target date for submitting the data is the end of this year)

Climate modelers have requirements that are somewhat different from most other users of supercomputing facilities anyway:

very long runs – e.g. runs that take weeks or even months to complete;
frequent stop and restart of runs – e.g. the runs might be configured to stop once per simulated year, at which point they generate a restart file, and then automatically restart, so that intermediate results can be checked and analyzed, and because some experiments make use of multiple model variants, initialized from a restart file produced partway through a baseline run.
very high volumes of data generated – e.g. the CMIP5 runs currently underway at IPSL generate 6 terabytes per day, and in postprocessing, this goes up to 30 terabytes per day. Which is a problem, given that the NEC SX-9 being used for these runs has a 4 terabyte work disk and a 35 terabyte scratch disk. It’s getting increasingly hard to move the data to the tape archive fast enough.

Everyone seems to have underestimated the volumes of data generated from these CMIP5 runs. The implication is that data throughput rates are becoming a more important factor than processor speed, which may mean that climate computing centres require a different architecture than most high performance computing centres offer.

Anyway, I was going to write more about the infrastructure needed for this data handling problem, but Bryan Lawrence beat me to it, with his presentation to the NSF cyberinfrastructure “data task force”. He makes excellent points about the (lack of) scaleability of the current infrastructure, and the social and cultural issues with questions of how people get credit for the work they put into this infrastructure, and the issues of data curation and trust. Which means the danger is we will create a WORN (write-once, read-never) archive with all this data…!

Epistemology of Climate Modeling

28. September 2010 · 3 comments · Categories: climate modeling

This will keep me occupied with good reads for the next few weeks – this month’s issue of the Journal Studies in History and Philosophy of Modern Physics is a special on climate modeling. Here’s the table of contents:

Matthias Heymann (guest editor’s introduction): Understanding and misunderstanding computer simulation: The case of atmospheric and climate science—An introduction
Vladimir Jankovic: Climates as commodities: Jean Pierre Purry and the modelling of the best climate on Earth
Spencer Weart: The development of general circulation models of climate
Matthias Heymann: Lumping, testing, tuning: The invention of an artificial chemistry in atmospheric transport modeling
Gabriele Gramelsberger: Conceiving processes in atmospheric models—General equations, subscale parameterizations, and ‘superparameterizations’
Hélène Guillemot: Connections between simulations and observation in climate computer modeling. Scientist’s practices and “bottom-up epistemology” lessons
Johannes Lenhard, Eric Winsberg: Holism, entrenchment, and the future of climate model pluralism
Wendy S. Parker: Predicting weather and climate: Uncertainty, ensembles and probability
Mikaela Sundberg: Cultures of simulations vs. cultures of calculations? The development of simulation practices in meteorology and astrophysics
Amy Dahan: Putting the Earth System in a numerical box? The evolution from climate modeling toward global change

Some very provocative titles there. I’m curious to see how much their observations cohere with my own…

Verification and Validation of Earth System Models

18. September 2010 · 10 comments · Categories: climate modeling

I’ve been meaning to write a summary of the V&V techniques used for Earth System Models (ESMs) for ages, but never quite got round to it. However, I just had to put together a piece for a book chapter, and thought I would post it here to see if folks have anything to add (or argue with)).

Verification and Validation for ESMs is hard because running the models is an expensive proposition (a fully coupled simulation run can take weeks to complete), and because there is rarely a “correct” result – expert judgment is needed to assess the model outputs.

However, it is helpful to distinguish between verification and validation, because the former can often be automated, while the latter cannot. Verification tests are objective tests of correctness. These include basic tests (usually applied after each code change) that the model will compile and run without crashing in each of its standard configurations, that a run can be stopped and restarted from the restart files without affecting the results, and that identical results are obtained when the model is run using different processor layouts. Verification would also include the built-in tests for conservation of mass and energy over the global system on very long simulation runs.

In contrast, validation refers to science tests, where subjective judgment is needed. These include tests that the model simulates a realistic, stable climate, given stable forcings, that it matches the trends seen in observational data when subjected to historically accurate forcings, and that the means and variations (e.g. seasonal cycles) are realistic for the main climate variables (E.g. see Phillips et al, 2004).

While there is an extensive literature on the philosophical status of model validation in computational sciences (see for example, Oreskes et al (1994); Sterman (1994); Randall and Wielicki (1997); Stehr (2001)), much of it bears very little relation to practical techniques for ESM validation, and very little has been written on practical testing techniques for ESMs. In practice, testing strategies rely on a hierarchy of standard tests, starting with the simpler ones, and building up to the most sophisticated.

Pope and Davies (2002) give one such sequence for testing atmosphere models:

Simpliﬁed tests – e.g. reduce 3D equations of motion to 2D horizontal ﬂow (e.g. a shallow water testbed). This is especially useful if the reduction has an analytical solution, or if a reference solution is available. It also permits assessment of relative accuracy and stability over a wide parameter space, and hence is especially useful when developing new numerical routines.
Dynamical core tests – test for numerical convergence of the dynamics with physical parameterizations replaced by a simpliﬁed physics model (e.g. no topography, no seasonal or diurnal cycle, simpliﬁed radiation).
Single-column tests – allows testing of individual physical parameterizations separately from the rest of the model. A single column of data is used, with horizontal forcing prescribed from observations or from idealized proﬁles. This is useful for understanding a new parameterization, and for comparing interaction between several parameterizations, but doesn’t cover interaction with large-scale dynamics, nor interaction with adjacent grid points. This type of test also depends on availability of observational datasets.
Idealized aquaplanet – test the fully coupled atmosphere-ocean model, but with idealized sea-surface temperatures at all grid points. This allows for testing of numerical convergence in the absence of complications of orography and coastal effects.
Uncoupled model components tested against realistic climate regimes – test each model component in stand-alone mode, with a prescribed set of forcings. For example, test the atmosphere on its own, with prescribed sea surface temperatures, sea-ice boundary conditions, solar forcings, and ozone distribution. Statistical tests are then applied to check for realistic mean climate and variability.
Double-call tests. Run the full coupled model, and test a new scheme by calling both the old and new scheme at each timestep, but with the new scheme’s outputs not fed back in to the model. This allows assessment of the performance of new scheme in comparison with older schemes.
Spin-up tests. Run the full ESM for just a few days of simulation (typically between 1 and 5 days of simulation), starting from an observed state. Such tests are cheap enough that they can be run many times, sampling across the initial state uncertainty. Then the average of a large number of such tests can be analyzed (Pope and Davies suggest that 60 is enough for statistical signiﬁcance). This allows the results from different schemes to be compared, to explore differences in short term tendencies.

Whenever a code change is made to an ESM, in principle, an extensive set of simulation runs are needed to assess whether the change has a noticeable impact on the climatology of the model. This in turn requires a sub jective judgment for whether minor variations constitute acceptable variations, or whether they add up to a signiﬁcantly different climatology.

Because this testing is so expensive, a standard shortcut is to require exact reproducibility for minor changes, which can then be tested quickly through the use of bit comparison tests . These are automated checks over a short run (e.g. a few days of simulation time) that the outputs or restart ﬁles of two different model conﬁgurations are identical down to the least signiﬁcant bits. This is useful for checking that a change didn’t break anything it shouldn’t, but requires that each change be designed so that it can be “turned off” (e.g. via run-time switches) to ensure previous experiments can be reproduced. Bit comparison tests can also check that different conﬁgurations give identical results. In effect, bit reproducibility over a short run is a proxy for testing that two different versions of the model will give the same climate over a long run. It’s much faster than testing the full simulations, and it catches most (but not all) errors that would affect the model climatology.

Bit comparison tests do have a number of drawbacks, however, in that they restrict the kinds of change that can be made to the model. Occasionally, bit reproducibility cannot be guaranteed from one version of the model to another, for example when there is a change of compiler, change of hardware, a code refactoring, or almost any kind of code optimization. The decision about whether to insist on bit reproducibility, or whether to allow it to be broken from one version of the model to the next, is a difficult trade-off between ﬂexibility and ease of testing.

A number of simple practices can be used to help improve code sustainability and remove coding errors. These include running the code through multiple compilers, which is effective because different compilers give warnings about different language features, and some allow poor or ambiguous code which others will report. It’s better to identify and remove such problems when they are ﬁrst inserted, rather than discover later on that it will takes months of work to port the code to a new compiler.

Building conservation tests directly into the code also helps. These would typically be part of the coupler, and can check the global mass balance for carbon, water, salt, atmospheric aerosols, and so on. For example the coupler needs to check that water ﬂowing from rivers enters the ocean; that the total mass of carbon is conserved as it cycles through atmosphere, oceans, ice, vegetation, and so on. Individual component models sometimes neglect such checks, as the balance isn’t necessarily conserved in a single component. However, for long runs of coupled models, such conservation tests are important.

Another useful strategy is to develop a veriﬁcation toolkit for each model component, and for the entire coupled system. These contain a series of standard tests which users of the model can run themselves, on their own platforms, to conﬁrm that the model behaves in the way it should in the local computation environment. They also provide the users with a basic set of tests for local code modiﬁcations made for a speciﬁc experiment. This practice can help to overcome the tendency of model users to test only the speciﬁc physical process they are interested in, while assuming the rest of the model is okay.

During development of model components, informal comparisons with models developed by other research groups can often lead to insights in how to improve the model, and also as a method for conﬁrming and identifying suspected coding errors. But more importantly, over the last two decades, model intercomparisons have come to play a critical role in improving the quality of ESMs through a series of formally organised Model Intercomparison Projects (MIPs).

In the early days, these projects focussed on comparisons of the individual components of ESMs, for example, the Atmosphere Model Intercomparison Project (AMIP), which began in 1990 (Gates, 1992). But by the time of the IPCC second assessment report, there was a widespread recognition that a more systematic comparison of coupled models was needed, which led to the establishment of the Coupled Model Intercomparison Pro jects (CMIP), which now play a central role in the IPCC assessment process (Meehl et al, 2000).

For example, CMIP3, which was organized for the fourth IPCC assessment, involved a massive effort by 17 modeling groups from 12 countries with 24 models (Meehl et al, 2007). As of September 2010, the list of MIPs maintained by the World Climate Research Program included 44 different model intercomparison projects (Pirani, 2010).

Model Intercomparison Projects bring a number of important beneﬁts to the modeling community. Most obviously, they bring the community together with a common purpose, and hence increase awareness and collaboration between different labs. More importantly, they require the participants to reach a consensus on a standard set of model scenarios, which often entails some deep thinking about what the models ought to be able to do. Likewise, they require the participants to deﬁne a set of standard evaluation criteria, which then act as benchmarks for comparing model skill. Finally, they also produce a consistent body of data representing a large ensemble of model runs, which is then available for the broader community to analyze.

The beneﬁts of these MIPs are consistent with reports of software benchmarking efforts in other research areas. For example, Sim et al (2003) report that when a research community that builds software tools come together to create benchmarks, they frequently experience a leap forward in research progress, arising largely from the insights gained from the process of reaching consensus on the scenarios and evaluation criteria to be used in the benchmark. However, the deﬁnition of precise evaluation criteria is an important part of the benchmark – without this, the intercomparison pro ject can become unfocussed, with uncertain outcomes and without the huge leap forward in progress (Bueler, 2008).

Another form of model intercomparison is the use of model ensembles (Collins, 2007), which increasingly provide a more robust prediction system than single models runs, but which also play an important role in model validation:

Multi-model ensembles – to compare models developed at different labs on a common scenario.
Multi-model ensembles using variants of a single model – to compare different schemes for parts of the model, e.g. different radiation schemes.
Perturbed physics ensembles – to explore probabilities of different outcomes, in response to systematically varying physical parameters in a single model.
Varied initial conditions within a single model – to test the robustness of the model, and to better quantify probabilities for predicted climate change signals.

What’s the pricetag on a Global Climate Model?

03. September 2010 · 5 comments · Categories: climate modeling

Here’s a question I’ve been asking a few people lately, ever since I asserted that climate models are big expensive scientific instruments: How expensive are we talking about? Unfortunately, it’s almost impossible to calculate. The effort of creating a climate model is tangled up with the scientific research, such that you can’t even reliably determine how much of a particular scientist’s time is “model development” and how much is “doing science”. The problem is that you can’t build the model without a lot of that “doing science” part, because the model is the result of a lot of thinking, experimentation, theory building, testing hypotheses, analyzing simulation results, and discussions with other scientists. Many pieces of the model are based on the equations or empirical results in published research papers; even if you’re not doing the research yourself, you still have to keep up with the literature, understand the state-of-the-art, and know which bits of research are mature enough to incorporate into the model.

So, my first cut, which will be an over-estimation, is that *all* of the effort at a climate modeling lab is necessary to build the model. Labs vary in size, but a typical climate modeling lab is of the order of 200 people (including scientists, technicians, and admin support). And most of the models I’ve looked at have been under steady development for twenty years or more. So, that gives us starting point of 200*20 = 4,000 person-years. Luckily, most scientists care more about science than salary, so they’re much cheaper than software professionals. Given we’ll have a mix of postdocs and senior scientists, let’s say average salary would be around $150,000 per year including benefits and other overheads. Thats $600 million.

Oh, and that doesn’t including the costs of equipping and operating a tier-2 supercomputing facility, as the climate model runs will easily keep such a facility fully loaded full time (and we’ll need to factor in the cost to replace the supercomputer every few years to take advantage of performance increases). In most cases, the supercomputing facilities are shared with other scientific uses of high performance computing. But there is one centre that’s dedicated to climate modeling, the DKRZ in Hamburg, which has an annual budget of around 30 million euro. Let’s pretend euros are dollars, and call that $30 million per year, which for 20 years gives us another $600 million. The latest supercomputer at DKRZ, Blizzard, cost 35 million euro. Let’s say we replace this every five years, and throw some more money in for many terabytes of data storage, that’ll get us to around $200 million for hardware.

Grand total: $1.4 billion.

Now, I said that’s an over-estimate. Over lunch today I quizzed some of the experts here at IPSL in Paris, and they thought that 1,000 person-years (50 persons per year for 20 years) was a better estimate of the actual model development effort. This seems reasonable – it means that only 1/4 of the research at my 200 person research institute directly contributes to model development, the rest is science that uses the model but isn’t essential for developing it. So, that brings the salary figure down to $150 million. I’ve probably got to do the same conversion for the supercomputing facilities – let’s say about 1/4 of the supercomputing capacity is reserved for model development and testing. That also feels about right: 5-10% of the capacity is reserved for test processes (e.g. the ones that run automatically every day to do the automated build-and-test process), and a further 10%-20% might be used for validation runs on development versions of the model.

That brings the grand total down to $350 million.

Now, it has been done for less than this. For example, the Canadian Climate Centre, CCCma, has a modeling team one tenth this size, although they do share a lot of code with the Canadian Meteorological Service. And their model isn’t as full-featured as some of the other GCMs (it also has a much smaller user base). As with other software projects, the costs don’t scale linearly with functionality: a team of 5 software developers can achieve much more than 1/10th of what a team of 50 can (cf The Mythical Man Month). Oh, and the computing costs won’t come down much at all – the CCCma model is no more efficient than other models. So we’re still likely to be above the $100 million mark.

Now, there are probably other ways of figuring it – so far we’ve only looked at the total cumulative investment in one of today’s world leading climate models. What about replacement costs? If we had to build a new model from scratch, using what we already know (rather than doing all the research over again), how much would that cost? Well, nobody has ever done this, but there are few experiences we could draw on. For example, the Max Planck Institute has been developing a new model from scratch, ICON, which uses a icosahedral grid and hence needs a new approach to the dynamics. The project has been going for 8 years. It started with just a couple of people, and has ramped up to about a dozen. But they’re still a long way from being done, and they’re re-using a lot of the physics code from their old model, ECHAM. On the other hand, its an entirely new approach to the grid structure, so a lot of the early work was pure research.

Where does that leave us? It’s really a complete guess, but I would suggest a team of 10 people (half of them scientists, half scientific programmers) could re-implement the old model from scratch (including all the testing and validation) in around 5 years. Unfortunately, climate science is a fast moving field. What we’d get at the end of 5 years is a model that, scientifically speaking, is 5 years out of date. Unless of course we also paid for a large research effort to bring the latest science into the model while we were constructing it, but then we’re back where we started. I think this means you can’t replace a state-of-the-art climate model for much less than the original development costs.

What’s the conclusion? The bottom line is that the development cost of a climate model is in the hundreds of millions of dollars.

I never said that!

01. September 2010 · 6 comments · Categories: climate modeling

I’m pleased to see that my recent paper, “Climate Change: A Software Grand Challenge” is getting some press attention. However, I’m horrified to see how it’s been distorted in the echo chamber of the media. Danny Bradbury, writing in the Guardian, gives his piece the headline “Climate scientists should not write their own software, says researcher“. Aaaaaaargh! Nooooo! That’s the exact opposite of what I would say!

Our research shows that earth system models, the workhorses of climate science, appear to have very few bugs, and produce remarkably good simulations of past climate. One of the most important success factors is that the code is written by the scientists themselves, as they understand the domain inside out. Now, of course, this leads to other problems, for instance the code is hard to understand, and hard to modify. And the job of integrating the various components of the models is really hard. But there are no obvious solutions to fix this without losing this hands-on relationship between the scientists and the code. Handing the code development over to software professionals is likely to be a disaster.

I’ve posted a comment on Bradbury’s article, but ~~I have very little hope he’ll alter the headline, as it obviously plays into a storyline that’s popular with denialists right now~~ (see update, below).

Some other reports:

How Coders Can Help Fight Climate Change (MIT Technology Review) – very good!
BusinessGreen.com – a repost of the the Bradbury piece in the Guardian [but now with a corrected headline. Excellent!]

Update (2/9/10): Well that’s a delight! I just got off the overnight train to Paris, and discover that Danny has commented here, and wants to put everything right, and has already corrected the headline in the BusinessGreen version. So, apologies to Danny for doubting him, and also, thanks for restoring my faith in journalism. As is clear in some of the comments, it’s easy to see how one might draw the conclusion that climate scientists shouldn’t write their own code from a reading of my paper. It’s a subtle point, so I probably need to write a longer piece on this to explain…

Update #2 (later that same day): And now the Guardian headline has been changed too. Victory for honest journalism!

Scientific bricolage and what to do about it

19. August 2010 · 3 comments · Categories: climate modeling

I’ve pointed out a number of times that the software processes used to build the Earth System Models used in climate science don’t look anything like conventional software engineering practices. One very noticeable difference is the absence of detailed project plans, estimates, development phases, etc. While scientific steering committees do discuss long term strategy and set high level goals for the development of the model, the vast majority of model development work occurs bottom-up, through a series of open-ended, exploratory changes to the code. The scientists who work most closely with the models get together and decide what needs doing, typically on a week-to-week basis. Which is a little like agile planning, but without any of the agile planning techniques. Is this the best approach? Well, if the goal was to deliver working software to some external customer by a certain target date, then probably not. But that’s not the goal at all – the goal is to do good science. Which means that much of the work is exploratory and opportunistic. It’s difficult to plan model development in any detail, because it’s never clear what will work, nor how long it will take to try out some new idea. Nearly everything that’s worth doing to improve the model hasn’t been done before.

This approach also favours a kind of scientific bricolage. Imagine we have sketched out a conceptual architecture for an earth system model. The conventional software development approach would be to draw up a plan to build each of the components on a given timeline, such that they would all be ready by some target date for integration. And it would fail spectacularly, because it would be impossible to estimate timelines for each component – each part involves significant new research. The best we can do is to get groups of scientists to go off and work on each subsystem, and wait to see what emerges. And to be willing to try incorporating new pieces of code whenever they seem to be mature enough, no matter where they came from.

So we might end up with a coupled earth system model where each of the major components was built at a different lab, each was incorporated into the model at a different stage in its development, and none of this was planned long in advance. And, as a consequence, each component has its own community of developers and users who have goals that often diverge from the goals of the overall earth system model. Typically, each community wants to run its component model in stand-alone model, to pursue scientific questions specific to that subfield. For example, ocean models are built by oceanographers to study oceanography. Plant growth models are built by biologists to study the carbon cycle. And so on.

One problem is that if you take components from each of these communities to incorporate into a coupled model, you don’t want to fork the code. A fork would give you the freedom to modify the component to make it work in the coupled scheme. But, as with forking in open source projects, is nearly always a mistake. It fragments the community, and means the forked copy no longer gets the ongoing improvements to the original software (or more precisely, it quickly becomes too costly to transplant such improvements into the forked code). Access to the relevant community of expertise and their ongoing model improvements are at least as important as any specific snapshot of their code, otherwise the coupled model will fail to keep up with the latest science. Which means a series of compromises must be made – some changes might be necessary to make the component work in a coupled scheme, but these must not detract from the ability of the community to continue working with the component as a stand-alone model.

So, building an earth system model means assembling a set of components that weren’t really designed to work together, and a continual process of negotiation between the requirements for the entire coupled model and the requirements of the individual modeling communities. The alternative, re-building each component from scratch, doesn’t make sense financially or scientifically. It would be expensive and time consuming, and you’d end up with untested software, that scientifically, is several years behind the state-of-the-art. [Actually, this might be true of any software: see this story of the netscape rebuild].

Over the long term, a set of conventions have emerged that help to make it easier to couple together components built by different communities. These include the basic data formating and message passing standards, as well as standard couplers. And more recently, modeling frameworks, metadata standards and data sharing infrastructure. But as with all standardization efforts, it takes a long time (decades?) for these to be accepted across the various modeling communities, and there is always resistance, in part because meeting the standard incurs a cost and usually detracts from the immediate goals of each particular modeling community (with the benefits accruing elsewhere – specifically to those interested in working with coupled models). Remember: these models are expensive scientific instruments. Changes that limit the use of the component as a standalone model, or which tie it to a particular coupling scheme, can diminish its value to the community that built it.

So, we’re stuck with the problem of incorporating a set of independently developed component models, without the ability to impose a set of interface standards on the teams that build the components. The interface definitions have to be continually re-negotiated. Bryan Lawrence has some nice slides on the choices, which he characterizes as the “coupler approach” and the “framework approach” (I shamelessly stole his diagrams…)

The coupler approach leaves the models almost unchanged, with a communication library doing any necessary transformation on the data fields.

The framework approach splits the original code into smaller units, adapting their data structures and calling interfaces, allowing them to be recombined in a more appropriate calling hierarchy

The advantage of the coupler approach is that it requires very little change to the original code, and allows the coupler itself to be treated as just another stand-alone component that can be re-used by other labs. However, it’s inefficient, and seriously limits the opportunities to optimize the run configuration: while the components can run in parallel, the coupler must still wait on each component to do its stuff.

The advantage of the framework approach is that it produces a much more flexible and efficient coupled model, with more opportunities to lay out the subcomponents across a parallel machine architecture, and a greater ability to plug other subcomponents in as desired. The disadvantage is that component models might need substantial re-factoring to work in the framework. The trick here is to get the framework accepted as a standard across a variety of different modeling communities. This is, of course, a bit of a chicken-and-egg problem, because its advantages have to be clearly demonstrated with some success stories before such acceptance can happen.

There is a third approach, adopted by some of the bigger climate modeling labs: build everything (or as much as possible) in house, and build ad hoc interfaces between various components as necessary. However, as earth system models become more complex, and incorporate more and more different physical, chemical and biological processes, the ability to do it all in-house is getting harder and harder. This is not a viable long term strategy.