Do Climate Models need Independent Verification and Validation?

27. November 2010 · 29 comments · Categories: climate modeling

A common cry from climate contrarians is that climate models need better verification and validation (V&V), and in particular, that they need Independent V&V (aka IV&V). George Crews has been arguing this for a while, and now Judith Curry has taken up the cry. Having spent part of the 1990’s as lead scientist at NASA’s IV&V facility, and the last few years studying climate model development processes, I think I can offer some good insights into this question.

The short answer is “no, they don’t”. The slightly longer answer is “if you have more money to spend to enhance the quality of climate models, spending it on IV&V is probably the least effective thing you could do”.

The full answer involves deconstructing the question, to show that it is based on three incorrect assumptions about climate models: (1) that there’s some significant risk to society associated with the use of climate models; (2) that the existing models are inadequately tested / verified / validated / whatevered; and (3) that trust in the models can be improved by using an IV&V process. I will demonstrate what’s wrong with each of these assumptions, but first I need to explain what IV&V is.

Independent Verification and Validation (IV&V) is a methodology developed primarily in the aerospace industry for reducing the risk of software failures, by engaging a separate team (separate from the software development team, that is) to perform various kinds of testing and analysis on the software as it is produced. NASA adopted IV&V for development of the flight software for the space shuttle in the 1970’s. Because IV&V is expensive (it typically adds 10%-20% to the cost of a software development contract), NASA tried to cancel the IV&V on the shuttle in the early 1980’s, once the shuttle was declared operational. Then, of course the Challenger disaster occurred. Although software wasn’t implicated, a consequence of the investigation was the creation of the Leveson committee, to review the software risk. Leveson’s committee concluded that far from cancelling IV&V, NASA needed to adopt the practice across all of its space flight programs. As a result of the Leveson report, the NASA IV&V facility was established in the early 1990’s, as a centre of expertise for all of NASA’s IV&V contracts. In 1995, I was recruited as lead scientist at the facility, and while I was there, our team investigated the operational effectiveness of the IV&V contracts on the Space Shuttle, International Space Station, Earth Observation System, Cassini, as well as a few other smaller programs. (I also reviewed the software failures on NASA’s Mars missions in the 1990’s, and have a talk about the lessons learned)

The key idea for IV&V is that when NASA puts out a contract to develop flight control software, it also creates a separate contract with a different company, to provide an ongoing assessment of software quality and risk as the development proceeds. One difficulty with IV&V contracts in the US aerospace industry is that it’s hard to achieve real independence, because industry consolidation has left very few aerospace companies available to take on such contracts, and they’re not sufficiently independent from one another.

NASA’s approach demands independence along three dimensions:

managerial independence (the IV&V contractor is free to determine how to proceed, and where to devote effort, independently of either the software development contractor and the customer)
financial independence (the funding for the IV&V contract is separate from the development contract, and cannot be raided if more resources are needed for development); and
technical independence (the IV&V contractor is free to develop its own criteria, and apply whatever V&V methods and tools it deems appropriate).

This has led to the development of a number of small companies who specialize only in IV&V (thus avoiding any contractual relationship with other aerospace companies), and who tend to recruit ex-NASA staff to provide them with the necessary domain expertise.

For the aerospace industry, IV&V has been demonstrated to be a cost effective strategy to improve software quality and reduce risk. The problem is that the risks are extreme: software errors in the control software for a spacecraft or an aircraft are highly likely to cause loss of life, loss of the vehicle, and/or loss of the mission. There is a sharp distinction between the development phase and the operation phase for such software: it had better be correct when it’s launched. Which means the risk mitigation has to be done during development, rather than during operation. In other words, iterative/agile approaches don’t work – you can’t launch with a beta version of the software. The goal is to detect and remove software defects before the software is ever used in an operational setting. An extreme example of this was the construction of the space station, where the only full end-to-end construction of the system was done in orbit; it wasn’t possible to put the hardware together on the ground in order to do a full systems test on the software.

IV&V is essential for such projects, because it overcomes natural confirmation bias of software development teams. Even the NASA program managers overseeing the contracts suffer from this too – we discovered one case where IV&V reports on serious risks were being systematically ignored by the NASA program office, because the program managers preferred to believe the project was going well. We fixed this by changing the reporting structure, and routing the IV&V reports directly to the Office of Safety and Mission Assurance at NASA headquarters. The IV&V teams developed their own emergency strategy too – if they encountered a risk that they considered mission-critical, and couldn’t get the attention of the program office to address it, they would go and have a quiet word with the astronauts, who would then ensure the problem got seen to!

But IV&V is very hard to do right, because much of it is a sociological problem rather than a technical problem. The two companies (developer and IV&V contractor) are naturally set up in an adversarial relationship, but if they act as adversaries, they cannot be effective: the developer will have a tendency to hide things, and the IV&V contractor will have a tendency to exaggerate the risks. Hence, we observed that the relationship is most effective where there is a good horizontal communication channel between the technical staff in each company, and that they come to respect one another’s expertise. The IV&V contractor has to be careful not to swamp the communication channels with spurious low-level worries, and the development contractor must be willing to respond positively to criticism. One way this works very well is for the IV&V team to give the developers advance warning of any issues they planned to report up the hierarchy to NASA, so that the development contractor could have a solution in place as even before NASA asked for it. For a more detailed account of these coordination and communication issues, see:

Easterbrook S. The Role of Indepedent V&V in Upstream Software Development Processes. In: 2nd World Conference on Integrated Design and Process Technology (IDPT). Austin, Texas; 1996.

Okay, let’s look at whether IV&V is applicable to climate modeling. Earlier, I identified three assumptions made by people advocating it. Let’s take them one at a time:

1) The assumption there’s some significant risk to society associated with the use of climate models.

A large part of the mistake here is to misconstrue the role of climate models in policymaking. Contrarians tend to start from an assumption that proposed climate change mitigation policies (especially any attempt to regulate emissions) will wreck the economies of the developed nations (or specifically the US economy, if it’s an American contrarian). I prefer to think that a massive investment in carbon-neutral technologies will be a huge boon to the world’s economy, but let’s set aside that debate, and assume for sake of arguments that whatever policy path the world takes, it’s incredibly risky, with a non-neglibable probability of global catastrophe if the policies are either too aggressive or not aggressive enough, i.e. if the scientific assessments are wrong.

The key observation is that software does not play the same role in this system that flight software does for a spacecraft. For a spacecraft, the software represents a single point of failure. An error in the control software can immediately cause a disaster. But climate models are not control systems, and they do not determine climate policy. They don’t even control it indirectly – policy is set by a laborious process of political manoeuvring and international negotiation, in which the impact of any particular climate model is negligible.

Here’s what happens: the IPCC committees propose a whole series of experiments for the climate modelling labs around the world to perform, as part of a Coupled Model Intercomparison Project. Each participating lab chooses those runs they are most able to do, given their resources. When they have completed their runs, they submit the data to a public data repository. Scientists around the world then have about a year to analyze this data, interpret the results, to compare performance of the models, discuss findings at conferences and workshops, and publish papers. This results in thousands of publications from across a number of different scientific disciplines. The publications that make use of model outputs take their place alongside other forms of evidence, including observational studies, studies of paleoclimate data, and so on. The IPCC reports are an assessment of the sum total of the evidence; the model results from many runs of many different models are just one part of that evidence. Jim Hansen rates models as the third most important source of evidence for understanding climate change, after (1) paleoclimate studies and (2) observed global changes.

The consequences of software errors in a model, in the worst case, are likely to extend to no more than a few published papers being retracted. This is a crucial point: climate scientists don’t blindly publish model outputs as truth; they use model outputs to explore assumptions and test theories, and then publish papers describing the balance of evidence. Further papers then come along that add more evidence, or contradict the earlier findings. The assessment reports then weigh up all these sources of evidence.

I’ve been asking around for a couple of years for examples of published papers that were subsequently invalidated by software errors in the models. I’ve found several cases where a version of the model used in the experiments reported in a published paper was later found to contain an important software bug. But in none of those cases did the bug actually invalidate the conclusions of the paper. So even this risk is probably overstated.

The other point to make is that around twenty different labs around the world participate in the Model Intercomparison Projects that provide data for the IPCC assessments. That’s a level of software redundancy that is simply impossible in the aerospace industry. It’s likely that these 20+ models are not quite as independent as they might be (e.g. see Knutti’s analysis of this), but even so, the ability to run many different models on the same set of experiments, and to compare and discuss their differences is really quite remarkable, and the Model Intercomparison Projects have been a major factor in driving the science forward in the last decade or so. It’s effectively a huge benchmarking effort for climate models, with all the benefits normally associated with software benchmarking (and worthy of a separate post – stay tuned).

So in summary, while there are huge risks to society of getting climate policy wrong, those risks are not software risks. A single error in the flight software for a spacecraft could kill the crew. A single error in a climate model can, at most, only affect a handful of the thousands of published papers on which the IPCC assessments are based. The actual results of a particular model run are far less important than the understanding the scientists gain about what the model is doing and why, and the nature of the uncertainties involved. The modellers know that the models are imperfect approximations of very complex physical, chemical and biological processes. Conclusions about key issues such as climate sensitivity are based not on particular model runs, but on many different experiments with many different models over many years, and the extent to which these experiments agree or disagree with other sources of evidence.

2) the assumption that the current models are inadequately tested / verified / validated / whatevered;

This is a common talking point among contrarians. Part of the problem is that while the modeling labs have evolved sophisticated processes for developing and testing their models, they rarely bother to describe these processes to outsiders – nearly all published reports focus on the science done with the models, rather than the modeling process itself. I’ve been working to correct this, with, first, my study of the model development processes at the UK Met Office, and more recently my comparative studies of other labs, and my accounts of the existing V&V processes. Some people have interpreted the latter as a proposal for what should be done, but it is not; it is an account of the practices currently in place across all the of the labs I have studied.

A key point is that for climate models, unlike spacecraft flight controllers, there is no enforced separation between software development and software operation. A climate model is always an evolving, experimental tool, it’s never a finished product – even the prognostic runs done as input to the IPCC process are just experiments, requiring careful interpretation before any conclusions can be drawn. If the model crashes, or gives crazy results, the only damage is wasted time.

This means that an iterative development approach is the norm, which is far superior to the waterfall process used in the aerospace industry. Climate modeling labs have elevated the iterative development process to a new height: each change to the model is treated as a scientific experiment, where the change represents a hypothesis for how to improve the model, and a series of experiments is used to test whether the hypothesis was correct. This means that software development proceeds far more slowly than commercial software practices (at least in terms of lines of code per day), but that the models are continually tested and challenged by the people who know them inside out, and comparison with observational data is a daily activity.

The result is that climate models have very few bugs, compared to commercial software, when measured using industry standard defect density measures. However, although defect density is a standard IV&V metric, it’s probably a poor measure for this type of software – it’s handy for assessing risk of failure in a control system, but a poor way of assessing the validity and utility of a climate model. The real risk is that there may be latent errors in the model that mean it isn’t doing what the modellers designed it to do. The good news is that such errors are extremely rare: nearly all coding defects cause problems that are immediately obvious: the model crashes, or the simulation becomes unstable. Coding defects can only remain hidden if they have an effect that is small enough that it doesn’t cause significant perturbations in any of the diagnostic variables collected during a model run; in this case they are indistinguishable from the acceptable imperfections that arise as a result of using approximate techniques. The testing processes for the climate models (which in most labs include a daily build and automated test across all reference configurations) are sufficient that such problems are nearly always identified relatively early.

This means that there are really only two serious error types that can lead to misleading scientific results: (1) misunderstanding of what the model is actually doing by the scientists who conduct the model experiments, and (2) structural errors, where specific earth system processes are omitted or poorly captured in the model. In flight control software, these would correspond to requirements errors, and would be probed by an IV&V team through specification analysis. Catching these in control software is vital because you only get one chance to get it right. But in climate science, these are science errors, and are handled very well by the scientific process: making such mistakes, learning from them, and correcting them are all crucial parts of doing science. The normal scientific peer review process handles these kinds of errors very well. Model developers publish the details of their numerical algorithms and parameterization schemes, and these are reviewed and discussed in the community. In many cases, different labs will attempt to build their own implementations from these descriptions, and in the process subject them to critical scrutiny. In other words, there is already an independent expert review process for the most critical parts of the models, using the normal scientific route of replicating one another’s techniques. Similarly, experimental results are published, and the data is made available for other scientists to explore.

As a measure of how well this process works for building scientifically valid models, one senior modeller recently pointed out to me that it’s increasingly the case now that when the models diverge from the observations, it’s often the observational data that turns out to be wrong. The observational data is itself error prone, and software models turn out to be an important weapon in identifying and eliminating such errors.

However, there is another risk here that needs to be dealt with. Outside of the labs where the models are developed, there is a tendency for scientists who want to make use of the models to treat them as black box oracles. Proper use of the models depends on a detailed understanding of their strengths and weaknesses, and the ways in which uncertainties are handled. If we have some funding available to improve the quality of climate models, it would be far better spent on improving the user interfaces, and better training of the broader community of model users.

The bottom line is that climate models are subjected to very intensive system testing, and the incremental development process incorporates a sophisticated regression test process that’s superior to most industrial software practices. The biggest threat to validity of climate models is errors in the scientific theories on which they are based, but such errors are best investigated through the scientific process, rather than through an IV&V process. Which brings us to:

(3) the assumption that our ability to trust in the models can be improved by an IV&V process;

IV&V is essentially a risk management strategy for safety-critical software when which an iterative development strategy is not possible – where the software has to work correctly the first (and every) time it is used in an operational setting. Climate models aren’t like this at all. They aren’t safety critical, they can be used even while they are being developed (and hence are built by iterative refinement); and they solve complex, wicked problems, for which there’s no clear correctness criteria. In fact, as a species of software development process, I’ve come to the conclusion they are dramatically different from any of the commercial software development paradigms that have been described in the literature.

A common mistake in the software engineering community is to think that software processes can be successfully transplanted from one organisation to another. Our comparative studies of different software organizations show that this is simply not true, even for organisations developing similar types of software. There are few, if any, documented cases of a software development organisation successfully adopting a process model developed elsewhere, without very substantial tailoring. What usually happens is that ideas from elsewhere are gradually infused and re-fashioned to work in the local context. And the evidence shows that every software oganisation evolves its own development processes that are highly dependent on local context, and on the constraints they operate under. Far more important than a prescribed process is the development of a shared understanding within the software team. The idea of taking a process model that was developed in the aerospace industry, and transplanting it wholesale into a vastly different kind of software development process (climate modeling) is quite simply ludicrous.

For example, one consequence of applying IV&V is that it reduces flexibility for development team, as they have to set clearer milestones and deliver workpackages on schedule (otherwise IV&V team cannot plan their efforts). Because the development of scientific codes is inherently unpredictable, would be almost impossible to plan and resource an IV&V effort. The flexibility to explore new model improvements opportunistically, and to adjust schedules to match varying scientific rhythms, is crucial to the scientific mission – locking the development into more rigid schedules to permit IV&V would be a disaster.

If you wanted to set up an IV&V process for climate models, it would have to be done by domain experts; domain expertise is the single most important factor in successful use of IV&V in the aerospace industry. This means it would have to be done by other climate scientists. But other climate scientists already do this routinely – it’s built into the Model Intercomparison Projects, as well as the peer review process and through attempts to replicate one another’s results. In fact the Model Intercomparison Projects already achieve far more than an IV&V process would, because they are done in the open and involve a much broader community.

In other words, the available pool of talent for performing IV&V is already busy using a process that’s far more effective than IV&V ever can be: it’s called doing science. Actually, I suspect that those people calling for IV&V of climate models are really trying to say that climate scientists can’t be trusted to check each other’s work, and that some other (unspecified) group ought to do the IV&V for them. However, this argument can only be used by people who don’t understand what IV&V is. IV&V works in the aerospace industry not because of any particular process, but because it brings in the experts – the people with grey hair who understand the flight systems inside out, and understand all the risks.

And remember that IV&V is expensive. NASA’s rule of thumb was an additional 10%-20% of the development cost. This cannot be taken from the development budget – it’s strictly an additional cost. Given my estimate of the development cost of a climate model as somewhere in the ballpark of $350 million, then we’ll need to find another $35 million for each climate modeling centre to fund their IV&V contract. And if we had such funds to add to their budgets, I would argue that IV&V is one of the least sensible ways of spending this money. Instead, I would:

Hire more permanent software support staff to work alongside the scientists;
Provide more training courses to give the scientists better software skills;
Do more research into modeling frameworks;
Experiment with incremental improvements to existing practices, such as greater use of testing tools and frameworks, pair programming and code sprints;
More support to grow the user communities (e.g. user workshops and training courses), and more community building and beta testing;
Documenting the existing software development and V&V best practices so that different labs can share ideas and experiences, and the process of model building becomes more transparent to outsiders.

To summarize, IV&V would be an expensive mistake for climate modeling. It would divert precious resources (experts) away from existing modeling teams, and reduce their flexibility to respond to the science. IV&V isn’t appropriate because this isn’t ~~mission~~safety-critical software, it doesn’t have distinct development and operational phases, and the risks of software error are minor. There’s no single point of failure, because many labs around the world build their own models, and the normal scientific processes of experimentation, peer-review, replication, and model inter-comparison already provide a sophisticated process to examine the scientific validity of the models. Virtually all coding errors are detected in routine testing, and science errors are best handled through the usual scientific process, rather than through an IV&V process. Furthermore, there is only a small pool of experts available to perform IV&V on climate models (namely, other climate modelers) and they are already hard at work improving their own models. Re-deploying them to do IV&V of each other’s models would reduce the overall quality of the science rather than improving it.

(BTW I shouldn’t have had to write this article at all…)

29 Comments

Pingback: Tweets that mention Do Climate Models need Independent Verification and Validation? | Serendipity -- Topsy.com
George Crews
November 27, 2010 at 7:47 am

Hi Steve,

Great post, as usual for the technical content of your blog. Let me comment on just a couple of things. (I hope I get the time to comment further later. For example, I would like to comment later on the validation part of climate IV&V.)

If the climate models are not “mission critical,” why devote the resources to it? Stop modeling the climate. Many climate stakeholders are just not that curious about the climate unless there is the potential for significant climate effects. They would rather spend that money elsewhere. Why can’t they? Because, IMHO, the climate models ARE mission critical. They are the only dynamic forecast tool we have.

And why can’t stakeholders tell the modelers what will establish their confidence in the climate models rather than the other way around? Neither IV&V processes nor the science are settled. After all, that is part of the meaning of “independent.” All stakeholders get to decide the meta-issue of what constitutes appropriate IV&V processes. The risk to one stakeholder can vary by orders of magnitude when compared to another. That affects the choice of processes themselves.

And which stakeholders got to decide the climate models are scientific instruments rather than engineering instruments or policy instruments? BTW, I completely disagree with the concept of the climate models as “instruments.” There is not one thing in the real world they actually measure. They are forecast tools. But I will accept your terminology here.

Are the appropriate IV&V processes the same for scientific, engineering, or policy instruments? No. I developed the IV&V processes for a high-level nuclear waste facility with a million year design life. We used standard software engineering IV&V processes. We did not use aerospace’s specialized IV&V processes. Nuclear facilities can be dangerous, but they can’t fly. In fact, computer models were not used at all for forecasting the climate in Nevada a million years into the future. Their usage would have been inappropriate. Would their usage have been appropriate for even ten thousand years? No. One thousand? One hundred? Thirty? A week? (Part of the validation issue I hope to get time for.)

The cost of climate model IV&V, which will be significant, should be weighed against the confidence it will provide. Do the modelers care about the confidence stakeholders should place in the climate models? Of course they do. So I and others are becoming more and more confident that the appropriate IV&V resources will be expended.

[Ooops, I didn’t mean to say it’s not mission-critical. I’ve fixed that. (more substantive comments below…) – Steve]
Lou Grinzo
November 27, 2010 at 8:06 am

Reading this blog entry and George Crews’ comment was enlightening, to say the least.

Crews’ last paragraph seems to be saying that the main goal of an IV&V effort is simply a political one — to make stakeholders feel secure about the models. Does anyone who is remotely familiar with current climate change discussions (and I use that term quite loosely) think that any IV&V effort could move the needle in acceptance of or confidence in models? I sure don’t. The overwhelming majority of stakeholders — the mainstream public — won’t ever know the IV&V effort happened. Those who need no further confidence boost won’t be moved, obviously. And those who lack confidence in models now (and again, that’s an exceedingly loose way of describing the often virulent attacks on model and modelers we’ve all seen), will insist on one more round of “proof” after another, as they have one goal only: To delay collective action on climate change.

I am NOT accusing Crews of being a delayer/denier; I know too little of him to make such a judgment. But I’m quite confident that investing resources in an IV&V effort won’t change anything in the real world, from the models themselves, how they’re used, to how outsiders view them, which makes it a very bad investment.
George Crews
November 27, 2010 at 12:22 pm

@Lou Grinzo
I understand the problem and sympathize. By analogy, there are those who would delay/deny nuclear facilities by any means available, and this has driven up the cost of such facilities to the point they are not being built. The waste facility I worked on was canceled, for example. We now don’t even have a final plan for nuclear waste disposal here in the USA. Much less building a solution for future generations.

However, the solution to the quality problem is not to simply say: “We are the experts, we know what we are doing. Trust us and have confidence in us.” IMHO, Steve and others have tried this approach in the blogsphere for climate science/software. Even to the point of stretching the bounds of decorum. But I don’t think the effort has made much progress. Or maybe things like Climategate ruin their efforts.

But there is an alternative to an insular approach that, IMHO, would not be a waste of effort. It is to structure and invest resources such that IV&V experts outside the field of climate science reach a consensus about climate science IV&V. This would carry weight. There is nothing to be done about the fringe. But there are many reasonable, educated people outside the field of climate science and modeling who are uneasy with their approach to software V&V and who all would be considerably reassured by such an outside consensus.

Therefore, I think the work like Steve is doing is very important to research and provide the necessary objective evidence to these independent domain (IV&V domain) experts. I encourage Steve and cohorts to continue. But Steve is not independent himself. I think their task is to put climate model software processes in the fullest, best and brightest light possible. For others to judge.
jstults
November 27, 2010 at 12:59 pm

But other climate scientists already do this routinely – it’s built into the Model Intercomparison Projects, as well as the peer review process and through attempts to replicate one another’s results. In fact the Model Intercomparison Projects already achieve far more than an IV&V process would, because they are done in the open and involve a much broader community.

I don’t ever see myself saying, “Don’t sweat the V&V fellas, you guys took your code and played at the Drag Prediction workshop, so lets cut metal!”

Model comparison and benchmarking is certainly an important part of creating knowledge, but that is a fundamentally different activity than providing credible decision support. In fact, you’ve pointed out before on this site how doing IPCC runs is sometimes seen as an impediment to doing science by the guys in the modeling centers.
Alastair McKinstry
November 27, 2010 at 3:10 pm

Great article, Steve, and I would basically agree.

The question I would have for those demanding IV&V is: what do you think IV&V means in the context of climate science, given what Steve has said? That is, given that model groups run standard regression tests where possible (Held-Suarez, etc), and that correctness is ill-defined?

You are limited to finding coding errors. While finding and fixing these is good, Steve has been showing that this is effectively irrelevant to the scientific results. It gives deniers a stick to beat modellers with, but doesn’t improve the science for the most part.

V&V in science happens primarily within the scientific process itself. Do you have suggestions on how this can be improved on?
George Crews
November 27, 2010 at 8:32 pm

@Alastair McKinstry
Climate software IV&V is irrelevant to the scientific results? Two comments on that concept.

One. What scientific results? Specifically, what experiments? Or put another way, why does the IPCC use the term “projections” rather than “predictions” when describing the output of the climate models?

Two. Who cares about the science? Many climate stakeholders do not. They only care about the ability or lack thereof of the software to make predictions with practical application, not about its usefulness in helping scientists understand the climate. Does the confidence the models provide effectively lower the specific risks of climate change mitigation or inaction?

The climate models MUST be used for more than just scientific purposes. And when a diverse set of stakeholders are involved, this point cannot be ignored. So the issue in not primarily within the scientific process itself. Software engineering quality assurance processes are also important.

Perhaps a good way to put the point I am trying to make would be to say one of the overall design requirements of the climate models is that they have the ability to provide confidence in their results. And this is the function climate model IV&V performs.

Finally, verification and validation are two different things. Coding errors are a verification issue. The issue I am addressing in this comment, fitness for intended use, is a validation issue.
John Mashey
November 28, 2010 at 2:24 am

Great article. It’s nice to read one by someone with highly relevant experience.

I am afraid many of the suggestions for how climate models are done come from people without much relevant experience, just as those most vociferous about doing lots of R&D to achieve energy breakthroughs seem to lack relevant R&D experience, or any such experience.
Alastair McKinstry
November 28, 2010 at 6:57 am

@George Crews:

“Projections” are used vs “predictions” when the outcome depends on policy inputs: ie. it is used to signal that its not a case of “this will happen” but “this would happen if”, depending on policy choices.

The point Steve made is that the IPCC process, etc. is a summary of the science, not the climate model output. Thats why we all should care about all the science.
The validation happens within the context of science, not software development.

A concrete example: the THOR experiments on Atlantic circulation and its potential reduction with climate change. (Some of the model runs for this are done with a model I’m working with, EC-Earth. EC-Earth is also involved in CMIP5 runs, but not directly as part of this).
Now, the things that are most likely to give incorrect results in the model are parameterisations for mixing, etc in the ocean. This is validated scientifically by comparisons to observations, to paleoclimate results: does it match the Younger Dryas, etc? I know of no way of validating this work without comparison to Paleo results, etc. ie. science.

The CMIP5 comparisons within the IPCC process are important for teasing out what features of the models are responsible for what features that are observed. They also help define a measure of model error.

All this validation happens within a scientific context, not a software development one. The measure of model error comes from comparing ensemble runs, both between and within models, to observation; it does not come from defects/kloc.

So “Climate change stakeholders” should be concerned at the quality of the science, and its output, not the climate model output directly. This is the case in the IPCC process. Policy is built on the science, not model output.

So back to Validation of the models: Can you please explain how you would validate the climate model output, given the lack clear definitions of correctness for the models?
JMurphy
November 28, 2010 at 7:30 am

George Crews wrote : “One. What scientific results? Specifically, what experiments? Or put another way, why does the IPCC use the term “projections” rather than “predictions” when describing the output of the climate models?”
.
.
.

Projection. The term “projection” is used in two senses in the climate change literature. In general usage, a projection can be regarded as any description of the future and the pathway leading to it.

However, a more specific interpretation has been attached to the term “climate projection” by the IPCC when referring to model-derived estimates of future climate.
.
Forecast/Prediction. When a projection is designated “most likely” it becomes a forecast or
prediction. A forecast is often obtained using physically-based models, possibly a set of these, outputs of which can enable some level of confidence to be attached to projections.
IPCC Guidelines
George Crews
November 28, 2010 at 9:50 am

@Alastair McKinstry

I am afraid I do not understand.

You say: “All this validation happens within a scientific context, not a software development one.” Would you also say: “All this validation happens within a scientific context, not a mathematical one?” To me, the first makes about as much sense as the second. They both seem quite beside the point.

I also do not understand the several statements regarding “the lack of clear definitions of correctness for the models.” Doesn’t this argue for IV&V rather than against it? It’s one of the things IV&V does: establish confidence in the models for its fitness for its intended usages.

It’s times like this that I realize how poor my communication skills are. 🙂 Pretty frustrating. But I’ll think about what you said Alastair.
George Crews
November 28, 2010 at 10:49 am

@Alastair McKinstry

Alastair, let me try this. On a personal note, I am neither an alarmist/denier. I am a “OMG, Big Science is gonna screw this climate problem up!” person. I rank Big Science along with Big Government and Big Business. In decreasing order of likability, but all orders of magnitude below, say, fluffy kittens.

So I am not trying to decrease your convictions about climate science – but to INCREASE them. Mine too. To increase our confidence in the fitness of use in climate model outputs. (Despite my fear of the shortcomings of Big Science, it is possible for flawed institutions to produce flawless works. How?)

You may say, or somebody, that they are already completely convinced, one way or another, so they don’t need no stinkin IV&V. But a sense of conviction has nothing to do with reality. However, the results of the application of the scientific method have shown that the process by which such a sense of conviction is reached can have everything to do with reality.

This is my philosophical frame of reference for the role of software IV&V. It is a consensus formalization of what scientists have been doing for several centuries applied to modern scientific software. And there is already a considerable body of knowledge out there in other scientific/engineering fields that use computational software.
jstults
November 28, 2010 at 11:39 am

@Alastair McKinstry

Do you have suggestions on how this can be improved on?

Yes. Some folks are starting to try them out. That paper points to the relevant literature; you’ll find that

correctness is ill-defined

and

lack clear definitions of correctness for the models

and even

there’s no clear correctness criteria

is an oft repeated confusion of the true state of affairs. The actual state of affairs is that as soon as you choose your governing equation set you can check correctness in a very rigorous and well-defined way.
Robert Grumbine
November 28, 2010 at 12:52 pm

@George Crews:

Though there are technical points I might take up, I’ll go meta instead. For whom is this Independent Verification and Validation being done? I think for climate models, Steve’s estimate of 10-20% (from his NASA work) is low, but let’s take the 20% figure. Budget realities being as they are, that means 20% less science being done. The benefit being that someone like you would be more confident about the remaining 80%.

The minor meta: Suppose the IV+V came back and said that the examined model(s) had indeed correctly implemented all stated algorithms and so forth (everything that you could ask an IV+V report to say, whatever that is). What conclusions of yours about the science would change, by how much? What conclusions about policy would change and by how much?

Main meta: Who else is this 20% being spent for? 20-25% of the US agrees with John Shimkus that God has promised that there will be no bad climate change, ever. It wasn’t lack of V+V, or IV+V that got them there. On the other hand, something around 50% agree with Steve that current V+V does the job ok (whether that translates to a desire for policy or what sort is a different question). 25-30% in between, but almost all of them are committed (see incoming Republicans in the house and senate) that if there were to be a policy response, it could only be national or international, and they are unalterably opposed to any national or international response. (I think they’re wrong on both counts, but that’s me.)

The population of those who currently think that the models aren’t useful but who would change their minds based on an IV+V (but not improved matches to, say, paleoclimate tests, i.e., improved V+V) seems to me to be awfully small. Why should 20% of the climate modeling budget be spent just on making you and a few (very few) like-minded people happy? You can change the mind of folks like Shimkus? Barton?

It’s a serious question. There can be reasons to spend 20% of a budget to make 1% of the people happy. Harder to come by when that 1% does not include congress.
Alastair McKinstry
November 28, 2010 at 2:47 pm

@jstults:
Thanks for the reference.

@George:
Ok, another hopefully clearer example. Suppose I start with the hypothesis that Kolmogorovs -5/3 power law for turbulence applies in the ocean. I code this up in viscoscity terms asSmaorinsky did. This in classical terms represents my theory, which I want to compare to observation. Simply by looking at this theory, I cannot tell what it predicts, in the way I could with Newtons laws in School and college. I use numerical models to get predictions, which I then compare to observations: for climate work these are frequently paleoclimate records, given the timescales that climates work on.

Now, the results either match or don’t. In practice given the large range of spatial and time scales involved, the match will be imperfect, and I will try to analyze why it matches in some cases but not others. Then, if the work looks good enough, I publish.

Now there are two sources of error in the problem: my theory is wrong, or my coding of the theory is wrong. What Steve has been pointing out (I think) is that it is unlikely that errors in the computer programming of this problem make it through to publication. Either the error is so big the model crashes, or the difference from observation is strange enough that it won’t pass the analysis step. The error in the theory normally dominates, but typically the results help advance our understanding enough that it is worth publishing, warts and all.

Now note the relative importance of the climate model in the above: the climate model helps us interpret paleo results, and understand observed results; when they disagree all are equally under scrutiny. But in policy terms you can justify climate change activities more definitively as “paleoclimate results show this will happen”; This is what Jim Hansen means by these being more important. The climate model results show it more graphically (literally in terms of temperature graphs), and make more headlines, but are not themselves the strongest evidence.

Now in principle any improvement in catching programming errors (IV&V) would be worth it; good to know you’ve got the coding right, even if the model is wrong. In practice Steve is pointing out that the analysis step is good enough to make that almost irrelevant; any detriment to the manpower involved there (pulling people over to other tasks) is bad.
This analysis is what I mean by science: both within the model group before publication, and then by the larger scientific community.

As a professional programmer (with a physics background) coming into the climate field (doing a PhD in the subject, while working on scaling parallel codes), I agree with Steve; I would argue (any think many agree) that the real benefit external software engineers can bring to the field is better tools for productivity: clearer programming languages, tools for refactoring old Fortran 77 codes, etc. rather than code review. People who understand the models in depth are at a premium: code review demands pulling people off the scientific development cycle outlined above. Instead, speed up that cycle (and hence fix theory/model errors), pull more people in by making the code clearer with modern tools.

I think people misunderstand how few skilled modellers there are. Two years ago, I had a position to fill at work; a year long contract working on climate codes. I naively approached the profs leading climate groups I know around Europe; surely in the coming downturn they would be happy to see their students in employment. I was savagely attacked for poaching. As a professional programmer, reading the papers based on the code, it took a year or two to grasp the outline of big codes (1-2 M lines) such as IFS, COSMOS, EC-Earth. IV&V would simply not be possible without pulling people out of this small “gene pool”. (Becoming productive in OS internals, or database internals at previous employers was trivial by comparison). In summary, its not that IV&V is inherently bad, its that there are better solutions to the same problem given very finite manpower.
Gator
November 28, 2010 at 5:02 pm

I work on FAA regulated products. We use a DO-178B methodology for software development. In our world, V&V basically doubles the price of our development for the highest level of design assurance. Despite paying this price, V&V has not prevented substantial bugs from going out with products, or more importantly, prevented the wrong thing from being made. By this I mean V&V tests to your requirements — if your requirements are wrong, who cares what V&V finds. From what I can see V&V in practice is an expensive CYA exercise.

The idea that V&V on climate models would convince anyone who is not yet convinced is naive. The idea that V&V on climate models would improve their accuracy or utility is probably also naive in that the money would probably be spent on more researchers/developers and running more tests.
steve
November 28, 2010 at 7:48 pm

George: You talk of stakeholders, but you don’t appear to have done your stakeholder analysis very well. I call climate models scientific instruments, because they are used by the scientists themselves to help gain insight into scientific questions. They’re not built for anyone else, and they’re certainly not run by anyone else. The results are sometimes quoted to help illustrate the findings of climate science to policymakers and the general public. But when you say we need to help these other stakeholders improve their confidence, you’re confusing “confidence in the models” with “confidence in the science”. Unfortunately, nearly everyone outside the scientific community makes up his/her mind about how much they trust the science using on factors that have nothing to do with the actual results of that science, nor how it was conducted (although some commentators like to used perceived weaknesses in either as a posthoc justification of their rejection of the science). But this is just whack-a-mole. Adding IV&V will not change anyone’s mind, will not improve the quality of the models, and will divert precious resources from where they are most needed.
steve
November 28, 2010 at 8:01 pm

Josh: Good point, but correctness of the implementation against the governing equations is only one part of what most people would understand as “correctness criteria”. I would also include questions like “how do we know we have the coupling between the different processes correct?” and “how do we know we’ve correctly chosen the relevant physical processes to resolve for the given scientific question?” Of course, these are validation criteria, whereas yours is a verification criterion. Probably I should avoid terms like “correctness” and use the V & V words instead.
George Crews
November 28, 2010 at 8:16 pm

@steve
Well Steve, I guess we will just have to disagree on the importance and effectiveness of documenting and demonstrating software development life-cycle process for climate model software.

But I do enjoy your postings describing the processes and approaches used by the various modeling groups. And thank you for letting me comment at length.

[George: If that first paragraph is a summary of your argument, then we don’t disagree at all, in fact that was one of my suggestions at the end of my post. – Steve]
jstults
November 28, 2010 at 10:14 pm

Of course, these are validation criteria, whereas yours is a verification criterion. Probably I should avoid terms like “correctness” and use the V & V words instead.

Yes; it’s tough to have a useful discussion if everyone is using different definitions without knowing it; we all just talk past each other.

I call climate models scientific instruments, because they are used by the scientists themselves to help gain insight into scientific questions. They’re not built for anyone else, and they’re certainly not run by anyone else.

This high-lights a division of labor question that the climate modeling / policy support community needs to come to grips with. This goes back to the distinction between knowledge creation and decision support I mentioned earlier; in other areas there are often “production shops” and “research shops”; maybe you don’t think this distinction is a useful one for climate modelers?
jstults
November 29, 2010 at 10:21 am

Alastair McKinstry :
@jstults:
Thanks for the reference.

You are welcome; one thing that our host and I agree on is that blogs can be useful open notebooks (we sometimes even agree on the definition of certain words : – ).

Now note the relative importance of the climate model in the above: the climate model helps us interpret paleo results, and understand observed results; when they disagree all are equally under scrutiny. But in policy terms you can justify climate change activities more definitively as “paleoclimate results show this will happen”; This is what Jim Hansen means by these being more important. The climate model results show it more graphically (literally in terms of temperature graphs), and make more headlines, but are not themselves the strongest evidence.

Using the climate models in a diagnostic method like you describe certainly doesn’t diminish their importance as part of “the evidence”. I’d see this, rather, as support for Steve’s argument that they are scientific instruments. It would be more correct to say “paleoclimate results show this probably happened.” Does paleoclimate really tell you what will happen? Finding useful analogies in the data to use for prediction is a pretty hard thing.
Tim van Beek
November 29, 2010 at 3:23 pm

From my experience in industrial software development (custom business software) I’d say that IV&V usually cannot tell you if you are doing the right thing (validation: will the software do something that is useful to the stakeholders?) and usually is not very effective in telling you if you doing things right, that is in finding bugs in the software (verification: will the software do what the specification prescribes?).

The specification should be written by people who are domain experts, and if they are, it is unlikely that an outsider will be able to teach them much about the domain 🙂 The software developers should be able to think hard enough about their code to be able to find any obvious bugs, which implies that someone doing a casual inspection hardly has a chance to find the really interesting ones.

But, usually, independent external experts are very good in doing a project audit, a quick ssessment if

* there is a structured software development process and if not, what is missing,

* if the software is well structured (there is an appropriate software architecture) and documented, and if all developers understand the architecture and comply to the coding standards.

Instead of setting up an ongoing IV&V process like in the aerospace industry, like Steve describes it, maybe it would be better to think about short-term project audits by software engineers, who don’t have to be domain experts – although it would be easier for them if they were, of course 🙂
Pingback: Validating Climate Models | Serendipity
Nick Barnes
November 30, 2010 at 11:59 am

It seems to me that climate model development processes fit approximately into an ‘agile’ camp, in that there is a very rapid turnaround between code development and the ‘users’. This suggests that other ‘agile’ techniques are good candidates for improvements. For example, more unit testing, with unit test development running in parallel and advance of code development, and automated regression testing.
Nick Barnes
November 30, 2010 at 12:00 pm

(and IV&V, of course, is more-or-less the opposite of ‘agile’, at least when considering the distance between development and use).
steve
December 1, 2010 at 11:43 pm

Josh: ” in other areas there are often “production shops” and “research shops”

I had lunch today with David Randall at CSU, and we talked about this idea. It hasn’t happened in climate modeling, but we both agree that in principle, it would be good. The problem is that the community is too small – there aren’t enough people who understand how to put together a climate model as it is; bifurcating the effort will make this shortfall even worse. David points out that part of the problem is that climate models are now so complex that nobody really understands the entire model; the other problem is that our grad schools aren’t producing people with the aptitude and enthusiasm for climate modeling. So really, it comes down to some difficult questions about priorities: given the serious shortage of good modellers, do we push ahead with the current approach in which progress at the leading edge of the science is prioritized, or do we split the effort to create these production shops? What matters for the IPCC at the moment is a good assessment of the current science, not some separate climate forecasting service. If a commercial market develops for the latter (which is possible, once people really start to get serious about climate change), then someone will have to figure out how to channel the revenues into training a new generation of modellers.
Nick Barnes
December 2, 2010 at 3:39 am

‘channel’?

[oops. Fixed. Thanks – Steve]
jstults
December 2, 2010 at 10:13 am

Thanks for sharing that Steve; that’s interesting. I’m pretty surprised by the “aren’t producing people with aptitude and enthusiasm for climate modeling” part; seems like this is a hot / growth area (maybe that impression is just due to the press coverage).
Michael Tobis
December 2, 2010 at 5:07 pm

jstults: I’m pretty surprised by the “aren’t producing people with aptitude and enthusiasm for climate modeling” part; seems like this is a hot / growth area (maybe that impression is just due to the press coverage).

Funding is weak and sporadic; the political visibility of these issues often causes revenge-taking at the top of the funding hierarchy. Recent news, for instance, seems to be of drastic cuts in Canadian funding for climate science. (Can you elaborate on this, Steve?)

The limited budgets lead to attachment to awkward legacy codes, which drives away the most ambitious programmers. The nature of the problem stymies the most mathematically adept who are inclined to look for more purity. Software engineers take a back seat to physical scientists with little regard for software design as a profession. All in all, the work is drastically ill-rewarded in proportion to its importance, and it’s fair to say that while it attracts good people, it’s not hard to imagine a larger group of much higher productivity and greater computational science sophistication working on this problem.
Pingback: Should science models be separate from production models? | Serendipity
Pingback: Curious weather on Jupiter « Wott's Up With That?
Herman A. Pope
December 18, 2010 at 5:11 pm

The climate models are based on flawed climate theory and I have no doubt that they properly do represent the flawed theory.
Climate science leaves out albedo.
I have searched site after site […snip]

[Silliest comment I’ve seen in ages. The entire theory of heat-trapping greenhouse gases wouldn’t make much sense without albedo. Are you sure you even know how to use a search engine? – Steve]
Pingback: Special Pleading Continues « Models Methods Software
Herman A. Pope
July 24, 2011 at 3:45 pm

@Herman A. Pope
I searched this whole blog and the only two times Albedo was mentioned was in my post and your Reply to my post. You did leave out Albedo.
[…snip]

[Wow. Months have passed, and you’ve learned nothing. Go read an introductory textbook. – Steve]
Pingback: Post of Record « Models Methods Software

Do Climate Models need Independent Verification and Validation?

29 Comments

Leave a Reply