I’m at the AGU meeting in San Francisco this week. The internet connections in the meeting rooms suck, so I won’t be twittering much, but will try and blog any interesting talks. But first things first! I presented my poster in the session on “Methodologies of Climate Model Evaluation, Confirmation, and Interpretation” yesterday morning. Nice to get my presentation out of the way early, so I can enjoy the rest of the conference.

Here’s my poster, and the abstract is below (click for the full sized version at the AGU ePoster site):

A Hierarchical Systems Approach to Model Validation


Discussions of how climate models should be evaluated tend to rely on either philosophical arguments about the status of models as scientific tools, or on empirical arguments about how well runs from a given model match observational data. These lead to quantitative measures expressed in terms of model bias or forecast skill, and ensemble approaches where models are assessed according to the extent to which the ensemble brackets the observational data.

Such approaches focus the evaluation on models per se (or more specifically, on the simulation runs they produce), as if the models can be isolated from their context. Such approaches may overlook a number of important aspects of the use of climate models:

  • the process by which models are selected and configured for a given scientific question.
  • the process by which model outputs are selected, aggregated and interpreted by a community of expertise in climatology.
  • the software fidelity of the models (i.e. whether the running code is actually doing what the modellers think it’s doing).
  • the (often convoluted) history that begat a given model, along with the modelling choices long embedded in the code.
  • variability in the scientific maturity of different components within a coupled earth system model.

These omissions mean that quantitative approaches cannot assess whether a model produces the right results for the wrong reasons, or conversely, the wrong results for the right reasons (where, say the observational data is problematic, or the model is configured to be unlike the earth system for a specific reason).

Furthermore, quantitative skill scores only assess specific versions of models, configured for specific ensembles of runs; they cannot reliably make any statements about other configurations built from the same code.

Quality as Fitness for Purpose

The problem is that there is no such thing as “the model”. The body of code that constitutes a modern climate model actually represents an enormous number of possible models, each corresponding to a different way of configuring that code for a particular run. Furthermore, this body of code isn’t a static thing. The code is changed on a daily basis, through a continual process of experimentation and model improvement. This applies even to any specific “official release”, which again is just a body of code that can be configured to run as any of a huge number of different models, and again, is not unchanging – as with all software, there will be occasional bugfix releases applied to it, along with improvements to the ancillary datasets.

Evaluation of climate models should not be about “the model”, but about the relationship between a modelling system and the purposes to which it is put. More precisely, it’s about the relationship between particular ways of building and configuring models and the ways in which the runs produced by those models are used.

What are the uses of a climate model? They vary tremendously:

  • To provide inputs to assessments of the current state of climate science;
  • To explore the consequences of a current theory;
  • To test a hypothesis about the observational system (e.g. forward modeling);
  • To test a hypothesis about the calculational system (e.g. to explore known weaknesses);
  • To provide homogenized datasets (e.g. re-analysis);
  • To conduct thought experiments about different climates;
  • To act as a comparator when debugging another model;

In general, we can distinguish three separate systems: the calculational system (the model code); the theoretical system (current understandings of climate processes) and the observational system. In the most general sense, climate models are developed to explore how well our current understanding (i.e. our theories) of climate explain the available observations. And of course the inverse: what additional observations might we make to help test our theories.

We’re dealing with relationships between three different systems

Validation of the Entire Modeling System

When we ask questions about likely future climate change, we don’t ask the question of the calculational system, we ask it of the theoretical system; the models are just a convenient way of probing the theory to provide answers.
When society asks climate scientists for future projections, the question is directed at climate scientists, not their models. Modellers apply their judgment to select appropriate versions & configurations of the models to use, set up the runs, and interpret the results in the light of what is known about the models’ strengths and weaknesses and about any gaps between the computational models and the current theoretical understanding. And they add all sorts of caveats to the conclusions they draw from the model runs when they present their results.

Validation is not a post-hoc process to be applied to an individual “finished” model, to ensure it meets some criteria for fidelity to the real world. In reality, there is no such thing as a finished model, just many different snapshots of a large set of model configurations, steadily evolving as the science progresses. Knowing something about the fidelity of a given model configuration to the real world is useful, but not sufficient to address fitness for purpose. For this, we have to assess the extent to which climate models match our current theories, and the extent to which the process of improving the models keeps up with theoretical advances.


Our approach to model validation extends current approaches:

  • down into the detailed codebase to explore the processes by which the code is built and tested. Thus, we build up a picture of the day-to-day practices by which modellers make small changes to the model and test the effect of such changes (both in isolated sections of code, and on the climatology of a full model). The extent to which these practices improve the confidence and understanding of the model depends on how systematically this testing process is applied, and how many of the broad range of possible types of testing are applied. We also look beyond testing to other software practices that improve trust in the code, including automated checking for conservation of mass across the coupled system, and various approaches to spin-up and restart testing.
  • up into the broader scientific context in which models are selected and used to explore theories and test hypotheses. Thus, we examine how features of the entire scientific enterprise improve (or impede) model validity, from the collection of observational data, creation of theories, use of these theories to develop models, choices for which model and which model configuration to use, choices for how to set up the runs, and interpretation of the results. We also look at how model inter-comparison projects provide a de facto benchmarking process, leading in turn to exchanges of ideas between modelling labs, and hence advances in the scientific maturity of the models.

This layered approach does not attempt to quantify model validity, but it can provide a systematic account of how the detailed practices involved in the development and use of climate models contribute to the quality of modelling systems and the scientific enterprise that they support. By making the relationships between these practices and model quality more explicit, we expect to identify specific strengths and weaknesses the modelling systems, particularly with respect to structural uncertainty in the models, and better characterize the “unknown unknowns”.

I had several interesting conversations at WCRP11 last week about how different the various climate models are. The question is important because it gives some insight into how much an ensemble of different models captures the uncertainty in climate projections. Several speakers at WCRP suggested we need an international effort to build a new, best of breed climate model. For example, Christian Jakob argued that we need a “Manhattan project” to build a new, more modern climate model, rather than continuing to evolve our old ones (I’ve argued in the past that this is not a viable approach). There have also been calls for a new international climate modeling centre, with the resources to build much larger supercomputing facilities.

The counter-argument is that the current diversity in models is important, and re-allocating resources to a single centre would remove this benefit. Currently around 20 or so different labs around the world build their own climate models to participate in the model inter-comparison projects that form a key input to the IPCC assessments. Part of the argument for this diversity of models is that when different models give similar results, that boosts our confidence in those results, and when they give different results, the comparisons provide insights into how well we currently understand and can simulate the climate system. For assessment purposes, the spread of the models is often taken as a proxy for uncertainty, in the absence of any other way of calculating error bars for model projections.

But that raises a number of questions. How well do the current set of coupled climate models capture the uncertainty? How different are the models really? Do they all share similar biases? And can we characterize how model intercomparisons feed back into progress in improving the models? I think we’re starting to get interesting answers to the first two of these questions, while the last two are, I think, still unanswered.

First, then, is the question of representing uncertainty. There are, of course, a number of sources of uncertainty. [Note that ‘uncertainty’ here doesn’t mean ‘ignorance’ (a mistake often made by non-scientists); it means, roughly, how big should the error bars be when we make a forecast, or more usefully, what does the probability distribution look like for different climate outcomes?]. In climate projections, sources of uncertainty can be grouped into three types:

  • Internal variability: natural fluctuations in the climate (for example, the year-to-year differences caused by the El Niño Southern Oscillation, ENSO);
  • Scenario uncertainty: the uncertainty over future carbon emissions, land use changes, and other types of anthropogenic forcings. As we really don’t know how these will change year-by-year in the future (irrespective of whether any explicit policy targets are set), it’s hard to say exactly how much climate change we should expect.
  • Model uncertainty: the range of different responses to the same emissions scenario given by different models. Such differences arise, presumably, because we don’t understand all the relevant processes in the climate system perfectly. This is the kind of uncertainty that a large ensemble of different models ought to be able to assess.

Hawkins and Sutton analyzed the impact of these different type of uncertainty on projections of global temperature over the range of a century. Here, Fractional Uncertainty means the ratio of the model spread to the projected temperature change (against a 1971-2000 mean):

This analysis shows that for short term (decadal) projections, the internal variability is significant. Finding ways of reducing this (for example by better model initialization from the current state of the climate) is important the kind of near-term regional projections needed by, for example, city planners, and utility and insurance companies, etc. Hawkins & Sutton indicate with dashed lines some potential to reduce this uncertainty for decadal projections through better initialization of the models.

For longer term (century) projections, internal variability is dwarfed by scenario uncertainty. However, if we’re clear about the nature of the scenarios used, we can put scenario uncertainty aside and treat model runs as “what-if” explorations – if the emissions follow a particular pathway over the 21st Century, what climate response might we expect?

Model uncertainty remains significant over both short and long term projections. The important question here for predicting climate change is how much of this range of different model responses captures the real uncertainties in the science itself. In the analysis above, the variability due to model differences is about 1/4 of the magnitude of the mean temperature rise projected for the end of the century. For example, if a given emissions scenario leads to a model mean of +4°C, the model spread would be about 1°C, yielding a projection of +4±0.5°C. So is that the right size for an error bar on our end-of-century temperature projections? Or, to turn the question around, what is the probability of a surprise – where the climate change turns out to fall outside the range represented by the current model ensemble?

Just as importantly, is the model ensemble mean the most likely outcome? Or do the models share certain biases so that the truth is somewhere other than the multi-model mean? Last year, James Annan demolished the idea that the models cluster around the truth, and in a paper with Julia Hargreaves, provides some evidence that the model ensembles do a relatively good job of bracketing the observational data, and, if anything, the ensemble spread is too broad. If the latter point is correct, then the model ensembles over-estimate the uncertainty.

This brings me to the question of how different the models really are. Over the summer, Kaitlin Alexander worked with me to explore the software architecture of some of the models that I’ve worked with from Europe and N. America. The first thing that jumped out at me when she showed me her diagrams was how different the models all look from one another. Here are six of them presented side-by-side. The coloured ovals indicate the size (in lines of code) of each major model component (relative to other components in the same model; the different models are not shown to scale), and the coloured arrows indicate data exchanges between the major components (see Kaitlin’s post for more details):

There are clearly differences in how the components are coupled together (for example, whether all data exchanges pass through a coupler, or whether components interact directly). In some cases, major subcomponents are embedded as subroutines within a model component, which makes the architecture harder to understand, but may make sense from a scientific point of view, when earth system processes themselves are tightly coupled. However, such differences in the code might just be superficial, as the choice of call structure should not, in principle affect the climatology.

The other significant difference is in the relative sizes of the major components. Lines of code isn’t necessarily a reliable measure, but it usually offers a reasonable proxy for the amount of functionality. So a model with an atmosphere model dramatically bigger than the other components indicates a model for which far more work (and hence far more science) has gone into modeling the atmosphere than the other components.

Compare for example, the relative sizes of the atmosphere and ocean components for HadGEM3 and IPSLCM5A, which, incidentally, both use the same ocean model, NEMO. HadGEMs has a much bigger atmosphere model, representing more science, or at least many more options for different configurations. In part, this is because the UK Met Office is an operational weather forecasting centre, and the code base is shared between NWP and climate research. Daily use of this model for weather forecasting offers many opportunities to improve the skill of the model (although improvement in skill in short term weather forecasting doesn’t necessarily imply improvements in skill for climate simulations). However, the atmosphere model is the biggest beneficiary of this process, and, in fact, the UK Met Office does not have much expertise in ocean modeling. In contrast, the IPSL model is the result of a collaboration between several similarly sized research groups, representing different earth subsystems.

But do these architectural differences show up as scientific differences? I think they do, but was finding this hard to analyze. Then I had a fascinating conversation at WCRP last week with Reto Knutti, who showed me a recent paper that he published with D. Masson, in which they analyzed model similarity from across the CMIP3 dataset. The paper describes a cluster analysis over all the CMIP3 models (plus three re-analysis datasets, to represent observations), based on how well the capture the full spatial field for temperature (on the left) and precipitation (on the right). The cluster diagrams look like this (click for bigger):

In these diagrams, the models from the same lab are coloured the same. Observational data are in pale blue (three observational datasets were included for temperature, and two for precipitation). Some obvious things jump out: the different observational datasets are more similar to each other than they are to any other model, but as a cluster, they don’t look any different from the models. Interestingly, models from the same lab tend to be more similar to one another, even when these span different model generations. For example, for temperature, the UK Met Office models HadCM3 and HadGEM1 are more like each other than they are like any other models, even though they run at very different resolutions, and have different ocean models. For precipitation, all the GISS models cluster together and are quite different from all the other models.

The overall conclusion from this analysis is that using models from just one lab (even in very different configurations, and across model generations) gives you a lot less variability than using models from different labs. Which does suggest that there’s something in the architectural choices made at each lab that leads to a difference in the climatology. In the paper, Masson & Knutti go on to analyze perturbed physics ensembles, and show that the same effect shows up here too. Taking a single model, and systematically varying the parameters used in the model physics still gives you less variability than using models from different labs.

There’s another followup question that I would like to analyze: do models that share major components tend to cluster together? There’s a growing tendency for a given component (e.g. an ocean model, an atmosphere model) to show up in more than one lab’s GCM. It’s not yet clear how this affects variability in a multi-model ensemble.

So what are the lessons here? First, there is evidence that the use of multi-model ensembles is valuable and important, and that these ensembles capture the uncertainty much better than multiple runs of a single model (no matter how it is perturbed). The evidence suggests that models from different labs are significantly different from one another both scientifically and structurally, and at least part of the explanation for this is that labs tend to have different clusters of expertise across the full range of earth system processes. Studies that compare model results with observational data (E.g. Hargreaves & Annan; Masson & Knutti) show that the observations looks no different from just another member of the multi-model ensemble (or to put it in Annan and Hargreaves’ terms, the truth is statistically indistinguishable from another model in the ensemble).

It would appear that the current arrangement of twenty or so different labs competing to build their own models is a remarkably robust approach to capturing the full range of scientific uncertainty with respect to climate processes. And hence it doesn’t make sense to attempt to consolidate this effort into one international lab.

One of the questions I’ve been chatting to people about this week at the WCRP Open Science Conference this week is whether climate modelling needs to be reorganized as an operational service, rather than as a scientific activity. The two respond to quite different goals, and hence would be organized very differently:

  • An operational modelling centre would prioritize stability and robustness of the code base, and focus on supporting the needs of (non-scientist) end-users who want models and model results.
  • A scientific modelling centre focusses on supporting scientists themselves as users. The key priority here is to support the scientists’ need to get their latest ideas into the code, to run experiments and get data ready to support publication of new results. (This is what most climate modeling centres do right now).

Both need good software practices, but those practices would look very different in the case when the scientists are building code for their own experiments, versus serving the needs of other communities. There are also very different resource implications: an operational centre that serves the needs of a much more diverse set of stakeholders would need a much larger engineering support team in relation to the scientific team.

The question seems very relevant to the conference this week, as one of the running themes has been the question of what “climate services” might look like. Many of the speakers call for “actionable science”, and there has been a lot of discussion of how scientists should work with various communities who need knowledge about climate to inform their decision-making.

And there’s clearly a gap here, with lots of criticism of how it works at the moment. For example, here’s a great from Bruce Hewitson on the current state of climate information:

“A proliferation of portals and data sets, developed with mixed motivations, with poorly articulated uncertainties and weakly explained assumptions and dependencies, the data implied as information, displayed through confusing materials, hard to find or access, written in opaque language, and communicated by interface organizations only semi‐aware of the nuances, to a user community poorly equipped to understand the information limitations”

I can’t argue with any of that. But it begs the question as to whether solving this problem requires a reconceptualization of climate modeling activities to make them much more like operational weather forecasting centres?

Most of the people I spoke to this week think that’s the wrong paradigm. In weather forecasting, the numerical models play a central role, and become the workhorse for service provision. The models are run every day, to supply all sorts of different types of forecasts to a variety of stakeholders. Sure, a weather forecasting service also needs to provide expertise to interpret model runs (and of course, also needs a vast data collection infrastructure to feed the models with observations). But in all of this, the models are absolutely central.

In contrast, for climate services, the models are unlikely to play such a central role. Take for example, the century-long runs, such as those used in the IPCC assessments. One might think that these model runs represent an “operational service” provided to the IPCC as an external customer. But this is a fundamentally mistaken view of what the IPCC is and what it does. The IPCC is really just the scientific community itself, reviewing and assessing the current state of the science. The CMIP5 model runs currently being done in preparation for the next IPCC assessment report, AR5, are conducted by, and for, the science community itself. Hence, these runs have to come from science labs working at the cutting edge of earth system modelling. An operational centre one step removed from the leading science would not be able to provide what the IPCC needs.

One can criticize the IPCC for not doing enough to translate the scientific knowledge into something that’s “actionable” for different communities that need such knowledge. But that criticism isn’t really about the modeling effort (e.g. the CMIP5 runs) that contributes to the Working Group 1 reports. It’s about how the implications of the working group 1 translate into useful information in working groups 2 and 3.

The stakeholders who need climate services won’t be interested in century-long runs. At most they’re interest in decadal forecasts (a task that is itself still in it’s infancy, and a long way from being ready for operational forecasting). More often, they will want help interpreting observational data and trends, and assessing impacts on health, infrastructure, ecosystems, agriculture, water, etc. While such services might make use of data from climate model runs, it generally involve run models regularly in an operational mode. Instead the needs would be more focussed on downscaling the outputs from existing model run datasets. And sitting somewhere between current weather forecasting and long term climate projections is the need for seasonal forecasts and regional analysis of trends, attribution of extreme events, and so on.

So I don’t think it makes sense for climate modelling labs to move towards an operational modelling capability. Climate modeling centres will continue to focus primarily on developing models for use within the scientific community itself. Organizations that provide climate services might need to develop their own modelling capability, focussed more on high resolution, short term (decadal or shorter) regional modelling, and of course, on assessment models that explore the interaction of socio-economic factors and policy choices. Such assessment models would make use of basic climate data from global circulation models (for example, calculations of climate sensitivity, and spatial distributions of temperature change), but don’t connect directly with climate modeling.

We’ve just announced a special issue of the Open Access Journal Geoscientific Model Development (GMD):

Call for Papers: Special Issue on Community software to support the delivery of CMIP5

CMIP5 represents the most ambitious and computer-intensive model inter-comparison project ever attempted. Integrating a new generation of Earth system models and sharing the model results with a broad community has brought with it many significant technical challenges, along with new community-wide efforts to provide the necessary software infrastructure. This special issue will focus on the software that supports the scientific enterprise for CMIP5, including: couplers and coupling frameworks for Earth system models; the Common Information Model and Controlled Vocabulary for describing models and data; The development of the Earth System Grid Federation; the development of new portals for providing data access to different end-user communities; the scholarly publishing of datasets, and studies of the software development and testing processes used for the CMIP5 models. We especially welcome papers that offer comparative studies of the software approaches taken by different groups, and lessons learnt from community efforts to create shareable software components and frameworks.

See here for submission instructions. The call is open ended, as we can keep adding papers to the special issue. We’ve solicited papers from some of the software projects involved in CMIP5, but welcome unsolicited submissions too.

GMD operates an open review process, whereby submitted papers are posted to the open discussion site (known as GMDD), so that both the invited reviewers and anyone else can make comments on the papers and then discuss such comments with the authors, prior to a final acceptance decision for the journal. I was appointed to the editorial board earlier this year, and am currently getting my first taste of how this works – I’m looking forward to applying this idea to our special issue.

Valdivino, who is working on a PhD in Brazil, on formal software verification techniques, is inspired by my suggestion to find ways to apply our current software research skills to climate science. But he asks some hard questions:

1.) If I want to Validate and Verify climate models should I forget all the things that I have learned so far in the V&V discipline? (e.g. Model-Based Testing (Finite State Machine, Statecharts, Z, B), structural testing, code inspection, static analysis, model checking)
2.) Among all V&V techniques, what can really be reused / adapted for climate models?

Well, I wish I had some good answers. When I started looking at the software development processes for climate models, I expected to be able to apply many of the [edit] formal techniques I’ve worked on in the past in Verification and Validation (V&V) and Requirements Engineering (RE). It turns out almost none of it seems to apply, at least in any obvious way.

Climate models are built through a long, slow process of trial and error, continually seeking to improve the quality of the simulations (See here for an overview of how they’re tested). As this is scientific research, it’s unknown, a priori, what will work, what’s computationally feasible, etc. Worse still, the complexity of the earth systems being studied means its often hard to know which processes in the model most need work, because the relationship between particular earth system processes and the overall behaviour of the climate system is exactly what the researchers are working to understand.

Which means that model development looks most like an agile software development process, where the both the requirements and the set of techniques needed to implement them are unknown (and unknowable) up-front. So they build a little, and then explore how well it works. The closest they come to a formal specification is a set of hypotheses along the lines of:

“if I change <piece of code> in <routine>, I expect it to have <specific impact on model error> in <output variable> by <expected margin> because of <tentative theory about climactic processes and how they’re represented in the model>”

This hypothesis can then be tested by a formal experiment in which runs of the model with and without the altered code become two treatments, assessed against the observational data for some relevant period in the past. The expected improvement might be a reduction in the root mean squared error for some variable of interest, or just as importantly, an improvement in the variability (e.g. the seasonal or diurnal spread).

The whole process looks a bit like this (although, see Jakob’s 2010 paper for a more sophisticated view of the process):

And of course, the central V&V technique here is full integration testing. The scientists build and run the full model to conduct the end-to-end tests that constitute the experiments.

So the closest thing they have to a specification would be a chart such as the following (courtesy of Tim Johns at the UK Met Office):

This chart shows how well the model is doing on 34 selected output variables (click the graph to see a bigger version, to get a sense of what the variables are). The scores for the previous model version have been normalized to 1.0, so you can quickly see whether the new model version did better or worse for each output variable – the previous model version is the line at “1.0” and the new model version is shown as the coloured dots above and below the line. The whiskers show the target skill level for each variable. If the coloured dots are within the whisker for a given variable, then the model is considered to be within the variability range for the observational data for that variable. Colour-coded dots then show how well the current version did: green dots mean it’s within the target skill range, yellow mean it’s outside the target range, but did better than the previous model version, and red means it’s outside the target and did worse than the previous model version.

Now, as we know, agile software practices aren’t really amenable to any kind of formal verification technique. If you don’t know what’s possible before you write the code, then you can’t write down a formal specification (the ‘target skill levels’ in the chart above don’t count – these aspirational goals rather than specifications). And if you can’t write down a formal specification for the expected software behaviour, then you can’t apply formal reasoning techniques to determine if the specification was met.

So does this really mean, as Valdivino suggests, that we can’t apply any of our toolbox of formal verification methods? I think attempting to answer this would make a great research project. I have some ideas for places to look where such techniques might be applicable. For example:

  • One important built-in check in a climate model is ‘conservation of mass’. Some fluxes move mass between the different components of the model. Water is an obvious one – it’s evaporated from the oceans, to become part of the atmosphere, and is then passed to the land component as rain, thence to the rivers module, and finally back to the ocean. All the while, the total mass of water across all components must not change. Similar checks apply to salt, carbon (actually this does change due to emissions), and various trace elements. At present, such checks are this is built in to the models as code assertions. In some cases, flux corrections were necessary because of imperfections in the numerical routines or the geometry of the grid, although in most cases, the models have improved enough that most flux corrections have been removed. But I think you could automatically extract from the code an abstracted model capturing just the ways in which these quantities change, and then use a model checker to track down and reason about such problems.
  • A more general version of the previous idea: In some sense, a climate model is a giant state-machine, but the scientists don’t ever build abstracted versions of it – they only work at the code level. If we build more abstracted models of the major state changes in each component of the model, and then do compositional verification over a combination of these models, it *might* offer useful insights into how the model works and how to improve it. At the very least, it would be an interesting teaching tool for people who want to learn about how a climate model works.
  • Climate modellers generally don’t use unit testing. The challenge here is that they find it hard to write down correctness properties for individual code units. I’m not entirely clear how formal methods could help here, but it seems like someone with experience of patterns for temporal logic properties might be able to help here. Clune and Rood have a forthcoming paper on this in November’s IEEE Software. I suspect this is one of the easiest places to get started for software people new to climate models.
  • There’s one other kind of verification test that is currently done by inspection, but might be amenable to some kind of formalization: the check that the code correctly implements a given mathematical formula. I don’t think this will be a high value tool, as the fortran code is close enough to the mathematics that simple inspection is already very effective. But occasionally a subtle bug slips through – for example, I came across an example where the modellers discovered they had used the wrong logarithm (loge in place of log10), although this was more due to lack of clarity in the original published paper, rather than a coding error.

Feel free to suggest more ideas in the comments!

Over the next few years, you’re likely to see a lot of graphs like this (click for a bigger version):

This one is from a forthcoming paper by Meehl et al, and was shown by Jerry Meehl in his talk at the Annecy workshop this week. It shows the results for just a single model, CCSM4, so it shouldn’t be taken as representative yet. The IPCC assessment will use graphs taken from ensembles of many models, as model ensembles have been shown to be consistently more reliable than any single model (the models tend to compensate for each other’s idiosyncrasies).

But as a first glimpse of the results going into IPCC AR5, I find this graph fascinating:

  • The extension of a higher emissions scenario out to three centuries shows much more dramatically how the choices we make in the next few decades can profoundly change the planet for centuries to come. For IPCC AR4, only the lower scenarios were run beyond 2100. Here, we see that a scenario that gives us 5 degrees of warming by the end of the century is likely to give us that much again (well over 9 degrees) over the next three centuries. In the past, people talked too much about temperature change at the end of this century, without considering that the warming is likely to continue well beyond that.
  • The explicit inclusion of two mitigation scenarios (RCP2.6 and RCP4.5) give good reason for optimism about what can be achieved through a concerted global strategy to reduce emissions. It is still possible to keep emissions below 2 degrees of warming. But, as I discuss below, the optimism is bounded by some hard truths about how much adaptation will still be necessary – even in this wildly optimistic case, the temperature drops only slowly over the three centuries, and still ends up warmer than today, even at the year 2300.

As the approach to these model runs has changed so much since AR4, a few words of explanation might be needed.

First, note that the zero point on the temperature scale is the global average temperature for 1986-2005. That’s different from the baseline used in the previous IPCC assessment, so you have to be careful with comparisons. I’d much prefer they used a pre-industrial baseline – to get that, you have to add 1 (roughly!) to the numbers on the y-axis on this graph. I’ll do that throughout this discussion.

I introduced the RCPs (“Representative Concentration Pathways”) a little in my previous post. Remember, these RCPs were carefully selected from the work of the integrated assessment modelling community, who analyze interactions between socio-economic conditions, climate policy, and energy use. They are representative in the sense that they were selected to span the range of plausible emissions paths discussed in the literature, both with and without a coordinated global emissions policy. They are pathways, as they specify in detail how emissions of greenhouse gases and other pollutants would change, year by year, under each set of assumptions. The pathways matters a lot, because it is cumulative emissions (and the relative amounts of different types of emissions) that determine how much warming we get, rather than the actual emissions level in any given year. (See this graph for details on the emissions and concentrations in each RCP).

By the way, you can safely ignore the meaning of the numbers used to label the RCPs – they’re really just to remind the scientists which pathway is which. Briefly, the numbers represent the approximate anthropogenic forcing, in W/m², at the year 2100.

RCP8.5 and RCP6 represent two different pathways for a world with no explicit climate policy. RCP8.5 is at about the 90th percentile of the full set of non-mitigation scenarios described in the literature. So it’s not quite a worse case scenario, but emissions much higher than this are unlikely. One scenario that follows this path is a world in which renewable power supply grows only slowly (to about 20% of the global power mix by 2070) while most of a growing demand for energy is still met from fossil fuels. Emissions continue to grow strongly, and don’t peak before the end of the century. Incidentally, RCP8.5 ends up in the year 2100 with a similar atmospheric concentration to the old A1FI scenario in AR4, at around 900ppm CO2.

RCP6 (which is only shown to the year 2100 in this graph) is in the lower quartile of likely non-mitigation scenarios. Here, emissions peak by mid-century and then stabilize at a little below double current annual emissions. This is possible without an explicit climate policy because under some socio-economic conditions, the world still shifts (slowly) towards cleaner energy sources, presumably because the price of renewables continues to fall while oil starts to run out.

The two mitigation pathways, RCP2.6 and RCP4.5 bracket a range of likely scenarios for a concerted global carbon emissions policy. RCP2.6 was explicitly picked as one of the most optimistic possible pathways – note that it’s outside the 90% confidence interval for mitigation scenarios. The expert group were cautious about selecting it, and spent extra time testing its assumptions before including it. But it was picked because there was interest in whether, in the most optimistic case, it’s possible to stay below 2°C of warming.

Most importantly, note that one of the assumptions in RCP2.6 is that the world goes carbon-negative by around 2070. Wait, what? Yes, that’s right – the pathway depends on our ability to find a way to remove more carbon from the atmosphere than we produce, and to be able to do this consistently on a the global scale by 2070. So, the green line in the graph above is certainly possible, but it’s well outside the set of emissions targets currently under discussion in any international negotiations.

RCP4.5 represents a more mainstream view of global attempts to negotiate emissions reductions. On this pathway, emissions peak before mid-century, and fall to well below today’s levels by the end of the century. Of course, this is not enough to stabilize atmospheric concentrations until the end of the century.

The committee that selected the RCPs warns against over-interpretation. They deliberately selected an even number of pathways, to avoid any implication that a “middle” one is the most likely. Each pathway is the result of a different set of assumptions about how the world will develop over the coming century, either with, or without climate policies. Also:

  • The RCPs should not be treated as forecasts, nor bounds on forecasts. No RCP represents a “best guess”. The high and low scenarios were picked as representative of the upper and lower ends of the range described in the literature.
  • The RCPs should not be treated as policy prescriptions. They were picked to help answer scientific questions, not to offer specific policy choices.
  • There isn’t a unique socio-economic scenario driving each RCP – there are multiple sets of conditions that might be consistent with a particular pathway. Identifying these sets of conditions in more detail is an open question to be studied over the next few years.
  • There’s no consistent logic to the four RCPs, as each was derived from a different assessment model. So you can’t, for example, adjust individual assumptions to get from one RCP to another.
  • The translation from emissions profiles (which the RCPs specify) into atmospheric concentrations and radiative forcings is uncertain, and hence is also an open research question. The intent is to study these uncertainties explicitly through the modeling process.

So, we have a set of emissions pathways chosen because they represent “interesting” points in the space of likely global socio-economic scenarios covered in the literature. These are the starting point for multiple lines of research by different research communities. The climate modeling community will use them as inputs to climate simulations, to explore temperature response, regional variations, precipitation, extreme weather, glaciers, sea ice, and so on. The impacts and adaptation community will use them to explore the different effects on human life and infrastructure, and how much adaptation will be needed under each scenario. The mitigation community will use them to study the impacts of possible policy choices, and will continue to investigate the socio-economic assumptions underlying these pathways, to give us a clearer account of how each might come about, and to produce an updated set of scenarios for future assessments.

Okay, back to the graph. This represents one of the first available sets of temperature outputs from a Global Climate Model for the four RCPs. Over the next two years, other modeling groups will produces data from their own runs of these RCPs, to give us a more robust set of multi-model ensemble runs.

So the results in this graph are very preliminary, but if the results from other groups are consistent with them, here’s what I think it means. The upper path, RCP8.5, offers a glimpse of what happens if economic development and fossil fuel use continue to grow they way they have over the last few decades. It’s hard to imagine much of the human race surviving the next few centuries under this scenario. The lowest path, RCP2.6, keeps us below the symbolically important threshold of 2 degrees of warming, but then doesn’t bring us down much from that throughout the coming centuries. And that’s a pretty stark result: even if we do find a way to go carbon-negative by the latter part of this century, the following two centuries still end up hotter than it is now. All the while that we’re re-inventing the entire world’s industrial basis to make it carbon-negative, we also have to be adapting to a global climate that is warmer than any experienced since the human species evolved.

[By the way: the 2 degree threshold is probably more symbolic than it is scientific, although there’s some evidence that this is the point above which many scientists believe positive feedbacks would start to kick in. For a history of the 2 degree limit, see Randalls 2010].

I’m on my way back from a workshop on Computing in the Atmospheric Sciences, in Annecy, France. The opening keynote, by Gerald Meehl of NCAR, gave us a fascinating overview of the CMIP5 model experiments that will form a key part of the upcoming IPCC Fifth Assessment Report. I’ve been meaning to write about the CMIP5 experiments for ages, as the modelling groups were all busy getting their runs started when I visited them last year. As Jerry’s overview was excellent, this gives me the impetus to write up a blog post. The rest of this post is a summary of Jerry’s talk.

Jerry described CMIP5 as “the most ambitious and computer-intensive inter-comparison project ever attempted”, and having seen many of the model labs working hard to get the model runs started last summer, I think that’s an apt description. More than 20 modelling groups around the world are expected to participate, supplying a total estimated dataset of more than 2 petabytes.

It’s interesting to compare CMIP5 to CMIP3, the model intercomparison project for the last IPCC assessment. CMIP3 began in 2003, and was, at that time, itself an unprecedented set of coordinated climate modelling experiments. It involved 16 groups, from 11 countries with 23 models (some groups contributed more than one model). The resulting CMIP3 dataset, hosted at PCMDI, is 31 terabytes, is openly accessible, has been accessed by more than 1200 scientists, has generated hundreds of papers, and use of this data is still ongoing. The ‘iconic’ figures for future projections of climate change in IPCC AR4 are derived from this dataset (see for example, Figure 10.4 which I’ve previously critiqued).

Most of the CMIP3 work was based on the IPCC SRES “what if” scenarios, which offer different views on future economic development and fossil fuel emissions, but none of which include a serious climate mitigation policy.

By 2006, during the planning the next IPCC assessment, it was already clear that a profound paradigm shift was in progress. The idea of climate services had emerged, with a growing demand from industry, government and other group for detailed regional information about the impacts of climate change, and, of course, a growing need to explicitly consider mitigation and adaptation scenarios. And of course the questions are connected: With different mitigation choices, what are the remaining regional climate effects that adaptation will have to deal with?

So, CMIP5 represents a new paradigm for climate change prediction:

  1. Decadal prediction, with high resolution Atmosphere-Ocean General Circulation Models (AOGCMs), with say, 50km grids, initialized to explore near-time climate change over the next three decades.
  2. First generation Earth System Models, with include a coupled carbon cycle, and ice sheet models, typically run at intermediate resolution (100-150km grids) to study longer term feedbacks past mid-century, using a new set of scenarios that include both mitigation and non-mitigation emissions profiles.
  3. Stronger links between communities – e.g. WCRP, IGBP, and the weather prediction community, but most importantly, stronger interaction between the three working groups of the IPCC: WG1 (which looks at the physical science basis), WG2 (which looks at impacts, adaptation and vulnerability), and WG3 (integrated assessment modelling and scenario development). The lack of interaction between WG1 and the others has been a problem in the past, especially as it’s WG2 and WG3 before, as they’re the ones trying to understand the impacts of different policy choices.

The model experiments for CMIP5 are not dictated by IPCC, but selected by climate science community itself. A large set of experiments have been identified, intended to provide a 5-year framework (2008-2013) for climate change modelling. As not all modelling groups will be able to run all the experiments, they have been prioritized into three clusters: A core set that everyone will run, and two tiers of optional experiments. Experiments that are completed by early 2012 will be analyzed in the next IPCC assessment (due for publication in 2013).

The overall design for the set of experiments is broken out into two clusters (near-term, i.e. decadal runs; and long-term, i.e. century and longer), design for different types of model (although for some centres, this really means different configurations of the same model code, if their models can be run at very different resolutions). In both cases, the core experiment set includes runs of both past and future climate. The past runs are used as hindcasts to assess model skill. Here’s the decadal experiments, showing the core set in the middle, and tier 1 around the edge (there’s no tier 2 for these, as there aren’t so many decadal experiments:

These experiments include some very computationally-demanding runs at very high resolution, and include the first generation of global cloud-resolving models. For example, the prescribed SST time-slices experiments include two periods (1979-2008 and 2026-2035) where prescribed sea-surface temperatures taken from lower resolution, fully-coupled model runs will be used as a basis for very high resolution atmosphere-ocean runs. The intent of these experiments is to explore the local/regional effects of climate change, including on hurricanes and extreme weather events.

Here’s the set of experiments for the longer-term cluster, marked up to indicate three different uses: Model evaluation (where the runs can be compared to observations to identify weakness in the models and explore reasons for divergences between the models); climate projections (to show what the models do on four representative scenarios, at least to the year 2100, and, for some runs, out to the year 2300); and understanding, (including thought experiments, such as the Aqua planet with no land mass, and abrupt changes in GHG concentrations):

These experiments include a much wider range of scientific questions than earlier IPCC assessment (which is why there are so many more experiments this time round). Here’s another way of grouping the long-term runs, showing the collaborations with the many different research communities who are participating:

With these experiments, some crucial science questions will be addressed:

  • what are the time-evolving changes in regional climate change and extremes over the next few decades?
  • what are the size and nature of the carbon cycle and other feedbacks in the climate system, and what will be the resulting magnitude of change for different mitigation scenarios?

The long-term experiments are based on a new set of scenarios that represent a very different approach than was used in the last IPCC assessment. The new scenarios are called Representative Concentration Pathways (RCPs), although as Jerry points out, the name is a little confusing. I’ll write more about the RCPs in my next post, but here’s a brief summary…

The RCPs were selected after a long a series of discussion with the integrated assessment modelling community. A large set of possible scenarios were whittled down to just four. For the convenience of the climate modelling community, they’re labelled with the expected anomaly in radiative forcing (in W/m²) by the year 2100, to give us the set {RCP2.6, RCP4.5, RCP6, RCP8.5}. For comparison, the current total radiative forcing due to anthropogenic greenhouse gases is about 2W/m². But really, the numbers are just to help remember which RCP is which. Really, the term pathway is the important part  – each of the four was chosen as an illustrative example of how greenhouse gas concentrations might change over the rest of the century, under different circumstances. They were generated from integrated assessment models that provide detailed emissions profiles for a wide range of different greenhouse gases and other variables (e.g. aerosols). Here’s what the pathways look like (the darker coloured lines are the chosen representative pathways, the thinner lines show others that were consided, and each cluster is labelled with the model that generated them (click for bigger):

Each RCP was produced by a different model, in part because no single model was capable of providing the detail needed for all four different scenarios, although this means that the RCPs cannot be directly compared, because they include different assumptions. The graph above shows the range of mitigation scenarios considered by the blue shading, and the range of non-mitigation scenarios with gray shading (the two areas overlap a little).

Here’s a rundown on the four scenarios:

  • RCP2.6 represents the lower end of possible mitigation strategies, where emissions peak in the next decade or so, and then decline rapidly. This scenario is only possible if the world has gone carbon-negative by the 2070s, presumably by developing wide-scale carbon-capture and storage(CCS) technologies. This might be possible with an energy mix by 2070 of at least 35% renewables, 45% fossil fuels with full CCS (and 20% without), along with use of biomass, tree planting, and perhaps some other air-capture technologies. [My interpretation: this is the most optimistic scenario, in which we manage to do everything short of geo-engineering, and we get started immediately].
  • RCP4.5 represents a less aggressive emissions mitigation policy, where emissions peak before mid-century, and then fall, but not to zero. Under this scenario, concentrations stabilize by the end of the century, but won’t start falling, so the extra radiative forcing at the year 2100 is still more than double what it is today, at 4.5W/m². [My interpretation: this is the compromise future in which most countries work hard to reduce emissions, with a fair degree of success, but where CCS turns out not to be viable for massive deployment].
  • RCP6 represents the more optimistic of the non-mitigation futures. [My interpretation: this scenario is a world without any coordinated climate policy, but where there is still significant uptake of renewable power, but not enough to offset fossil-fuel driven growth among developing nations].
  • RCP8.5 represents the more pessimistic of the non-mitigation futures. For example, by 2070, we would still be getting about 80% of the world’s energy needs from fossil fuels, without CCS, while the remaining 20% come from renewables and/or nuclear. [My interpretation: this is the closest to the “drill, baby, drill” scenario beloved of certain right-wing American politicians].

Jerry showed some early model results for these scenarios from the NCAR model, CCSM4, but I’ll save that for my next post. To summarize:

  • 24 modelling groups are expected to participate in CMIP5, and about 10 of these groups have fully coupled earth system models.
  • Data is currently in from 10 groups, covering 14 models. Here’s a live summary, currently showing 172TB, which is already more than 5 times all the model data for CMIP3. Jerry put the total expected data at 1-2 petabytes, although in a talk later in the afternoon, Gary Strand from NCAR pegged it at 2.2PB. [Given how much everyone seems to have underestimated the data volumes from the CMIP5 experiments, I wouldn’t be surprised if it’s even bigger. Sitting next to me during Jerry’s talk, Marie-Alice Foujols from IPSL came up with an estimated of 2PB just for all the data collected from the runs done at IPSL, of which she thought something like 20% would be submitted to the CMIP5 archive].
  • The model outputs will be accessed via the Earth System Grid, and will include much more extensive documentation than previously. The Metafor project has built a controlled vocabulary for describing models and experiments, and the Curator project has developed web-based tools for ingesting this metatdata.
  • There’s a BAMS paper coming out soon describing CMIP5.
  • There will be a CMIP5 results session at the WCRP Open science conference next month, another at the AGU meeting in December, and another at a workshop in Hawaii in March.

For the Computing in Atmospheric Sciences workhop next month, I’ll be giving a talk entitled “On the relationship between earth system models and the labs that build them”. Here’s the abstract:

In this talk I will discuss a number of observations from a comparative study of four major climate modeling centres:
– the UK Met Office Hadley Centre (UKMO), in Exeter, UK
– the National Centre for Atmospheric Research (NCAR) in Boulder Colorado,
– the Max-Planck Institute for Meteorology (MPI-M) in Hamburg, Germany
– the Institute Pierre Simon Laplace (IPSL) in Paris, France).
The study focussed on the organizational structures and working practices at each centre with respect to earth system model development, and how these affect the history and current qualities of their models. While the centres share a number of similarities, including a growing role for software specialists and greater use of open source tools for managing code and the testing process, there are marked differences in how the different centres are funded, in their organizational structure and in how they allocate resources. These differences are reflected in the program code in a number of ways, including the nature of the coupling between model components, the portability of the code, and (potentially) the quality of the program code.

While all these modelling centres continually seek to refine their software development practices and the software quality of their models, they all struggle to manage the growth (in terms of size and complexity) in the models. Our study suggests that improvements to the software engineering practices at the centres have to take account of differing organizational constraints at each centre. Hence, there is unlikely to be a single set of best practices that work anywhere. Indeed, improvement in modelling practices usually come from local, grass-roots initiatives, in which new tools and techniques are adapted to suit the context at a particular centre. We suggest therefore that there is need for a stronger shared culture of describing current model development practices and sharing lessons learnt, to facilitate local adoption and adaptation.

Previously I posted on the first two sessions of the workshop on A National Strategy for Advancing Climate Modeling” that was held at NCAR at the end of last month:

  1. What should go into earth system models;
  2. Challenges with hardware, software and human resources;

    The third session focussed on the relationship between models and data.

    Kevin Trenberth kicked off with a talk on Observing Systems. Unfortunately, I missed part of his talk, but I’ll attempt a summary anyway – apologies if it’s incomplete. His main points were that we don’t suffer from a lack of observational data, but from problems with quality, consistency, and characterization of errors. Continuity is a major problem, because much of the observational system was designed for weather forecasting, where consistency of measurement over years and decades isn’t required. Hence, there’s a need for reprocessing and reanalysis of past data, to improve calibration and assess accuracy, and we need benchmarks to measure the effectiveness of reprocessing tools.

    Kevin points out that it’s important to understand that models are used for much more than prediction. They are used:

    • for analysis of observational data, for example to produce global gridded data from the raw observations;
    • to diagnose climate & improve understanding of climate processes (and thence to improve the models);
    • for attribution studies, through experiments to determine climate forcing;
    • for projections and prediction of future climate change;
    • for downscaling to provide regional information about climate impacts;

    Confronting the models with observations is a core activity in earth system modelling. Obviously, it is essential for model evaluation. But observational data is also used to tune the models, for example to remove known systematic biases. Several people at the workshop pointed out that the community needs to do a better job of keeping the data used to tune the models distinct from the data used to evaluate them. For tuning, a number of fields are used – typically top-of-the-atmosphere data such as net shortwave and longwave radiation flux, cloud and clear sky forcing, and cloud fractions. Also, precipitation and surface wind stress, global mean surface temperature, and the period and amplitude of ENSO. Kevin suggests we need to do a better job of collecting information about model tuning from different modelling groups, and ensure model evaluations don’t use the same fields.

    For model evaluation, a number of integrated score metrics have been proposed to summarize correlation, root-mean-squared (rms) error and variance ratios – See for example, Taylor 2001Boer and Lambert 2001Murphy et al, 2004Reichler & Kim 2008.

    But model evaluation and tuning aren’t the only ways in which models and data are brought together. Just as important is re-analysis, where multiple observational datasets are processed through a model to provide more comprehensive (model-like) data products. For this, data assimilation is needed, whereby observational data fields are used to nudge the model at each timestep as it runs.

    Kevin also talked about forward modelling, a technique in which the model used to reproduce the signal that a particular instrument would record, given certain climate conditions. Forward modelling is used for comparing models with ground observations and satellite data. In much of this work, there is an implicit assumption that the satellite data are correct, but in practice, all satellite data have biases, and need re-processing. For this work, the models need good emulation of instrument properties and thresholds. For examples, see: Chepfer, Bony et al, 2010Stubenrauch & Kinne 2009.

    He also talked about some of the problems with existing data and models:

    • nearly all satellite data sets contain large spurious variability associated with changing instruments and satellites, orbital decay/drift, calibration, and changing methods of analysis.
    • simulation of the hydrological cycle is poor, especially in the intertropical convergence zone (ITCZ). Tropical transients are too weak, runoff and recycling is not correct, and the diurnal cycle is poor.
    • there are large differences between datasets for low cloud (see Marchand at al 2010)
    • clouds are not well defined. Partly this is a problem of sensitivity of instruments, compounded by the difficulty of distinguishing between clouds and aerosols.
    • Most models have too much incoming solar radiation in the southern oceans, caused by too few clouds. This makes for warmer oceans and diminished poleward transport, which messes up storm tracking and analysis of ocean transports.

    What is needed to support modelling over the next twenty years? Kevin made the following recommendations:

    • Support observations and development into climate datasets.
    • Support reprocessing and reanalysis.
    • Unify NWP and climate models to exploit short term predictions and confront the models with data.
    • Develop more forward modelling and observation simulators, but with more observational input.
    • Targeted process studies such as GEWEX and analysis of climate extremes, for model evaluation.
    • Target problem areas such as monsoons and tropical precipitation.
    • Carry out a survey of fields used to tune models.
    • Design evaluation and model merit scoring based on fields other than those used in tuning.
    • Promote assessments of observational datasets so modellers know which to use (and not use).
    • Support existing projects, including GSICS, SCOPE-CM, CLARREO, GRUAN,

    Overall, there’s a need for a climate observing system. Process studies should not just be left to the observationists – we need the modellers to get involved.

    The second talk was by Ben Kirtman, on “Predictability, Credibility, and Uncertainty Quantification“. He began by pointing out that there is ongoing debate over what predictability means. Some treat it as an inherent property of the climate system, while others think of it as a model property. Ben distinguished two kinds of predictability:

    • Sensitivity of the climate system to initial conditions (predictability of the first kind);
    • Predictability of the boundary forcing (predictability of the second kind).

    Predictability is enhanced by ensuring specific processes are included. For example, you need to include the MJO if you want to predict ENSO. But model-based estimates of predictability are model dependent. If we want to do a better job of assessing predictability, we have to characterize model uncertainty, and we don’t know how to do this today.

    Good progress has been made on quantifying initial condition uncertainty. We have pretty of good ideas for how to probe this (stochastic optimals, bred vectors, etc.) using ensembles with perturbed initial conditions. But from our understanding of chaos theory (e.g. see the Lorenz attractor), predictability depends on which part of the regime you’re in, so we need to assess the predictability for each particular forecast.

    Uncertainty in external forcing include uncertainties in both the natural and anthropogenic forcing; however this is becoming less of an issue in modelling, as these forcings are better understood. Therefore, the biggest challenge is in quantifying uncertainties in model formulation. These arise because of the discrete representation of climate system, the use of parameterization of subgrid processes, and because of missing processes. Current approaches can be characterized as:

    • a posteriori techniques, such as the multi-model ensembles of opportunity used in IPCC assessments, and perturbed parameters/parameterizations, as used in climateprediction.net.
    • a priori techniques, where we incorporate uncertainty as the model evolves. The idea is that uncertainty is in subscale processes and missing physics. Model this non-locally and stochastically. E.g. backscatter, interactive ensembles to incorporate uncertainty in the coupling.

    The term credibility is even less well defined. Ben asked his students what they understood by the term, and they came up with a simple answer: credibility is the extent to which you use the best available science [which corresponds roughly to my suggestion of what model validation ought to mean]. In the literature, there are a number of other way of expressing credibility:

    • In terms of model bias. For example, Lenny Smith offers a Temporal (or spatial) credibility ratio, calculated as the ratio of the smallest timestep in the model to the smallest duration over which a variable has to be averaged before it compares favourably with observations. This expresses how much averaging over the temporal (or spatial) scale you have to do to make the model look like the data.
    • In terms of whether the ensembles bracket the observations. But the problem here is that you can always pump up an ensemble to do this, and it doesn’t really tell you about probabilistic forecast skill.
    • In terms of model skill. In numerical weather prediction, it’s usual to measure forecast quality using some specific skill metrics.
    • In terms of process fidelity – how well the processes represented in the model capture what is known about those processes in reality. This is a reductionist approach, and depends on the extent to which specific processes can be isolated (both in the model, and in the world).
    • In terms of faith – for example, the modellers’ subjective assessment of how good their model is.

    In the literature, credibility is usually used in a qualitative way to talk about model bias. Hence, in the literature, model bias is roughly synonymous with inverse of credibility. However, in these terms, the models currently have a major credibility gap. For example, Ben showed the annual mean rainfall from a long simulation of CESM1, showing bias with respect to GPCP observations. These show the model struggling to capture the spatial distribution of sea surface temperature (SST), especially in equatorial regions.

    Every climate model has a problem with equatorial sea surface temperatures (SST). A recent paper, Anagnostopoulos et al 2009 makes a big deal of this, and is clearly very hostile to climate modelling. They look at regional biases in temperature and precipitation, where the models are clearly not bracketing observations. I googled the Anagnostopooulos paper while Ben was talking – The first few pages of google hits are dominated by denialist website proclaiming this as a major new study demonstrating the models are poor. It’s amusing that this is treated as news, given that such weaknesses in the models are well known within the modelling community, and discussed in the IPCC report. Meanwhile the hydrologists at the workshop tell me that it’s a third-rate journal, so none of them would pay any attention to this paper.

    Ben argues that these weaknesses need to be removed to increase model credibility. This argument seems a little weak to me. While improving model skill and removing biases are important goals for this community, they don’t necessarily help with model credibility in terms of using the best science (because often replacing an empirically derived parameterization with one that’s more theoretically justified will often reduce model skill). More importantly, those outside the modeling community will have their own definitions of credibility, and they’re unlikely to correspond to those used within the community. Some attention to the ways in which other stakeholders understand model credibility would be useful and interesting.

    In summary, Ben identified a number of important tensions for climate modeling. For example, there are tensions between:

    • the desire to measure prediction skill vs. the desire to explore the limits of predictability;
    • the desire to quantify uncertainty, vs. the push for more resolution and complexity in the models;
    • a priori vs. a posteriori methods of assessing model uncertainty.
    • operational vs. research activities (Many modellers believe the IPCC effort is getting a little out of control – it’s a good exercise, but too demanding on resources);
    • weather vs climate modelling;
    • model diversity vs critical mass;

    Ben urged the community to develop a baseline for climate modelling, capturing best practices for uncertainty estimation.

    During a break in the workshop last week, Cecilia Bitz and I managed to compare notes on our undergrad courses. We’ve both been exploring how to teach ideas about climate modeling to students who are not majoring in earth sciences. Cecilia’s course on Weather and Climate Prediction is a little more advanced than mine, as she had a few math and physics pre-requisites, while mine was open to any first year students. For example, Cecilia managed to get the students to run CAM, and experiment with altering the earth’s orbit. It’s an interesting exercise, as it should lead to plenty of insights into connections between different processes in the earth’s system. One of the challenges is that earth system models aren’t necessarily geared up for this kind of tinkering, so you need good expertise in the model being used to understand which kinds of experiments are likely to make sense. But even so, I’m keen to explore this further, as I think the ability to tinker with such models could be an important tool for promoting a better understanding of how the climate system works, even for younger kids

    Part of what’s missing is elegant user interfaces. EdGCM is better, but still very awkward to use. What I really want is something that’s as intuitive as Angry Birds. Okay, so I’m going to have to compromise somewhere – nonlinear dynamics are a bit more complicated than the flight trajectories of avian slingshot.

    But that’s not all – someone needs to figure out what kinds of experiments students (and school kids) might want to perform, and provide the appropriate control widgets, so they don’t have mess around with code and scripts. Rich Loft tells me there’s a project in the works to do something like this with CESM – I’ll looking forward to that! In the meantime, Rich suggested two examples of simple simulations of dynamical systems that get closer to what I’m looking for:

    • The Shodor Disease model that lets you explore the dynamics of epidemics, with people in separate rooms, where you can adjust how much they can move between rooms, how the disease works, and whether immunization is available. Counter-intuitive lesson: crank up the mortality rate to 100% and (almost) everyone survives!
    • The Shodor Rabbits and Wolves simulation, which lets you explore population dynamics of predators and prey. Counter-intuitive lesson: double the lifespan of the rabbits and they all die out pretty quickly!

    In the last post, I talked about the opening session at the workshop on “A National Strategy for Advancing Climate Modeling”, which focussed on the big picture questions. In the second session, we focussed on the hardware, software and human resources challenges.

    To kick off, Jeremy Kepner from MIT called in via telecon to talk about software issues, from his perspective working on Matlab tools to support computational modeling. He made the point that it’s getting hard to make scientific code work on new architectures, because it’s increasingly hard to find anyone who wants to do the programming. There’s a growing gap between the software stacks used in current web and mobile apps, gaming, and so on, and that used in scientific software. Programmers are used to having new development environments and tools, for example for developing games for Facebook, and regard scientific software development tools as archaic. This means it’s hard to recruit talent from the software world.

    Jeremy quipped that software is an evil thing – the trick is to get people to write as little of it as possible (and he points out that programmers make mistakes at the rate of one per 1,000 lines of code). Hence, we need higher levels of abstraction, with code generated automatically from higher level descriptions. Hence, an important question is whether it’s time to abandon Fortran. He also points out that programmers believe they spend most of their time coding, but in fact, coding is a relatively small part of what they do. At least half of their time is testing, which means that effort to speed up the testing process gives you the most bang for the buck.

    Ricky Rood, Jack Fellows, and Chet Koblinsky then ran a panel on human resources issues. Ricky pointed out that if we are to identify shortages in human resources, we have to be clear about whether we mean for modeling vs. climate science vs. impacts studies vs. users of climate information, and so on. The argument can be made that in terms of absolute numbers there are enough people in the field, but the problems are in getting an appropriate expertise mix / balance, having people at the interfaces between different communities of expertise, a lack of computational people (and not enough emphasis on training our own), and management of fragmented resources.

    Chet pointed out that there’s been a substantial rise in the number of job postings using the term “climate modelling” over the last decade. But there’s still a widespread perception is that there aren’t enough jobs (i.e. more grad students being trained than we have positions for). There are some countervailing voices – for example Pielke argues that universities will always churn out more than enough scientists to support their mission, and there’s a recent BAMs article that explored the question “are we training too many atmospheric scientists?“. The shortage isn’t in the number of people being trained, but in the skills mix.

    We covered a lot of ground in the discussions. I’ll cover just some of the highlights here.

    Several people observed that climate model software development has diverged from mainstream computing. Twenty years ago, academia was the centre of the computing world. Now most computing is in commercial world, and computational scientists have much less leverage than we used to. This means that some tools we rely on might no longer be sustainable. E.g. fortran compilers (and autogenerators?) – fewer users care about these, and so there is less support for transitioning them to new architectures. Climate modeling is a 10+ year endeavour, and we need a long-term basis to maintain continuity.

    Much of the discussion focussed on anticipated disruptive transitions in hardware architectures. Whereas in the past, modellers have relied on faster and faster processors to deliver new computing capacity, this is coming to an end. Advances in clock speed have tailed off, and now its  massive parallelization that delivers the additional computing power. Unfortunately, this means the brute force approach of scaling up current GCM numerical methods on a uniform grid is a dead-end.

    As Bryan Lawrence pointed out, there’s a paradigm change here: computers no longer compute, they produce data. We’re entering an era where CPU time is essentially free, and it’s data wrangling that forms the bottleneck. Massive parallelization of climate models is hard because of the volume of data that must be passed around the system. We can anticipated 1-100 exabyte scale datasets (i.e. this is the size not of the archive, but of the data from a single run of an ensemble). It’s unlikely than any institution will have the ability to evolve their existing codes into this reality.

    The massive parallelization and data volumes also bring another problem. In the past, climate modellers have regarded bit-level reproducibility of climate runs to be crucial, partly because reproducing a run exactly is considered good scientific practice, and partly because it allows many kinds of model test to be automated. The problem is, at the scales we’re talking about, exact bit reproducibility is getting hard to maintain. When we scale up to millions of processors, and terabyte data sets, bit-level failures are frequent enough that exact reproducibility can no longer be guaranteed – if a single bit is corrupted during a model run, it may not matter for the climatology of the run, but it will mean exact reproducibility is impossible. Add to this the fact that in the future, CPUs are likely to be less deterministic, then, as Tim Palmer argued at the AGU meeting, we’ll be forced to fundamentally change our codes, and therefore, maybe we should take the opportunity to make the models probabilistic.

    One recommendation that came out of our discussions is to consider a two track approach for the software. Now that most modeling centres have finished their runs for the current IPCC assessment (AR5), we should plan to evolve current codes towards the next IPCC assessment (AR6), while starting now on developing entirely new software for AR7. The new codes will address i/o issues, new solvers, etc.

    One of the questions the committee posed to the workshop was the potential for hardware-software co-design. The general consensus was that it’s not possible in current funding climate. But even if the funding was available, it’s not clear this is desirable, as the software has much longer useful life than any hardware. Designing for specific hardware instantiations tends to bring major liabilities, and (as my own studies have indicated) there seems to be an inverse correlation between availability of dedicated computing resources and robustness/portability of the software. Things change in climate models all the time, and we need the flexibility to change algorithms, refactor software, etc. This means FPGAs might be a better solution. Dark silicon might push us in this direction anyway.

    Software sharing came up as an important topic, although we didn’t talk about this as much as I would have liked. There seems to be a tendency among modelers to assume that making the code available is sufficient. But as Cecelia Deluca pointed out, from the ESMF experience, community feedback and participation is important. Adoption mandates are not constructive – you want people to adopt software because it works better. One of the big problems here is understandability of shared code. The learning curve is getting bigger, and code sharing between labs is really only possible with a lot of personal interaction. We did speculate that auto-generation of code might help here, because it forces the development of higher level language to describe what’s in a climate model.

    For the human resources question, there was a widespread worry that we don’t have the skills and capacity to deal with anticipated disruptive changes in computational resources. There is a shortage of high quality applicants for model development positions, and many disincentives for people to pursue such a career: the long publication cycle, academic snobbery, and the demands of the IPCC all make model development an unattractive career for grad students and early career scientists. We need a different reward system, so that contributions to the model are rewarded.

    However, it’s also clear that we don’t have enough solid data on this – just lots of anecdotal evidence. We don’t know enough about talent development and capacity to say precisely where the problems are. We identified three distinct roles, which someone amusingly labelled: diagnosticians (who use models and model output in their science), perturbers (who explore new types of runs by making small changes to models) and developers (who do the bulk of model development). Universities produce most of the first, a few of the second, and very few of the third. Furthermore, model developers could be subdivided between people who develop new parameterizations and numerical analysts, although I would add a third category: developers of infrastructure code.

    As well as worrying about training of a new generation of modellers, we also worried about whether the other groups (diagnosticians and perturbers) would have the necessary skillsets. Students are energized by climate change as a societal problem, even if they’re not enticed by a career in earth sciences. Can we capitalize on this, through more interaction with work at the policy/science interface? We also need to make climate modelling more attractive to students, and to connect them more closely with the modeling groups. This could be done through certificate programs for undergrads to bring them into modelling groups, and by bringing grad students into modelling centres in their later grad years. To boost computational skills, we should offer training in earth system science to students in computer science, and expand training for earth system scientists in computational skills.

    Finally, let me end with a few of the suggestions that received a very negative response from many workshop attendees:

    • Should the US be offering only one center’s model to the IPCC for each CMIP round? Currently every major modeling center participates, and many of the centres complain that it dominates their resources during the CMIP exercise. However, participating brings many benefits, including visibility, detailed comparison with other models, and a pressure to improve model quality and model documentation.
    • Should we ditch Fortran and move to a higher level languages? This one didn’t really even get much discussion. My own view is that it’s simply not possible – the community has too much capital tied up in Fortran, and it’s the only language everyone knows.
    • Can we incentivize a mass participation in climate modeling, like the “develop apps for the iphone”? This is an intriguing notion, but one that I don’t think will get much traction, because of the depth of knowledge needed to do anything useful at all in current earth system modeling. Oh, and we’d probably need a different answer to the previous question, too.

    This week I’ve been attending a workshop at NCAR in Boulder, to provide input to the US National Acadamies committee on “A National Strategy for Advancing Climate Modeling”. The overall goal is to:

    “…develop a strategy for improving the nation’s capability to accurately simulate climate and related Earth system changes on decadal to centennial timescales. The committee’s report is envisioned as a high level analysis, providing a strategic framework to guide progress in the nation’s climate modeling enterprise over the next 10-20 years”

    The workshop has been fascinating, addressing many of the issues I encountered on my visits to various modelling labs last year – how labs are organised, what goes into the models, how they are tested and validated, etc. I now have a stack of notes from the talks and discussions, so I’ll divide this into several blog posts, corresponding to the main themes of the workshop:

    The full agenda is available at the NRC website, and they’ll be posting meeting summaries soon.

    The first session of the workshop kicked off with the big picture questions: what earth system processes should (and shouldn’t) be included in future earth system models, and what relationship should there be with the models used for weather forecasting?To get the discussion started, we had two talks, from Andrew Weaver (University of Victoria) and Tim Palmer (U Oxford and ECMWF).

    Andrew Weaver’s talk drew on lessons learned from climate modelling in Canada to argue that we need flexibility to build different types of model for different kinds of questions. Central to his talk was a typology of modelling needs, based on the kinds of questions that need addressing:

    • Curiosity-driven research (mainly done in the universities). For example:
      • Paleo-climate (e.g. what are the effects of a Heinrich event on African climate, and how does this help our understanding of the migration of early humans from Africa?)
      • Polar climates (e.g. what is the future of the greenland ice sheet?)
    • Policy and decision making questions (mainly done by the government agencies). These can be separated into:
      • Mitigation questions (e.g. how much can we emit and still stabilize at various temperature levels?)
      • Adaptation questions (e.g. infrastructure planning over things like water supplies, energy supply and demand)

    These diverse types of question place different challenges on climate modelling. From the policy-making point of view, there is a need for higher resolution and downscaling for regional assessments, along with challenges of modelling sub-grid scale processes (Dave Randall talked more about this in a later session). On the other hand, for paleo-climate studies, we need to bring additional things into the models, such as representing isotopes, ecosystems, and human activity (e.g. the effect on climate of the switch from hunter-gatherer to agrarian society).

    This means we need a range of different models for different purposes. Andrew argues that we should design a model to respond to a specific research question, rather than having one big model that we apply to any problem. He made a strong case for a hierarchy of models, describing his work with EMICs like the UVic model, which uses a simple energy/moisture balance model for the atmosphere, coupled with components from other labs for sea ice, ocean, vegetation, etc. He raised a chuckle when he pointed out that neither EMICs nor GCMs are able to get the greenland icesheet right, and therefore EMICS are superior because they get the wrong answer faster. There is a serious point here – for some types of question, all models are poor approximations, and hence, their role is to probe what we don’t know, rather than to accurately simulate what we do know.

    Some of problems faced in the current generation of models will never go away: clouds and aerosols, ocean mixing, precipitation. And there are some paths we don’t want to take, particularly the idea of coupling general equilibrium economic models with earth system models. The latter may be very sexy, but it’s not clear what we’ll learn. The uncertainties in the economics models are so large that you can get any answer you like, which means this will soak up resources for very little knowledge gain. We should also resist calls to consolidate modelling activities at just one or two international centres.

    In summary, Andrew argued we are at a critical juncture for US and international modelling efforts. Currently the field is dominated by a race to build the biggest model with the most subcomponents for the SRES scenarios. But a future priority must be on providing information needed for adaptation planning – this has been neglected in the past in favour of mitigation studies.

    Tim Palmer’s talk covered some of the ideas on probabilistic modeling that he presented at the AGU meeting in December. He began with an exploration of why better prediction capability is important, adding a third category to Andrew’s set of policy-related questions:

    • mitigation – while the risk of catastrophic climate change is unequivocal, we must reduce the uncertainties if we are ever to tackle the indifference to significant emissions cuts; otherwise humanity is heading for utter disaster.
    • adaptation – spending wisely on new infrastructure depends on our ability to do reliable regional prediction.
    • geo-engineering – could we ever take this seriously without reliable bias-free models?

    Tim suggested two over-riding goals for the climate modeling community. We should aspire to develop ab initio models (based on first principles) whose outputs do not differ significantly from observations; and we should aspire to improve probabilistic forecasts by reducing uncertainties to the absolute minimum, but without making the probabilistic forecasts over-confident.

    The main barriers to progress are limited human and computational resources, and limited observational data. Earth System Models are complex – there is no more complex problem in computational science – and great amounts of ingenuity and creativity are needed. All of the directions we wish to pursue – great complexity, higher resolution, bigger ensembles, longer paleo runs, data assimilation – demand more computational resources. But just as much, we’re hampered by an inability to get the maximum value from observations. So how do we use observations to inform modelling in a reliable way?

    Are we constrained by our history? Ab initio climate models originally grew out of numerical weather prediction, but rapidly diverged. Few climate models today are viable for NWP. Models are developed at the institutional level, so we have diversity of models worldwide, and this is often seen as a virtue – the diversity permits the creation of model ensembles, and rivalry between labs ought to be good for progress. But do we want to maintain this? Are the benefits as great as they are often portrayed? This history, and the institutional politics of labs, universities and funding agencies provide an unspoken constraint on our thinking. We must be able to identify and separate these constraints if we’re going to give taxpayers best value for money.

    Uncertainty is a key issue. While the diverisity of models often cited as measure of confidence in ensemble forecasts, this is very ad hoc. Models are not designed to span uncertainty in representation of physical processes in any systematic way, so the ensembles are ensembles of opportunity. So if the models agree, how can we be sure this is a measure of confidence? An obvious counter-example is in economics, where the financial crisis happened despite all economics models agreeing.

    So can we reduce uncertainty if we had better models? Tim argues that seamless prediction is important. He defines it as “bringing the constraints and insights of NWP into the climate arena”. The key idea is to be able to use the same modeling system to move seamlessly between forecasts at different scales, both temporally and spatially. In essence, it means unifying climate and weather prediction models. This unification brings a number of advantages:

    • It accelerates model development and reduces model biases, by exploring model skill on shorter, verifiable timescales.
    • It helps to bridge between observations and testing strategies, using techniques such as data assimilation.
    • It brings the weather and climate communities together
    • It allows for cross-over of best practices.

    Many key climate amplifiers are associated with fast timescale processes, which are best explored in NWP mode. This approach also allows us to test the reliability of probabilistic forecasts, on weekly, monthly and seasonal timescales. For example, reliability diagrams can be used to explore the reliability of a whole season of forecasts. This is done by subsampling the forecasts, taking, for example, all the forecasts that said it would rain with 40% probability, and checking that it did rain 40% of the time for this subsample.

    Such a unification also allows a pooling of research activity in universities, labs, and NWP centres to make progress on important ideas such as stochastic parameterizations and data assimilation. Stochastic parameterizations are proving more skilful in NWP than multi-model ensembles on monthly timescales. The ideas are still in their infancy, but there is potential to pool research activity in universities, labs, and NWP centres. Data assimilation provides a bridge between modelling and observational world (As an example, Rodwell and Palmer were able to rule out high climate sensitivities in the climateprediction.net data using assimilation with the 6-hour ECMWF observational data).

    In contrast to Andrew’s argument to avoid consolidation, Tim argues that consolidation of effort at the continental (rather than national) level would bring many benefits. He cites the example of the Aerobus, where European countries pooled their efforts because aircraft design had become too complex for any individual nation to go it alone. The Aerobus approach involves a consortium, allowing for specialization within each country.

    Tim closed with a recap of an argument he made in a recent Physics World article. If a group of nations can come together to fund the Large Hadron Collider, then why can’t we come together to fund a computing infrastructure for climate computing? If climate is the number one problem facing the world, then why don’t we have number one supercomputer, rather than being around 50 in the TOP500.

    Following these talks, we broke off into discussion groups to discuss a set of questions posed by the committee. I managed to capture some of the key points that emerged from these discussions – the committee report will present a much fuller account.

    • As we expand earth system models to include ever more processes, we should not lose sight of the old unsolved problems, such as clouds, ice sheets, etc.
    • We should be very cautious about introducing things into the models that are not tied back to basic physics. Many people in the discussions commented that they would resist human activities being incorporated into the models, as this breaks the ab initio approach. However, others argued that including social and ecosystem processes is inevitable, as some research questions demand it, and so we should  focus on how it should be done. For example, we should start with places where 2-way interaction is the largest. E.g. land use, energy feedbacks, renewable energy.
    • We should be able to develop an organized hierarchy of models, with traceable connections to one another.
    • Seamless prediction means same within the same framework/system, but not necessarily same model.
    • As models will become more complex, we need to be able to disentangle different kinds of complexity – for example the complexity that derives from emergent behaviour vs. number of processes vs. number of degrees of freedom.
    • The scale gap between models and observations is disappearing. Model resolution is increasing exponentially, while observational capability increasing only  polynomially. This will improve our ability to test models against observations.
    • There is a great ambiguity over what counts as an Earth System Model. Do they include regional models? What level of coupling defines an ESM?
    • There are huge challenges in bringing communities together to consolidate efforts, even on seemingly simple things like agreeing terminology.
    • There’s a complementary challenge to the improvement of ESMs: how can we improve the representation of climate processes in other kinds of model?
    • We should pay attention to user-generated questions. In particular, the general public doesn’t feel that climate modelling is relevant to them, which partly explains problems with lack of funding.

    Next up: challenges arising from hardware, software and human resource limitations…

    Last summer, when I was visiting NCAR, Warren Washington gave me some papers to read on the first successful numerical weather prediction, done on ENIAC in 1950, by a team of meteorologists put together by John von Neumann. von Neumann was very keen to explore new applications for ENIAC, and saw numerical weather prediction as an obvious thing to try, especially as he’d been working with atmospheric simulations for modeling nuclear weapons explosions. There was, of course, a military angle to this – the cold war was just getting going, and weather control was posited as a potentially powerful new weapon. Certainly that was enough to get the army to fund the project.

    The original forecast took about 24 hours to compute (including time for loading the punched cards), for a 24-hour forecast. This was remarkable, as it meant that with a faster machine, useful weather prediction was possible. There’s a very nice account of the ENIAC computations in:

    …and a slightly more technical account, with details of the algorithm in:

    So having read up on this, I thought it would be interesting to attempt to re-create the program in a modern programming language, as an exercise in modeling, and as way of better understanding this historic milestone. At which point Warren pointed out to me that it’s already been done, by Peter Lynch at University College Dublin:

    And not only that, but Peter then went one better, and re-implemented it again on a mobile phone, as a demonstration of how far Moore’s law has brought us. And he calls it PHONIAC . It took less than a second to compute on PHONIAC (remember, the original computation needed a room-sized machine, a bunch of operators, and 24 hours).

    I mentioned this in the comment thread on my earlier post on model IV&V, but I’m elevating it to a full post here because it’s an interesting point of discussion.

    I had a very interesting lunch with David Randall at CSU yesterday, in which we talked about many of the challenges facing climate modelling groups as they deal with increasing complexity in the models. One topic that came up was the question of whether it’s time for the climate modeling labs to establish separate divisions for the science models (experimental tools for trying out new ideas) and production models (which would be used for assessments and policy support). This separation hasn’t happened in climate modelling, but may well be inevitable, if the the anticipated market for climate services ever materializes.

    There are many benefits of such a separation. It would clarify the goals and roles within the modeling labs, and allow for a more explicit decision process that decides which ideas from the bleeding edge science models are mature enough for inclusion in the operational models. The latter would presumably only contain the more well-established science, would change less rapidly, and could be better engineered for robustness and usability. And better documented.

    But there’s a huge downside: the separation would effectively mean two separate models need to be developed and maintained (thus potentially doubling the effort), and the separation would make it harder to get the latest science transferred into the production model. Which in turn would mean a risk that assessments such as the IPCC’s become even more dated than they are now: there’s already a several year delay because of the time it takes to complete model runs, share the data, analyze it, peer-review and publish results, and then compile the assessment reports. Divorcing science models from production models would make this delay worse.

    But there’s an even bigger problem: the community is too small. There aren’t enough people who understand how to put together a climate model as it is; bifurcating the effort will make this shortfall even worse. David points out that part of the problem is that climate models are now so complex that nobody really understands the entire model; the other problem is that our grad schools aren’t producing many people who have both the aptitude and enthusiasm for climate modeling. There’s a risk that the best modellers will choose to stay in the science shops (because getting leading edge science into the models is much more motivating), leaving insufficient expertise to maintain quality in the production models.

    So really, it comes down to some difficult questions about priorities: given the serious shortage of good modellers, do we push ahead with the current approach in which progress at the leading edge of the science is prioritized, or do we split the effort to create these production shops? It seems to me that what matters for the IPCC at the moment is a good assessment of the current science, not some separate climate forecasting service. If a commercial market develops for the latter (which is possible, once people really start to get serious about climate change), then someone will have to figure out how to channel the revenues into training a new generation of modellers.

    When I mentioned this discussion on the earlier thread, Josh was surprised at the point that universities aren’t producing enough people with the aptitude and motivation:

    “seems like this is a hot / growth area (maybe that impression is just due to the press coverage).”

    Michael replied:

    “Funding is weak and sporadic; the political visibility of these issues often causes revenge-taking at the top of the funding hierarchy. Recent news, for instance, seems to be of drastic cuts in Canadian funding for climate science. …

    The limited budgets lead to attachment to awkward legacy codes, which drives away the most ambitious programmers. The nature of the problem stymies the most mathematically adept who are inclined to look for more purity. Software engineers take a back seat to physical scientists with little regard for software design as a profession. All in all, the work is drastically ill-rewarded in proportion to its importance, and it’s fair to say that while it attracts good people, it’s not hard to imagine a larger group of much higher productivity and greater computational science sophistication working on this problem.

    And that is the nub of the problem. There’s plenty of scope for improving the quality of the models and the quality of the software in them. But if we can’t grow the pool of engaged talent, it won’t happen.

    My post on validating climate models suggested that the key validation criteria is the extent to which the model captures (some aspect of) the current scientific theory, and is useful in exploring the theory. In effect, I’m saying that climate models are scientific tools, and should be validated as scientific tools. This makes them very different from, say numerical weather prediction (NWP) software, which are used in an operational setting to provide a service (predicting the weather).

    What’s confusing is that both communities (climate modeling and weather modeling) use many of the same techniques both for the design of the models, and for comparing the models with observational data.

    For NWP, forecast accuracy is the overriding objective, and the community has developed an extensive methodology for doing forecast verification. I pondered for a while whether this use of the term ‘verification’ here is consistent with my definitions, because surely we should be “validating” a forecast rather than “verifying it”. After thinking about it for a while, I concluded that the terminology is consistent, because forecast verification is like checking a program against it’s specification. In this case the specification states precisely what is being predicted, with what accuracy, and what would constitute a successful forecast (Bob Grumbine gives a recent example in verifying accuracy of seasonal sea ice forecasts). The verification procedure checks that the actual forecast was accurate, within the criteria set by this specification. Whether or not the forecast was useful is another question: that’s the validation question (and it’s a subjective question that requires some investigation of why people want forecasts in the first place).

    An important point here is that forecast verification is not software verification: it doesn’t verify a particular piece of software. It’s also not simulation verification: it doesn’t verify a given run produced by that software. It’s verification of an entire forecasting system. A forecasting system makes use of computational models (often more than one), as well as a bunch of experts who interpret the model results.It also includes an extensive data collection system that gathers information about the current state of the world to use as input to the model. (And of course, some forecasting systems don’t use computational models at all). So:

    • If the forecast is inaccurate (according to the forecast criteria), it doesn’t necessarily mean there’s a flaw in the models – it might just as well be a flaw in the interpretation of the model outputs, or in the data collection process that provided it’s inputs. Oh, and of course, the verification might also fail because the specification is wrong, e.g. because there are flaws in the observational system used in the verification procedure too.
    • If the forecasting system persistently produces accurate forecasts (according to the forecast criteria), that doesn’t necessarily tell us anything about the quality of the software itself, it just means that the entire forecast system worked. It may well be that the model is very poor, but the meteorologists who interpret model outputs are brilliant at overcoming the weaknesses in the model (perhaps in the way they configure the runs, or perhaps in the way they filter model outputs), to produce accurate forecasts for their customers.

    However, one effect of using this forecast verification approach day-in-day-out for weather forecasting systems over several decades (with an overall demand from customers for steady improvements in forecast accuracy) is that all parts of the forecasting system have improved dramatically over the last few decades, including the software. And climate modelling has benefited from this, as improvements in the modelling of processes needed for NWP can often also be used to improve the climate models (Senior et al have an excellent chapter on this in a forthcoming book, which I will review nearer to the publication date).

    The question is, can we apply a similar forecast verification methodology to the “climate forecasting system”, despite the differences between weather and climate?

    Note that the question isn’t about whether we can verify the accuracy of climate models this way, because the methodology doesn’t separate the models from the broader system in which they are used. So, if we take this route at all, we’re attempting to verify the forecast accuracy of the whole system: collection of observational data, creation of theories, use of these theories to develop models, choices for which model and which model configuration to use, choices for how to set up the runs, and interpretation of the results.

    Climate models are not designed as forecasting tools, they are designed as tools to explore current theories about the climate system, and to investigate sources of uncertainty in these theories. However, the fact that they can be used to project potential future climate change (under various scenarios) is very handy. Of course, this is not the only way to produce quantified estimates of future climate change – you can do it using paper and pencil. It’s also a little unfortunate, because the IPCC process (or at least the end-users of IPCC reports) tend to over-emphasize the model projections at the expense of the science that went into them, and increasingly the funding for the science is tied to the production of such projections.

    But some people (both within the climate modeling community and within the denialist community) would prefer that they not be used to project future climate change at all. (The argument from within the modelling community is that the results get over-interpreted or mis-interpreted by lay audiences; the argument from the denialist community is that models aren’t perfect. I think these two arguments are connected…). However, both arguments ignore reality: society demands of climate science that it provides its best estimates of the rate and size of future climate change, and (to the extent that they embody what we currently know about climate) the models are the best tool for this job. Not using them in the IPCC assessments would be like marching into the jungle with one eye closed.

    So, back to the question: can we use NWP forecast verification for climate projections? I think the answer is ‘no’, because of the timescales involved. Projections of climate change really only make sense on the scale of decades to centuries. Waiting for decades to do the verification is pointless – by then the science will have moved on, and it will be way too late for policymaking purposes anyway.

    If we can’t verify the forecasts on a timescale that’s actually useful, does this mean the models are invalid? Again the answer is ‘no’, for three reasons. First, we have plenty of other V&V techniques to apply to climate models. Second, the argument that climate models are a valid tool for creating future projections of climate change is based not on our ability to do forecast verification, but on how well the models capture the current state of the science. And third, because forecast verification wouldn’t necessarily say anything about the models themselves anyway, as it assesses the entire forecast system.

    It would certainly be really, really useful to be able to verify the “climate forecast” system. But the fact that we can’t does not mean we cannot validate climate models.