I had several interesting conversations at WCRP11 last week about how different the various climate models are. The question is important because it gives some insight into how much an ensemble of different models captures the uncertainty in climate projections. Several speakers at WCRP suggested we need an international effort to build a new, best of breed climate model. For example, Christian Jakob argued that we need a “Manhattan project” to build a new, more modern climate model, rather than continuing to evolve our old ones (I’ve argued in the past that this is not a viable approach). There have also been calls for a new international climate modeling centre, with the resources to build much larger supercomputing facilities.
The counter-argument is that the current diversity in models is important, and re-allocating resources to a single centre would remove this benefit. Currently around 20 or so different labs around the world build their own climate models to participate in the model inter-comparison projects that form a key input to the IPCC assessments. Part of the argument for this diversity of models is that when different models give similar results, that boosts our confidence in those results, and when they give different results, the comparisons provide insights into how well we currently understand and can simulate the climate system. For assessment purposes, the spread of the models is often taken as a proxy for uncertainty, in the absence of any other way of calculating error bars for model projections.
But that raises a number of questions. How well do the current set of coupled climate models capture the uncertainty? How different are the models really? Do they all share similar biases? And can we characterize how model intercomparisons feed back into progress in improving the models? I think we’re starting to get interesting answers to the first two of these questions, while the last two are, I think, still unanswered.
First, then, is the question of representing uncertainty. There are, of course, a number of sources of uncertainty. [Note that ‘uncertainty’ here doesn’t mean ‘ignorance’ (a mistake often made by non-scientists); it means, roughly, how big should the error bars be when we make a forecast, or more usefully, what does the probability distribution look like for different climate outcomes?]. In climate projections, sources of uncertainty can be grouped into three types:
- Internal variability: natural fluctuations in the climate (for example, the year-to-year differences caused by the El Niño Southern Oscillation, ENSO);
- Scenario uncertainty: the uncertainty over future carbon emissions, land use changes, and other types of anthropogenic forcings. As we really don’t know how these will change year-by-year in the future (irrespective of whether any explicit policy targets are set), it’s hard to say exactly how much climate change we should expect.
- Model uncertainty: the range of different responses to the same emissions scenario given by different models. Such differences arise, presumably, because we don’t understand all the relevant processes in the climate system perfectly. This is the kind of uncertainty that a large ensemble of different models ought to be able to assess.
Hawkins and Sutton analyzed the impact of these different type of uncertainty on projections of global temperature over the range of a century. Here, Fractional Uncertainty means the ratio of the model spread to the projected temperature change (against a 1971-2000 mean):
This analysis shows that for short term (decadal) projections, the internal variability is significant. Finding ways of reducing this (for example by better model initialization from the current state of the climate) is important the kind of near-term regional projections needed by, for example, city planners, and utility and insurance companies, etc. Hawkins & Sutton indicate with dashed lines some potential to reduce this uncertainty for decadal projections through better initialization of the models.
For longer term (century) projections, internal variability is dwarfed by scenario uncertainty. However, if we’re clear about the nature of the scenarios used, we can put scenario uncertainty aside and treat model runs as “what-if” explorations – if the emissions follow a particular pathway over the 21st Century, what climate response might we expect?
Model uncertainty remains significant over both short and long term projections. The important question here for predicting climate change is how much of this range of different model responses captures the real uncertainties in the science itself. In the analysis above, the variability due to model differences is about 1/4 of the magnitude of the mean temperature rise projected for the end of the century. For example, if a given emissions scenario leads to a model mean of +4°C, the model spread would be about 1°C, yielding a projection of +4±0.5°C. So is that the right size for an error bar on our end-of-century temperature projections? Or, to turn the question around, what is the probability of a surprise – where the climate change turns out to fall outside the range represented by the current model ensemble?
Just as importantly, is the model ensemble mean the most likely outcome? Or do the models share certain biases so that the truth is somewhere other than the multi-model mean? Last year, James Annan demolished the idea that the models cluster around the truth, and in a paper with Julia Hargreaves, provides some evidence that the model ensembles do a relatively good job of bracketing the observational data, and, if anything, the ensemble spread is too broad. If the latter point is correct, then the model ensembles over-estimate the uncertainty.
This brings me to the question of how different the models really are. Over the summer, Kaitlin Alexander worked with me to explore the software architecture of some of the models that I’ve worked with from Europe and N. America. The first thing that jumped out at me when she showed me her diagrams was how different the models all look from one another. Here are six of them presented side-by-side. The coloured ovals indicate the size (in lines of code) of each major model component (relative to other components in the same model; the different models are not shown to scale), and the coloured arrows indicate data exchanges between the major components (see Kaitlin’s post for more details):
There are clearly differences in how the components are coupled together (for example, whether all data exchanges pass through a coupler, or whether components interact directly). In some cases, major subcomponents are embedded as subroutines within a model component, which makes the architecture harder to understand, but may make sense from a scientific point of view, when earth system processes themselves are tightly coupled. However, such differences in the code might just be superficial, as the choice of call structure should not, in principle affect the climatology.
The other significant difference is in the relative sizes of the major components. Lines of code isn’t necessarily a reliable measure, but it usually offers a reasonable proxy for the amount of functionality. So a model with an atmosphere model dramatically bigger than the other components indicates a model for which far more work (and hence far more science) has gone into modeling the atmosphere than the other components.
Compare for example, the relative sizes of the atmosphere and ocean components for HadGEM3 and IPSLCM5A, which, incidentally, both use the same ocean model, NEMO. HadGEMs has a much bigger atmosphere model, representing more science, or at least many more options for different configurations. In part, this is because the UK Met Office is an operational weather forecasting centre, and the code base is shared between NWP and climate research. Daily use of this model for weather forecasting offers many opportunities to improve the skill of the model (although improvement in skill in short term weather forecasting doesn’t necessarily imply improvements in skill for climate simulations). However, the atmosphere model is the biggest beneficiary of this process, and, in fact, the UK Met Office does not have much expertise in ocean modeling. In contrast, the IPSL model is the result of a collaboration between several similarly sized research groups, representing different earth subsystems.
But do these architectural differences show up as scientific differences? I think they do, but was finding this hard to analyze. Then I had a fascinating conversation at WCRP last week with Reto Knutti, who showed me a recent paper that he published with D. Masson, in which they analyzed model similarity from across the CMIP3 dataset. The paper describes a cluster analysis over all the CMIP3 models (plus three re-analysis datasets, to represent observations), based on how well the capture the full spatial field for temperature (on the left) and precipitation (on the right). The cluster diagrams look like this (click for bigger):
In these diagrams, the models from the same lab are coloured the same. Observational data are in pale blue (three observational datasets were included for temperature, and two for precipitation). Some obvious things jump out: the different observational datasets are more similar to each other than they are to any other model, but as a cluster, they don’t look any different from the models. Interestingly, models from the same lab tend to be more similar to one another, even when these span different model generations. For example, for temperature, the UK Met Office models HadCM3 and HadGEM1 are more like each other than they are like any other models, even though they run at very different resolutions, and have different ocean models. For precipitation, all the GISS models cluster together and are quite different from all the other models.
The overall conclusion from this analysis is that using models from just one lab (even in very different configurations, and across model generations) gives you a lot less variability than using models from different labs. Which does suggest that there’s something in the architectural choices made at each lab that leads to a difference in the climatology. In the paper, Masson & Knutti go on to analyze perturbed physics ensembles, and show that the same effect shows up here too. Taking a single model, and systematically varying the parameters used in the model physics still gives you less variability than using models from different labs.
There’s another followup question that I would like to analyze: do models that share major components tend to cluster together? There’s a growing tendency for a given component (e.g. an ocean model, an atmosphere model) to show up in more than one lab’s GCM. It’s not yet clear how this affects variability in a multi-model ensemble.
So what are the lessons here? First, there is evidence that the use of multi-model ensembles is valuable and important, and that these ensembles capture the uncertainty much better than multiple runs of a single model (no matter how it is perturbed). The evidence suggests that models from different labs are significantly different from one another both scientifically and structurally, and at least part of the explanation for this is that labs tend to have different clusters of expertise across the full range of earth system processes. Studies that compare model results with observational data (E.g. Hargreaves & Annan; Masson & Knutti) show that the observations looks no different from just another member of the multi-model ensemble (or to put it in Annan and Hargreaves’ terms, the truth is statistically indistinguishable from another model in the ensemble).
It would appear that the current arrangement of twenty or so different labs competing to build their own models is a remarkably robust approach to capturing the full range of scientific uncertainty with respect to climate processes. And hence it doesn’t make sense to attempt to consolidate this effort into one international lab.