This week, I start teaching a new grad course on computational models of climate change, aimed at computer science grad students with no prior background in climate science or meteorology. Here’s my brief blurb:

Detailed projections of future climate change are created using sophisticated computational models that simulate the physical dynamics of the atmosphere and oceans and their interaction with chemical and biological processes around the globe. These models have evolved over the last 60 years, along with scientists’ understanding of the climate system. This course provides an introduction to the computational techniques used in constructing global climate models, the engineering challenges in coupling and testing models of disparate earth system processes, and the scaling challenges involved in exploiting peta-scale computing architectures. The course will also provide a historical perspective on climate modelling, from the early ENIAC weather simulations created by von Neumann and Charney, through to today’s Earth System Models, and the role that these models play in the scientific assessments of the UN’s Intergovernmental Panel on Climate Change (IPCC). The course will also address the philosophical issues raised by the role of computational modelling in the discovery of scientific knowledge, the measurement of uncertainty, and a variety of techniques for model validation. Additional topics, based on interest, may include the use of multi-model ensembles for probabilistic forecasting, data assimilation techniques, and the use of models for re-analysis.

I’ve come up with a draft outline for the course, and some possible readings for each topic. Comments are very welcome:

  1. History of climate and weather modelling. Early climate science. Quick tour of range of current models. Overview of what we knew about climate change before computational modeling was possible.
  2. Calculating the weather. Bjerknes’ equations. ENIAC runs. What does a modern dynamical core do? [Includes basic introduction to thermodynamics of atmosphere and ocean]
  3. Chaos and complexity science. Key ideas: forcings, feedbacks, dynamic equilibrium, tipping points, regime shifts, systems thinking. Planetary boundaries. Potential for runaway feedbacks. Resilience & sustainability. (way too many readings this week. Have to think about how to address this – maybe this is two weeks worth of material?)
    • Liepert, B. G. (2010). The physical concept of climate forcing. Wiley Interdisciplinary Reviews: Climate Change, 1(6), 786-802.
    • Manson, S. M. (2001). Simplifying complexity: a review of complexity theory. Geoforum, 32(3), 405-414.
    • Rind, D. (1999). Complexity and Climate. Science, 284(5411), 105-107.
    • Randall, D. A. (2011). The Evolution of Complexity In General Circulation Models. In L. Donner, W. Schubert, & R. Somerville (Eds.), The Development of Atmospheric General Circulation Models: Complexity, Synthesis, and Computation. Cambridge University Press.
    • Meadows, D. H. (2008). Chapter One: The Basics. Thinking In Systems: A Primer (pp. 11-34). Chelsea Green Publishing.
    • Randers, J. (2012). The Real Message of Limits to Growth: A Plea for Forward-Looking Global Policy, 2, 102-105.
    • Rockström, J., Steffen, W., Noone, K., Persson, Å., Chapin, F. S., Lambin, E., Lenton, T. M., et al. (2009). Planetary boundaries: exploring the safe operating space for humanity. Ecology and Society, 14(2), 32.
    • Lenton, T. M., Held, H., Kriegler, E., Hall, J. W., Lucht, W., Rahmstorf, S., & Schellnhuber, H. J. (2008). Tipping elements in the Earth’s climate system. Proceedings of the National Academy of Sciences of the United States of America, 105(6), 1786-93.
  4. Typology of climate Models. Basic energy balance models. Adding a layered atmosphere. 3-D models. Coupling in other earth systems. Exploring dynamics of the socio-economic system. Other types of model: EMICS; IAMS.
  5. Earth System Modeling. Using models to study interactions in the earth system. Overview of key systems (carbon cycle, hydrology, ice dynamics, biogeochemistry).
  6. Overcoming computational limits. Choice of grid resolution; grid geometry, online versus offline; regional models; ensembles of simpler models; perturbed ensembles. The challenge of very long simulations (e.g. for studying paleoclimate).
  7. Epistemic status of climate models. E.g. what does a future forecast actually mean? How are model runs interpreted? Relationship between model and theory. Reproducibility and open science.
    • Shackley, S. (2001). Epistemic Lifestyles in Climate Change Modeling. In P. N. Edwards (Ed.), Changing the Atmosphere: Expert Knowledge and Environmental Government (pp. 107-133). MIT Press.
    • Sterman, J. D., Jr, E. R., & Oreskes, N. (1994). The Meaning of Models. Science, 264(5157), 329-331.
    • Randall, D. A., & Wielicki, B. A. (1997). Measurement, Models, and Hypotheses in the Atmospheric Sciences. Bulletin of the American Meteorological Society, 78(3), 399-406.
    • Smith, L. a. (2002). What might we learn from climate forecasts? Proceedings of the National Academy of Sciences of the United States of America, 99 Suppl 1, 2487-92.
  8. Assessing model skill - comparing models against observations, forecast validation, hindcasting. Validation of the entire modelling system. Problems of uncertainty in the data. Re-analysis, data assimilation. Model intercomparison projects.
  9. Uncertainty. Three different types: initial state uncertainty, scenario uncertainty and structural uncertainty. How well are we doing? Assessing structural uncertainty in the models. How different are the models anyway?
  10. Current Research Challenges. Eg: Non-standard grids – e.g. non-rectangular, adaptive, etc; Probabilistic modelling – both fine grain (e.g. ECMWF work) and use of ensembles; Petascale datasets; Reusable couplers and software frameworks. (need some more readings on different research challenges for this topic)
  11. The future. Projecting future climates. Role of modelling in the IPCC assessments. What policymakers want versus what they get. Demands for actionable science and regional, decadal forecasting. The idea of climate services.
  12. Knowledge and wisdom. What the models tell us. Climate ethics. The politics of doubt. The understanding gap. Disconnect between our understanding of climate and our policy choices.

For a talk earlier this year, I put together a timeline of the history of climate modelling. I just updated it for my course, and now it’s up on Prezi, as a presentation you can watch and play with. Click the play button to follow the story, or just drag and zoom within the viewing pane to explore your own path.

Consider this a first draft though – if there are key milestones I’ve missed out (or misrepresented!) let me know!

In the talk I gave this week at the workshop on the CMIP5 experiments, I argued that we should do a better job of explaining how climate science works, especially the day-to-day business of working with models and data. I think we have a widespread problem that people outside of climate science have the wrong mental models about what a climate scientist does. As with any science, the day-to-day work might appear to be chaotic, with scientists dealing with the daily frustrations of working with large, messy datasets, having instruments and models not work the way they’re supposed to, and of course, the occasional mistake that you only discover after months of work. This doesn’t map onto the mental model that many non-scientists have of “how science should be done”, because the view presented in school, and in the media, is that science is about nicely packaged facts. In reality, it’s a messy process of frustrations, dead-end paths, and incremental progress exploring the available evidence.

Some climate scientists I’ve chatted to are nervous about exposing more of this messy day-to-day work. They already feel under constant attack, and they feel that allowing the public to peer under the lid (or if you prefer, to see inside the sausage factory) will only diminish people’s respect for the science. I take the opposite view – the more we present the science as a set of nicely polished results, the more potential there is for the credibility of the science to be undermined when people do manage to peek under the lid (e.g. by publishing internal emails). I think it’s vitally important that we work to clear away some of the incorrect mental models people have of how science is (or should be) done, and give people a better appreciation for how our confidence in scientific results slowly emerges from a slow, messy, collaborative process.

Giving people a better appreciation of how science is done would also help to overcome some of games of ping pong you get in the media, where each new result in a published paper is presented as a startling new discovery, overturning previous research, and (if you’re in the business of selling newspapers, preferably) overturning an entire field. In fact, it’s normal for new published results to turn out to be wrong, and most of the interesting work in science is in reconciling apparently contradictory findings.

The problem is that these incorrect mental models of how science is done are often well entrenched, and the best that we can do is to try to chip away at them, by explaining at every opportunity what scientists actually do. For example, here’s a mental model I’ve encountered from time to time about how climate scientists build models to address the kinds of questions policymakers ask about the need for different kinds of climate policy:

This view suggests that scientists respond to a specific policy question by designing and building software models (preferably testing that the model satisfies its specification), and then running the model to answer the question. This is not the only (or even the most common?) layperson’s view of climate modelling, but the point is that there are many incorrect mental models of how climate models are developed and used, and one of the things we should strive to do is to work towards dislodging some of these by doing a better job of explaining the process.

With respect to climate model development, I’ve written before about how models slowly advance based on a process that roughly mimics the traditional view of “the scientific method” (I should acknowledge, for all the philosophy of science buffs, that there really isn’t a single, “correct” scientific method, but let’s keep that discussion for another day). So here’s how I characterize the day to day work of developing a model:

Most of the effort is spent identifying and diagnosing where the weaknesses in the current model are, and looking for ways to improve them. Each possible improvement then becomes an experiment, in which the experimental hypothesis might look like:

“if I change <piece of code> in <routine>, I expect it to have <specific impact on model error> in <output variable> by <expected margin> because of <tentative theory about climactic processes and how they’re represented in the model>”

The previous version of the model acts as a control, and the modified model is the experimental condition.

But of course, this process isn’t just a random walk – it’s guided at the next level up by a number of influences, because the broader climate science community (and to some extent the meteorological community) are doing all sorts of related research, which then influences model development. In the paper we wrote about the software development processes at the UK Met Office, we portrayed it like this:

But I could go even broader and place this within a context in which a number of longer term observational campaigns (“process studies”) are collecting new types of observational data to investigate climate processes that are still poorly understood. This then involves the interaction several distinct communities. Christian Jakob portrays it like this:

Although the point of Jakob’s paper is to argue that the modelling and process studies communities don’t currently do enough of this kind of interactions, so there’s room for improvement in how the modelling influences the kinds of process studies needed, and how the results from process studies feed back into model development.

So, how else should we be explaining the day-to-day work of climate scientists?

I’m attending a workshop this week in which some of the initial results from the Fifth Coupled Model Intercomparison Project (CMIP5) will be presented. CMIP5 will form a key part of the next IPCC assessment report – it’s a coordinated set of experiments on the global climate models built by labs around the world. The experiments include hindcasts to compare model skill on pre-industrial and 20th Century climate, projections into the future for 100 and 300 years, shorter term decadal projections, paleoclimate studies, plus lots of other experiments that probe specific processes in the models. (For more explanation, see the post I wrote on the design of the experiments for CMIP5 back in September).

I’ve been looking at some of the data for the past CMIP exercises. CMIP1 originally consisted of one experiment – a control run with fixed forcings. The idea was to compare how each of the models simulates a stable climate. CMIP2 included two experiments, a control run like CMIP1, and a climate change scenario in which CO2 levels were increased by 1% per year. CMIP3 then built on these projects with a much broader set of experiments, and formed a key input to the IPCC Fourth Assessment Report.

There was no CMIP4, as the numbers were resynchronised to match the IPCC report numbers (also there was a thing called the Coupled Carbon Cycle Climate Model Intercomparison Project, which was nicknamed C4MIP, so it’s probably just as well!), so CMIP5 will feed into the fifth assessment report.

So here’s what I have found so far on the vital statistics of each project. Feel free to correct my numbers and help me to fill in the gaps!

CMIP
(1996 onwards)
CMIP2
(1997 onwards)
CMIP3
(2005-2006)
CMIP5
(2010-2011)
Number of Experiments 1 2 12 110
Centres Participating 16 18 15 24
# of Distinct Models 19 24 21 45
# of Runs (Models X Expts) 19 48 211 841
Total Dataset Size ?? ?? 36 TeraByte 3.3 PetaByte
Total Downloads from archive ?? ?? 1 PetaByte
Number of Papers Published 47 595
Users ?? ?? 6700

[Update:] I’ve added a row for number of runs, i.e. the sum of the number of experiments run on each model (in CMIP3 and CMIP5, centres were able to pick a subset of the experiments to run, so you can’t just multiply models and experiments to get the number of runs). Also, I ought to calculate the total number of simulated years that represents (If a centre did all the CMIP5 experiments, I figure it would result in at least 12,000 simulated years).

Oh, one more datapoint from this week. We came up with an estimate that by 2020, each individual experiment will generate an Exabyte of data. I’ll explain how we got this number once we’ve given the calculations a bit more of a thorough checking over.

Our paper on defect density analysis of climate models is now out for review at the journal Geoscientific Model Development (GMD). GMD is an open review / open access journal, which means the review process is publicly available (anyone can see the submitted paper, the reviews it receives during the process, and the authors’ response). If the paper is eventually accepted, the final version will also be freely available.

The way this works at GMD is that the paper is first published to Geoscientific Model Development Discussions (GMDD) as an un-reviewed manuscript. The interactive discussion is then open for a fixed period (in this case, 2 months). At that point the editors will make a final accept/reject decision, and, if accepted, the paper is then published to GMD itself. During the interactive discussion period, anyone can post comments on the paper, although in practice, discussion papers often only get comments from the expert reviewers commissioned by the editors.

One of the things I enjoy about the peer-review process is that a good, careful review can help improve the final paper immensely. As I’ve never submitted before to a journal that uses an open review process, I’m curious to see how the open reviewing will help – I suspect (and hope!) it will tend to make reviewers more constructive.

Anyway, here’s the paper. As it’s open review, anyone can read it and make comments (click the title to get to the review site):

Assessing climate model software quality: a defect density analysis of three models

J. Pipitone and S. Easterbrook
Department of Computer Science, University of Toronto, Canada

Abstract. A climate model is an executable theory of the climate; the model encapsulates climatological theories in software so that they can be simulated and their implications investigated. Thus, in order to trust a climate model one must trust that the software it is built from is built correctly. Our study explores the nature of software quality in the context of climate modelling. We performed an analysis of defect reports and defect fixes in several versions of leading global climate models by collecting defect data from bug tracking systems and version control repository comments. We found that the climate models all have very low defect densities compared to well-known, similarly sized open-source projects. We discuss the implications of our findings for the assessment of climate model software trustworthiness.

On Thursday, Kaitlin presented her poster at the AGU meeting, which shows the results of the study she did with us in the summer. Her poster generated a lot of interest, especially the visualizations she has of the different model architectures. Click on thumbnail to see the full poster at the AGU site:

A few things to note when looking at the diagrams:

  • Each diagram shows the components of a model, scale to their relative size by lines of code. However, the models are not to scale with one another, as the smallest, UVic’s, is only a tenth of the size of the biggest, CESM. Someone asked what accounts for that size. Well, the UVic model is an EMIC rather than a GCM. It has a very simplified atmosphere model that does not include atmospheric dynamics, which makes it easier to run for very long simulations (e.g. to study paleoclimate). On the other hand, CESM is a community model, with a large number of contributors across the scientific community. (See Randall and Held’s point/counterpoint article in last months IEEE Software for a discussion of how these fit into different model development strategies).
  • The diagrams show the couplers (in grey), again sized according to number of lines of code. A coupler handles data re-gridding (when the scientific components use different grids), temporal aggregation (when the scientific components run on different time steps) along with other data handling. These are often invisible in diagrams the scientists create of their models, because they are part of the infrastructure code; however Kaitlin’s diagrams show how substantial they are in comparison with the scientific modules. The European models all use the same coupler, following a decade-long effort to develop this as a shared code resource.
  • Note that there are many different choices associated with the use of a coupler, as sometimes it’s easier to connect components directly rather through the coupler, and the choice may be driven by performance impact, flexibility (e.g. ‘plug-and-play’ compatibility) and legacy code issues. Sea ice presents an interesting example, because its extent varies over the course of a model run. So somewhere there must be code that keeps track of which grid cells have ice, and then routes the fluxes from ocean and atmosphere to the sea ice component for these grid cells. This could be done in the coupler, or in any of the three scientific modules. In the GFDL model, sea ice is treated as an interface to the ocean, so all atmosphere-ocean fluxes pass through it, whether there’s ice in a particular cell or not.
  • The relative size of the scientific components is a reasonable proxy for functionality (or, if you like, scientific complexity/maturity). Hence, the diagrams give clues about where each lab has placed its emphasis in terms of scientific development, whether by deliberate choice, or because of availability (or unavailability) of different areas of expertise. The differences between the models from different labs show some strikingly different choices here, for example between models that are clearly atmosphere-centric, versus models that have a more balanced set of earth system components.
  • One comment we received in discussions around the poster was about the places where we have shown sub-components in some of the models. Some modeling groups are more explicit about naming the sub-components, and indicating them in the code. Hence, our ability to identify these might be more dependent on naming practices rather than any fundamental architectural differences.

I’m sure Kaitlin will blog more of her reflections on the poster (and AGU in general) once she’s back home.

I’m at the AGU meeting in San Francisco this week. The internet connections in the meeting rooms suck, so I won’t be twittering much, but will try and blog any interesting talks. But first things first! I presented my poster in the session on “Methodologies of Climate Model Evaluation, Confirmation, and Interpretation” yesterday morning. Nice to get my presentation out of the way early, so I can enjoy the rest of the conference.

Here’s my poster, and the abstract is below (click for the full sized version at the AGU ePoster site):

A Hierarchical Systems Approach to Model Validation

Introduction

Discussions of how climate models should be evaluated tend to rely on either philosophical arguments about the status of models as scientific tools, or on empirical arguments about how well runs from a given model match observational data. These lead to quantitative measures expressed in terms of model bias or forecast skill, and ensemble approaches where models are assessed according to the extent to which the ensemble brackets the observational data.

Such approaches focus the evaluation on models per se (or more specifically, on the simulation runs they produce), as if the models can be isolated from their context. Such approaches may overlook a number of important aspects of the use of climate models:

  • the process by which models are selected and configured for a given scientific question.
  • the process by which model outputs are selected, aggregated and interpreted by a community of expertise in climatology.
  • the software fidelity of the models (i.e. whether the running code is actually doing what the modellers think it’s doing).
  • the (often convoluted) history that begat a given model, along with the modelling choices long embedded in the code.
  • variability in the scientific maturity of different components within a coupled earth system model.

These omissions mean that quantitative approaches cannot assess whether a model produces the right results for the wrong reasons, or conversely, the wrong results for the right reasons (where, say the observational data is problematic, or the model is configured to be unlike the earth system for a specific reason).

Furthermore, quantitative skill scores only assess specific versions of models, configured for specific ensembles of runs; they cannot reliably make any statements about other configurations built from the same code.

Quality as Fitness for Purpose

The problem is that there is no such thing as “the model”. The body of code that constitutes a modern climate model actually represents an enormous number of possible models, each corresponding to a different way of configuring that code for a particular run. Furthermore, this body of code isn’t a static thing. The code is changed on a daily basis, through a continual process of experimentation and model improvement. This applies even to any specific “official release”, which again is just a body of code that can be configured to run as any of a huge number of different models, and again, is not unchanging – as with all software, there will be occasional bugfix releases applied to it, along with improvements to the ancillary datasets.

Evaluation of climate models should not be about “the model”, but about the relationship between a modelling system and the purposes to which it is put. More precisely, it’s about the relationship between particular ways of building and configuring models and the ways in which the runs produced by those models are used.

What are the uses of a climate model? They vary tremendously:

  • To provide inputs to assessments of the current state of climate science;
  • To explore the consequences of a current theory;
  • To test a hypothesis about the observational system (e.g. forward modeling);
  • To test a hypothesis about the calculational system (e.g. to explore known weaknesses);
  • To provide homogenized datasets (e.g. re-analysis);
  • To conduct thought experiments about different climates;
  • To act as a comparator when debugging another model;

In general, we can distinguish three separate systems: the calculational system (the model code); the theoretical system (current understandings of climate processes) and the observational system. In the most general sense, climate models are developed to explore how well our current understanding (i.e. our theories) of climate explain the available observations. And of course the inverse: what additional observations might we make to help test our theories.

We're dealing with relationships between three different systems

Validation of the Entire Modeling System

When we ask questions about likely future climate change, we don’t ask the question of the calculational system, we ask it of the theoretical system; the models are just a convenient way of probing the theory to provide answers.
When society asks climate scientists for future projections, the question is directed at climate scientists, not their models. Modellers apply their judgment to select appropriate versions & configurations of the models to use, set up the runs, and interpret the results in the light of what is known about the models’ strengths and weaknesses and about any gaps between the computational models and the current theoretical understanding. And they add all sorts of caveats to the conclusions they draw from the model runs when they present their results.

Validation is not a post-hoc process to be applied to an individual “finished” model, to ensure it meets some criteria for fidelity to the real world. In reality, there is no such thing as a finished model, just many different snapshots of a large set of model configurations, steadily evolving as the science progresses. Knowing something about the fidelity of a given model configuration to the real world is useful, but not sufficient to address fitness for purpose. For this, we have to assess the extent to which climate models match our current theories, and the extent to which the process of improving the models keeps up with theoretical advances.

Summary

Our approach to model validation extends current approaches:

  • down into the detailed codebase to explore the processes by which the code is built and tested. Thus, we build up a picture of the day-to-day practices by which modellers make small changes to the model and test the effect of such changes (both in isolated sections of code, and on the climatology of a full model). The extent to which these practices improve the confidence and understanding of the model depends on how systematically this testing process is applied, and how many of the broad range of possible types of testing are applied. We also look beyond testing to other software practices that improve trust in the code, including automated checking for conservation of mass across the coupled system, and various approaches to spin-up and restart testing.
  • up into the broader scientific context in which models are selected and used to explore theories and test hypotheses. Thus, we examine how features of the entire scientific enterprise improve (or impede) model validity, from the collection of observational data, creation of theories, use of these theories to develop models, choices for which model and which model configuration to use, choices for how to set up the runs, and interpretation of the results. We also look at how model inter-comparison projects provide a de facto benchmarking process, leading in turn to exchanges of ideas between modelling labs, and hence advances in the scientific maturity of the models.

This layered approach does not attempt to quantify model validity, but it can provide a systematic account of how the detailed practices involved in the development and use of climate models contribute to the quality of modelling systems and the scientific enterprise that they support. By making the relationships between these practices and model quality more explicit, we expect to identify specific strengths and weaknesses the modelling systems, particularly with respect to structural uncertainty in the models, and better characterize the “unknown unknowns”.

I had several interesting conversations at WCRP11 last week about how different the various climate models are. The question is important because it gives some insight into how much an ensemble of different models captures the uncertainty in climate projections. Several speakers at WCRP suggested we need an international effort to build a new, best of breed climate model. For example, Christian Jakob argued that we need a “Manhattan project” to build a new, more modern climate model, rather than continuing to evolve our old ones (I’ve argued in the past that this is not a viable approach). There have also been calls for a new international climate modeling centre, with the resources to build much larger supercomputing facilities.

The counter-argument is that the current diversity in models is important, and re-allocating resources to a single centre would remove this benefit. Currently around 20 or so different labs around the world build their own climate models to participate in the model inter-comparison projects that form a key input to the IPCC assessments. Part of the argument for this diversity of models is that when different models give similar results, that boosts our confidence in those results, and when they give different results, the comparisons provide insights into how well we currently understand and can simulate the climate system. For assessment purposes, the spread of the models is often taken as a proxy for uncertainty, in the absence of any other way of calculating error bars for model projections.

But that raises a number of questions. How well do the current set of coupled climate models capture the uncertainty? How different are the models really? Do they all share similar biases? And can we characterize how model intercomparisons feed back into progress in improving the models? I think we’re starting to get interesting answers to the first two of these questions, while the last two are, I think, still unanswered.

First, then, is the question of representing uncertainty. There are, of course, a number of sources of uncertainty. [Note that 'uncertainty' here doesn't mean 'ignorance' (a mistake often made by non-scientists); it means, roughly, how big should the error bars be when we make a forecast, or more usefully, what does the probability distribution look like for different climate outcomes?]. In climate projections, sources of uncertainty can be grouped into three types:

  • Internal variability: natural fluctuations in the climate (for example, the year-to-year differences caused by the El Niño Southern Oscillation, ENSO);
  • Scenario uncertainty: the uncertainty over future carbon emissions, land use changes, and other types of anthropogenic forcings. As we really don’t know how these will change year-by-year in the future (irrespective of whether any explicit policy targets are set), it’s hard to say exactly how much climate change we should expect.
  • Model uncertainty: the range of different responses to the same emissions scenario given by different models. Such differences arise, presumably, because we don’t understand all the relevant processes in the climate system perfectly. This is the kind of uncertainty that a large ensemble of different models ought to be able to assess.

Hawkins and Sutton analyzed the impact of these different type of uncertainty on projections of global temperature over the range of a century. Here, Fractional Uncertainty means the ratio of the model spread to the projected temperature change (against a 1971-2000 mean):

This analysis shows that for short term (decadal) projections, the internal variability is significant. Finding ways of reducing this (for example by better model initialization from the current state of the climate) is important the kind of near-term regional projections needed by, for example, city planners, and utility and insurance companies, etc. Hawkins & Sutton indicate with dashed lines some potential to reduce this uncertainty for decadal projections through better initialization of the models.

For longer term (century) projections, internal variability is dwarfed by scenario uncertainty. However, if we’re clear about the nature of the scenarios used, we can put scenario uncertainty aside and treat model runs as “what-if” explorations – if the emissions follow a particular pathway over the 21st Century, what climate response might we expect?

Model uncertainty remains significant over both short and long term projections. The important question here for predicting climate change is how much of this range of different model responses captures the real uncertainties in the science itself. In the analysis above, the variability due to model differences is about 1/4 of the magnitude of the mean temperature rise projected for the end of the century. For example, if a given emissions scenario leads to a model mean of +4°C, the model spread would be about 1°C, yielding a projection of +4±0.5°C. So is that the right size for an error bar on our end-of-century temperature projections? Or, to turn the question around, what is the probability of a surprise – where the climate change turns out to fall outside the range represented by the current model ensemble?

Just as importantly, is the model ensemble mean the most likely outcome? Or do the models share certain biases so that the truth is somewhere other than the multi-model mean? Last year, James Annan demolished the idea that the models cluster around the truth, and in a paper with Julia Hargreaves, provides some evidence that the model ensembles do a relatively good job of bracketing the observational data, and, if anything, the ensemble spread is too broad. If the latter point is correct, then the model ensembles over-estimate the uncertainty.

This brings me to the question of how different the models really are. Over the summer, Kaitlin Alexander worked with me to explore the software architecture of some of the models that I’ve worked with from Europe and N. America. The first thing that jumped out at me when she showed me her diagrams was how different the models all look from one another. Here are six of them presented side-by-side. The coloured ovals indicate the size (in lines of code) of each major model component (relative to other components in the same model; the different models are not shown to scale), and the coloured arrows indicate data exchanges between the major components (see Kaitlin’s post for more details):

There are clearly differences in how the components are coupled together (for example, whether all data exchanges pass through a coupler, or whether components interact directly). In some cases, major subcomponents are embedded as subroutines within a model component, which makes the architecture harder to understand, but may make sense from a scientific point of view, when earth system processes themselves are tightly coupled. However, such differences in the code might just be superficial, as the choice of call structure should not, in principle affect the climatology.

The other significant difference is in the relative sizes of the major components. Lines of code isn’t necessarily a reliable measure, but it usually offers a reasonable proxy for the amount of functionality. So a model with an atmosphere model dramatically bigger than the other components indicates a model for which far more work (and hence far more science) has gone into modeling the atmosphere than the other components.

Compare for example, the relative sizes of the atmosphere and ocean components for HadGEM3 and IPSLCM5A, which, incidentally, both use the same ocean model, NEMO. HadGEMs has a much bigger atmosphere model, representing more science, or at least many more options for different configurations. In part, this is because the UK Met Office is an operational weather forecasting centre, and the code base is shared between NWP and climate research. Daily use of this model for weather forecasting offers many opportunities to improve the skill of the model (although improvement in skill in short term weather forecasting doesn’t necessarily imply improvements in skill for climate simulations). However, the atmosphere model is the biggest beneficiary of this process, and, in fact, the UK Met Office does not have much expertise in ocean modeling. In contrast, the IPSL model is the result of a collaboration between several similarly sized research groups, representing different earth subsystems.

But do these architectural differences show up as scientific differences? I think they do, but was finding this hard to analyze. Then I had a fascinating conversation at WCRP last week with Reto Knutti, who showed me a recent paper that he published with D. Masson, in which they analyzed model similarity from across the CMIP3 dataset. The paper describes a cluster analysis over all the CMIP3 models (plus three re-analysis datasets, to represent observations), based on how well the capture the full spatial field for temperature (on the left) and precipitation (on the right). The cluster diagrams look like this (click for bigger):

In these diagrams, the models from the same lab are coloured the same. Observational data are in pale blue (three observational datasets were included for temperature, and two for precipitation). Some obvious things jump out: the different observational datasets are more similar to each other than they are to any other model, but as a cluster, they don’t look any different from the models. Interestingly, models from the same lab tend to be more similar to one another, even when these span different model generations. For example, for temperature, the UK Met Office models HadCM3 and HadGEM1 are more like each other than they are like any other models, even though they run at very different resolutions, and have different ocean models. For precipitation, all the GISS models cluster together and are quite different from all the other models.

The overall conclusion from this analysis is that using models from just one lab (even in very different configurations, and across model generations) gives you a lot less variability than using models from different labs. Which does suggest that there’s something in the architectural choices made at each lab that leads to a difference in the climatology. In the paper, Masson & Knutti go on to analyze perturbed physics ensembles, and show that the same effect shows up here too. Taking a single model, and systematically varying the parameters used in the model physics still gives you less variability than using models from different labs.

There’s another followup question that I would like to analyze: do models that share major components tend to cluster together? There’s a growing tendency for a given component (e.g. an ocean model, an atmosphere model) to show up in more than one lab’s GCM. It’s not yet clear how this affects variability in a multi-model ensemble.

So what are the lessons here? First, there is evidence that the use of multi-model ensembles is valuable and important, and that these ensembles capture the uncertainty much better than multiple runs of a single model (no matter how it is perturbed). The evidence suggests that models from different labs are significantly different from one another both scientifically and structurally, and at least part of the explanation for this is that labs tend to have different clusters of expertise across the full range of earth system processes. Studies that compare model results with observational data (E.g. Hargreaves & Annan; Masson & Knutti) show that the observations looks no different from just another member of the multi-model ensemble (or to put it in Annan and Hargreaves’ terms, the truth is statistically indistinguishable from another model in the ensemble).

It would appear that the current arrangement of twenty or so different labs competing to build their own models is a remarkably robust approach to capturing the full range of scientific uncertainty with respect to climate processes. And hence it doesn’t make sense to attempt to consolidate this effort into one international lab.

One of the questions I’ve been chatting to people about this week at the WCRP Open Science Conference this week is whether climate modelling needs to be reorganized as an operational service, rather than as a scientific activity. The two respond to quite different goals, and hence would be organized very differently:

  • An operational modelling centre would prioritize stability and robustness of the code base, and focus on supporting the needs of (non-scientist) end-users who want models and model results.
  • A scientific modelling centre focusses on supporting scientists themselves as users. The key priority here is to support the scientists’ need to get their latest ideas into the code, to run experiments and get data ready to support publication of new results. (This is what most climate modeling centres do right now).

Both need good software practices, but those practices would look very different in the case when the scientists are building code for their own experiments, versus serving the needs of other communities. There are also very different resource implications: an operational centre that serves the needs of a much more diverse set of stakeholders would need a much larger engineering support team in relation to the scientific team.

The question seems very relevant to the conference this week, as one of the running themes has been the question of what “climate services” might look like. Many of the speakers call for “actionable science”, and there has been a lot of discussion of how scientists should work with various communities who need knowledge about climate to inform their decision-making.

And there’s clearly a gap here, with lots of criticism of how it works at the moment. For example, here’s a great from Bruce Hewitson on the current state of climate information:

“A proliferation of portals and data sets, developed with mixed motivations, with poorly articulated uncertainties and weakly explained assumptions and dependencies, the data implied as information, displayed through confusing materials, hard to find or access, written in opaque language, and communicated by interface organizations only semi‐aware of the nuances, to a user community poorly equipped to understand the information limitations”

I can’t argue with any of that. But it begs the question as to whether solving this problem requires a reconceptualization of climate modeling activities to make them much more like operational weather forecasting centres?

Most of the people I spoke to this week think that’s the wrong paradigm. In weather forecasting, the numerical models play a central role, and become the workhorse for service provision. The models are run every day, to supply all sorts of different types of forecasts to a variety of stakeholders. Sure, a weather forecasting service also needs to provide expertise to interpret model runs (and of course, also needs a vast data collection infrastructure to feed the models with observations). But in all of this, the models are absolutely central.

In contrast, for climate services, the models are unlikely to play such a central role. Take for example, the century-long runs, such as those used in the IPCC assessments. One might think that these model runs represent an “operational service” provided to the IPCC as an external customer. But this is a fundamentally mistaken view of what the IPCC is and what it does. The IPCC is really just the scientific community itself, reviewing and assessing the current state of the science. The CMIP5 model runs currently being done in preparation for the next IPCC assessment report, AR5, are conducted by, and for, the science community itself. Hence, these runs have to come from science labs working at the cutting edge of earth system modelling. An operational centre one step removed from the leading science would not be able to provide what the IPCC needs.

One can criticize the IPCC for not doing enough to translate the scientific knowledge into something that’s “actionable” for different communities that need such knowledge. But that criticism isn’t really about the modeling effort (e.g. the CMIP5 runs) that contributes to the Working Group 1 reports. It’s about how the implications of the working group 1 translate into useful information in working groups 2 and 3.

The stakeholders who need climate services won’t be interested in century-long runs. At most they’re interest in decadal forecasts (a task that is itself still in it’s infancy, and a long way from being ready for operational forecasting). More often, they will want help interpreting observational data and trends, and assessing impacts on health, infrastructure, ecosystems, agriculture, water, etc. While such services might make use of data from climate model runs, it generally involve run models regularly in an operational mode. Instead the needs would be more focussed on downscaling the outputs from existing model run datasets. And sitting somewhere between current weather forecasting and long term climate projections is the need for seasonal forecasts and regional analysis of trends, attribution of extreme events, and so on.

So I don’t think it makes sense for climate modelling labs to move towards an operational modelling capability. Climate modeling centres will continue to focus primarily on developing models for use within the scientific community itself. Organizations that provide climate services might need to develop their own modelling capability, focussed more on high resolution, short term (decadal or shorter) regional modelling, and of course, on assessment models that explore the interaction of socio-economic factors and policy choices. Such assessment models would make use of basic climate data from global circulation models (for example, calculations of climate sensitivity, and spatial distributions of temperature change), but don’t connect directly with climate modeling.

12. October 2011 · Comments Off · Categories: climate modeling

We’ve just announced a special issue of the Open Access Journal Geoscientific Model Development (GMD):

Call for Papers: Special Issue on Community software to support the delivery of CMIP5

CMIP5 represents the most ambitious and computer-intensive model inter-comparison project ever attempted. Integrating a new generation of Earth system models and sharing the model results with a broad community has brought with it many significant technical challenges, along with new community-wide efforts to provide the necessary software infrastructure. This special issue will focus on the software that supports the scientific enterprise for CMIP5, including: couplers and coupling frameworks for Earth system models; the Common Information Model and Controlled Vocabulary for describing models and data; The development of the Earth System Grid Federation; the development of new portals for providing data access to different end-user communities; the scholarly publishing of datasets, and studies of the software development and testing processes used for the CMIP5 models. We especially welcome papers that offer comparative studies of the software approaches taken by different groups, and lessons learnt from community efforts to create shareable software components and frameworks.

See here for submission instructions. The call is open ended, as we can keep adding papers to the special issue. We’ve solicited papers from some of the software projects involved in CMIP5, but welcome unsolicited submissions too.

GMD operates an open review process, whereby submitted papers are posted to the open discussion site (known as GMDD), so that both the invited reviewers and anyone else can make comments on the papers and then discuss such comments with the authors, prior to a final acceptance decision for the journal. I was appointed to the editorial board earlier this year, and am currently getting my first taste of how this works – I’m looking forward to applying this idea to our special issue.

Valdivino, who is working on a PhD in Brazil, on formal software verification techniques, is inspired by my suggestion to find ways to apply our current software research skills to climate science. But he asks some hard questions:

1.) If I want to Validate and Verify climate models should I forget all the things that I have learned so far in the V&V discipline? (e.g. Model-Based Testing (Finite State Machine, Statecharts, Z, B), structural testing, code inspection, static analysis, model checking)
2.) Among all V&V techniques, what can really be reused / adapted for climate models?

Well, I wish I had some good answers. When I started looking at the software development processes for climate models, I expected to be able to apply many of the [edit] formal techniques I’ve worked on in the past in Verification and Validation (V&V) and Requirements Engineering (RE). It turns out almost none of it seems to apply, at least in any obvious way.

Climate models are built through a long, slow process of trial and error, continually seeking to improve the quality of the simulations (See here for an overview of how they’re tested). As this is scientific research, it’s unknown, a priori, what will work, what’s computationally feasible, etc. Worse still, the complexity of the earth systems being studied means its often hard to know which processes in the model most need work, because the relationship between particular earth system processes and the overall behaviour of the climate system is exactly what the researchers are working to understand.

Which means that model development looks most like an agile software development process, where the both the requirements and the set of techniques needed to implement them are unknown (and unknowable) up-front. So they build a little, and then explore how well it works. The closest they come to a formal specification is a set of hypotheses along the lines of:

“if I change <piece of code> in <routine>, I expect it to have <specific impact on model error> in <output variable> by <expected margin> because of <tentative theory about climactic processes and how they’re represented in the model>”

This hypothesis can then be tested by a formal experiment in which runs of the model with and without the altered code become two treatments, assessed against the observational data for some relevant period in the past. The expected improvement might be a reduction in the root mean squared error for some variable of interest, or just as importantly, an improvement in the variability (e.g. the seasonal or diurnal spread).

The whole process looks a bit like this (although, see Jakob’s 2010 paper for a more sophisticated view of the process):

And of course, the central V&V technique here is full integration testing. The scientists build and run the full model to conduct the end-to-end tests that constitute the experiments.

So the closest thing they have to a specification would be a chart such as the following (courtesy of Tim Johns at the UK Met Office):

This chart shows how well the model is doing on 34 selected output variables (click the graph to see a bigger version, to get a sense of what the variables are). The scores for the previous model version have been normalized to 1.0, so you can quickly see whether the new model version did better or worse for each output variable – the previous model version is the line at “1.0″ and the new model version is shown as the coloured dots above and below the line. The whiskers show the target skill level for each variable. If the coloured dots are within the whisker for a given variable, then the model is considered to be within the variability range for the observational data for that variable. Colour-coded dots then show how well the current version did: green dots mean it’s within the target skill range, yellow mean it’s outside the target range, but did better than the previous model version, and red means it’s outside the target and did worse than the previous model version.

Now, as we know, agile software practices aren’t really amenable to any kind of formal verification technique. If you don’t know what’s possible before you write the code, then you can’t write down a formal specification (the ‘target skill levels’ in the chart above don’t count – these aspirational goals rather than specifications). And if you can’t write down a formal specification for the expected software behaviour, then you can’t apply formal reasoning techniques to determine if the specification was met.

So does this really mean, as Valdivino suggests, that we can’t apply any of our toolbox of formal verification methods? I think attempting to answer this would make a great research project. I have some ideas for places to look where such techniques might be applicable. For example:

  • One important built-in check in a climate model is ‘conservation of mass’. Some fluxes move mass between the different components of the model. Water is an obvious one – it’s evaporated from the oceans, to become part of the atmosphere, and is then passed to the land component as rain, thence to the rivers module, and finally back to the ocean. All the while, the total mass of water across all components must not change. Similar checks apply to salt, carbon (actually this does change due to emissions), and various trace elements. At present, such checks are this is built in to the models as code assertions. In some cases, flux corrections were necessary because of imperfections in the numerical routines or the geometry of the grid, although in most cases, the models have improved enough that most flux corrections have been removed. But I think you could automatically extract from the code an abstracted model capturing just the ways in which these quantities change, and then use a model checker to track down and reason about such problems.
  • A more general version of the previous idea: In some sense, a climate model is a giant state-machine, but the scientists don’t ever build abstracted versions of it – they only work at the code level. If we build more abstracted models of the major state changes in each component of the model, and then do compositional verification over a combination of these models, it *might* offer useful insights into how the model works and how to improve it. At the very least, it would be an interesting teaching tool for people who want to learn about how a climate model works.
  • Climate modellers generally don’t use unit testing. The challenge here is that they find it hard to write down correctness properties for individual code units. I’m not entirely clear how formal methods could help here, but it seems like someone with experience of patterns for temporal logic properties might be able to help here. Clune and Rood have a forthcoming paper on this in November’s IEEE Software. I suspect this is one of the easiest places to get started for software people new to climate models.
  • There’s one other kind of verification test that is currently done by inspection, but might be amenable to some kind of formalization: the check that the code correctly implements a given mathematical formula. I don’t think this will be a high value tool, as the fortran code is close enough to the mathematics that simple inspection is already very effective. But occasionally a subtle bug slips through – for example, I came across an example where the modellers discovered they had used the wrong logarithm (loge in place of log10), although this was more due to lack of clarity in the original published paper, rather than a coding error.

Feel free to suggest more ideas in the comments!

Over the next few years, you’re likely to see a lot of graphs like this (click for a bigger version):

This one is from a forthcoming paper by Meehl et al, and was shown by Jerry Meehl in his talk at the Annecy workshop this week. It shows the results for just a single model, CCSM4, so it shouldn’t be taken as representative yet. The IPCC assessment will use graphs taken from ensembles of many models, as model ensembles have been shown to be consistently more reliable than any single model (the models tend to compensate for each other’s idiosyncrasies).

But as a first glimpse of the results going into IPCC AR5, I find this graph fascinating:

  • The extension of a higher emissions scenario out to three centuries shows much more dramatically how the choices we make in the next few decades can profoundly change the planet for centuries to come. For IPCC AR4, only the lower scenarios were run beyond 2100. Here, we see that a scenario that gives us 5 degrees of warming by the end of the century is likely to give us that much again (well over 9 degrees) over the next three centuries. In the past, people talked too much about temperature change at the end of this century, without considering that the warming is likely to continue well beyond that.
  • The explicit inclusion of two mitigation scenarios (RCP2.6 and RCP4.5) give good reason for optimism about what can be achieved through a concerted global strategy to reduce emissions. It is still possible to keep emissions below 2 degrees of warming. But, as I discuss below, the optimism is bounded by some hard truths about how much adaptation will still be necessary – even in this wildly optimistic case, the temperature drops only slowly over the three centuries, and still ends up warmer than today, even at the year 2300.

As the approach to these model runs has changed so much since AR4, a few words of explanation might be needed.

First, note that the zero point on the temperature scale is the global average temperature for 1986-2005. That’s different from the baseline used in the previous IPCC assessment, so you have to be careful with comparisons. I’d much prefer they used a pre-industrial baseline – to get that, you have to add 1 (roughly!) to the numbers on the y-axis on this graph. I’ll do that throughout this discussion.

I introduced the RCPs (“Representative Concentration Pathways”) a little in my previous post. Remember, these RCPs were carefully selected from the work of the integrated assessment modelling community, who analyze interactions between socio-economic conditions, climate policy, and energy use. They are representative in the sense that they were selected to span the range of plausible emissions paths discussed in the literature, both with and without a coordinated global emissions policy. They are pathways, as they specify in detail how emissions of greenhouse gases and other pollutants would change, year by year, under each set of assumptions. The pathways matters a lot, because it is cumulative emissions (and the relative amounts of different types of emissions) that determine how much warming we get, rather than the actual emissions level in any given year. (See this graph for details on the emissions and concentrations in each RCP).

By the way, you can safely ignore the meaning of the numbers used to label the RCPs – they’re really just to remind the scientists which pathway is which. Briefly, the numbers represent the approximate anthropogenic forcing, in W/m², at the year 2100.

RCP8.5 and RCP6 represent two different pathways for a world with no explicit climate policy. RCP8.5 is at about the 90th percentile of the full set of non-mitigation scenarios described in the literature. So it’s not quite a worse case scenario, but emissions much higher than this are unlikely. One scenario that follows this path is a world in which renewable power supply grows only slowly (to about 20% of the global power mix by 2070) while most of a growing demand for energy is still met from fossil fuels. Emissions continue to grow strongly, and don’t peak before the end of the century. Incidentally, RCP8.5 ends up in the year 2100 with a similar atmospheric concentration to the old A1FI scenario in AR4, at around 900ppm CO2.

RCP6 (which is only shown to the year 2100 in this graph) is in the lower quartile of likely non-mitigation scenarios. Here, emissions peak by mid-century and then stabilize at a little below double current annual emissions. This is possible without an explicit climate policy because under some socio-economic conditions, the world still shifts (slowly) towards cleaner energy sources, presumably because the price of renewables continues to fall while oil starts to run out.

The two mitigation pathways, RCP2.6 and RCP4.5 bracket a range of likely scenarios for a concerted global carbon emissions policy. RCP2.6 was explicitly picked as one of the most optimistic possible pathways – note that it’s outside the 90% confidence interval for mitigation scenarios. The expert group were cautious about selecting it, and spent extra time testing its assumptions before including it. But it was picked because there was interest in whether, in the most optimistic case, it’s possible to stay below 2°C of warming.

Most importantly, note that one of the assumptions in RCP2.6 is that the world goes carbon-negative by around 2070. Wait, what? Yes, that’s right – the pathway depends on our ability to find a way to remove more carbon from the atmosphere than we produce, and to be able to do this consistently on a the global scale by 2070. So, the green line in the graph above is certainly possible, but it’s well outside the set of emissions targets currently under discussion in any international negotiations.

RCP4.5 represents a more mainstream view of global attempts to negotiate emissions reductions. On this pathway, emissions peak before mid-century, and fall to well below today’s levels by the end of the century. Of course, this is not enough to stabilize atmospheric concentrations until the end of the century.

The committee that selected the RCPs warns against over-interpretation. They deliberately selected an even number of pathways, to avoid any implication that a “middle” one is the most likely. Each pathway is the result of a different set of assumptions about how the world will develop over the coming century, either with, or without climate policies. Also:

  • The RCPs should not be treated as forecasts, nor bounds on forecasts. No RCP represents a “best guess”. The high and low scenarios were picked as representative of the upper and lower ends of the range described in the literature.
  • The RCPs should not be treated as policy prescriptions. They were picked to help answer scientific questions, not to offer specific policy choices.
  • There isn’t a unique socio-economic scenario driving each RCP – there are multiple sets of conditions that might be consistent with a particular pathway. Identifying these sets of conditions in more detail is an open question to be studied over the next few years.
  • There’s no consistent logic to the four RCPs, as each was derived from a different assessment model. So you can’t, for example, adjust individual assumptions to get from one RCP to another.
  • The translation from emissions profiles (which the RCPs specify) into atmospheric concentrations and radiative forcings is uncertain, and hence is also an open research question. The intent is to study these uncertainties explicitly through the modeling process.

So, we have a set of emissions pathways chosen because they represent “interesting” points in the space of likely global socio-economic scenarios covered in the literature. These are the starting point for multiple lines of research by different research communities. The climate modeling community will use them as inputs to climate simulations, to explore temperature response, regional variations, precipitation, extreme weather, glaciers, sea ice, and so on. The impacts and adaptation community will use them to explore the different effects on human life and infrastructure, and how much adaptation will be needed under each scenario. The mitigation community will use them to study the impacts of possible policy choices, and will continue to investigate the socio-economic assumptions underlying these pathways, to give us a clearer account of how each might come about, and to produce an updated set of scenarios for future assessments.

Okay, back to the graph. This represents one of the first available sets of temperature outputs from a Global Climate Model for the four RCPs. Over the next two years, other modeling groups will produces data from their own runs of these RCPs, to give us a more robust set of multi-model ensemble runs.

So the results in this graph are very preliminary, but if the results from other groups are consistent with them, here’s what I think it means. The upper path, RCP8.5, offers a glimpse of what happens if economic development and fossil fuel use continue to grow they way they have over the last few decades. It’s hard to imagine much of the human race surviving the next few centuries under this scenario. The lowest path, RCP2.6, keeps us below the symbolically important threshold of 2 degrees of warming, but then doesn’t bring us down much from that throughout the coming centuries. And that’s a pretty stark result: even if we do find a way to go carbon-negative by the latter part of this century, the following two centuries still end up hotter than it is now. All the while that we’re re-inventing the entire world’s industrial basis to make it carbon-negative, we also have to be adapting to a global climate that is warmer than any experienced since the human species evolved.

[By the way: the 2 degree threshold is probably more symbolic than it is scientific, although there's some evidence that this is the point above which many scientists believe positive feedbacks would start to kick in. For a history of the 2 degree limit, see Randalls 2010].

I’m on my way back from a workshop on Computing in the Atmospheric Sciences, in Annecy, France. The opening keynote, by Gerald Meehl of NCAR, gave us a fascinating overview of the CMIP5 model experiments that will form a key part of the upcoming IPCC Fifth Assessment Report. I’ve been meaning to write about the CMIP5 experiments for ages, as the modelling groups were all busy getting their runs started when I visited them last year. As Jerry’s overview was excellent, this gives me the impetus to write up a blog post. The rest of this post is a summary of Jerry’s talk.

Jerry described CMIP5 as “the most ambitious and computer-intensive inter-comparison project ever attempted”, and having seen many of the model labs working hard to get the model runs started last summer, I think that’s an apt description. More than 20 modelling groups around the world are expected to participate, supplying a total estimated dataset of more than 2 petabytes.

It’s interesting to compare CMIP5 to CMIP3, the model intercomparison project for the last IPCC assessment. CMIP3 began in 2003, and was, at that time, itself an unprecedented set of coordinated climate modelling experiments. It involved 16 groups, from 11 countries with 23 models (some groups contributed more than one model). The resulting CMIP3 dataset, hosted at PCMDI, is 31 terabytes, is openly accessible, has been accessed by more than 1200 scientists, has generated hundreds of papers, and use of this data is still ongoing. The ‘iconic’ figures for future projections of climate change in IPCC AR4 are derived from this dataset (see for example, Figure 10.4 which I’ve previously critiqued).

Most of the CMIP3 work was based on the IPCC SRES “what if” scenarios, which offer different views on future economic development and fossil fuel emissions, but none of which include a serious climate mitigation policy.

By 2006, during the planning the next IPCC assessment, it was already clear that a profound paradigm shift was in progress. The idea of climate services had emerged, with a growing demand from industry, government and other group for detailed regional information about the impacts of climate change, and, of course, a growing need to explicitly consider mitigation and adaptation scenarios. And of course the questions are connected: With different mitigation choices, what are the remaining regional climate effects that adaptation will have to deal with?

So, CMIP5 represents a new paradigm for climate change prediction:

  1. Decadal prediction, with high resolution Atmosphere-Ocean General Circulation Models (AOGCMs), with say, 50km grids, initialized to explore near-time climate change over the next three decades.
  2. First generation Earth System Models, with include a coupled carbon cycle, and ice sheet models, typically run at intermediate resolution (100-150km grids) to study longer term feedbacks past mid-century, using a new set of scenarios that include both mitigation and non-mitigation emissions profiles.
  3. Stronger links between communities – e.g. WCRP, IGBP, and the weather prediction community, but most importantly, stronger interaction between the three working groups of the IPCC: WG1 (which looks at the physical science basis), WG2 (which looks at impacts, adaptation and vulnerability), and WG3 (integrated assessment modelling and scenario development). The lack of interaction between WG1 and the others has been a problem in the past, especially as it’s WG2 and WG3 before, as they’re the ones trying to understand the impacts of different policy choices.

The model experiments for CMIP5 are not dictated by IPCC, but selected by climate science community itself. A large set of experiments have been identified, intended to provide a 5-year framework (2008-2013) for climate change modelling. As not all modelling groups will be able to run all the experiments, they have been prioritized into three clusters: A core set that everyone will run, and two tiers of optional experiments. Experiments that are completed by early 2012 will be analyzed in the next IPCC assessment (due for publication in 2013).

The overall design for the set of experiments is broken out into two clusters (near-term, i.e. decadal runs; and long-term, i.e. century and longer), design for different types of model (although for some centres, this really means different configurations of the same model code, if their models can be run at very different resolutions). In both cases, the core experiment set includes runs of both past and future climate. The past runs are used as hindcasts to assess model skill. Here’s the decadal experiments, showing the core set in the middle, and tier 1 around the edge (there’s no tier 2 for these, as there aren’t so many decadal experiments:

These experiments include some very computationally-demanding runs at very high resolution, and include the first generation of global cloud-resolving models. For example, the prescribed SST time-slices experiments include two periods (1979-2008 and 2026-2035) where prescribed sea-surface temperatures taken from lower resolution, fully-coupled model runs will be used as a basis for very high resolution atmosphere-ocean runs. The intent of these experiments is to explore the local/regional effects of climate change, including on hurricanes and extreme weather events.

Here’s the set of experiments for the longer-term cluster, marked up to indicate three different uses: Model evaluation (where the runs can be compared to observations to identify weakness in the models and explore reasons for divergences between the models); climate projections (to show what the models do on four representative scenarios, at least to the year 2100, and, for some runs, out to the year 2300); and understanding, (including thought experiments, such as the Aqua planet with no land mass, and abrupt changes in GHG concentrations):

These experiments include a much wider range of scientific questions than earlier IPCC assessment (which is why there are so many more experiments this time round). Here’s another way of grouping the long-term runs, showing the collaborations with the many different research communities who are participating:

With these experiments, some crucial science questions will be addressed:

  • what are the time-evolving changes in regional climate change and extremes over the next few decades?
  • what are the size and nature of the carbon cycle and other feedbacks in the climate system, and what will be the resulting magnitude of change for different mitigation scenarios?

The long-term experiments are based on a new set of scenarios that represent a very different approach than was used in the last IPCC assessment. The new scenarios are called Representative Concentration Pathways (RCPs), although as Jerry points out, the name is a little confusing. I’ll write more about the RCPs in my next post, but here’s a brief summary…

The RCPs were selected after a long a series of discussion with the integrated assessment modelling community. A large set of possible scenarios were whittled down to just four. For the convenience of the climate modelling community, they’re labelled with the expected anomaly in radiative forcing (in W/m²) by the year 2100, to give us the set {RCP2.6, RCP4.5, RCP6, RCP8.5}. For comparison, the current total radiative forcing due to anthropogenic greenhouse gases is about 2W/m². But really, the numbers are just to help remember which RCP is which. Really, the term pathway is the important part  - each of the four was chosen as an illustrative example of how greenhouse gas concentrations might change over the rest of the century, under different circumstances. They were generated from integrated assessment models that provide detailed emissions profiles for a wide range of different greenhouse gases and other variables (e.g. aerosols). Here’s what the pathways look like (the darker coloured lines are the chosen representative pathways, the thinner lines show others that were consided, and each cluster is labelled with the model that generated them (click for bigger):

Each RCP was produced by a different model, in part because no single model was capable of providing the detail needed for all four different scenarios, although this means that the RCPs cannot be directly compared, because they include different assumptions. The graph above shows the range of mitigation scenarios considered by the blue shading, and the range of non-mitigation scenarios with gray shading (the two areas overlap a little).

Here’s a rundown on the four scenarios:

  • RCP2.6 represents the lower end of possible mitigation strategies, where emissions peak in the next decade or so, and then decline rapidly. This scenario is only possible if the world has gone carbon-negative by the 2070s, presumably by developing wide-scale carbon-capture and storage(CCS) technologies. This might be possible with an energy mix by 2070 of at least 35% renewables, 45% fossil fuels with full CCS (and 20% without), along with use of biomass, tree planting, and perhaps some other air-capture technologies. [My interpretation: this is the most optimistic scenario, in which we manage to do everything short of geo-engineering, and we get started immediately].
  • RCP4.5 represents a less aggressive emissions mitigation policy, where emissions peak before mid-century, and then fall, but not to zero. Under this scenario, concentrations stabilize by the end of the century, but won’t start falling, so the extra radiative forcing at the year 2100 is still more than double what it is today, at 4.5W/m². [My interpretation: this is the compromise future in which most countries work hard to reduce emissions, with a fair degree of success, but where CCS turns out not to be viable for massive deployment].
  • RCP6 represents the more optimistic of the non-mitigation futures. [My interpretation: this scenario is a world without any coordinated climate policy, but where there is still significant uptake of renewable power, but not enough to offset fossil-fuel driven growth among developing nations].
  • RCP8.5 represents the more pessimistic of the non-mitigation futures. For example, by 2070, we would still be getting about 80% of the world’s energy needs from fossil fuels, without CCS, while the remaining 20% come from renewables and/or nuclear. [My interpretation: this is the closest to the "drill, baby, drill" scenario beloved of certain right-wing American politicians].

Jerry showed some early model results for these scenarios from the NCAR model, CCSM4, but I’ll save that for my next post. To summarize:

  • 24 modelling groups are expected to participate in CMIP5, and about 10 of these groups have fully coupled earth system models.
  • Data is currently in from 10 groups, covering 14 models. Here’s a live summary, currently showing 172TB, which is already more than 5 times all the model data for CMIP3. Jerry put the total expected data at 1-2 petabytes, although in a talk later in the afternoon, Gary Strand from NCAR pegged it at 2.2PB. [Given how much everyone seems to have underestimated the data volumes from the CMIP5 experiments, I wouldn't be surprised if it's even bigger. Sitting next to me during Jerry's talk, Marie-Alice Foujols from IPSL came up with an estimated of 2PB just for all the data collected from the runs done at IPSL, of which she thought something like 20% would be submitted to the CMIP5 archive].
  • The model outputs will be accessed via the Earth System Grid, and will include much more extensive documentation than previously. The Metafor project has built a controlled vocabulary for describing models and experiments, and the Curator project has developed web-based tools for ingesting this metatdata.
  • There’s a BAMS paper coming out soon describing CMIP5.
  • There will be a CMIP5 results session at the WCRP Open science conference next month, another at the AGU meeting in December, and another at a workshop in Hawaii in March.

For the Computing in Atmospheric Sciences workhop next month, I’ll be giving a talk entitled “On the relationship between earth system models and the labs that build them”. Here’s the abstract:

In this talk I will discuss a number of observations from a comparative study of four major climate modeling centres:
- the UK Met Office Hadley Centre (UKMO), in Exeter, UK
- the National Centre for Atmospheric Research (NCAR) in Boulder Colorado,
- the Max-Planck Institute for Meteorology (MPI-M) in Hamburg, Germany
- the Institute Pierre Simon Laplace (IPSL) in Paris, France).
The study focussed on the organizational structures and working practices at each centre with respect to earth system model development, and how these affect the history and current qualities of their models. While the centres share a number of similarities, including a growing role for software specialists and greater use of open source tools for managing code and the testing process, there are marked differences in how the different centres are funded, in their organizational structure and in how they allocate resources. These differences are reflected in the program code in a number of ways, including the nature of the coupling between model components, the portability of the code, and (potentially) the quality of the program code.

While all these modelling centres continually seek to refine their software development practices and the software quality of their models, they all struggle to manage the growth (in terms of size and complexity) in the models. Our study suggests that improvements to the software engineering practices at the centres have to take account of differing organizational constraints at each centre. Hence, there is unlikely to be a single set of best practices that work anywhere. Indeed, improvement in modelling practices usually come from local, grass-roots initiatives, in which new tools and techniques are adapted to suit the context at a particular centre. We suggest therefore that there is need for a stronger shared culture of describing current model development practices and sharing lessons learnt, to facilitate local adoption and adaptation.

18. May 2011 · Comments Off · Categories: climate modeling

Previously I posted on the first two sessions of the workshop on A National Strategy for Advancing Climate Modeling” that was held at NCAR at the end of last month:

  1. What should go into earth system models;
  2. Challenges with hardware, software and human resources;

    The third session focussed on the relationship between models and data.

    Kevin Trenberth kicked off with a talk on Observing Systems. Unfortunately, I missed part of his talk, but I’ll attempt a summary anyway – apologies if it’s incomplete. His main points were that we don’t suffer from a lack of observational data, but from problems with quality, consistency, and characterization of errors. Continuity is a major problem, because much of the observational system was designed for weather forecasting, where consistency of measurement over years and decades isn’t required. Hence, there’s a need for reprocessing and reanalysis of past data, to improve calibration and assess accuracy, and we need benchmarks to measure the effectiveness of reprocessing tools.

    Kevin points out that it’s important to understand that models are used for much more than prediction. They are used:

    • for analysis of observational data, for example to produce global gridded data from the raw observations;
    • to diagnose climate & improve understanding of climate processes (and thence to improve the models);
    • for attribution studies, through experiments to determine climate forcing;
    • for projections and prediction of future climate change;
    • for downscaling to provide regional information about climate impacts;

    Confronting the models with observations is a core activity in earth system modelling. Obviously, it is essential for model evaluation. But observational data is also used to tune the models, for example to remove known systematic biases. Several people at the workshop pointed out that the community needs to do a better job of keeping the data used to tune the models distinct from the data used to evaluate them. For tuning, a number of fields are used – typically top-of-the-atmosphere data such as net shortwave and longwave radiation flux, cloud and clear sky forcing, and cloud fractions. Also, precipitation and surface wind stress, global mean surface temperature, and the period and amplitude of ENSO. Kevin suggests we need to do a better job of collecting information about model tuning from different modelling groups, and ensure model evaluations don’t use the same fields.

    For model evaluation, a number of integrated score metrics have been proposed to summarize correlation, root-mean-squared (rms) error and variance ratios – See for example, Taylor 2001Boer and Lambert 2001Murphy et al, 2004Reichler & Kim 2008.

    But model evaluation and tuning aren’t the only ways in which models and data are brought together. Just as important is re-analysis, where multiple observational datasets are processed through a model to provide more comprehensive (model-like) data products. For this, data assimilation is needed, whereby observational data fields are used to nudge the model at each timestep as it runs.

    Kevin also talked about forward modelling, a technique in which the model used to reproduce the signal that a particular instrument would record, given certain climate conditions. Forward modelling is used for comparing models with ground observations and satellite data. In much of this work, there is an implicit assumption that the satellite data are correct, but in practice, all satellite data have biases, and need re-processing. For this work, the models need good emulation of instrument properties and thresholds. For examples, see: Chepfer, Bony et al, 2010Stubenrauch & Kinne 2009.

    He also talked about some of the problems with existing data and models:

    • nearly all satellite data sets contain large spurious variability associated with changing instruments and satellites, orbital decay/drift, calibration, and changing methods of analysis.
    • simulation of the hydrological cycle is poor, especially in the intertropical convergence zone (ITCZ). Tropical transients are too weak, runoff and recycling is not correct, and the diurnal cycle is poor.
    • there are large differences between datasets for low cloud (see Marchand at al 2010)
    • clouds are not well defined. Partly this is a problem of sensitivity of instruments, compounded by the difficulty of distinguishing between clouds and aerosols.
    • Most models have too much incoming solar radiation in the southern oceans, caused by too few clouds. This makes for warmer oceans and diminished poleward transport, which messes up storm tracking and analysis of ocean transports.

    What is needed to support modelling over the next twenty years? Kevin made the following recommendations:

    • Support observations and development into climate datasets.
    • Support reprocessing and reanalysis.
    • Unify NWP and climate models to exploit short term predictions and confront the models with data.
    • Develop more forward modelling and observation simulators, but with more observational input.
    • Targeted process studies such as GEWEX and analysis of climate extremes, for model evaluation.
    • Target problem areas such as monsoons and tropical precipitation.
    • Carry out a survey of fields used to tune models.
    • Design evaluation and model merit scoring based on fields other than those used in tuning.
    • Promote assessments of observational datasets so modellers know which to use (and not use).
    • Support existing projects, including GSICS, SCOPE-CM, CLARREO, GRUAN,

    Overall, there’s a need for a climate observing system. Process studies should not just be left to the observationists – we need the modellers to get involved.

    The second talk was by Ben Kirtman, on “Predictability, Credibility, and Uncertainty Quantification“. He began by pointing out that there is ongoing debate over what predictability means. Some treat it as an inherent property of the climate system, while others think of it as a model property. Ben distinguished two kinds of predictability:

    • Sensitivity of the climate system to initial conditions (predictability of the first kind);
    • Predictability of the boundary forcing (predictability of the second kind).

    Predictability is enhanced by ensuring specific processes are included. For example, you need to include the MJO if you want to predict ENSO. But model-based estimates of predictability are model dependent. If we want to do a better job of assessing predictability, we have to characterize model uncertainty, and we don’t know how to do this today.

    Good progress has been made on quantifying initial condition uncertainty. We have pretty of good ideas for how to probe this (stochastic optimals, bred vectors, etc.) using ensembles with perturbed initial conditions. But from our understanding of chaos theory (e.g. see the Lorenz attractor), predictability depends on which part of the regime you’re in, so we need to assess the predictability for each particular forecast.

    Uncertainty in external forcing include uncertainties in both the natural and anthropogenic forcing; however this is becoming less of an issue in modelling, as these forcings are better understood. Therefore, the biggest challenge is in quantifying uncertainties in model formulation. These arise because of the discrete representation of climate system, the use of parameterization of subgrid processes, and because of missing processes. Current approaches can be characterized as:

    • a posteriori techniques, such as the multi-model ensembles of opportunity used in IPCC assessments, and perturbed parameters/parameterizations, as used in climateprediction.net.
    • a priori techniques, where we incorporate uncertainty as the model evolves. The idea is that uncertainty is in subscale processes and missing physics. Model this non-locally and stochastically. E.g. backscatter, interactive ensembles to incorporate uncertainty in the coupling.

    The term credibility is even less well defined. Ben asked his students what they understood by the term, and they came up with a simple answer: credibility is the extent to which you use the best available science [which corresponds roughly to my suggestion of what model validation ought to mean]. In the literature, there are a number of other way of expressing credibility:

    • In terms of model bias. For example, Lenny Smith offers a Temporal (or spatial) credibility ratio, calculated as the ratio of the smallest timestep in the model to the smallest duration over which a variable has to be averaged before it compares favourably with observations. This expresses how much averaging over the temporal (or spatial) scale you have to do to make the model look like the data.
    • In terms of whether the ensembles bracket the observations. But the problem here is that you can always pump up an ensemble to do this, and it doesn’t really tell you about probabilistic forecast skill.
    • In terms of model skill. In numerical weather prediction, it’s usual to measure forecast quality using some specific skill metrics.
    • In terms of process fidelity – how well the processes represented in the model capture what is known about those processes in reality. This is a reductionist approach, and depends on the extent to which specific processes can be isolated (both in the model, and in the world).
    • In terms of faith – for example, the modellers’ subjective assessment of how good their model is.

    In the literature, credibility is usually used in a qualitative way to talk about model bias. Hence, in the literature, model bias is roughly synonymous with inverse of credibility. However, in these terms, the models currently have a major credibility gap. For example, Ben showed the annual mean rainfall from a long simulation of CESM1, showing bias with respect to GPCP observations. These show the model struggling to capture the spatial distribution of sea surface temperature (SST), especially in equatorial regions.

    Every climate model has a problem with equatorial sea surface temperatures (SST). A recent paper, Anagnostopoulos et al 2009 makes a big deal of this, and is clearly very hostile to climate modelling. They look at regional biases in temperature and precipitation, where the models are clearly not bracketing observations. I googled the Anagnostopooulos paper while Ben was talking – The first few pages of google hits are dominated by denialist website proclaiming this as a major new study demonstrating the models are poor. It’s amusing that this is treated as news, given that such weaknesses in the models are well known within the modelling community, and discussed in the IPCC report. Meanwhile the hydrologists at the workshop tell me that it’s a third-rate journal, so none of them would pay any attention to this paper.

    Ben argues that these weaknesses need to be removed to increase model credibility. This argument seems a little weak to me. While improving model skill and removing biases are important goals for this community, they don’t necessarily help with model credibility in terms of using the best science (because often replacing an empirically derived parameterization with one that’s more theoretically justified will often reduce model skill). More importantly, those outside the modeling community will have their own definitions of credibility, and they’re unlikely to correspond to those used within the community. Some attention to the ways in which other stakeholders understand model credibility would be useful and interesting.

    In summary, Ben identified a number of important tensions for climate modeling. For example, there are tensions between:

    • the desire to measure prediction skill vs. the desire to explore the limits of predictability;
    • the desire to quantify uncertainty, vs. the push for more resolution and complexity in the models;
    • a priori vs. a posteriori methods of assessing model uncertainty.
    • operational vs. research activities (Many modellers believe the IPCC effort is getting a little out of control – it’s a good exercise, but too demanding on resources);
    • weather vs climate modelling;
    • model diversity vs critical mass;

    Ben urged the community to develop a baseline for climate modelling, capturing best practices for uncertainty estimation.