This is an excerpt from the draft manuscript of my forthcoming book, Computing the Climate.

While models are used throughout the sciences, the word ‘model’ can mean something very different to scientists from different fields. This can cause great confusion. I often encounter scientists from outside of climate science who think climate models are statistical models of observed data, and that future projections from these models must be just extrapolations of past trends. And just to confuse things further, some of the models used in climate policy analysis are like this. But the physical climate models that underpin our knowledge of why climate change occurs are fundamentally different from statistical models.

A useful distinction made by philosophers of science is between models of phenomena, and models of data. The former include models developed by physicists and engineers to capture cause-and-effect relationships. Such models are derived from theory and experimentation, and have explanatory power: the model captures the reasons why things happen. Models of data, on the other hand, describe patterns in observed data, such as correlations and trends over time, without reference to why they occur. Statistical models, for example, describe common patterns (distributions) in data, without saying anything about what caused them. This simplifies the job of describing and analyzing patterns: if you can find a statistical model that matches your data, you can reduce the data to a few parameters (sometimes just two: a mean and a standard deviation). For example, the heights of any large group of people tend to follow a normal distribution—the bell-shaped curve—but this model doesn’t explain why heights vary in that way, nor whether they always will in the future. New techniques from machine learning have extended the power of these kinds of models in recent years, allowing more complex patterns to be discovered by “training” an algorithm to find more complex kinds of pattern.

Statistical techniques and machine learning algorithms are good at discovering patterns in data (eg “A and B always seems to change together”), but hopeless at explaining why those patterns occur. To get over this, many branches of science use statistical methods together with controlled experiments, so that if we find a pattern in the data after we’ve carefully manipulated the conditions, we can argue that the changes we introduced in the experiment caused that pattern. The ability to identify a causal relationship in a controlled experiment has nothing to do with the statistical model used—it comes from the logic of the experimental design. Only if the experiment is designed properly will statistical analysis of the results provide any insights into cause and effect.

Unfortunately, for some scientific questions, experimentation is hard, or even impossible. Climate change is a good example. Even though it’s possible to manipulate the climate (as indeed we are currently doing, by adding more greenhouse gases), we can’t set up a carefully controlled experiment, because we only have one planet to work with. Instead, we use numerical models, which simulate the causal factors—a kind of virtual experiment. An experiment conducted in a causal model won’t necessarily tell us what will happen in the real world, but it often gives a very useful clue. If we run the virtual experiment many times in our causal model, under slightly varied conditions, we can then turn back to a statistical model to help analyze the results. But without the causal model to set up the experiment, a statistical analysis won’t tell us much.

Both traditional statistical models and modern machine learning techniques are brittle, in the sense that they struggle when confronted with new situations not captured in the data from which the models were derived. An observed statistical trend projected into the future is only useful as a predictor if the future is like the past; it will be a very poor predictor if the conditions that cause the trend change. Climate change in particular is likely to make a mess of all of our statistical models, because the future will be very unlike the past. In contrast, a causal model based on the laws of physics will continue to give good predictions, as long as the laws of physics still hold.

Modern climate models contain elements of both types of model. The core elements of a climate model capture cause-and-effect relationships from basic physics, such as the thermodynamics and radiative properties of the atmosphere. But these elements are supplemented by statistical models of phenomena such as clouds, which are less well understood. To a large degree, our confidence in future predictions from climate models comes from the parts that are causal models based on physical laws, and the uncertainties in these predictions derive from the parts that are statistical summaries of less well-understood phenomena. Over the years, many of the improvements in climate models have come from removing a component that was based on a statistical model, and replacing it with a causal model. And our confidence in the causal components in these models comes from our knowledge of the laws of physics, and from running a very large number of virtual experiments in the model to check whether we’ve captured these laws correctly in the model, and whether they really do explain climate patterns that have been observed in the past.

This week I’m reading my way through three biographies, which neatly capture the work of three key scientists who laid the foundation for modern climate modeling: Arrhenius, Bjerknes and Callendar.


Crawford, E. (1996). Arrhenius: From Ionic Theory to the Greenhouse Effect. Science History Publications.
A biography of Svante Arrhenius, the Swedish scientist who, in 1895, created the first computational climate model, and spent almost a full year calculating by hand the likely temperature changes across the planet for increased and decreased levels of carbon dioxide. The term “greenhouse effect” hadn’t been coined back then, and Arrhenius was more interested in the question of whether the ice ages might have been caused by reduced levels of CO2. But nevertheless, his model was a remarkably good first attempt, and produced the first quantitative estimate of the warming expected from human’s ongoing use of fossil fuels.
Friedman, R. M. (1993). Appropriating the Weather: Vilhelm Bjerknes and the Construction of a Modern Meteorology. Cornell University Press.
A biography of Vilhelm Bjerknes, the Norwegian scientist, who, in 1904, identified the primitive equations, a set of differential equations that form the basis of modern computational weather forecasting and climate models. The equations are, in essence, an adaption of the equations of fluid flow and thermodynamics, adapted to represent the atmosphere as a fluid on a rotating sphere in a gravitational field. At the time, the equations were little more than a theoretical exercise, and we had to wait half a century for the early digital computers, before it became possible to use them for quantitative weather forecasting.
Fleming, J. R. (2009). The Callendar Effect: The Life and Work of Guy Stewart Callendar (1898-1964). University of Chicago Press.
A biography of Guy S. Callendar, the British scientist, who, in 1938, first compared long term observations of temperatures with measurements of rising carbon dioxide in the atmosphere, to demonstrate a warming trend as predicted by Arrhenius’ theory. It was several decades before his work was taken seriously by the scientific community. Some now argue that we should use the term “Callendar Effect” to describe the warming from increased emissions of carbon dioxide, because the term “greenhouse effect” is too confusing – greenhouse gases were keeping the planet warm long before we started adding more, and anyway, the analogy with the way that glass traps heat in a greenhouse is a little inaccurate.

Not only do the three form a neat ABC, they also represent the three crucial elements you need for modern climate modelling: a theoretical framework to determine which physical processes are likely to matter, a set of detailed equations that allow you to quantify the effects, and comparison with observations as a first step in validating the calculations.

It’s been a while since I’ve written about the question of climate model validation, but I regularly get asked about it when I talk about the work I’ve been doing studying how climate models are developed. There’s an upcoming conference organized by the Rotman Institute of Philosophy, in London, Ontario, on Knowledge and Models in Climate Science, at which many of my favourite thinkers on this topic will be speaking. So I thought it was a good time to get philosophical about this again, and define some terms that I think help frame the discussion (at least in the way I see it!).

Here’s my abstract for the conference:

Constructive and External Validity for Climate Modeling

Discussion of validity of scientific computational models tend to treat “the model” as a unitary artifact, and ask questions about its fidelity with respect to observational data, and its predictive power with respect to future situations. For climate modeling, both of these questions are problematic, because of long timescales and inhomogeneities in the available data. Our ethnographic studies of the day-to-day practices of climate modelers suggest an alternative framework for model validity, focusing on a modeling system rather than any individual model. Any given climate model can be configured for a huge variety of different simulation runs, and only ever represents a single instance of a continually evolving body of program code. Furthermore, its execution is always embedded in a broader social system of scientific collaboration which selects suitable model configurations for specific experiments, and interprets the results of the simulations within the broader context of the current body of theory about earth system processes.

We propose that the validity of a climate modeling system should be assessed with respect to two criteria: Constructive Validity, which refers to the extent to which the day-to-day practices of climate model construction involve the continual testing of hypotheses about the ways in which earth system processes are coded into the models, and External Validity, which refers to the appropriateness of claims about how well model outputs ought to correspond to past or future states of the observed climate system. For example, a typical feature of the day-to-day practice of climate model construction is the incremental improvement of the representation of specific earth system processes in the program code, via a series of hypothesis-testing experiments. Each experiment begins with a hypothesis (drawn from current or emerging theories about the earth system) that a particular change to the model code ought to result in a predicable change to the climatology produced by various runs of the model. Such a hypothesis is then tested empirically, using the current version of the model as a control, and the modified version of the model as the experimental case. Such experiments are then replicated for various configurations of the model, and results are evaluated in a peer review process via the scientific working groups who are responsible for steering the ongoing model development effort.

Assessment of constructive validity for a modeling system would take account of how well the day-to-day practices in a climate modeling laboratory adhere to rigorous standards for such experiments, and how well they routinely test the assumptions that are built into the model in this way. Similarly, assessment of the external validity of the modeling system would take account of how well knowledge of the strengths and weaknesses of particular instances of the model are taken into account when making claims about the scope of applicability of model results. We argue that such an approach offers a more coherent approach to questions of model validity, as it corresponds more directly with the way in which climate models are developed and used.

For more background, see:

Imagine for a moment if Microsoft had 24 competitors around the world, each building their own version of Microsoft Word. Imagine further that every few years, they all agreed to run their software through the same set of very demanding tests of what a word processor ought to be able to do in a large variety of different conditions. And imagine that all these competing  companies agreed that all the results from these tests would be freely available on the web, for anyone to see. Then, people who want to use a word processor can explore the data and decide for themselves which one best serves their purpose. People who have concerns about the reliability of word processors can analyze the strengths and weaknesses of each company’s software. Then think about what such a process would do to the reliability of word processors. Wouldn’t that be a great world to live in?

Well, that’s what climate modellers do, through a series of model inter-comparison projects. There are around 25 major climate modelling labs around the world developing fully integrated global climate models, and hundreds of smaller labs building specialized models of specific components of the earth system. The fully integrated models are compared in detail every few years through the Coupled Model Intercomparison Projects. And there are many other model inter-comparison projects for various specialist communities within climate science.

Have a look at how this process works, via this short paper on the planning process for CMIP6.

What’s the difference between forecasting the weather and predicting future climate change? A few years ago, I wrote a long post explaining that weather forecasting is an initial value problem, while climate is a boundary value problem. This is a much shorter explanation:

Imagine I were to throw a water balloon at you. If you could measure precisely how I threw it, and you understand the laws of physics correctly, you could predict precisely where it will go. If you could calculate it fast enough, you would know whether you’re going to get wet, or whether I’ll miss. That’s an initial value problem. The less precise your measurements of the initial value (how I throw it), the less accurate your prediction will be. Also, the longer the throw, the more the errors grow. This is how weather forecasting works – you measure the current conditions (temperature, humidity, wind speed, and so on) as accurately as possible, put them into a model that simulates the physics of the atmosphere, and run it to see how the weather will evolve. But the further into the future that you want to peer, the less accurate your forecast, because the errors on the initial value get bigger. It’s really hard to predict the weather more than about a week into the future:

Weather as an initial value problem

Now imagine I release a helium balloon into the air flow from a desk fan, and the balloon is on a string that’s tied to the fan casing. The balloon will reach the end of its string, and bob around in the stream of air. It doesn’t matter how exactly I throw the balloon into the airstream – it will keep on bobbing about in the same small area. I could leave it there for hours and it will do the same thing. This is a boundary value problem. I won’t be able to predict exactly where the balloon will be at any moment, but I will be able to tell you fairly precisely the boundaries of the space in which it will be bobbing. If anything affects these boundaries (e.g. because I move the fan a little), I should also be able to predict how this will shifts the area in which the balloon will bob. This is how climate prediction works. You start off with any (reasonable) starting state, and run your model for as long as you like. If your model gets the physics right, it will simulate a stable climate indefinitely, no matter how you initialize it:

Climate as a boundary value problem

But if the boundary conditions change, because, for example, we alter the radiative balance of the planet, the model should also be able to predict fairly accurately how this will shift the boundaries on the climate:

Climate change as a change in boundary conditions


We cannot predict what the weather will do on any given day far into the future. But if we understand the boundary conditions and how they are altered, we can predict fairly accurately how the range of possible weather patterns will be affected. Climate change is a change in the boundary conditions on our weather systems.

A few weeks ago, Mark Higgins, from EUMETSAT, posted this wonderful video of satellite imagery of planet earth for the whole of the year 2013. The video superimposes the aggregated satellite data from multiple satellites on the top of NASA’s ‘Blue Marble Next Generation’ ground maps, to give a consistent picture of large scale weather patterns (Original video here – be sure to listen to Mark’s commentary):

When I saw the video, it reminded me of something. Here’s the output from the CAM3, the atmospheric component of the global climate model CESM, run at very high resolution (Original video here):

I find it fascinating to play these two videos at the same time, and observe how the model captures the large scale weather patterns of the planet. The comparison isn’t perfect, because the satellite data measures the cloud temperature (the colder the clouds, the whiter they are shown), while the climate model output shows total water vapour & rain (i.e. warmer clouds are a lot more visible, and precipitation is shown in orange). This means the tropical regions look much drier in the satellite imagery than they do in the model output.

But even so, there are some remarkable similarities. For example, both videos clearly show the westerlies, the winds that flow from west to east at the top and bottom of the map (e.g. pushing rain across the North Atlantic to the UK), and they both show the trade winds, which flow from east to west, closer to the equator. Both videos also show how cyclones form in the regions between these wind patterns. For example, in both videos, you can see the typhoon season ramp up in the Western Pacific in August and September – the model has two hitting Japan in August, and the satellite data shows several hitting China in September. The curved tracks of these storms are similar in both models. If you look closely, you can also see the daily cycle of evaporation and rain over South America and Central Africa in both videos – watch how these regions appear to pulse each day.

I find these similarities remarkable, because none of these patterns are coded into the climate model – they all emerge as a consequence of getting the basic thermodynamic properties of the atmosphere right. Remember also that a climate model is not intended to forecast the particular weather of any given year (that would be impossible, due to chaos theory). However, the model simulates a “typical” year on planet earth. So the specifics of where and when each storm forms do not correspond to anything that actually happened in any given year. But when the model gets the overall patterns about right, that’s a pretty impressive achievement.

We now have a fourth paper added to our special issue of the journal Geoscientific Model Development, on Community software to support the delivery of CMIP5. All papers are open access:

  • M. Stockhause, H. Höck, F. Toussaint, and M. Lautenschlager, Quality assessment concept of the World Data Center for Climate and its application to CMIP5 data, Geosci. Model Dev., 5, 1023-1032, 2012.
    Describes the distributed quality control concept that was developed for handling the terabytes of data generated from CMIP5, and the challenges in ensuring data integrity (also includes a useful glossary in an appendix).
  • B. N. Lawrence, V. Balaji, P. Bentley, S. Callaghan, C. DeLuca, S. Denvil, G. Devine, M. Elkington, R. W. Ford, E. Guilyardi, M. Lautenschlager, M. Morgan, M.-P. Moine, S. Murphy, C. Pascoe, H. Ramthun, P. Slavin, L. Steenman-Clark, F. Toussaint, A. Treshansky, and S. Valcke, Describing Earth system simulations with the Metafor CIM, Geosci. Model Dev., 5, 1493-1500, 2012.
    Explains the Common Information Model, which was developed to describe climate model experiments in a uniform way, including the model used, the experimental setup and the resulting simulation.
  • S. Valcke, V. Balaji, A. Craig, C. DeLuca, R. Dunlap, R. W. Ford, R. Jacob, J. Larson, R. O’Kuinghttons, G. D. Riley, and M. Vertenstein, Coupling technologies for Earth System Modelling, Geosci. Model Dev., 5, 1589-1596, 2012.
    An overview paper that compares different approaches to model coupling used by different earth system models in the CMIP5 ensemble.
  • S. Valcke, The OASIS3 coupler: a European climate modelling community software, Geosci. Model Dev., 6, 373-388, 2013 (See also the Supplement)
    A detailed description of the OASIS3 coupler, which is used in all the European models contributing to CMIP5. The OASIS User Guide is included as a supplement to this paper.

(Note: technically speaking, the call for papers for this issue is still open – if there are more software aspects of CMIP5 that you want to write about, feel free to submit them!)

This week, I start teaching a new grad course on computational models of climate change, aimed at computer science grad students with no prior background in climate science or meteorology. Here’s my brief blurb:

Detailed projections of future climate change are created using sophisticated computational models that simulate the physical dynamics of the atmosphere and oceans and their interaction with chemical and biological processes around the globe. These models have evolved over the last 60 years, along with scientists’ understanding of the climate system. This course provides an introduction to the computational techniques used in constructing global climate models, the engineering challenges in coupling and testing models of disparate earth system processes, and the scaling challenges involved in exploiting peta-scale computing architectures. The course will also provide a historical perspective on climate modelling, from the early ENIAC weather simulations created by von Neumann and Charney, through to today’s Earth System Models, and the role that these models play in the scientific assessments of the UN’s Intergovernmental Panel on Climate Change (IPCC). The course will also address the philosophical issues raised by the role of computational modelling in the discovery of scientific knowledge, the measurement of uncertainty, and a variety of techniques for model validation. Additional topics, based on interest, may include the use of multi-model ensembles for probabilistic forecasting, data assimilation techniques, and the use of models for re-analysis.

I’ve come up with a draft outline for the course, and some possible readings for each topic. Comments are very welcome:

  1. History of climate and weather modelling. Early climate science. Quick tour of range of current models. Overview of what we knew about climate change before computational modeling was possible.
  2. Calculating the weather. Bjerknes’ equations. ENIAC runs. What does a modern dynamical core do? [Includes basic introduction to thermodynamics of atmosphere and ocean]
  3. Chaos and complexity science. Key ideas: forcings, feedbacks, dynamic equilibrium, tipping points, regime shifts, systems thinking. Planetary boundaries. Potential for runaway feedbacks. Resilience & sustainability. (way too many readings this week. Have to think about how to address this – maybe this is two weeks worth of material?)
    • Liepert, B. G. (2010). The physical concept of climate forcing. Wiley Interdisciplinary Reviews: Climate Change, 1(6), 786-802.
    • Manson, S. M. (2001). Simplifying complexity: a review of complexity theory. Geoforum, 32(3), 405-414.
    • Rind, D. (1999). Complexity and Climate. Science, 284(5411), 105-107.
    • Randall, D. A. (2011). The Evolution of Complexity In General Circulation Models. In L. Donner, W. Schubert, & R. Somerville (Eds.), The Development of Atmospheric General Circulation Models: Complexity, Synthesis, and Computation. Cambridge University Press.
    • Meadows, D. H. (2008). Chapter One: The Basics. Thinking In Systems: A Primer (pp. 11-34). Chelsea Green Publishing.
    • Randers, J. (2012). The Real Message of Limits to Growth: A Plea for Forward-Looking Global Policy, 2, 102-105.
    • Rockström, J., Steffen, W., Noone, K., Persson, Å., Chapin, F. S., Lambin, E., Lenton, T. M., et al. (2009). Planetary boundaries: exploring the safe operating space for humanity. Ecology and Society, 14(2), 32.
    • Lenton, T. M., Held, H., Kriegler, E., Hall, J. W., Lucht, W., Rahmstorf, S., & Schellnhuber, H. J. (2008). Tipping elements in the Earth’s climate system. Proceedings of the National Academy of Sciences of the United States of America, 105(6), 1786-93.
  4. Typology of climate Models. Basic energy balance models. Adding a layered atmosphere. 3-D models. Coupling in other earth systems. Exploring dynamics of the socio-economic system. Other types of model: EMICS; IAMS.
  5. Earth System Modeling. Using models to study interactions in the earth system. Overview of key systems (carbon cycle, hydrology, ice dynamics, biogeochemistry).
  6. Overcoming computational limits. Choice of grid resolution; grid geometry, online versus offline; regional models; ensembles of simpler models; perturbed ensembles. The challenge of very long simulations (e.g. for studying paleoclimate).
  7. Epistemic status of climate models. E.g. what does a future forecast actually mean? How are model runs interpreted? Relationship between model and theory. Reproducibility and open science.
    • Shackley, S. (2001). Epistemic Lifestyles in Climate Change Modeling. In P. N. Edwards (Ed.), Changing the Atmosphere: Expert Knowledge and Environmental Government (pp. 107-133). MIT Press.
    • Sterman, J. D., Jr, E. R., & Oreskes, N. (1994). The Meaning of Models. Science, 264(5157), 329-331.
    • Randall, D. A., & Wielicki, B. A. (1997). Measurement, Models, and Hypotheses in the Atmospheric Sciences. Bulletin of the American Meteorological Society, 78(3), 399-406.
    • Smith, L. a. (2002). What might we learn from climate forecasts? Proceedings of the National Academy of Sciences of the United States of America, 99 Suppl 1, 2487-92.
  8. Assessing model skill – comparing models against observations, forecast validation, hindcasting. Validation of the entire modelling system. Problems of uncertainty in the data. Re-analysis, data assimilation. Model intercomparison projects.
  9. Uncertainty. Three different types: initial state uncertainty, scenario uncertainty and structural uncertainty. How well are we doing? Assessing structural uncertainty in the models. How different are the models anyway?
  10. Current Research Challenges. Eg: Non-standard grids – e.g. non-rectangular, adaptive, etc; Probabilistic modelling – both fine grain (e.g. ECMWF work) and use of ensembles; Petascale datasets; Reusable couplers and software frameworks. (need some more readings on different research challenges for this topic)
  11. The future. Projecting future climates. Role of modelling in the IPCC assessments. What policymakers want versus what they get. Demands for actionable science and regional, decadal forecasting. The idea of climate services.
  12. Knowledge and wisdom. What the models tell us. Climate ethics. The politics of doubt. The understanding gap. Disconnect between our understanding of climate and our policy choices.

For a talk earlier this year, I put together a timeline of the history of climate modelling. I just updated it for my course, and now it’s up on Prezi, as a presentation you can watch and play with. Click the play button to follow the story, or just drag and zoom within the viewing pane to explore your own path.

Consider this a first draft though – if there are key milestones I’ve missed out (or misrepresented!) let me know!

In the talk I gave this week at the workshop on the CMIP5 experiments, I argued that we should do a better job of explaining how climate science works, especially the day-to-day business of working with models and data. I think we have a widespread problem that people outside of climate science have the wrong mental models about what a climate scientist does. As with any science, the day-to-day work might appear to be chaotic, with scientists dealing with the daily frustrations of working with large, messy datasets, having instruments and models not work the way they’re supposed to, and of course, the occasional mistake that you only discover after months of work. This doesn’t map onto the mental model that many non-scientists have of “how science should be done”, because the view presented in school, and in the media, is that science is about nicely packaged facts. In reality, it’s a messy process of frustrations, dead-end paths, and incremental progress exploring the available evidence.

Some climate scientists I’ve chatted to are nervous about exposing more of this messy day-to-day work. They already feel under constant attack, and they feel that allowing the public to peer under the lid (or if you prefer, to see inside the sausage factory) will only diminish people’s respect for the science. I take the opposite view – the more we present the science as a set of nicely polished results, the more potential there is for the credibility of the science to be undermined when people do manage to peek under the lid (e.g. by publishing internal emails). I think it’s vitally important that we work to clear away some of the incorrect mental models people have of how science is (or should be) done, and give people a better appreciation for how our confidence in scientific results slowly emerges from a slow, messy, collaborative process.

Giving people a better appreciation of how science is done would also help to overcome some of games of ping pong you get in the media, where each new result in a published paper is presented as a startling new discovery, overturning previous research, and (if you’re in the business of selling newspapers, preferably) overturning an entire field. In fact, it’s normal for new published results to turn out to be wrong, and most of the interesting work in science is in reconciling apparently contradictory findings.

The problem is that these incorrect mental models of how science is done are often well entrenched, and the best that we can do is to try to chip away at them, by explaining at every opportunity what scientists actually do. For example, here’s a mental model I’ve encountered from time to time about how climate scientists build models to address the kinds of questions policymakers ask about the need for different kinds of climate policy:

This view suggests that scientists respond to a specific policy question by designing and building software models (preferably testing that the model satisfies its specification), and then running the model to answer the question. This is not the only (or even the most common?) layperson’s view of climate modelling, but the point is that there are many incorrect mental models of how climate models are developed and used, and one of the things we should strive to do is to work towards dislodging some of these by doing a better job of explaining the process.

With respect to climate model development, I’ve written before about how models slowly advance based on a process that roughly mimics the traditional view of “the scientific method” (I should acknowledge, for all the philosophy of science buffs, that there really isn’t a single, “correct” scientific method, but let’s keep that discussion for another day). So here’s how I characterize the day to day work of developing a model:

Most of the effort is spent identifying and diagnosing where the weaknesses in the current model are, and looking for ways to improve them. Each possible improvement then becomes an experiment, in which the experimental hypothesis might look like:

“if I change <piece of code> in <routine>, I expect it to have <specific impact on model error> in <output variable> by <expected margin> because of <tentative theory about climactic processes and how they’re represented in the model>”

The previous version of the model acts as a control, and the modified model is the experimental condition.

But of course, this process isn’t just a random walk – it’s guided at the next level up by a number of influences, because the broader climate science community (and to some extent the meteorological community) are doing all sorts of related research, which then influences model development. In the paper we wrote about the software development processes at the UK Met Office, we portrayed it like this:

But I could go even broader and place this within a context in which a number of longer term observational campaigns (“process studies”) are collecting new types of observational data to investigate climate processes that are still poorly understood. This then involves the interaction several distinct communities. Christian Jakob portrays it like this:

Although the point of Jakob’s paper is to argue that the modelling and process studies communities don’t currently do enough of this kind of interactions, so there’s room for improvement in how the modelling influences the kinds of process studies needed, and how the results from process studies feed back into model development.

So, how else should we be explaining the day-to-day work of climate scientists?

I’m attending a workshop this week in which some of the initial results from the Fifth Coupled Model Intercomparison Project (CMIP5) will be presented. CMIP5 will form a key part of the next IPCC assessment report – it’s a coordinated set of experiments on the global climate models built by labs around the world. The experiments include hindcasts to compare model skill on pre-industrial and 20th Century climate, projections into the future for 100 and 300 years, shorter term decadal projections, paleoclimate studies, plus lots of other experiments that probe specific processes in the models. (For more explanation, see the post I wrote on the design of the experiments for CMIP5 back in September).

I’ve been looking at some of the data for the past CMIP exercises. CMIP1 originally consisted of one experiment – a control run with fixed forcings. The idea was to compare how each of the models simulates a stable climate. CMIP2 included two experiments, a control run like CMIP1, and a climate change scenario in which CO2 levels were increased by 1% per year. CMIP3 then built on these projects with a much broader set of experiments, and formed a key input to the IPCC Fourth Assessment Report.

There was no CMIP4, as the numbers were resynchronised to match the IPCC report numbers (also there was a thing called the Coupled Carbon Cycle Climate Model Intercomparison Project, which was nicknamed C4MIP, so it’s probably just as well!), so CMIP5 will feed into the fifth assessment report.

So here’s what I have found so far on the vital statistics of each project. Feel free to correct my numbers and help me to fill in the gaps!

(1996 onwards)
(1997 onwards)
Number of Experiments 1 2 12 110
Centres Participating 16 18 15 24
# of Distinct Models 19 24 21 45
# of Runs (Models X Expts) 19 48 211 841
Total Dataset Size 1 Gigabyte 500 Gigabyte 36 TeraByte 3.3 PetaByte
Total Downloads from archive ?? ?? 1.2 PetaByte
Number of Papers Published 47 595
Users ?? ?? 6700

[Update:] I’ve added a row for number of runs, i.e. the sum of the number of experiments run on each model (in CMIP3 and CMIP5, centres were able to pick a subset of the experiments to run, so you can’t just multiply models and experiments to get the number of runs). Also, I ought to calculate the total number of simulated years that represents (If a centre did all the CMIP5 experiments, I figure it would result in at least 12,000 simulated years).

Oh, one more datapoint from this week. We came up with an estimate that by 2020, each individual experiment will generate an Exabyte of data. I’ll explain how we got this number once we’ve given the calculations a bit more of a thorough checking over.

Our paper on defect density analysis of climate models is now out for review at the journal Geoscientific Model Development (GMD). GMD is an open review / open access journal, which means the review process is publicly available (anyone can see the submitted paper, the reviews it receives during the process, and the authors’ response). If the paper is eventually accepted, the final version will also be freely available.

The way this works at GMD is that the paper is first published to Geoscientific Model Development Discussions (GMDD) as an un-reviewed manuscript. The interactive discussion is then open for a fixed period (in this case, 2 months). At that point the editors will make a final accept/reject decision, and, if accepted, the paper is then published to GMD itself. During the interactive discussion period, anyone can post comments on the paper, although in practice, discussion papers often only get comments from the expert reviewers commissioned by the editors.

One of the things I enjoy about the peer-review process is that a good, careful review can help improve the final paper immensely. As I’ve never submitted before to a journal that uses an open review process, I’m curious to see how the open reviewing will help – I suspect (and hope!) it will tend to make reviewers more constructive.

Anyway, here’s the paper. As it’s open review, anyone can read it and make comments (click the title to get to the review site):

Assessing climate model software quality: a defect density analysis of three models

J. Pipitone and S. Easterbrook
Department of Computer Science, University of Toronto, Canada

Abstract. A climate model is an executable theory of the climate; the model encapsulates climatological theories in software so that they can be simulated and their implications investigated. Thus, in order to trust a climate model one must trust that the software it is built from is built correctly. Our study explores the nature of software quality in the context of climate modelling. We performed an analysis of defect reports and defect fixes in several versions of leading global climate models by collecting defect data from bug tracking systems and version control repository comments. We found that the climate models all have very low defect densities compared to well-known, similarly sized open-source projects. We discuss the implications of our findings for the assessment of climate model software trustworthiness.

On Thursday, Kaitlin presented her poster at the AGU meeting, which shows the results of the study she did with us in the summer. Her poster generated a lot of interest, especially the visualizations she has of the different model architectures. Click on thumbnail to see the full poster at the AGU site:

A few things to note when looking at the diagrams:

  • Each diagram shows the components of a model, scale to their relative size by lines of code. However, the models are not to scale with one another, as the smallest, UVic’s, is only a tenth of the size of the biggest, CESM. Someone asked what accounts for that size. Well, the UVic model is an EMIC rather than a GCM. It has a very simplified atmosphere model that does not include atmospheric dynamics, which makes it easier to run for very long simulations (e.g. to study paleoclimate). On the other hand, CESM is a community model, with a large number of contributors across the scientific community. (See Randall and Held’s point/counterpoint article in last months IEEE Software for a discussion of how these fit into different model development strategies).
  • The diagrams show the couplers (in grey), again sized according to number of lines of code. A coupler handles data re-gridding (when the scientific components use different grids), temporal aggregation (when the scientific components run on different time steps) along with other data handling. These are often invisible in diagrams the scientists create of their models, because they are part of the infrastructure code; however Kaitlin’s diagrams show how substantial they are in comparison with the scientific modules. The European models all use the same coupler, following a decade-long effort to develop this as a shared code resource.
  • Note that there are many different choices associated with the use of a coupler, as sometimes it’s easier to connect components directly rather through the coupler, and the choice may be driven by performance impact, flexibility (e.g. ‘plug-and-play’ compatibility) and legacy code issues. Sea ice presents an interesting example, because its extent varies over the course of a model run. So somewhere there must be code that keeps track of which grid cells have ice, and then routes the fluxes from ocean and atmosphere to the sea ice component for these grid cells. This could be done in the coupler, or in any of the three scientific modules. In the GFDL model, sea ice is treated as an interface to the ocean, so all atmosphere-ocean fluxes pass through it, whether there’s ice in a particular cell or not.
  • The relative size of the scientific components is a reasonable proxy for functionality (or, if you like, scientific complexity/maturity). Hence, the diagrams give clues about where each lab has placed its emphasis in terms of scientific development, whether by deliberate choice, or because of availability (or unavailability) of different areas of expertise. The differences between the models from different labs show some strikingly different choices here, for example between models that are clearly atmosphere-centric, versus models that have a more balanced set of earth system components.
  • One comment we received in discussions around the poster was about the places where we have shown sub-components in some of the models. Some modeling groups are more explicit about naming the sub-components, and indicating them in the code. Hence, our ability to identify these might be more dependent on naming practices rather than any fundamental architectural differences.

I’m sure Kaitlin will blog more of her reflections on the poster (and AGU in general) once she’s back home.

I’m at the AGU meeting in San Francisco this week. The internet connections in the meeting rooms suck, so I won’t be twittering much, but will try and blog any interesting talks. But first things first! I presented my poster in the session on “Methodologies of Climate Model Evaluation, Confirmation, and Interpretation” yesterday morning. Nice to get my presentation out of the way early, so I can enjoy the rest of the conference.

Here’s my poster, and the abstract is below (click for the full sized version at the AGU ePoster site):

A Hierarchical Systems Approach to Model Validation


Discussions of how climate models should be evaluated tend to rely on either philosophical arguments about the status of models as scientific tools, or on empirical arguments about how well runs from a given model match observational data. These lead to quantitative measures expressed in terms of model bias or forecast skill, and ensemble approaches where models are assessed according to the extent to which the ensemble brackets the observational data.

Such approaches focus the evaluation on models per se (or more specifically, on the simulation runs they produce), as if the models can be isolated from their context. Such approaches may overlook a number of important aspects of the use of climate models:

  • the process by which models are selected and configured for a given scientific question.
  • the process by which model outputs are selected, aggregated and interpreted by a community of expertise in climatology.
  • the software fidelity of the models (i.e. whether the running code is actually doing what the modellers think it’s doing).
  • the (often convoluted) history that begat a given model, along with the modelling choices long embedded in the code.
  • variability in the scientific maturity of different components within a coupled earth system model.

These omissions mean that quantitative approaches cannot assess whether a model produces the right results for the wrong reasons, or conversely, the wrong results for the right reasons (where, say the observational data is problematic, or the model is configured to be unlike the earth system for a specific reason).

Furthermore, quantitative skill scores only assess specific versions of models, configured for specific ensembles of runs; they cannot reliably make any statements about other configurations built from the same code.

Quality as Fitness for Purpose

The problem is that there is no such thing as “the model”. The body of code that constitutes a modern climate model actually represents an enormous number of possible models, each corresponding to a different way of configuring that code for a particular run. Furthermore, this body of code isn’t a static thing. The code is changed on a daily basis, through a continual process of experimentation and model improvement. This applies even to any specific “official release”, which again is just a body of code that can be configured to run as any of a huge number of different models, and again, is not unchanging – as with all software, there will be occasional bugfix releases applied to it, along with improvements to the ancillary datasets.

Evaluation of climate models should not be about “the model”, but about the relationship between a modelling system and the purposes to which it is put. More precisely, it’s about the relationship between particular ways of building and configuring models and the ways in which the runs produced by those models are used.

What are the uses of a climate model? They vary tremendously:

  • To provide inputs to assessments of the current state of climate science;
  • To explore the consequences of a current theory;
  • To test a hypothesis about the observational system (e.g. forward modeling);
  • To test a hypothesis about the calculational system (e.g. to explore known weaknesses);
  • To provide homogenized datasets (e.g. re-analysis);
  • To conduct thought experiments about different climates;
  • To act as a comparator when debugging another model;

In general, we can distinguish three separate systems: the calculational system (the model code); the theoretical system (current understandings of climate processes) and the observational system. In the most general sense, climate models are developed to explore how well our current understanding (i.e. our theories) of climate explain the available observations. And of course the inverse: what additional observations might we make to help test our theories.

We’re dealing with relationships between three different systems

Validation of the Entire Modeling System

When we ask questions about likely future climate change, we don’t ask the question of the calculational system, we ask it of the theoretical system; the models are just a convenient way of probing the theory to provide answers.
When society asks climate scientists for future projections, the question is directed at climate scientists, not their models. Modellers apply their judgment to select appropriate versions & configurations of the models to use, set up the runs, and interpret the results in the light of what is known about the models’ strengths and weaknesses and about any gaps between the computational models and the current theoretical understanding. And they add all sorts of caveats to the conclusions they draw from the model runs when they present their results.

Validation is not a post-hoc process to be applied to an individual “finished” model, to ensure it meets some criteria for fidelity to the real world. In reality, there is no such thing as a finished model, just many different snapshots of a large set of model configurations, steadily evolving as the science progresses. Knowing something about the fidelity of a given model configuration to the real world is useful, but not sufficient to address fitness for purpose. For this, we have to assess the extent to which climate models match our current theories, and the extent to which the process of improving the models keeps up with theoretical advances.


Our approach to model validation extends current approaches:

  • down into the detailed codebase to explore the processes by which the code is built and tested. Thus, we build up a picture of the day-to-day practices by which modellers make small changes to the model and test the effect of such changes (both in isolated sections of code, and on the climatology of a full model). The extent to which these practices improve the confidence and understanding of the model depends on how systematically this testing process is applied, and how many of the broad range of possible types of testing are applied. We also look beyond testing to other software practices that improve trust in the code, including automated checking for conservation of mass across the coupled system, and various approaches to spin-up and restart testing.
  • up into the broader scientific context in which models are selected and used to explore theories and test hypotheses. Thus, we examine how features of the entire scientific enterprise improve (or impede) model validity, from the collection of observational data, creation of theories, use of these theories to develop models, choices for which model and which model configuration to use, choices for how to set up the runs, and interpretation of the results. We also look at how model inter-comparison projects provide a de facto benchmarking process, leading in turn to exchanges of ideas between modelling labs, and hence advances in the scientific maturity of the models.

This layered approach does not attempt to quantify model validity, but it can provide a systematic account of how the detailed practices involved in the development and use of climate models contribute to the quality of modelling systems and the scientific enterprise that they support. By making the relationships between these practices and model quality more explicit, we expect to identify specific strengths and weaknesses the modelling systems, particularly with respect to structural uncertainty in the models, and better characterize the “unknown unknowns”.

I had several interesting conversations at WCRP11 last week about how different the various climate models are. The question is important because it gives some insight into how much an ensemble of different models captures the uncertainty in climate projections. Several speakers at WCRP suggested we need an international effort to build a new, best of breed climate model. For example, Christian Jakob argued that we need a “Manhattan project” to build a new, more modern climate model, rather than continuing to evolve our old ones (I’ve argued in the past that this is not a viable approach). There have also been calls for a new international climate modeling centre, with the resources to build much larger supercomputing facilities.

The counter-argument is that the current diversity in models is important, and re-allocating resources to a single centre would remove this benefit. Currently around 20 or so different labs around the world build their own climate models to participate in the model inter-comparison projects that form a key input to the IPCC assessments. Part of the argument for this diversity of models is that when different models give similar results, that boosts our confidence in those results, and when they give different results, the comparisons provide insights into how well we currently understand and can simulate the climate system. For assessment purposes, the spread of the models is often taken as a proxy for uncertainty, in the absence of any other way of calculating error bars for model projections.

But that raises a number of questions. How well do the current set of coupled climate models capture the uncertainty? How different are the models really? Do they all share similar biases? And can we characterize how model intercomparisons feed back into progress in improving the models? I think we’re starting to get interesting answers to the first two of these questions, while the last two are, I think, still unanswered.

First, then, is the question of representing uncertainty. There are, of course, a number of sources of uncertainty. [Note that ‘uncertainty’ here doesn’t mean ‘ignorance’ (a mistake often made by non-scientists); it means, roughly, how big should the error bars be when we make a forecast, or more usefully, what does the probability distribution look like for different climate outcomes?]. In climate projections, sources of uncertainty can be grouped into three types:

  • Internal variability: natural fluctuations in the climate (for example, the year-to-year differences caused by the El Niño Southern Oscillation, ENSO);
  • Scenario uncertainty: the uncertainty over future carbon emissions, land use changes, and other types of anthropogenic forcings. As we really don’t know how these will change year-by-year in the future (irrespective of whether any explicit policy targets are set), it’s hard to say exactly how much climate change we should expect.
  • Model uncertainty: the range of different responses to the same emissions scenario given by different models. Such differences arise, presumably, because we don’t understand all the relevant processes in the climate system perfectly. This is the kind of uncertainty that a large ensemble of different models ought to be able to assess.

Hawkins and Sutton analyzed the impact of these different type of uncertainty on projections of global temperature over the range of a century. Here, Fractional Uncertainty means the ratio of the model spread to the projected temperature change (against a 1971-2000 mean):

This analysis shows that for short term (decadal) projections, the internal variability is significant. Finding ways of reducing this (for example by better model initialization from the current state of the climate) is important the kind of near-term regional projections needed by, for example, city planners, and utility and insurance companies, etc. Hawkins & Sutton indicate with dashed lines some potential to reduce this uncertainty for decadal projections through better initialization of the models.

For longer term (century) projections, internal variability is dwarfed by scenario uncertainty. However, if we’re clear about the nature of the scenarios used, we can put scenario uncertainty aside and treat model runs as “what-if” explorations – if the emissions follow a particular pathway over the 21st Century, what climate response might we expect?

Model uncertainty remains significant over both short and long term projections. The important question here for predicting climate change is how much of this range of different model responses captures the real uncertainties in the science itself. In the analysis above, the variability due to model differences is about 1/4 of the magnitude of the mean temperature rise projected for the end of the century. For example, if a given emissions scenario leads to a model mean of +4°C, the model spread would be about 1°C, yielding a projection of +4±0.5°C. So is that the right size for an error bar on our end-of-century temperature projections? Or, to turn the question around, what is the probability of a surprise – where the climate change turns out to fall outside the range represented by the current model ensemble?

Just as importantly, is the model ensemble mean the most likely outcome? Or do the models share certain biases so that the truth is somewhere other than the multi-model mean? Last year, James Annan demolished the idea that the models cluster around the truth, and in a paper with Julia Hargreaves, provides some evidence that the model ensembles do a relatively good job of bracketing the observational data, and, if anything, the ensemble spread is too broad. If the latter point is correct, then the model ensembles over-estimate the uncertainty.

This brings me to the question of how different the models really are. Over the summer, Kaitlin Alexander worked with me to explore the software architecture of some of the models that I’ve worked with from Europe and N. America. The first thing that jumped out at me when she showed me her diagrams was how different the models all look from one another. Here are six of them presented side-by-side. The coloured ovals indicate the size (in lines of code) of each major model component (relative to other components in the same model; the different models are not shown to scale), and the coloured arrows indicate data exchanges between the major components (see Kaitlin’s post for more details):

There are clearly differences in how the components are coupled together (for example, whether all data exchanges pass through a coupler, or whether components interact directly). In some cases, major subcomponents are embedded as subroutines within a model component, which makes the architecture harder to understand, but may make sense from a scientific point of view, when earth system processes themselves are tightly coupled. However, such differences in the code might just be superficial, as the choice of call structure should not, in principle affect the climatology.

The other significant difference is in the relative sizes of the major components. Lines of code isn’t necessarily a reliable measure, but it usually offers a reasonable proxy for the amount of functionality. So a model with an atmosphere model dramatically bigger than the other components indicates a model for which far more work (and hence far more science) has gone into modeling the atmosphere than the other components.

Compare for example, the relative sizes of the atmosphere and ocean components for HadGEM3 and IPSLCM5A, which, incidentally, both use the same ocean model, NEMO. HadGEMs has a much bigger atmosphere model, representing more science, or at least many more options for different configurations. In part, this is because the UK Met Office is an operational weather forecasting centre, and the code base is shared between NWP and climate research. Daily use of this model for weather forecasting offers many opportunities to improve the skill of the model (although improvement in skill in short term weather forecasting doesn’t necessarily imply improvements in skill for climate simulations). However, the atmosphere model is the biggest beneficiary of this process, and, in fact, the UK Met Office does not have much expertise in ocean modeling. In contrast, the IPSL model is the result of a collaboration between several similarly sized research groups, representing different earth subsystems.

But do these architectural differences show up as scientific differences? I think they do, but was finding this hard to analyze. Then I had a fascinating conversation at WCRP last week with Reto Knutti, who showed me a recent paper that he published with D. Masson, in which they analyzed model similarity from across the CMIP3 dataset. The paper describes a cluster analysis over all the CMIP3 models (plus three re-analysis datasets, to represent observations), based on how well the capture the full spatial field for temperature (on the left) and precipitation (on the right). The cluster diagrams look like this (click for bigger):

In these diagrams, the models from the same lab are coloured the same. Observational data are in pale blue (three observational datasets were included for temperature, and two for precipitation). Some obvious things jump out: the different observational datasets are more similar to each other than they are to any other model, but as a cluster, they don’t look any different from the models. Interestingly, models from the same lab tend to be more similar to one another, even when these span different model generations. For example, for temperature, the UK Met Office models HadCM3 and HadGEM1 are more like each other than they are like any other models, even though they run at very different resolutions, and have different ocean models. For precipitation, all the GISS models cluster together and are quite different from all the other models.

The overall conclusion from this analysis is that using models from just one lab (even in very different configurations, and across model generations) gives you a lot less variability than using models from different labs. Which does suggest that there’s something in the architectural choices made at each lab that leads to a difference in the climatology. In the paper, Masson & Knutti go on to analyze perturbed physics ensembles, and show that the same effect shows up here too. Taking a single model, and systematically varying the parameters used in the model physics still gives you less variability than using models from different labs.

There’s another followup question that I would like to analyze: do models that share major components tend to cluster together? There’s a growing tendency for a given component (e.g. an ocean model, an atmosphere model) to show up in more than one lab’s GCM. It’s not yet clear how this affects variability in a multi-model ensemble.

So what are the lessons here? First, there is evidence that the use of multi-model ensembles is valuable and important, and that these ensembles capture the uncertainty much better than multiple runs of a single model (no matter how it is perturbed). The evidence suggests that models from different labs are significantly different from one another both scientifically and structurally, and at least part of the explanation for this is that labs tend to have different clusters of expertise across the full range of earth system processes. Studies that compare model results with observational data (E.g. Hargreaves & Annan; Masson & Knutti) show that the observations looks no different from just another member of the multi-model ensemble (or to put it in Annan and Hargreaves’ terms, the truth is statistically indistinguishable from another model in the ensemble).

It would appear that the current arrangement of twenty or so different labs competing to build their own models is a remarkably robust approach to capturing the full range of scientific uncertainty with respect to climate processes. And hence it doesn’t make sense to attempt to consolidate this effort into one international lab.