One of the questions I’ve been chatting to people about this week at the WCRP Open Science Conference this week is whether climate modelling needs to be reorganized as an operational service, rather than as a scientific activity. The two respond to quite different goals, and hence would be organized very differently:

  • An operational modelling centre would prioritize stability and robustness of the code base, and focus on supporting the needs of (non-scientist) end-users who want models and model results.
  • A scientific modelling centre focusses on supporting scientists themselves as users. The key priority here is to support the scientists’ need to get their latest ideas into the code, to run experiments and get data ready to support publication of new results. (This is what most climate modeling centres do right now).

Both need good software practices, but those practices would look very different in the case when the scientists are building code for their own experiments, versus serving the needs of other communities. There are also very different resource implications: an operational centre that serves the needs of a much more diverse set of stakeholders would need a much larger engineering support team in relation to the scientific team.

The question seems very relevant to the conference this week, as one of the running themes has been the question of what “climate services” might look like. Many of the speakers call for “actionable science”, and there has been a lot of discussion of how scientists should work with various communities who need knowledge about climate to inform their decision-making.

And there’s clearly a gap here, with lots of criticism of how it works at the moment. For example, here’s a great from Bruce Hewitson on the current state of climate information:

“A proliferation of portals and data sets, developed with mixed motivations, with poorly articulated uncertainties and weakly explained assumptions and dependencies, the data implied as information, displayed through confusing materials, hard to find or access, written in opaque language, and communicated by interface organizations only semi‐aware of the nuances, to a user community poorly equipped to understand the information limitations”

I can’t argue with any of that. But it begs the question as to whether solving this problem requires a reconceptualization of climate modeling activities to make them much more like operational weather forecasting centres?

Most of the people I spoke to this week think that’s the wrong paradigm. In weather forecasting, the numerical models play a central role, and become the workhorse for service provision. The models are run every day, to supply all sorts of different types of forecasts to a variety of stakeholders. Sure, a weather forecasting service also needs to provide expertise to interpret model runs (and of course, also needs a vast data collection infrastructure to feed the models with observations). But in all of this, the models are absolutely central.

In contrast, for climate services, the models are unlikely to play such a central role. Take for example, the century-long runs, such as those used in the IPCC assessments. One might think that these model runs represent an “operational service” provided to the IPCC as an external customer. But this is a fundamentally mistaken view of what the IPCC is and what it does. The IPCC is really just the scientific community itself, reviewing and assessing the current state of the science. The CMIP5 model runs currently being done in preparation for the next IPCC assessment report, AR5, are conducted by, and for, the science community itself. Hence, these runs have to come from science labs working at the cutting edge of earth system modelling. An operational centre one step removed from the leading science would not be able to provide what the IPCC needs.

One can criticize the IPCC for not doing enough to translate the scientific knowledge into something that’s “actionable” for different communities that need such knowledge. But that criticism isn’t really about the modeling effort (e.g. the CMIP5 runs) that contributes to the Working Group 1 reports. It’s about how the implications of the working group 1 translate into useful information in working groups 2 and 3.

The stakeholders who need climate services won’t be interested in century-long runs. At most they’re interest in decadal forecasts (a task that is itself still in it’s infancy, and a long way from being ready for operational forecasting). More often, they will want help interpreting observational data and trends, and assessing impacts on health, infrastructure, ecosystems, agriculture, water, etc. While such services might make use of data from climate model runs, it generally involve run models regularly in an operational mode. Instead the needs would be more focussed on downscaling the outputs from existing model run datasets. And sitting somewhere between current weather forecasting and long term climate projections is the need for seasonal forecasts and regional analysis of trends, attribution of extreme events, and so on.

So I don’t think it makes sense for climate modelling labs to move towards an operational modelling capability. Climate modeling centres will continue to focus primarily on developing models for use within the scientific community itself. Organizations that provide climate services might need to develop their own modelling capability, focussed more on high resolution, short term (decadal or shorter) regional modelling, and of course, on assessment models that explore the interaction of socio-economic factors and policy choices. Such assessment models would make use of basic climate data from global circulation models (for example, calculations of climate sensitivity, and spatial distributions of temperature change), but don’t connect directly with climate modeling.

We’ve just announced a special issue of the Open Access Journal Geoscientific Model Development (GMD):

Call for Papers: Special Issue on Community software to support the delivery of CMIP5

CMIP5 represents the most ambitious and computer-intensive model inter-comparison project ever attempted. Integrating a new generation of Earth system models and sharing the model results with a broad community has brought with it many significant technical challenges, along with new community-wide efforts to provide the necessary software infrastructure. This special issue will focus on the software that supports the scientific enterprise for CMIP5, including: couplers and coupling frameworks for Earth system models; the Common Information Model and Controlled Vocabulary for describing models and data; The development of the Earth System Grid Federation; the development of new portals for providing data access to different end-user communities; the scholarly publishing of datasets, and studies of the software development and testing processes used for the CMIP5 models. We especially welcome papers that offer comparative studies of the software approaches taken by different groups, and lessons learnt from community efforts to create shareable software components and frameworks.

See here for submission instructions. The call is open ended, as we can keep adding papers to the special issue. We’ve solicited papers from some of the software projects involved in CMIP5, but welcome unsolicited submissions too.

GMD operates an open review process, whereby submitted papers are posted to the open discussion site (known as GMDD), so that both the invited reviewers and anyone else can make comments on the papers and then discuss such comments with the authors, prior to a final acceptance decision for the journal. I was appointed to the editorial board earlier this year, and am currently getting my first taste of how this works – I’m looking forward to applying this idea to our special issue.

Valdivino, who is working on a PhD in Brazil, on formal software verification techniques, is inspired by my suggestion to find ways to apply our current software research skills to climate science. But he asks some hard questions:

1.) If I want to Validate and Verify climate models should I forget all the things that I have learned so far in the V&V discipline? (e.g. Model-Based Testing (Finite State Machine, Statecharts, Z, B), structural testing, code inspection, static analysis, model checking)
2.) Among all V&V techniques, what can really be reused / adapted for climate models?

Well, I wish I had some good answers. When I started looking at the software development processes for climate models, I expected to be able to apply many of the [edit] formal techniques I’ve worked on in the past in Verification and Validation (V&V) and Requirements Engineering (RE). It turns out almost none of it seems to apply, at least in any obvious way.

Climate models are built through a long, slow process of trial and error, continually seeking to improve the quality of the simulations (See here for an overview of how they’re tested). As this is scientific research, it’s unknown, a priori, what will work, what’s computationally feasible, etc. Worse still, the complexity of the earth systems being studied means its often hard to know which processes in the model most need work, because the relationship between particular earth system processes and the overall behaviour of the climate system is exactly what the researchers are working to understand.

Which means that model development looks most like an agile software development process, where the both the requirements and the set of techniques needed to implement them are unknown (and unknowable) up-front. So they build a little, and then explore how well it works. The closest they come to a formal specification is a set of hypotheses along the lines of:

“if I change <piece of code> in <routine>, I expect it to have <specific impact on model error> in <output variable> by <expected margin> because of <tentative theory about climactic processes and how they’re represented in the model>”

This hypothesis can then be tested by a formal experiment in which runs of the model with and without the altered code become two treatments, assessed against the observational data for some relevant period in the past. The expected improvement might be a reduction in the root mean squared error for some variable of interest, or just as importantly, an improvement in the variability (e.g. the seasonal or diurnal spread).

The whole process looks a bit like this (although, see Jakob’s 2010 paper for a more sophisticated view of the process):

And of course, the central V&V technique here is full integration testing. The scientists build and run the full model to conduct the end-to-end tests that constitute the experiments.

So the closest thing they have to a specification would be a chart such as the following (courtesy of Tim Johns at the UK Met Office):

This chart shows how well the model is doing on 34 selected output variables (click the graph to see a bigger version, to get a sense of what the variables are). The scores for the previous model version have been normalized to 1.0, so you can quickly see whether the new model version did better or worse for each output variable – the previous model version is the line at “1.0” and the new model version is shown as the coloured dots above and below the line. The whiskers show the target skill level for each variable. If the coloured dots are within the whisker for a given variable, then the model is considered to be within the variability range for the observational data for that variable. Colour-coded dots then show how well the current version did: green dots mean it’s within the target skill range, yellow mean it’s outside the target range, but did better than the previous model version, and red means it’s outside the target and did worse than the previous model version.

Now, as we know, agile software practices aren’t really amenable to any kind of formal verification technique. If you don’t know what’s possible before you write the code, then you can’t write down a formal specification (the ‘target skill levels’ in the chart above don’t count – these aspirational goals rather than specifications). And if you can’t write down a formal specification for the expected software behaviour, then you can’t apply formal reasoning techniques to determine if the specification was met.

So does this really mean, as Valdivino suggests, that we can’t apply any of our toolbox of formal verification methods? I think attempting to answer this would make a great research project. I have some ideas for places to look where such techniques might be applicable. For example:

  • One important built-in check in a climate model is ‘conservation of mass’. Some fluxes move mass between the different components of the model. Water is an obvious one – it’s evaporated from the oceans, to become part of the atmosphere, and is then passed to the land component as rain, thence to the rivers module, and finally back to the ocean. All the while, the total mass of water across all components must not change. Similar checks apply to salt, carbon (actually this does change due to emissions), and various trace elements. At present, such checks are this is built in to the models as code assertions. In some cases, flux corrections were necessary because of imperfections in the numerical routines or the geometry of the grid, although in most cases, the models have improved enough that most flux corrections have been removed. But I think you could automatically extract from the code an abstracted model capturing just the ways in which these quantities change, and then use a model checker to track down and reason about such problems.
  • A more general version of the previous idea: In some sense, a climate model is a giant state-machine, but the scientists don’t ever build abstracted versions of it – they only work at the code level. If we build more abstracted models of the major state changes in each component of the model, and then do compositional verification over a combination of these models, it *might* offer useful insights into how the model works and how to improve it. At the very least, it would be an interesting teaching tool for people who want to learn about how a climate model works.
  • Climate modellers generally don’t use unit testing. The challenge here is that they find it hard to write down correctness properties for individual code units. I’m not entirely clear how formal methods could help here, but it seems like someone with experience of patterns for temporal logic properties might be able to help here. Clune and Rood have a forthcoming paper on this in November’s IEEE Software. I suspect this is one of the easiest places to get started for software people new to climate models.
  • There’s one other kind of verification test that is currently done by inspection, but might be amenable to some kind of formalization: the check that the code correctly implements a given mathematical formula. I don’t think this will be a high value tool, as the fortran code is close enough to the mathematics that simple inspection is already very effective. But occasionally a subtle bug slips through – for example, I came across an example where the modellers discovered they had used the wrong logarithm (loge in place of log10), although this was more due to lack of clarity in the original published paper, rather than a coding error.

Feel free to suggest more ideas in the comments!

Over the next few years, you’re likely to see a lot of graphs like this (click for a bigger version):

This one is from a forthcoming paper by Meehl et al, and was shown by Jerry Meehl in his talk at the Annecy workshop this week. It shows the results for just a single model, CCSM4, so it shouldn’t be taken as representative yet. The IPCC assessment will use graphs taken from ensembles of many models, as model ensembles have been shown to be consistently more reliable than any single model (the models tend to compensate for each other’s idiosyncrasies).

But as a first glimpse of the results going into IPCC AR5, I find this graph fascinating:

  • The extension of a higher emissions scenario out to three centuries shows much more dramatically how the choices we make in the next few decades can profoundly change the planet for centuries to come. For IPCC AR4, only the lower scenarios were run beyond 2100. Here, we see that a scenario that gives us 5 degrees of warming by the end of the century is likely to give us that much again (well over 9 degrees) over the next three centuries. In the past, people talked too much about temperature change at the end of this century, without considering that the warming is likely to continue well beyond that.
  • The explicit inclusion of two mitigation scenarios (RCP2.6 and RCP4.5) give good reason for optimism about what can be achieved through a concerted global strategy to reduce emissions. It is still possible to keep emissions below 2 degrees of warming. But, as I discuss below, the optimism is bounded by some hard truths about how much adaptation will still be necessary – even in this wildly optimistic case, the temperature drops only slowly over the three centuries, and still ends up warmer than today, even at the year 2300.

As the approach to these model runs has changed so much since AR4, a few words of explanation might be needed.

First, note that the zero point on the temperature scale is the global average temperature for 1986-2005. That’s different from the baseline used in the previous IPCC assessment, so you have to be careful with comparisons. I’d much prefer they used a pre-industrial baseline – to get that, you have to add 1 (roughly!) to the numbers on the y-axis on this graph. I’ll do that throughout this discussion.

I introduced the RCPs (“Representative Concentration Pathways”) a little in my previous post. Remember, these RCPs were carefully selected from the work of the integrated assessment modelling community, who analyze interactions between socio-economic conditions, climate policy, and energy use. They are representative in the sense that they were selected to span the range of plausible emissions paths discussed in the literature, both with and without a coordinated global emissions policy. They are pathways, as they specify in detail how emissions of greenhouse gases and other pollutants would change, year by year, under each set of assumptions. The pathways matters a lot, because it is cumulative emissions (and the relative amounts of different types of emissions) that determine how much warming we get, rather than the actual emissions level in any given year. (See this graph for details on the emissions and concentrations in each RCP).

By the way, you can safely ignore the meaning of the numbers used to label the RCPs – they’re really just to remind the scientists which pathway is which. Briefly, the numbers represent the approximate anthropogenic forcing, in W/m², at the year 2100.

RCP8.5 and RCP6 represent two different pathways for a world with no explicit climate policy. RCP8.5 is at about the 90th percentile of the full set of non-mitigation scenarios described in the literature. So it’s not quite a worse case scenario, but emissions much higher than this are unlikely. One scenario that follows this path is a world in which renewable power supply grows only slowly (to about 20% of the global power mix by 2070) while most of a growing demand for energy is still met from fossil fuels. Emissions continue to grow strongly, and don’t peak before the end of the century. Incidentally, RCP8.5 ends up in the year 2100 with a similar atmospheric concentration to the old A1FI scenario in AR4, at around 900ppm CO2.

RCP6 (which is only shown to the year 2100 in this graph) is in the lower quartile of likely non-mitigation scenarios. Here, emissions peak by mid-century and then stabilize at a little below double current annual emissions. This is possible without an explicit climate policy because under some socio-economic conditions, the world still shifts (slowly) towards cleaner energy sources, presumably because the price of renewables continues to fall while oil starts to run out.

The two mitigation pathways, RCP2.6 and RCP4.5 bracket a range of likely scenarios for a concerted global carbon emissions policy. RCP2.6 was explicitly picked as one of the most optimistic possible pathways – note that it’s outside the 90% confidence interval for mitigation scenarios. The expert group were cautious about selecting it, and spent extra time testing its assumptions before including it. But it was picked because there was interest in whether, in the most optimistic case, it’s possible to stay below 2°C of warming.

Most importantly, note that one of the assumptions in RCP2.6 is that the world goes carbon-negative by around 2070. Wait, what? Yes, that’s right – the pathway depends on our ability to find a way to remove more carbon from the atmosphere than we produce, and to be able to do this consistently on a the global scale by 2070. So, the green line in the graph above is certainly possible, but it’s well outside the set of emissions targets currently under discussion in any international negotiations.

RCP4.5 represents a more mainstream view of global attempts to negotiate emissions reductions. On this pathway, emissions peak before mid-century, and fall to well below today’s levels by the end of the century. Of course, this is not enough to stabilize atmospheric concentrations until the end of the century.

The committee that selected the RCPs warns against over-interpretation. They deliberately selected an even number of pathways, to avoid any implication that a “middle” one is the most likely. Each pathway is the result of a different set of assumptions about how the world will develop over the coming century, either with, or without climate policies. Also:

  • The RCPs should not be treated as forecasts, nor bounds on forecasts. No RCP represents a “best guess”. The high and low scenarios were picked as representative of the upper and lower ends of the range described in the literature.
  • The RCPs should not be treated as policy prescriptions. They were picked to help answer scientific questions, not to offer specific policy choices.
  • There isn’t a unique socio-economic scenario driving each RCP – there are multiple sets of conditions that might be consistent with a particular pathway. Identifying these sets of conditions in more detail is an open question to be studied over the next few years.
  • There’s no consistent logic to the four RCPs, as each was derived from a different assessment model. So you can’t, for example, adjust individual assumptions to get from one RCP to another.
  • The translation from emissions profiles (which the RCPs specify) into atmospheric concentrations and radiative forcings is uncertain, and hence is also an open research question. The intent is to study these uncertainties explicitly through the modeling process.

So, we have a set of emissions pathways chosen because they represent “interesting” points in the space of likely global socio-economic scenarios covered in the literature. These are the starting point for multiple lines of research by different research communities. The climate modeling community will use them as inputs to climate simulations, to explore temperature response, regional variations, precipitation, extreme weather, glaciers, sea ice, and so on. The impacts and adaptation community will use them to explore the different effects on human life and infrastructure, and how much adaptation will be needed under each scenario. The mitigation community will use them to study the impacts of possible policy choices, and will continue to investigate the socio-economic assumptions underlying these pathways, to give us a clearer account of how each might come about, and to produce an updated set of scenarios for future assessments.

Okay, back to the graph. This represents one of the first available sets of temperature outputs from a Global Climate Model for the four RCPs. Over the next two years, other modeling groups will produces data from their own runs of these RCPs, to give us a more robust set of multi-model ensemble runs.

So the results in this graph are very preliminary, but if the results from other groups are consistent with them, here’s what I think it means. The upper path, RCP8.5, offers a glimpse of what happens if economic development and fossil fuel use continue to grow they way they have over the last few decades. It’s hard to imagine much of the human race surviving the next few centuries under this scenario. The lowest path, RCP2.6, keeps us below the symbolically important threshold of 2 degrees of warming, but then doesn’t bring us down much from that throughout the coming centuries. And that’s a pretty stark result: even if we do find a way to go carbon-negative by the latter part of this century, the following two centuries still end up hotter than it is now. All the while that we’re re-inventing the entire world’s industrial basis to make it carbon-negative, we also have to be adapting to a global climate that is warmer than any experienced since the human species evolved.

[By the way: the 2 degree threshold is probably more symbolic than it is scientific, although there’s some evidence that this is the point above which many scientists believe positive feedbacks would start to kick in. For a history of the 2 degree limit, see Randalls 2010].

I’m on my way back from a workshop on Computing in the Atmospheric Sciences, in Annecy, France. The opening keynote, by Gerald Meehl of NCAR, gave us a fascinating overview of the CMIP5 model experiments that will form a key part of the upcoming IPCC Fifth Assessment Report. I’ve been meaning to write about the CMIP5 experiments for ages, as the modelling groups were all busy getting their runs started when I visited them last year. As Jerry’s overview was excellent, this gives me the impetus to write up a blog post. The rest of this post is a summary of Jerry’s talk.

Jerry described CMIP5 as “the most ambitious and computer-intensive inter-comparison project ever attempted”, and having seen many of the model labs working hard to get the model runs started last summer, I think that’s an apt description. More than 20 modelling groups around the world are expected to participate, supplying a total estimated dataset of more than 2 petabytes.

It’s interesting to compare CMIP5 to CMIP3, the model intercomparison project for the last IPCC assessment. CMIP3 began in 2003, and was, at that time, itself an unprecedented set of coordinated climate modelling experiments. It involved 16 groups, from 11 countries with 23 models (some groups contributed more than one model). The resulting CMIP3 dataset, hosted at PCMDI, is 31 terabytes, is openly accessible, has been accessed by more than 1200 scientists, has generated hundreds of papers, and use of this data is still ongoing. The ‘iconic’ figures for future projections of climate change in IPCC AR4 are derived from this dataset (see for example, Figure 10.4 which I’ve previously critiqued).

Most of the CMIP3 work was based on the IPCC SRES “what if” scenarios, which offer different views on future economic development and fossil fuel emissions, but none of which include a serious climate mitigation policy.

By 2006, during the planning the next IPCC assessment, it was already clear that a profound paradigm shift was in progress. The idea of climate services had emerged, with a growing demand from industry, government and other group for detailed regional information about the impacts of climate change, and, of course, a growing need to explicitly consider mitigation and adaptation scenarios. And of course the questions are connected: With different mitigation choices, what are the remaining regional climate effects that adaptation will have to deal with?

So, CMIP5 represents a new paradigm for climate change prediction:

  1. Decadal prediction, with high resolution Atmosphere-Ocean General Circulation Models (AOGCMs), with say, 50km grids, initialized to explore near-time climate change over the next three decades.
  2. First generation Earth System Models, with include a coupled carbon cycle, and ice sheet models, typically run at intermediate resolution (100-150km grids) to study longer term feedbacks past mid-century, using a new set of scenarios that include both mitigation and non-mitigation emissions profiles.
  3. Stronger links between communities – e.g. WCRP, IGBP, and the weather prediction community, but most importantly, stronger interaction between the three working groups of the IPCC: WG1 (which looks at the physical science basis), WG2 (which looks at impacts, adaptation and vulnerability), and WG3 (integrated assessment modelling and scenario development). The lack of interaction between WG1 and the others has been a problem in the past, especially as it’s WG2 and WG3 before, as they’re the ones trying to understand the impacts of different policy choices.

The model experiments for CMIP5 are not dictated by IPCC, but selected by climate science community itself. A large set of experiments have been identified, intended to provide a 5-year framework (2008-2013) for climate change modelling. As not all modelling groups will be able to run all the experiments, they have been prioritized into three clusters: A core set that everyone will run, and two tiers of optional experiments. Experiments that are completed by early 2012 will be analyzed in the next IPCC assessment (due for publication in 2013).

The overall design for the set of experiments is broken out into two clusters (near-term, i.e. decadal runs; and long-term, i.e. century and longer), design for different types of model (although for some centres, this really means different configurations of the same model code, if their models can be run at very different resolutions). In both cases, the core experiment set includes runs of both past and future climate. The past runs are used as hindcasts to assess model skill. Here’s the decadal experiments, showing the core set in the middle, and tier 1 around the edge (there’s no tier 2 for these, as there aren’t so many decadal experiments:

These experiments include some very computationally-demanding runs at very high resolution, and include the first generation of global cloud-resolving models. For example, the prescribed SST time-slices experiments include two periods (1979-2008 and 2026-2035) where prescribed sea-surface temperatures taken from lower resolution, fully-coupled model runs will be used as a basis for very high resolution atmosphere-ocean runs. The intent of these experiments is to explore the local/regional effects of climate change, including on hurricanes and extreme weather events.

Here’s the set of experiments for the longer-term cluster, marked up to indicate three different uses: Model evaluation (where the runs can be compared to observations to identify weakness in the models and explore reasons for divergences between the models); climate projections (to show what the models do on four representative scenarios, at least to the year 2100, and, for some runs, out to the year 2300); and understanding, (including thought experiments, such as the Aqua planet with no land mass, and abrupt changes in GHG concentrations):

These experiments include a much wider range of scientific questions than earlier IPCC assessment (which is why there are so many more experiments this time round). Here’s another way of grouping the long-term runs, showing the collaborations with the many different research communities who are participating:

With these experiments, some crucial science questions will be addressed:

  • what are the time-evolving changes in regional climate change and extremes over the next few decades?
  • what are the size and nature of the carbon cycle and other feedbacks in the climate system, and what will be the resulting magnitude of change for different mitigation scenarios?

The long-term experiments are based on a new set of scenarios that represent a very different approach than was used in the last IPCC assessment. The new scenarios are called Representative Concentration Pathways (RCPs), although as Jerry points out, the name is a little confusing. I’ll write more about the RCPs in my next post, but here’s a brief summary…

The RCPs were selected after a long a series of discussion with the integrated assessment modelling community. A large set of possible scenarios were whittled down to just four. For the convenience of the climate modelling community, they’re labelled with the expected anomaly in radiative forcing (in W/m²) by the year 2100, to give us the set {RCP2.6, RCP4.5, RCP6, RCP8.5}. For comparison, the current total radiative forcing due to anthropogenic greenhouse gases is about 2W/m². But really, the numbers are just to help remember which RCP is which. Really, the term pathway is the important part  – each of the four was chosen as an illustrative example of how greenhouse gas concentrations might change over the rest of the century, under different circumstances. They were generated from integrated assessment models that provide detailed emissions profiles for a wide range of different greenhouse gases and other variables (e.g. aerosols). Here’s what the pathways look like (the darker coloured lines are the chosen representative pathways, the thinner lines show others that were consided, and each cluster is labelled with the model that generated them (click for bigger):

Each RCP was produced by a different model, in part because no single model was capable of providing the detail needed for all four different scenarios, although this means that the RCPs cannot be directly compared, because they include different assumptions. The graph above shows the range of mitigation scenarios considered by the blue shading, and the range of non-mitigation scenarios with gray shading (the two areas overlap a little).

Here’s a rundown on the four scenarios:

  • RCP2.6 represents the lower end of possible mitigation strategies, where emissions peak in the next decade or so, and then decline rapidly. This scenario is only possible if the world has gone carbon-negative by the 2070s, presumably by developing wide-scale carbon-capture and storage(CCS) technologies. This might be possible with an energy mix by 2070 of at least 35% renewables, 45% fossil fuels with full CCS (and 20% without), along with use of biomass, tree planting, and perhaps some other air-capture technologies. [My interpretation: this is the most optimistic scenario, in which we manage to do everything short of geo-engineering, and we get started immediately].
  • RCP4.5 represents a less aggressive emissions mitigation policy, where emissions peak before mid-century, and then fall, but not to zero. Under this scenario, concentrations stabilize by the end of the century, but won’t start falling, so the extra radiative forcing at the year 2100 is still more than double what it is today, at 4.5W/m². [My interpretation: this is the compromise future in which most countries work hard to reduce emissions, with a fair degree of success, but where CCS turns out not to be viable for massive deployment].
  • RCP6 represents the more optimistic of the non-mitigation futures. [My interpretation: this scenario is a world without any coordinated climate policy, but where there is still significant uptake of renewable power, but not enough to offset fossil-fuel driven growth among developing nations].
  • RCP8.5 represents the more pessimistic of the non-mitigation futures. For example, by 2070, we would still be getting about 80% of the world’s energy needs from fossil fuels, without CCS, while the remaining 20% come from renewables and/or nuclear. [My interpretation: this is the closest to the “drill, baby, drill” scenario beloved of certain right-wing American politicians].

Jerry showed some early model results for these scenarios from the NCAR model, CCSM4, but I’ll save that for my next post. To summarize:

  • 24 modelling groups are expected to participate in CMIP5, and about 10 of these groups have fully coupled earth system models.
  • Data is currently in from 10 groups, covering 14 models. Here’s a live summary, currently showing 172TB, which is already more than 5 times all the model data for CMIP3. Jerry put the total expected data at 1-2 petabytes, although in a talk later in the afternoon, Gary Strand from NCAR pegged it at 2.2PB. [Given how much everyone seems to have underestimated the data volumes from the CMIP5 experiments, I wouldn’t be surprised if it’s even bigger. Sitting next to me during Jerry’s talk, Marie-Alice Foujols from IPSL came up with an estimated of 2PB just for all the data collected from the runs done at IPSL, of which she thought something like 20% would be submitted to the CMIP5 archive].
  • The model outputs will be accessed via the Earth System Grid, and will include much more extensive documentation than previously. The Metafor project has built a controlled vocabulary for describing models and experiments, and the Curator project has developed web-based tools for ingesting this metatdata.
  • There’s a BAMS paper coming out soon describing CMIP5.
  • There will be a CMIP5 results session at the WCRP Open science conference next month, another at the AGU meeting in December, and another at a workshop in Hawaii in March.

For the Computing in Atmospheric Sciences workhop next month, I’ll be giving a talk entitled “On the relationship between earth system models and the labs that build them”. Here’s the abstract:

In this talk I will discuss a number of observations from a comparative study of four major climate modeling centres:
– the UK Met Office Hadley Centre (UKMO), in Exeter, UK
– the National Centre for Atmospheric Research (NCAR) in Boulder Colorado,
– the Max-Planck Institute for Meteorology (MPI-M) in Hamburg, Germany
– the Institute Pierre Simon Laplace (IPSL) in Paris, France).
The study focussed on the organizational structures and working practices at each centre with respect to earth system model development, and how these affect the history and current qualities of their models. While the centres share a number of similarities, including a growing role for software specialists and greater use of open source tools for managing code and the testing process, there are marked differences in how the different centres are funded, in their organizational structure and in how they allocate resources. These differences are reflected in the program code in a number of ways, including the nature of the coupling between model components, the portability of the code, and (potentially) the quality of the program code.

While all these modelling centres continually seek to refine their software development practices and the software quality of their models, they all struggle to manage the growth (in terms of size and complexity) in the models. Our study suggests that improvements to the software engineering practices at the centres have to take account of differing organizational constraints at each centre. Hence, there is unlikely to be a single set of best practices that work anywhere. Indeed, improvement in modelling practices usually come from local, grass-roots initiatives, in which new tools and techniques are adapted to suit the context at a particular centre. We suggest therefore that there is need for a stronger shared culture of describing current model development practices and sharing lessons learnt, to facilitate local adoption and adaptation.

Previously I posted on the first two sessions of the workshop on A National Strategy for Advancing Climate Modeling” that was held at NCAR at the end of last month:

  1. What should go into earth system models;
  2. Challenges with hardware, software and human resources;

    The third session focussed on the relationship between models and data.

    Kevin Trenberth kicked off with a talk on Observing Systems. Unfortunately, I missed part of his talk, but I’ll attempt a summary anyway – apologies if it’s incomplete. His main points were that we don’t suffer from a lack of observational data, but from problems with quality, consistency, and characterization of errors. Continuity is a major problem, because much of the observational system was designed for weather forecasting, where consistency of measurement over years and decades isn’t required. Hence, there’s a need for reprocessing and reanalysis of past data, to improve calibration and assess accuracy, and we need benchmarks to measure the effectiveness of reprocessing tools.

    Kevin points out that it’s important to understand that models are used for much more than prediction. They are used:

    • for analysis of observational data, for example to produce global gridded data from the raw observations;
    • to diagnose climate & improve understanding of climate processes (and thence to improve the models);
    • for attribution studies, through experiments to determine climate forcing;
    • for projections and prediction of future climate change;
    • for downscaling to provide regional information about climate impacts;

    Confronting the models with observations is a core activity in earth system modelling. Obviously, it is essential for model evaluation. But observational data is also used to tune the models, for example to remove known systematic biases. Several people at the workshop pointed out that the community needs to do a better job of keeping the data used to tune the models distinct from the data used to evaluate them. For tuning, a number of fields are used – typically top-of-the-atmosphere data such as net shortwave and longwave radiation flux, cloud and clear sky forcing, and cloud fractions. Also, precipitation and surface wind stress, global mean surface temperature, and the period and amplitude of ENSO. Kevin suggests we need to do a better job of collecting information about model tuning from different modelling groups, and ensure model evaluations don’t use the same fields.

    For model evaluation, a number of integrated score metrics have been proposed to summarize correlation, root-mean-squared (rms) error and variance ratios – See for example, Taylor 2001Boer and Lambert 2001Murphy et al, 2004Reichler & Kim 2008.

    But model evaluation and tuning aren’t the only ways in which models and data are brought together. Just as important is re-analysis, where multiple observational datasets are processed through a model to provide more comprehensive (model-like) data products. For this, data assimilation is needed, whereby observational data fields are used to nudge the model at each timestep as it runs.

    Kevin also talked about forward modelling, a technique in which the model used to reproduce the signal that a particular instrument would record, given certain climate conditions. Forward modelling is used for comparing models with ground observations and satellite data. In much of this work, there is an implicit assumption that the satellite data are correct, but in practice, all satellite data have biases, and need re-processing. For this work, the models need good emulation of instrument properties and thresholds. For examples, see: Chepfer, Bony et al, 2010Stubenrauch & Kinne 2009.

    He also talked about some of the problems with existing data and models:

    • nearly all satellite data sets contain large spurious variability associated with changing instruments and satellites, orbital decay/drift, calibration, and changing methods of analysis.
    • simulation of the hydrological cycle is poor, especially in the intertropical convergence zone (ITCZ). Tropical transients are too weak, runoff and recycling is not correct, and the diurnal cycle is poor.
    • there are large differences between datasets for low cloud (see Marchand at al 2010)
    • clouds are not well defined. Partly this is a problem of sensitivity of instruments, compounded by the difficulty of distinguishing between clouds and aerosols.
    • Most models have too much incoming solar radiation in the southern oceans, caused by too few clouds. This makes for warmer oceans and diminished poleward transport, which messes up storm tracking and analysis of ocean transports.

    What is needed to support modelling over the next twenty years? Kevin made the following recommendations:

    • Support observations and development into climate datasets.
    • Support reprocessing and reanalysis.
    • Unify NWP and climate models to exploit short term predictions and confront the models with data.
    • Develop more forward modelling and observation simulators, but with more observational input.
    • Targeted process studies such as GEWEX and analysis of climate extremes, for model evaluation.
    • Target problem areas such as monsoons and tropical precipitation.
    • Carry out a survey of fields used to tune models.
    • Design evaluation and model merit scoring based on fields other than those used in tuning.
    • Promote assessments of observational datasets so modellers know which to use (and not use).
    • Support existing projects, including GSICS, SCOPE-CM, CLARREO, GRUAN,

    Overall, there’s a need for a climate observing system. Process studies should not just be left to the observationists – we need the modellers to get involved.

    The second talk was by Ben Kirtman, on “Predictability, Credibility, and Uncertainty Quantification“. He began by pointing out that there is ongoing debate over what predictability means. Some treat it as an inherent property of the climate system, while others think of it as a model property. Ben distinguished two kinds of predictability:

    • Sensitivity of the climate system to initial conditions (predictability of the first kind);
    • Predictability of the boundary forcing (predictability of the second kind).

    Predictability is enhanced by ensuring specific processes are included. For example, you need to include the MJO if you want to predict ENSO. But model-based estimates of predictability are model dependent. If we want to do a better job of assessing predictability, we have to characterize model uncertainty, and we don’t know how to do this today.

    Good progress has been made on quantifying initial condition uncertainty. We have pretty of good ideas for how to probe this (stochastic optimals, bred vectors, etc.) using ensembles with perturbed initial conditions. But from our understanding of chaos theory (e.g. see the Lorenz attractor), predictability depends on which part of the regime you’re in, so we need to assess the predictability for each particular forecast.

    Uncertainty in external forcing include uncertainties in both the natural and anthropogenic forcing; however this is becoming less of an issue in modelling, as these forcings are better understood. Therefore, the biggest challenge is in quantifying uncertainties in model formulation. These arise because of the discrete representation of climate system, the use of parameterization of subgrid processes, and because of missing processes. Current approaches can be characterized as:

    • a posteriori techniques, such as the multi-model ensembles of opportunity used in IPCC assessments, and perturbed parameters/parameterizations, as used in climateprediction.net.
    • a priori techniques, where we incorporate uncertainty as the model evolves. The idea is that uncertainty is in subscale processes and missing physics. Model this non-locally and stochastically. E.g. backscatter, interactive ensembles to incorporate uncertainty in the coupling.

    The term credibility is even less well defined. Ben asked his students what they understood by the term, and they came up with a simple answer: credibility is the extent to which you use the best available science [which corresponds roughly to my suggestion of what model validation ought to mean]. In the literature, there are a number of other way of expressing credibility:

    • In terms of model bias. For example, Lenny Smith offers a Temporal (or spatial) credibility ratio, calculated as the ratio of the smallest timestep in the model to the smallest duration over which a variable has to be averaged before it compares favourably with observations. This expresses how much averaging over the temporal (or spatial) scale you have to do to make the model look like the data.
    • In terms of whether the ensembles bracket the observations. But the problem here is that you can always pump up an ensemble to do this, and it doesn’t really tell you about probabilistic forecast skill.
    • In terms of model skill. In numerical weather prediction, it’s usual to measure forecast quality using some specific skill metrics.
    • In terms of process fidelity – how well the processes represented in the model capture what is known about those processes in reality. This is a reductionist approach, and depends on the extent to which specific processes can be isolated (both in the model, and in the world).
    • In terms of faith – for example, the modellers’ subjective assessment of how good their model is.

    In the literature, credibility is usually used in a qualitative way to talk about model bias. Hence, in the literature, model bias is roughly synonymous with inverse of credibility. However, in these terms, the models currently have a major credibility gap. For example, Ben showed the annual mean rainfall from a long simulation of CESM1, showing bias with respect to GPCP observations. These show the model struggling to capture the spatial distribution of sea surface temperature (SST), especially in equatorial regions.

    Every climate model has a problem with equatorial sea surface temperatures (SST). A recent paper, Anagnostopoulos et al 2009 makes a big deal of this, and is clearly very hostile to climate modelling. They look at regional biases in temperature and precipitation, where the models are clearly not bracketing observations. I googled the Anagnostopooulos paper while Ben was talking – The first few pages of google hits are dominated by denialist website proclaiming this as a major new study demonstrating the models are poor. It’s amusing that this is treated as news, given that such weaknesses in the models are well known within the modelling community, and discussed in the IPCC report. Meanwhile the hydrologists at the workshop tell me that it’s a third-rate journal, so none of them would pay any attention to this paper.

    Ben argues that these weaknesses need to be removed to increase model credibility. This argument seems a little weak to me. While improving model skill and removing biases are important goals for this community, they don’t necessarily help with model credibility in terms of using the best science (because often replacing an empirically derived parameterization with one that’s more theoretically justified will often reduce model skill). More importantly, those outside the modeling community will have their own definitions of credibility, and they’re unlikely to correspond to those used within the community. Some attention to the ways in which other stakeholders understand model credibility would be useful and interesting.

    In summary, Ben identified a number of important tensions for climate modeling. For example, there are tensions between:

    • the desire to measure prediction skill vs. the desire to explore the limits of predictability;
    • the desire to quantify uncertainty, vs. the push for more resolution and complexity in the models;
    • a priori vs. a posteriori methods of assessing model uncertainty.
    • operational vs. research activities (Many modellers believe the IPCC effort is getting a little out of control – it’s a good exercise, but too demanding on resources);
    • weather vs climate modelling;
    • model diversity vs critical mass;

    Ben urged the community to develop a baseline for climate modelling, capturing best practices for uncertainty estimation.

    During a break in the workshop last week, Cecilia Bitz and I managed to compare notes on our undergrad courses. We’ve both been exploring how to teach ideas about climate modeling to students who are not majoring in earth sciences. Cecilia’s course on Weather and Climate Prediction is a little more advanced than mine, as she had a few math and physics pre-requisites, while mine was open to any first year students. For example, Cecilia managed to get the students to run CAM, and experiment with altering the earth’s orbit. It’s an interesting exercise, as it should lead to plenty of insights into connections between different processes in the earth’s system. One of the challenges is that earth system models aren’t necessarily geared up for this kind of tinkering, so you need good expertise in the model being used to understand which kinds of experiments are likely to make sense. But even so, I’m keen to explore this further, as I think the ability to tinker with such models could be an important tool for promoting a better understanding of how the climate system works, even for younger kids

    Part of what’s missing is elegant user interfaces. EdGCM is better, but still very awkward to use. What I really want is something that’s as intuitive as Angry Birds. Okay, so I’m going to have to compromise somewhere – nonlinear dynamics are a bit more complicated than the flight trajectories of avian slingshot.

    But that’s not all – someone needs to figure out what kinds of experiments students (and school kids) might want to perform, and provide the appropriate control widgets, so they don’t have mess around with code and scripts. Rich Loft tells me there’s a project in the works to do something like this with CESM – I’ll looking forward to that! In the meantime, Rich suggested two examples of simple simulations of dynamical systems that get closer to what I’m looking for:

    • The Shodor Disease model that lets you explore the dynamics of epidemics, with people in separate rooms, where you can adjust how much they can move between rooms, how the disease works, and whether immunization is available. Counter-intuitive lesson: crank up the mortality rate to 100% and (almost) everyone survives!
    • The Shodor Rabbits and Wolves simulation, which lets you explore population dynamics of predators and prey. Counter-intuitive lesson: double the lifespan of the rabbits and they all die out pretty quickly!

    In the last post, I talked about the opening session at the workshop on “A National Strategy for Advancing Climate Modeling”, which focussed on the big picture questions. In the second session, we focussed on the hardware, software and human resources challenges.

    To kick off, Jeremy Kepner from MIT called in via telecon to talk about software issues, from his perspective working on Matlab tools to support computational modeling. He made the point that it’s getting hard to make scientific code work on new architectures, because it’s increasingly hard to find anyone who wants to do the programming. There’s a growing gap between the software stacks used in current web and mobile apps, gaming, and so on, and that used in scientific software. Programmers are used to having new development environments and tools, for example for developing games for Facebook, and regard scientific software development tools as archaic. This means it’s hard to recruit talent from the software world.

    Jeremy quipped that software is an evil thing – the trick is to get people to write as little of it as possible (and he points out that programmers make mistakes at the rate of one per 1,000 lines of code). Hence, we need higher levels of abstraction, with code generated automatically from higher level descriptions. Hence, an important question is whether it’s time to abandon Fortran. He also points out that programmers believe they spend most of their time coding, but in fact, coding is a relatively small part of what they do. At least half of their time is testing, which means that effort to speed up the testing process gives you the most bang for the buck.

    Ricky Rood, Jack Fellows, and Chet Koblinsky then ran a panel on human resources issues. Ricky pointed out that if we are to identify shortages in human resources, we have to be clear about whether we mean for modeling vs. climate science vs. impacts studies vs. users of climate information, and so on. The argument can be made that in terms of absolute numbers there are enough people in the field, but the problems are in getting an appropriate expertise mix / balance, having people at the interfaces between different communities of expertise, a lack of computational people (and not enough emphasis on training our own), and management of fragmented resources.

    Chet pointed out that there’s been a substantial rise in the number of job postings using the term “climate modelling” over the last decade. But there’s still a widespread perception is that there aren’t enough jobs (i.e. more grad students being trained than we have positions for). There are some countervailing voices – for example Pielke argues that universities will always churn out more than enough scientists to support their mission, and there’s a recent BAMs article that explored the question “are we training too many atmospheric scientists?“. The shortage isn’t in the number of people being trained, but in the skills mix.

    We covered a lot of ground in the discussions. I’ll cover just some of the highlights here.

    Several people observed that climate model software development has diverged from mainstream computing. Twenty years ago, academia was the centre of the computing world. Now most computing is in commercial world, and computational scientists have much less leverage than we used to. This means that some tools we rely on might no longer be sustainable. E.g. fortran compilers (and autogenerators?) – fewer users care about these, and so there is less support for transitioning them to new architectures. Climate modeling is a 10+ year endeavour, and we need a long-term basis to maintain continuity.

    Much of the discussion focussed on anticipated disruptive transitions in hardware architectures. Whereas in the past, modellers have relied on faster and faster processors to deliver new computing capacity, this is coming to an end. Advances in clock speed have tailed off, and now its  massive parallelization that delivers the additional computing power. Unfortunately, this means the brute force approach of scaling up current GCM numerical methods on a uniform grid is a dead-end.

    As Bryan Lawrence pointed out, there’s a paradigm change here: computers no longer compute, they produce data. We’re entering an era where CPU time is essentially free, and it’s data wrangling that forms the bottleneck. Massive parallelization of climate models is hard because of the volume of data that must be passed around the system. We can anticipated 1-100 exabyte scale datasets (i.e. this is the size not of the archive, but of the data from a single run of an ensemble). It’s unlikely than any institution will have the ability to evolve their existing codes into this reality.

    The massive parallelization and data volumes also bring another problem. In the past, climate modellers have regarded bit-level reproducibility of climate runs to be crucial, partly because reproducing a run exactly is considered good scientific practice, and partly because it allows many kinds of model test to be automated. The problem is, at the scales we’re talking about, exact bit reproducibility is getting hard to maintain. When we scale up to millions of processors, and terabyte data sets, bit-level failures are frequent enough that exact reproducibility can no longer be guaranteed – if a single bit is corrupted during a model run, it may not matter for the climatology of the run, but it will mean exact reproducibility is impossible. Add to this the fact that in the future, CPUs are likely to be less deterministic, then, as Tim Palmer argued at the AGU meeting, we’ll be forced to fundamentally change our codes, and therefore, maybe we should take the opportunity to make the models probabilistic.

    One recommendation that came out of our discussions is to consider a two track approach for the software. Now that most modeling centres have finished their runs for the current IPCC assessment (AR5), we should plan to evolve current codes towards the next IPCC assessment (AR6), while starting now on developing entirely new software for AR7. The new codes will address i/o issues, new solvers, etc.

    One of the questions the committee posed to the workshop was the potential for hardware-software co-design. The general consensus was that it’s not possible in current funding climate. But even if the funding was available, it’s not clear this is desirable, as the software has much longer useful life than any hardware. Designing for specific hardware instantiations tends to bring major liabilities, and (as my own studies have indicated) there seems to be an inverse correlation between availability of dedicated computing resources and robustness/portability of the software. Things change in climate models all the time, and we need the flexibility to change algorithms, refactor software, etc. This means FPGAs might be a better solution. Dark silicon might push us in this direction anyway.

    Software sharing came up as an important topic, although we didn’t talk about this as much as I would have liked. There seems to be a tendency among modelers to assume that making the code available is sufficient. But as Cecelia Deluca pointed out, from the ESMF experience, community feedback and participation is important. Adoption mandates are not constructive – you want people to adopt software because it works better. One of the big problems here is understandability of shared code. The learning curve is getting bigger, and code sharing between labs is really only possible with a lot of personal interaction. We did speculate that auto-generation of code might help here, because it forces the development of higher level language to describe what’s in a climate model.

    For the human resources question, there was a widespread worry that we don’t have the skills and capacity to deal with anticipated disruptive changes in computational resources. There is a shortage of high quality applicants for model development positions, and many disincentives for people to pursue such a career: the long publication cycle, academic snobbery, and the demands of the IPCC all make model development an unattractive career for grad students and early career scientists. We need a different reward system, so that contributions to the model are rewarded.

    However, it’s also clear that we don’t have enough solid data on this – just lots of anecdotal evidence. We don’t know enough about talent development and capacity to say precisely where the problems are. We identified three distinct roles, which someone amusingly labelled: diagnosticians (who use models and model output in their science), perturbers (who explore new types of runs by making small changes to models) and developers (who do the bulk of model development). Universities produce most of the first, a few of the second, and very few of the third. Furthermore, model developers could be subdivided between people who develop new parameterizations and numerical analysts, although I would add a third category: developers of infrastructure code.

    As well as worrying about training of a new generation of modellers, we also worried about whether the other groups (diagnosticians and perturbers) would have the necessary skillsets. Students are energized by climate change as a societal problem, even if they’re not enticed by a career in earth sciences. Can we capitalize on this, through more interaction with work at the policy/science interface? We also need to make climate modelling more attractive to students, and to connect them more closely with the modeling groups. This could be done through certificate programs for undergrads to bring them into modelling groups, and by bringing grad students into modelling centres in their later grad years. To boost computational skills, we should offer training in earth system science to students in computer science, and expand training for earth system scientists in computational skills.

    Finally, let me end with a few of the suggestions that received a very negative response from many workshop attendees:

    • Should the US be offering only one center’s model to the IPCC for each CMIP round? Currently every major modeling center participates, and many of the centres complain that it dominates their resources during the CMIP exercise. However, participating brings many benefits, including visibility, detailed comparison with other models, and a pressure to improve model quality and model documentation.
    • Should we ditch Fortran and move to a higher level languages? This one didn’t really even get much discussion. My own view is that it’s simply not possible – the community has too much capital tied up in Fortran, and it’s the only language everyone knows.
    • Can we incentivize a mass participation in climate modeling, like the “develop apps for the iphone”? This is an intriguing notion, but one that I don’t think will get much traction, because of the depth of knowledge needed to do anything useful at all in current earth system modeling. Oh, and we’d probably need a different answer to the previous question, too.

    This week I’ve been attending a workshop at NCAR in Boulder, to provide input to the US National Acadamies committee on “A National Strategy for Advancing Climate Modeling”. The overall goal is to:

    “…develop a strategy for improving the nation’s capability to accurately simulate climate and related Earth system changes on decadal to centennial timescales. The committee’s report is envisioned as a high level analysis, providing a strategic framework to guide progress in the nation’s climate modeling enterprise over the next 10-20 years”

    The workshop has been fascinating, addressing many of the issues I encountered on my visits to various modelling labs last year – how labs are organised, what goes into the models, how they are tested and validated, etc. I now have a stack of notes from the talks and discussions, so I’ll divide this into several blog posts, corresponding to the main themes of the workshop:

    The full agenda is available at the NRC website, and they’ll be posting meeting summaries soon.

    The first session of the workshop kicked off with the big picture questions: what earth system processes should (and shouldn’t) be included in future earth system models, and what relationship should there be with the models used for weather forecasting?To get the discussion started, we had two talks, from Andrew Weaver (University of Victoria) and Tim Palmer (U Oxford and ECMWF).

    Andrew Weaver’s talk drew on lessons learned from climate modelling in Canada to argue that we need flexibility to build different types of model for different kinds of questions. Central to his talk was a typology of modelling needs, based on the kinds of questions that need addressing:

    • Curiosity-driven research (mainly done in the universities). For example:
      • Paleo-climate (e.g. what are the effects of a Heinrich event on African climate, and how does this help our understanding of the migration of early humans from Africa?)
      • Polar climates (e.g. what is the future of the greenland ice sheet?)
    • Policy and decision making questions (mainly done by the government agencies). These can be separated into:
      • Mitigation questions (e.g. how much can we emit and still stabilize at various temperature levels?)
      • Adaptation questions (e.g. infrastructure planning over things like water supplies, energy supply and demand)

    These diverse types of question place different challenges on climate modelling. From the policy-making point of view, there is a need for higher resolution and downscaling for regional assessments, along with challenges of modelling sub-grid scale processes (Dave Randall talked more about this in a later session). On the other hand, for paleo-climate studies, we need to bring additional things into the models, such as representing isotopes, ecosystems, and human activity (e.g. the effect on climate of the switch from hunter-gatherer to agrarian society).

    This means we need a range of different models for different purposes. Andrew argues that we should design a model to respond to a specific research question, rather than having one big model that we apply to any problem. He made a strong case for a hierarchy of models, describing his work with EMICs like the UVic model, which uses a simple energy/moisture balance model for the atmosphere, coupled with components from other labs for sea ice, ocean, vegetation, etc. He raised a chuckle when he pointed out that neither EMICs nor GCMs are able to get the greenland icesheet right, and therefore EMICS are superior because they get the wrong answer faster. There is a serious point here – for some types of question, all models are poor approximations, and hence, their role is to probe what we don’t know, rather than to accurately simulate what we do know.

    Some of problems faced in the current generation of models will never go away: clouds and aerosols, ocean mixing, precipitation. And there are some paths we don’t want to take, particularly the idea of coupling general equilibrium economic models with earth system models. The latter may be very sexy, but it’s not clear what we’ll learn. The uncertainties in the economics models are so large that you can get any answer you like, which means this will soak up resources for very little knowledge gain. We should also resist calls to consolidate modelling activities at just one or two international centres.

    In summary, Andrew argued we are at a critical juncture for US and international modelling efforts. Currently the field is dominated by a race to build the biggest model with the most subcomponents for the SRES scenarios. But a future priority must be on providing information needed for adaptation planning – this has been neglected in the past in favour of mitigation studies.

    Tim Palmer’s talk covered some of the ideas on probabilistic modeling that he presented at the AGU meeting in December. He began with an exploration of why better prediction capability is important, adding a third category to Andrew’s set of policy-related questions:

    • mitigation – while the risk of catastrophic climate change is unequivocal, we must reduce the uncertainties if we are ever to tackle the indifference to significant emissions cuts; otherwise humanity is heading for utter disaster.
    • adaptation – spending wisely on new infrastructure depends on our ability to do reliable regional prediction.
    • geo-engineering – could we ever take this seriously without reliable bias-free models?

    Tim suggested two over-riding goals for the climate modeling community. We should aspire to develop ab initio models (based on first principles) whose outputs do not differ significantly from observations; and we should aspire to improve probabilistic forecasts by reducing uncertainties to the absolute minimum, but without making the probabilistic forecasts over-confident.

    The main barriers to progress are limited human and computational resources, and limited observational data. Earth System Models are complex – there is no more complex problem in computational science – and great amounts of ingenuity and creativity are needed. All of the directions we wish to pursue – great complexity, higher resolution, bigger ensembles, longer paleo runs, data assimilation – demand more computational resources. But just as much, we’re hampered by an inability to get the maximum value from observations. So how do we use observations to inform modelling in a reliable way?

    Are we constrained by our history? Ab initio climate models originally grew out of numerical weather prediction, but rapidly diverged. Few climate models today are viable for NWP. Models are developed at the institutional level, so we have diversity of models worldwide, and this is often seen as a virtue – the diversity permits the creation of model ensembles, and rivalry between labs ought to be good for progress. But do we want to maintain this? Are the benefits as great as they are often portrayed? This history, and the institutional politics of labs, universities and funding agencies provide an unspoken constraint on our thinking. We must be able to identify and separate these constraints if we’re going to give taxpayers best value for money.

    Uncertainty is a key issue. While the diverisity of models often cited as measure of confidence in ensemble forecasts, this is very ad hoc. Models are not designed to span uncertainty in representation of physical processes in any systematic way, so the ensembles are ensembles of opportunity. So if the models agree, how can we be sure this is a measure of confidence? An obvious counter-example is in economics, where the financial crisis happened despite all economics models agreeing.

    So can we reduce uncertainty if we had better models? Tim argues that seamless prediction is important. He defines it as “bringing the constraints and insights of NWP into the climate arena”. The key idea is to be able to use the same modeling system to move seamlessly between forecasts at different scales, both temporally and spatially. In essence, it means unifying climate and weather prediction models. This unification brings a number of advantages:

    • It accelerates model development and reduces model biases, by exploring model skill on shorter, verifiable timescales.
    • It helps to bridge between observations and testing strategies, using techniques such as data assimilation.
    • It brings the weather and climate communities together
    • It allows for cross-over of best practices.

    Many key climate amplifiers are associated with fast timescale processes, which are best explored in NWP mode. This approach also allows us to test the reliability of probabilistic forecasts, on weekly, monthly and seasonal timescales. For example, reliability diagrams can be used to explore the reliability of a whole season of forecasts. This is done by subsampling the forecasts, taking, for example, all the forecasts that said it would rain with 40% probability, and checking that it did rain 40% of the time for this subsample.

    Such a unification also allows a pooling of research activity in universities, labs, and NWP centres to make progress on important ideas such as stochastic parameterizations and data assimilation. Stochastic parameterizations are proving more skilful in NWP than multi-model ensembles on monthly timescales. The ideas are still in their infancy, but there is potential to pool research activity in universities, labs, and NWP centres. Data assimilation provides a bridge between modelling and observational world (As an example, Rodwell and Palmer were able to rule out high climate sensitivities in the climateprediction.net data using assimilation with the 6-hour ECMWF observational data).

    In contrast to Andrew’s argument to avoid consolidation, Tim argues that consolidation of effort at the continental (rather than national) level would bring many benefits. He cites the example of the Aerobus, where European countries pooled their efforts because aircraft design had become too complex for any individual nation to go it alone. The Aerobus approach involves a consortium, allowing for specialization within each country.

    Tim closed with a recap of an argument he made in a recent Physics World article. If a group of nations can come together to fund the Large Hadron Collider, then why can’t we come together to fund a computing infrastructure for climate computing? If climate is the number one problem facing the world, then why don’t we have number one supercomputer, rather than being around 50 in the TOP500.

    Following these talks, we broke off into discussion groups to discuss a set of questions posed by the committee. I managed to capture some of the key points that emerged from these discussions – the committee report will present a much fuller account.

    • As we expand earth system models to include ever more processes, we should not lose sight of the old unsolved problems, such as clouds, ice sheets, etc.
    • We should be very cautious about introducing things into the models that are not tied back to basic physics. Many people in the discussions commented that they would resist human activities being incorporated into the models, as this breaks the ab initio approach. However, others argued that including social and ecosystem processes is inevitable, as some research questions demand it, and so we should  focus on how it should be done. For example, we should start with places where 2-way interaction is the largest. E.g. land use, energy feedbacks, renewable energy.
    • We should be able to develop an organized hierarchy of models, with traceable connections to one another.
    • Seamless prediction means same within the same framework/system, but not necessarily same model.
    • As models will become more complex, we need to be able to disentangle different kinds of complexity – for example the complexity that derives from emergent behaviour vs. number of processes vs. number of degrees of freedom.
    • The scale gap between models and observations is disappearing. Model resolution is increasing exponentially, while observational capability increasing only  polynomially. This will improve our ability to test models against observations.
    • There is a great ambiguity over what counts as an Earth System Model. Do they include regional models? What level of coupling defines an ESM?
    • There are huge challenges in bringing communities together to consolidate efforts, even on seemingly simple things like agreeing terminology.
    • There’s a complementary challenge to the improvement of ESMs: how can we improve the representation of climate processes in other kinds of model?
    • We should pay attention to user-generated questions. In particular, the general public doesn’t feel that climate modelling is relevant to them, which partly explains problems with lack of funding.

    Next up: challenges arising from hardware, software and human resource limitations…

    Last summer, when I was visiting NCAR, Warren Washington gave me some papers to read on the first successful numerical weather prediction, done on ENIAC in 1950, by a team of meteorologists put together by John von Neumann. von Neumann was very keen to explore new applications for ENIAC, and saw numerical weather prediction as an obvious thing to try, especially as he’d been working with atmospheric simulations for modeling nuclear weapons explosions. There was, of course, a military angle to this – the cold war was just getting going, and weather control was posited as a potentially powerful new weapon. Certainly that was enough to get the army to fund the project.

    The original forecast took about 24 hours to compute (including time for loading the punched cards), for a 24-hour forecast. This was remarkable, as it meant that with a faster machine, useful weather prediction was possible. There’s a very nice account of the ENIAC computations in:

    …and a slightly more technical account, with details of the algorithm in:

    So having read up on this, I thought it would be interesting to attempt to re-create the program in a modern programming language, as an exercise in modeling, and as way of better understanding this historic milestone. At which point Warren pointed out to me that it’s already been done, by Peter Lynch at University College Dublin:

    And not only that, but Peter then went one better, and re-implemented it again on a mobile phone, as a demonstration of how far Moore’s law has brought us. And he calls it PHONIAC . It took less than a second to compute on PHONIAC (remember, the original computation needed a room-sized machine, a bunch of operators, and 24 hours).

    I mentioned this in the comment thread on my earlier post on model IV&V, but I’m elevating it to a full post here because it’s an interesting point of discussion.

    I had a very interesting lunch with David Randall at CSU yesterday, in which we talked about many of the challenges facing climate modelling groups as they deal with increasing complexity in the models. One topic that came up was the question of whether it’s time for the climate modeling labs to establish separate divisions for the science models (experimental tools for trying out new ideas) and production models (which would be used for assessments and policy support). This separation hasn’t happened in climate modelling, but may well be inevitable, if the the anticipated market for climate services ever materializes.

    There are many benefits of such a separation. It would clarify the goals and roles within the modeling labs, and allow for a more explicit decision process that decides which ideas from the bleeding edge science models are mature enough for inclusion in the operational models. The latter would presumably only contain the more well-established science, would change less rapidly, and could be better engineered for robustness and usability. And better documented.

    But there’s a huge downside: the separation would effectively mean two separate models need to be developed and maintained (thus potentially doubling the effort), and the separation would make it harder to get the latest science transferred into the production model. Which in turn would mean a risk that assessments such as the IPCC’s become even more dated than they are now: there’s already a several year delay because of the time it takes to complete model runs, share the data, analyze it, peer-review and publish results, and then compile the assessment reports. Divorcing science models from production models would make this delay worse.

    But there’s an even bigger problem: the community is too small. There aren’t enough people who understand how to put together a climate model as it is; bifurcating the effort will make this shortfall even worse. David points out that part of the problem is that climate models are now so complex that nobody really understands the entire model; the other problem is that our grad schools aren’t producing many people who have both the aptitude and enthusiasm for climate modeling. There’s a risk that the best modellers will choose to stay in the science shops (because getting leading edge science into the models is much more motivating), leaving insufficient expertise to maintain quality in the production models.

    So really, it comes down to some difficult questions about priorities: given the serious shortage of good modellers, do we push ahead with the current approach in which progress at the leading edge of the science is prioritized, or do we split the effort to create these production shops? It seems to me that what matters for the IPCC at the moment is a good assessment of the current science, not some separate climate forecasting service. If a commercial market develops for the latter (which is possible, once people really start to get serious about climate change), then someone will have to figure out how to channel the revenues into training a new generation of modellers.

    When I mentioned this discussion on the earlier thread, Josh was surprised at the point that universities aren’t producing enough people with the aptitude and motivation:

    “seems like this is a hot / growth area (maybe that impression is just due to the press coverage).”

    Michael replied:

    “Funding is weak and sporadic; the political visibility of these issues often causes revenge-taking at the top of the funding hierarchy. Recent news, for instance, seems to be of drastic cuts in Canadian funding for climate science. …

    The limited budgets lead to attachment to awkward legacy codes, which drives away the most ambitious programmers. The nature of the problem stymies the most mathematically adept who are inclined to look for more purity. Software engineers take a back seat to physical scientists with little regard for software design as a profession. All in all, the work is drastically ill-rewarded in proportion to its importance, and it’s fair to say that while it attracts good people, it’s not hard to imagine a larger group of much higher productivity and greater computational science sophistication working on this problem.

    And that is the nub of the problem. There’s plenty of scope for improving the quality of the models and the quality of the software in them. But if we can’t grow the pool of engaged talent, it won’t happen.

    My post on validating climate models suggested that the key validation criteria is the extent to which the model captures (some aspect of) the current scientific theory, and is useful in exploring the theory. In effect, I’m saying that climate models are scientific tools, and should be validated as scientific tools. This makes them very different from, say numerical weather prediction (NWP) software, which are used in an operational setting to provide a service (predicting the weather).

    What’s confusing is that both communities (climate modeling and weather modeling) use many of the same techniques both for the design of the models, and for comparing the models with observational data.

    For NWP, forecast accuracy is the overriding objective, and the community has developed an extensive methodology for doing forecast verification. I pondered for a while whether this use of the term ‘verification’ here is consistent with my definitions, because surely we should be “validating” a forecast rather than “verifying it”. After thinking about it for a while, I concluded that the terminology is consistent, because forecast verification is like checking a program against it’s specification. In this case the specification states precisely what is being predicted, with what accuracy, and what would constitute a successful forecast (Bob Grumbine gives a recent example in verifying accuracy of seasonal sea ice forecasts). The verification procedure checks that the actual forecast was accurate, within the criteria set by this specification. Whether or not the forecast was useful is another question: that’s the validation question (and it’s a subjective question that requires some investigation of why people want forecasts in the first place).

    An important point here is that forecast verification is not software verification: it doesn’t verify a particular piece of software. It’s also not simulation verification: it doesn’t verify a given run produced by that software. It’s verification of an entire forecasting system. A forecasting system makes use of computational models (often more than one), as well as a bunch of experts who interpret the model results.It also includes an extensive data collection system that gathers information about the current state of the world to use as input to the model. (And of course, some forecasting systems don’t use computational models at all). So:

    • If the forecast is inaccurate (according to the forecast criteria), it doesn’t necessarily mean there’s a flaw in the models – it might just as well be a flaw in the interpretation of the model outputs, or in the data collection process that provided it’s inputs. Oh, and of course, the verification might also fail because the specification is wrong, e.g. because there are flaws in the observational system used in the verification procedure too.
    • If the forecasting system persistently produces accurate forecasts (according to the forecast criteria), that doesn’t necessarily tell us anything about the quality of the software itself, it just means that the entire forecast system worked. It may well be that the model is very poor, but the meteorologists who interpret model outputs are brilliant at overcoming the weaknesses in the model (perhaps in the way they configure the runs, or perhaps in the way they filter model outputs), to produce accurate forecasts for their customers.

    However, one effect of using this forecast verification approach day-in-day-out for weather forecasting systems over several decades (with an overall demand from customers for steady improvements in forecast accuracy) is that all parts of the forecasting system have improved dramatically over the last few decades, including the software. And climate modelling has benefited from this, as improvements in the modelling of processes needed for NWP can often also be used to improve the climate models (Senior et al have an excellent chapter on this in a forthcoming book, which I will review nearer to the publication date).

    The question is, can we apply a similar forecast verification methodology to the “climate forecasting system”, despite the differences between weather and climate?

    Note that the question isn’t about whether we can verify the accuracy of climate models this way, because the methodology doesn’t separate the models from the broader system in which they are used. So, if we take this route at all, we’re attempting to verify the forecast accuracy of the whole system: collection of observational data, creation of theories, use of these theories to develop models, choices for which model and which model configuration to use, choices for how to set up the runs, and interpretation of the results.

    Climate models are not designed as forecasting tools, they are designed as tools to explore current theories about the climate system, and to investigate sources of uncertainty in these theories. However, the fact that they can be used to project potential future climate change (under various scenarios) is very handy. Of course, this is not the only way to produce quantified estimates of future climate change – you can do it using paper and pencil. It’s also a little unfortunate, because the IPCC process (or at least the end-users of IPCC reports) tend to over-emphasize the model projections at the expense of the science that went into them, and increasingly the funding for the science is tied to the production of such projections.

    But some people (both within the climate modeling community and within the denialist community) would prefer that they not be used to project future climate change at all. (The argument from within the modelling community is that the results get over-interpreted or mis-interpreted by lay audiences; the argument from the denialist community is that models aren’t perfect. I think these two arguments are connected…). However, both arguments ignore reality: society demands of climate science that it provides its best estimates of the rate and size of future climate change, and (to the extent that they embody what we currently know about climate) the models are the best tool for this job. Not using them in the IPCC assessments would be like marching into the jungle with one eye closed.

    So, back to the question: can we use NWP forecast verification for climate projections? I think the answer is ‘no’, because of the timescales involved. Projections of climate change really only make sense on the scale of decades to centuries. Waiting for decades to do the verification is pointless – by then the science will have moved on, and it will be way too late for policymaking purposes anyway.

    If we can’t verify the forecasts on a timescale that’s actually useful, does this mean the models are invalid? Again the answer is ‘no’, for three reasons. First, we have plenty of other V&V techniques to apply to climate models. Second, the argument that climate models are a valid tool for creating future projections of climate change is based not on our ability to do forecast verification, but on how well the models capture the current state of the science. And third, because forecast verification wouldn’t necessarily say anything about the models themselves anyway, as it assesses the entire forecast system.

    It would certainly be really, really useful to be able to verify the “climate forecast” system. But the fact that we can’t does not mean we cannot validate climate models.

    In my last two posts, I demolished the idea that climate models need Independent Verification and Validation (IV&V), and I described the idea of a toolbox approach to V&V. Both posts were attacking myths: in the first case, the myth that an independent agent should be engaged to perform IV&V on the models, and in the second, the myth that you can critique the V&V of climate models without knowing anything about how they are currently built and tested.

    I now want to expand on the latter point, and explain how the day-to-day practices of climate modellers taken together constitute a robust validation process, and that the only way to improve this validation process is just to do more of it (i.e. give the modeling labs more funds to expand their current activities, rather than to do something very different).

    The most common mistake made by people discussing validation of climate models is to assume that a climate model is a thing-in-itself, and that the goal of validation is to demonstrate that some property holds of this thing. And whatever that property is, the assumption is that such measurement of it can be made without reference to its scientific milieu, and in particular without reference to its history and the processes by which it was constructed.

    This mistake leads people to talk of validation in terms of how well “the model” matches observations, or how well “the model” matches the processes in some real world system. This approach to validation is, as Oreskes et al pointed out, quite impossible. The models are numerical approximations of complex physical phenomena. You can verify that the underlying equations are coded correctly in a given version of the model, but you can never validate that a given model accurately captures real physical processes, because it never will accurately capture them. Or as George Box summed it up: “All models are wrong…” (we’ll come back to the second half of the quote later).

    The problem is that there is no such thing as “the model”. The body of code that constitutes a modern climate model actually represents an enormous number of possible models, each corresponding to a different way of configuring that code for a particular run. Furthermore, this body of code isn’t a static thing. The code is changed on a daily basis, through a continual process of experimentation and model improvement. Often these changes are done in parallel, so that there are multiple version at any given moment, being developed along multiple lines of investigation. Sometimes these lines of evolution are merged, to bring a number of useful enhancements together into a single version. Occasionally, the lines diverge enough to cause a fork: a point at which they are different enough that it just becomes too hard to reconcile them (See for example, this visualization of the evolution of ocean models). A forked model might at some point be given a new name, but the process by which a model gets a new name is rather arbitrary.

    Occasionally, a modeling lab will label a particular snapshot of this evolving body of code as an “official release”. An official release has typically been tested much more extensively, in a number of standard configurations for a variety of different platforms. It’s likely to be more reliable, and therefore easier for users to work with. By more reliable here, I mean relatively free from coding defects. In other words, it is better verified than other versions, but not necessarily better validated (I’ll explain why shortly). In many cases, official releases also contain some significant new science (e.g. new parameterizations), and these scientific enhancements will be described in a set of published papers.

    However, an official release isn’t a single model either. Again it’s just a body of code that can be configured to run as any of a huge number of different models, and it’s not unchanging either – as with all software, there will be occasional bugfix releases applied to it. Oh, and did I mention that to run a model, you have to make use of a huge number of ancillary datafiles, which define everything from the shape of the coastlines and land surfaces, to the specific carbon emissions scenario to be used. Any change to these effectively gives a different model too.

    So, if you’re hoping to validate “the model”, you have to say which one you mean: which configuration of which code version of which line of evolution, and with which ancillary files. I suppose the response from those clamouring for something different in the way of model validation would say “well, the one used for the IPCC projections, of course”. Which is a little tricky, because each lab produces a large number of different runs for the CMIP process that provides input to the IPCC, and each of these is a likely to involve a different model configuration.

    But let’s say for sake of argument that we could agree on a specific model configuration that ought to be “validated”. What will we do to validate it? What does validation actually mean? The Oreskes paper I mentioned earlier already demonstrated that comparison with real world observations, while interesting, does not constitute “validation”. The model will never match the observations exactly, so the best we’ll ever get along these lines is an argument that, on balance, given the sum total of the places where there’s a good match and the places where there’s a poor match, that the model does better or worse than some other model. This isn’t validation, and furthermore it isn’t even a sensible way of thinking about validation.

    At this point many commentators stop, and argue that if validation of a model isn’t possible, then the models can’t be used to support the science (or more usually, they mean they can’t be used for IPCC projections). But this is a strawman argument, based on a fundamental misconception of what validation is all about. Validation isn’t about checking that a given instance of a model satisfies some given criteria. Validation is about about fitness for purpose, which means it’s not about the model at all, but about the relationship between a model and the purposes to which it is put. Or more precisely, its about the relationship between particular ways of building and configuring models and the ways in which runs produced by those models are used.

    Furthermore, the purposes to which models are put and the processes by which they are developed co-evolve. The models evolve continually, and our ideas about what kinds of runs we might use them for evolve continually, which means validation must take this ongoing evolution into account. To summarize, validation isn’t about a property of some particular model instance; its about the whole process of developing and using models, and how this process evolves over time.

    Let’s take a step back a moment, and ask what is the purpose of a climate model. The second half of the George Box quote is “…but some models are useful”. Climate models are tools that allow scientists to explore their current understanding of climate processes, to build and test theories, and to explore the consequences of those theories. In other words we’re dealing with three distinct systems:

    We're dealing with relationships between three different systems

    There does not need to be any clear relationship between the calculational system and the observational system – I didn’t include such a relationship in my diagram. For example, climate models can be run in configurations that don’t match the real world at all: e.g. a waterworld with no landmasses, or a world in which interesting things are varied: the tilt of the pole, the composition of the atmosphere, etc. These models are useful, and the experiments performed with them may be perfectly valid, even though they differ deliberately from the observational system.

    What really matters is the relationship between the theoretical system and the observational system: in other words, how well does our current understanding (i.e. our theories) of climate explain the available observations (and of course the inverse: what additional observations might we make to help test our theories). When we ask questions about likely future climate changes, we’re not asking this question of the the calculational system, we’re asking it of the theoretical system; the models are just a convenient way of probing the theory to provide answers.

    By the way, when I use the term theory, I mean it in exactly the way it’s used in throughout all sciences: a theory is the best current explanation of a given set of phenomena. The word “theory” doesn’t mean knowledge that is somehow more tentative than other forms of knowledge; a theory is actually the kind of knowledge that has the strongest epistemological basis of any kind of knowledge, because it is supported by the available evidence, and best explains that evidence. A theory might not be capable of providing quantitative predictions (but it’s good when it does), but it must have explanatory power.

    In this context, the calculational system is valid as long as it can offer insights that help to understand the relationship between the theoretical system and the observational system. A model is useful as long as it helps to improve our understanding of climate, and to further the development of new (or better) theories. So a model that might have been useful (and hence valid) thirty years ago might not be useful today. If the old approach to modelling no longer matches current theory, then it has lost some or all of its validity. The model’s correspondence (or lack of) to the observations hasn’t changed (*), nor has its predictive power. But its utility as a scientific tool has changed, and hence its validity has changed.

    [(*) except that that accuracy of the observations may have changed in the meantime, due to the ongoing process of discovering and resolving anomalies in the historical record.]

    The key questions for validation then, are to do with how well the current generation of models (plural) support the discovery of new theoretical knowledge, and whether the ongoing process of improving those models continues to enhance their utility as scientific tools. We could focus this down to specific things we could measure by asking whether each individual change to the model is theoretically justified, and whether each such change makes the model more useful as a scientific tool.

    To do this requires a detailed study of day-to-day model development practices, the extent to which these are closely tied with the rest of climate science (e.g. field campaigns, process studies, etc). It also takes in questions such as how modeling centres decide on their priorities (e.g. which new bits of science to get into the models sooner), and how each individual change is evaluated. In this approach, validation proceeds by checking whether the individual steps taken to construct and test changes to the code add up to a sound scientific process, and how good this process is at incorporating the latest theoretical ideas. And we ought to be able to demonstrate a steady improvement in the theoretical basis for the model. An interesting quirk here is that sometimes an improvement to the model from a theoretical point of view reduces its skill at matching observations; this happens particularly when we’re replacing bits of the model that were based on empirical parameters with an implementation that has a stronger theoretical basis, because the empirical parameters were tuned to give a better climate simulation, without necessarily being well understood. In the approach I’m describing, this would be an indicator of an improvement in validity, even while reduces the correspondence with observations. If on the other hand we based our validation on some measure of correspondence with observations, such a step would reduce the validity of the model!

    But what does all of this tell us about whether it’s “valid” to use the models to produce projections of climate change into the future? Well, recall that when we ask for projections of future climate change, we’re not asking the question of the calculational system, because all that would result in is a number, or range of numbers, that are impossible to interpret, and therefore meaningless. Instead we’re asking the question of the theoretical system: given the sum total of our current theoretical understanding of climate, what is likely to happen in the future, under various scenarios for expected emissions and/or concentrations of greenhouse gases? If the models capture our current theoretical understanding well, then running the scenario on the model is a valid thing to do. If the models do a poor job of capturing our theoretical understanding, then running the models on these scenarios won’t be very useful.

    Note what is happening here: when we ask climate scientists for future projections, we’re asking the question of the scientists, not of their models. The scientists will apply their judgement to select appropriate versions/configurations of the models to use, they will set up the runs, and they will interpret the results in the light of what is known about the models’ strengths and weaknesses and about any gaps between the comptuational models and the current theoretical understanding. And they will add all sorts of caveats to the conclusions they draw from the model runs when they present their results.

    And how do we know whether the models capture our current theoretical understanding? By studying the processes by which the models are developed (i.e. continually evolved) be the various modeling centres, and examining how good each centre is at getting the latest science into the models. And by checking that whenever there are gaps between the models and the theory, these are adequately described by the caveats in the papers published about experiments with the models.

    Summary: It is a mistake to think that validation is a post-hoc process to be applied to an individual “finished” model to ensure it meets some criteria for fidelity to the real world. In reality, there is no such thing as a finished model, just many different snapshots of a large set of model configurations, steadily evolving as the science progresses. And fidelity of a model to the real world is impossible to establish, because the models are approximations. In reality, climate models are tools to probe our current theories about how climate processes work. Validity is the extent to which climate models match our current theories, and the extent to which the process of improving the models keeps up with theoretical advances.

    Sometime in the 1990’s, I drafted a frequently asked question list for NASA’s IV&V facility. Here’s what I wrote on the meaning of the terms “validation” and “verification”:

    The terms Verification and Validation are commonly used in software engineering to mean two different types of analysis. The usual definitions are:

    • Validation: Are we building the right system?
    • Verification: Are we building the system right?

    In other words, validation is concerned with checking that the system will meet the customer’s actual needs, while verification is concerned with whether the system is well-engineered, error-free, and so on. Verification will help to determine whether the software is of high quality, but it will not ensure that the system is useful.

    The distinction between the two terms is largely to do with the role of specifications. Validation is the process of checking whether the specification captures the customer’s needs, while verification is the process of checking that the software meets the specification.

    Verification includes all the activities associated with the producing high quality software: testing, inspection, design analysis, specification analysis, and so on. It is a relatively objective process, in that if the various products and documents are expressed precisely enough, no subjective judgements should be needed in order to verify software.

    In contrast, validation is an extremely subjective process. It involves making subjective assessments of how well the (proposed) system addresses a real-world need. Validation includes activities such as requirements modelling, prototyping and user evaluation.

    In a traditional phased software lifecycle, verification is often taken to mean checking that the products of each phase satisfy the requirements of the previous phase. Validation is relegated to just the begining and ending of the project: requirements analysis and acceptance testing. This view is common in many software engineering textbooks, and is misguided. It assumes that the customer’s requirements can be captured completely at the start of a project, and that those requirements will not change while the software is being developed. In practice, the requirements change throughout a project, partly in reaction to the project itself: the development of new software makes new things possible. Therefore both validation and verification are needed throughout the lifecycle.

    Finally, V&V is now regarded as a coherent discipline: ”Software V&V is a systems engineering discipline which evaluates the software in a systems context, relative to all system elements of hardware, users, and other software”. (from Software Verification and Validation: Its Role in Computer Assurance and Its Relationship with Software Project Management Standards, by Dolores R. Wallace and Roger U. Fujii, NIST Special Publication 500-165)

    Having thus carefully distinguished the two terms, my advice to V&V practitioners was then to forget about the distinction, and think instead about V&V as a toolbox, which provides a wide range of tools for asking different kinds of questions about software. And to master the use of each tool and figure out when and how to use it. Here’s one of my attempts to visualize the space of tools in the toolbox:

    A range of V&V techniques. Note that "modeling" and "model checking" refer to building and analyzing abstracted models of software behaviour, a very different kind of beast from scientific models used in the computational sciences

    For climate models, the definitions that focus on specifications don’t make much sense, because there are no detailed specifications of climate models (nor can there be – they’re built by iterative refinement like agile software development). But no matter – the toolbox approach still works; it just means some of the tools are applied a little differently. An appropriate toolbox for climate modeling looks a little different from my picture above, because some of these tools are more appropriate for real-time control systems, applications software, etc, and there are some missing from the above picture that are particular for simulation software. I’ll draw a better picture when I’ve finished analyzing the data from my field studies of practices used at climate labs.

    Many different V&V tools are already in use at most climate modelling labs, but there is room for adding more tools to the toolbox, and for sharpening the existing tools (what and how are the subjects of my current research). But the question of how best to do this must proceed from a detailed analysis of current practices and how effective they are. There seem to be plenty of people wandering into this space, claiming that the models are insufficiently verified, validated, or both. And such people like to pontificate about what climate modelers ought to do differently. But anyone who pontificates in this way, but is unable to give a detailed account of which V&V techniques climate modellers currently use, is just blowing smoke. If you don’t know what’s in the toolbox already, then you can’t really make constructive comments about what’s missing.