Search Results for: climate models

At the beginning of March, I was invited to give a talk at TEDxUofT. Colleagues tell me the hardest part of giving these talks is deciding what to talk about. I decided to see if I could answer the question of whether we can trust climate models. It was a fascinating and nerve-wracking experience, quite unlike any talk I’ve given before. Of course, I’d love to do another one, as I now know more about what works and what doesn’t.

Here’s the video and a transcript of my talk. [The bits in square brackets in are things I intended to say but forgot!] 

Computing the Climate: How Can a Computer Model Forecast the Future? TEDxUofT, March 1, 2014.

Talking about the weather forecast is a great way to start a friendly conversation. The weather forecast matters to us. It tells us what to wear in the morning; it tells us what to pack for a trip. We also know that weather forecasts can sometimes be wrong, but we’d be foolish to ignore them when they tell us a major storm is heading our way.

[Unfortunately, talking about climate forecasts is often a great way to end a friendly conversation!] Climate models tell us that by the end of this century, if we carry on burning fossil fuels at the rate we have been doing, and we carry on cutting down forests at the rate we have been doing, the planet will warm by somewhere between 5 to 6 degrees centigrade. That might not seem much, but, to put it into context, in the entire history of human civilization, the average temperature of the planet has not varied by more than 1 degree. So that forecast tells us something major is coming, and we probably ought to pay attention to it.

But on the other hand, we know that weather forecasts don’t work so well the longer into the future we peer. Tomorrow’s forecast is usually pretty accurate. Three day and five day forecasts are reasonably good. But next week? They always change their minds before next week comes. So how can we peer 100 years into the future and look at what is coming with respect to the climate? Should we trust those forecasts? Should we trust the climate models that provide them to us?

Six years ago, I set out to find out. I’m a professor of computer science. I study how large teams of software developers can put together complex pieces of software. I’ve worked with NASA, studying how NASA builds the flight software for the Space Shuttle and the International Space Station. I’ve worked with large companies like Microsoft and IBM. My work focusses not so much on software errors, but on the reasons why people make those errors, and how programmers then figure out they’ve made an error, and how they know how to fix it.

To start my study, I visited four major climate modelling labs around the world: in the UK, in Paris, Hamburg, Germany and in Colorado. Each of these labs have typically somewhere between 50-100 scientists who are contributing code to their climate models. And although I only visited four of these labs, there are another twenty or so around the world, all doing similar things. They run these models on some of the fastest supercomputers in the world, and many of the models have been in construction, the same model, for more than 20 years.

When I started this study, I asked one of my students to attempt to measure how many bugs there are in a typical climate model. We know from our experience with software there are always bugs. Sooner or later the machine crashes. So how buggy are climate models? More specifically, what we set out to measure is what we call “defect density” – How many errors are there per thousand lines of code. By this measure, it turns out climate models are remarkably high quality. In fact, they’re better than almost any commercial software that’s ever been studied. They’re about the same level of quality as the Space Shuttle flight software. Here’s my results (For the actual results you’ll have to read the paper):


We know it’s very hard to build a large complex piece of software without making mistakes.  Even the space shuttle’s software had errors in it. So the question is not “is the software perfect for predicting the future?”. The question is “Is it good enough?” Is it fit for purpose?

To answer that question, we’d better understand what that purpose of a climate model is. First of all, I’d better be clear what a climate model is not. A climate model is not a projection of trends we’ve seen in the past extrapolated into the future. If you did that, you’d be wrong, because you haven’t accounted for what actually causes the climate to change, and so the trend might not continue. They are also not decision-support tools. A climate model cannot tell us what to do about climate change. It cannot tell us whether we should be building more solar panels, or wind farms. It can’t tell us whether we should have a carbon tax. It can’t tell us what we ought to put into an international treaty.

What it does do is tell us how the physics of planet earth work, and what the consequences are of changing things, within that physics. I could describe it as “computational fluid dynamics on a rotating sphere”. But computational fluid dynamics is complex.

I went into my son’s fourth grade class recently, and I started to explain what a climate model is, and the first question they asked me was “is it like Minecraft?”. Well, that’s not a bad place to start. If you’re not familiar with Minecraft, it divides the world into blocks, and the blocks are made of stuff. They might be made of wood, or metal, or water, or whatever, and you can build things out of them. There’s no gravity in Minecraft, so you can build floating islands and it’s great fun.

Climate models are a bit like that. To build a climate model, you divide the world into a number of blocks. The difference is that in Minecraft, the blocks are made of stuff. In a climate model, the blocks are really blocks of space, through which stuff can flow. At each timestep, the program calculates how much water, or air, or ice is flowing into, or out of, each block, and if so, in which directions? It calculates changes in temperature, density, humidity, and so on. And whether stuff such as dust, salt, and pollutants are passing through or accumulating in each block. We have to account for the sunlight passing down through the block during the day. Some of what’s in each block might filter some of the the incoming sunlight, for example if there are clouds or dust, so some of the sunlight doesn’t get down to the blocks below. There’s also heat escaping upwards through the blocks, and again, some of what is in the block might trap some of that heat — for example clouds and greenhouse gases.

As you can see from this diagram, the blocks can be pretty large. The upper figure shows blocks of 87km on a side. If you want more detail in the model, you have to make the blocks smaller. Some of the fastest climate models today look more like the lower figure:


Ideally, you want to make the blocks as small as possible, but then you have many more blocks to keep track of, and you get to the point where the computer just can’t run fast enough. A typical run of a climate model, to simulate a century’s worth of climate, you might have to wait a couple of weeks on some of the fastest supercomputers for that run to complete. So the speed of the computer limits how small we can make the blocks.

Building models this way is remarkably successful. Here’s video of what a climate model can do today. This simulation shows a year’s worth of weather from a climate model. What you’re seeing is clouds and, in orange, that’s where it’s raining. Compare that to a year’s worth of satellite data for the year 2013. If you put them side by side, you can see many of the same patterns. You can see the westerlies, the winds at the top and bottom of the globe, heading from west to east, and nearer the equator, you can see the trade winds flowing in the opposite direction. If you look very closely, you might even see a pulse over South America, and a similar one over Africa in both the model and the satellite data. That’s the daily cycle as the land warms up in the morning and the moisture evaporates from soils and plants, and then later on in the afternoon as it cools, it turns into rain.

Note that the bottom is an actual year, 2013, while the top, the model simulation is not a real year at all – it’s a typical year. So the two don’t correspond exactly. You won’t get storms forming at the same time, because it’s not designed to be an exact simulation; the climate model is designed to get the patterns right. And by and large, it does. [These patterns aren’t coded into this model. They emerge as a consequences of getting the basic physics of the atmosphere right].

So how do you build a climate model like this? The answer is “very slowly”. It takes a lot of time, and a lot of failure. One of the things that surprised me when I visited these labs is that the scientists don’t build these models to try and predict the future. They build these models to try and understand the past. They know their models are only approximations, and they regularly quote the statistician, George Box, who said “All models are wrong, but some are useful”. What he meant is that any model of the world is only an approximation. You can’t get all the complexity of the real world into a model. But even so, even a simple model is a good way to test your theories about the world.

So the way that modellers work, is they spend their time focussing on places where the model does isn’t quite right. For example, maybe the model isn’t getting the Indian monsoon right. Perhaps it’s getting the amount of rain right, but it’s falling in the wrong place. They then form a hypothesis. They’ll say, I think I can improve the model, because I think this particular process is responsible, and if I improve that process in a particular way, then that should fix the simulation of the monsoon cycle. And then they run a whole series of experiments, comparing the old version of the model, which is getting it wrong, with the new version, to test whether the hypothesis is correct. And if after a series of experiments, they believe their hypothesis is correct, they have to convince the rest of the modelling team that this really is an improvement to the model.

In other words, to build the models, they are doing science. They are developing hypotheses, they are running experiments, and using peer review process to convince their colleagues that what they have done is correct:


Climate modellers also have a few other weapons up their sleeves. Imagine for a moment if Microsoft had 25 competitors around the world, all of whom were attempting to build their own versions of Microsoft Word. Imagine further that every few years, those 25 companies all agreed to run their software on a very complex battery of tests, designed to test all the different conditions under which you might expect a word processor to work. And not only that, but they agree to release all the results of those tests to the public, on the internet, so that anyone who wanted to use any of that software can pore over all the data and find out how well each version did, and decide which version they want to use for their own purposes. Well, that’s what climate modellers do. There is no other software in the world for which there are 25 teams around the world trying to build the same thing, and competing with each other.

Climate modellers also have some other advantages. In some sense, climate modelling is actually easier than weather forecasting. I can show you what I mean by that. Imagine I had a water balloon (actually, you don’t have to imagine – I have one here):


I’m going to throw it at the fifth row. Now, you might want to know who will get wet. You could measure everything about my throw: Will I throw underarm, or overarm? Which way am I facing when I let go of it? How much swing do I put in? If you could measure all of those aspects of my throw, and you understand the physics of how objects move, you could come up with a fairly accurate prediction of who is going to get wet.

That’s like weather forecasting. We have to measure the current conditions as accurately as possible, and then project forward to see what direction it’s moving in:


If I make any small mistakes in measuring my throw, those mistakes will multiply as the balloon travels further. The further I attempt to throw it, the more room there is for inaccuracy in my estimate. That’s like weather forecasting. Any errors in the initial conditions multiply up rapidly, and the current limit appears to be about a week or so. Beyond that, the errors get so big that we just cannot make accurate forecasts.

In contrast, climate models would be more like releasing a balloon into the wind, and predicting where it will go by knowing about the wind patterns. I’ll make some wind here using a fan:


Now that balloon is going to bob about in the wind from the fan. I could go away and come back tomorrow and it will still be doing about the same thing. If the power stays on, I could leave it for a hundred years, and it might still be doing the same thing. I won’t be able to predict exactly where that balloon is going to be at any moment, but I can predict, very reliably, the space in which it will move. I can predict the boundaries of its movement. And if the things that shape those boundaries change, for example by moving the fan, and I know what the factors are that shape those boundaries, I can tell you how the patterns of its movements are going to change – how the boundaries are going to change. So we call that a boundary problem:


The initial conditions are almost irrelevant. It doesn’t matter where the balloon started, what matters is what’s shaping its boundary.

So can these models predict the future? Are they good enough to predict the future? The answer is “yes and no”. We know the models are better at some things than others. They’re better at simulating changes in temperature than they are at simulating changes in rainfall. We also know that each model tends to be stronger in some areas and weaker in others. If you take the average of a whole set of models, you get a much better simulation of how the planet’s climate works than if you look at any individual model on its own. What happens is that the weaknesses in any one model are compensated for by other models that don’t have those weaknesses.

But the results of the models have to be interpreted very carefully, by someone who knows what the models are good at, and what they are not good at – you can’t just take the output of a model and say “that’s how it’s going to be”.

Also, you don’t actually need a computer model to predict climate change. The first predictions of what would happen if we keep on adding carbon dioxide to the atmosphere were produced over 120 years ago. That’s fifty years before the first digital computer was invented. And those predictions were pretty accurate – what has happened over the twentieth century has followed very closely what was predicted all those years ago. Scientists also predicted, for example, that the arctic would warm faster than the equatorial regions, and that’s what happened. They predicted night time temperatures would rise faster than day time temperatures, and that’s what happened.

So in many ways, the models only add detail to what we already know about the climate. They allow scientists to explore “what if” questions. For example, you could ask of a model, what would happen if we stop burning all fossil fuels tomorrow. And the answer from the models is that the temperature of the planet will stay at whatever temperature it was when you stopped. For example, if we wait twenty years, and then stopped, we’re stuck with whatever temperature we’re at for tens of thousands of years. You could ask a model what happens if we dig up all known reserves of fossil fuels, and burn them all at once, in one big party? Well, it gets very hot.

More interestingly, you could ask what if we tried blocking some of the incoming sunlight to cool the planet down, to compensate for some of the warming we’re getting from adding greenhouse gases to the atmosphere? There have been a number of very serious proposals to do that. There are some who say we should float giant space mirrors. That might be hard, but a simpler way of doing it is to put dust up in the stratosphere, and that blocks some of the incoming sunlight. It turns out that if you do that, you can very reliably bring the average temperature of the planet back down to whatever level you want, just by adjusting the amount of the dust. Unfortunately, some parts of the planet cool too much, and others not at all. The crops don’t grow so well, and everyone’s weather gets messed up. So it seems like that could be a solution, but when you study the model results in detail, there are too many problems.

Remember that we know fairly well what will happen to the climate if we keep adding CO2, even without using a computer model, and the computer models just add detail to what we already know. If the models are wrong, they could be wrong in either direction. They might under-estimate the warming just as much as they might over-estimate it. If you look at how well the models can simulate the past few decades, especially the last decade, you’ll see some of both. For example, the models have under-estimated how fast the arctic sea ice has melted. The models have underestimated how fast the sea levels have risen over the last decade. On the other hand, they over-estimated the rate of warming at the surface of the planet. But they underestimated the rate of warming in the deep oceans, so some of the warming ends up in a different place from where the models predicted. So they can under-estimate just as much as they can over-estimate. [The less certain we are about the results from the models, the bigger the risk that the warming might be much worse than we think.]

So when you see a graph like this, which comes from the latest IPCC report that just came out last month, it doesn’t tell us what to do about climate change, it just tells us the consequences of what we might choose to do. Remember, humans aren’t represented in the models at all, except in terms of us producing greenhouse gases and adding them to the atmosphere.


If we keep on increasing our use of fossil fuels — finding more oil, building more pipelines, digging up more coal, we’ll follow the top path. And that takes us to a planet that by the end of this century, is somewhere between 4 and 6 degrees warmer, and it keeps on getting warmer over the next few centuries. On the other hand, the bottom path, in dark blue, shows what would happen if, year after year from now onwards, we use less fossil fuels than we did the previous year, until about mid-century, when we get down to zero emissions, and we invent some way to start removing that carbon dioxide from the atmosphere before the end of the century, to stay below 2 degrees of warming.

The models don’t tell us which of these paths we should follow. They just tell us that if this is what we do, here’s what the climate will do in response. You could say that what the models do is take all the data and all the knowledge we have about the climate system and how it works, and put them into one neat package, and its our job to take that knowledge and turn it into wisdom. And to decide which future we would like.

I’ve been collecting examples of different types of climate model that students can use in the classroom to explore different aspects of climate science and climate policy. In the long run, I’d like to use these to make the teaching of climate literacy much more hands-on and discovery-based. My goal is to foster more critical thinking, by having students analyze the kinds of questions people ask about climate, figure out how to put together good answers using a combination of existing data, data analysis tools, simple computational models, and more sophisticated simulations. And of course, learn how to critique the answers based on the uncertainties in the lines of evidence they have used.

Anyway, as a start, here’s a collection of runnable and not-so-runnable models, some of which I’ve used in the classroom:

Simple Energy Balance Models (for exploring the basic physics)

General Circulation Models (for studying earth system interactions)

  • EdGCM – an educational version of the NASA GISS general circulation model (well, an older version of it). EdGCM provides a simplified user interface for setting up model runs, but allows for some fairly sophisticated experiments. You typically need to let the model run overnight for a century-long simulation.
  • Portable University Model of the Atmosphere (PUMA) – a planet Simulator designed by folks at the University of Hamburg for use in the classroom to help train students interested in becoming climate scientists.

Integrated Assessment Models (for policy analysis)

  • C-Learn, a simple policy analysis tool from Climate Interactive. Allows you to specify emissions trajectories for three groups of nations, and explore the impact on global temperature. This is a simplified version of the C-ROADS model, which is used to analyze proposals during international climate treaty negotiations.
  • Java Climate Model (JVM) – a detailed desktop assessment model that offers detailed controls over different emissions scenarios and regional responses.

Systems Dynamics Models (to foster systems thinking)

  • Bathtub Dynamics and Climate Change from John Sterman at MIT. This simulation is intended to get students thinking about the relationship between emissions and concentrations, using the bathtub metaphor. It’s based on Sterman’s work on mental models of climate change.
  • The Climate Challenge: Our Choices, also from Sterman’s team at MIT. This one looks fancier, but gives you less control over the simulation – you can just pick one of three emissions paths: increasing, stabilized or reducing. On the other hand, it’s very effective at demonstrating the point about emissions vs. concentrations.
  • Carbon Cycle Model from Shodor, originally developed using Stella by folks at Cornell.
  • And while we’re on systems dynamics, I ought to mention toolkits for building your own systems dynamics models, such as Stella from ISEE Systems (here’s an example of it used to teach the global carbon cycle).

Other Related Models

  • A Kaya Identity Calculator, from David Archer at U Chicago. The Kaya identity is a way of expressing the interaction between the key drivers of carbon emissions: population growth, economic growth, energy efficiency, and the carbon intensity of our energy supply. Archer’s model allows you to play with these numbers.
  • An Orbital Forcing Calculator, also from David Archer. This allows you to calculate what the effect changes in the earth’s orbit and the wobble on its axis have on the solar energy that the earth receives, in any year in the past of future.

Useful readings on the hierarchy of climate models

For a talk earlier this year, I put together a timeline of the history of climate modelling. I just updated it for my course, and now it’s up on Prezi, as a presentation you can watch and play with. Click the play button to follow the story, or just drag and zoom within the viewing pane to explore your own path.

Consider this a first draft though – if there are key milestones I’ve missed out (or misrepresented!) let me know!

On Thursday, Kaitlin presented her poster at the AGU meeting, which shows the results of the study she did with us in the summer. Her poster generated a lot of interest, especially the visualizations she has of the different model architectures. Click on thumbnail to see the full poster at the AGU site:

A few things to note when looking at the diagrams:

  • Each diagram shows the components of a model, scale to their relative size by lines of code. However, the models are not to scale with one another, as the smallest, UVic’s, is only a tenth of the size of the biggest, CESM. Someone asked what accounts for that size. Well, the UVic model is an EMIC rather than a GCM. It has a very simplified atmosphere model that does not include atmospheric dynamics, which makes it easier to run for very long simulations (e.g. to study paleoclimate). On the other hand, CESM is a community model, with a large number of contributors across the scientific community. (See Randall and Held’s point/counterpoint article in last months IEEE Software for a discussion of how these fit into different model development strategies).
  • The diagrams show the couplers (in grey), again sized according to number of lines of code. A coupler handles data re-gridding (when the scientific components use different grids), temporal aggregation (when the scientific components run on different time steps) along with other data handling. These are often invisible in diagrams the scientists create of their models, because they are part of the infrastructure code; however Kaitlin’s diagrams show how substantial they are in comparison with the scientific modules. The European models all use the same coupler, following a decade-long effort to develop this as a shared code resource.
  • Note that there are many different choices associated with the use of a coupler, as sometimes it’s easier to connect components directly rather through the coupler, and the choice may be driven by performance impact, flexibility (e.g. ‘plug-and-play’ compatibility) and legacy code issues. Sea ice presents an interesting example, because its extent varies over the course of a model run. So somewhere there must be code that keeps track of which grid cells have ice, and then routes the fluxes from ocean and atmosphere to the sea ice component for these grid cells. This could be done in the coupler, or in any of the three scientific modules. In the GFDL model, sea ice is treated as an interface to the ocean, so all atmosphere-ocean fluxes pass through it, whether there’s ice in a particular cell or not.
  • The relative size of the scientific components is a reasonable proxy for functionality (or, if you like, scientific complexity/maturity). Hence, the diagrams give clues about where each lab has placed its emphasis in terms of scientific development, whether by deliberate choice, or because of availability (or unavailability) of different areas of expertise. The differences between the models from different labs show some strikingly different choices here, for example between models that are clearly atmosphere-centric, versus models that have a more balanced set of earth system components.
  • One comment we received in discussions around the poster was about the places where we have shown sub-components in some of the models. Some modeling groups are more explicit about naming the sub-components, and indicating them in the code. Hence, our ability to identify these might be more dependent on naming practices rather than any fundamental architectural differences.

I’m sure Kaitlin will blog more of her reflections on the poster (and AGU in general) once she’s back home.

Valdivino, who is working on a PhD in Brazil, on formal software verification techniques, is inspired by my suggestion to find ways to apply our current software research skills to climate science. But he asks some hard questions:

1.) If I want to Validate and Verify climate models should I forget all the things that I have learned so far in the V&V discipline? (e.g. Model-Based Testing (Finite State Machine, Statecharts, Z, B), structural testing, code inspection, static analysis, model checking)
2.) Among all V&V techniques, what can really be reused / adapted for climate models?

Well, I wish I had some good answers. When I started looking at the software development processes for climate models, I expected to be able to apply many of the [edit] formal techniques I’ve worked on in the past in Verification and Validation (V&V) and Requirements Engineering (RE). It turns out almost none of it seems to apply, at least in any obvious way.

Climate models are built through a long, slow process of trial and error, continually seeking to improve the quality of the simulations (See here for an overview of how they’re tested). As this is scientific research, it’s unknown, a priori, what will work, what’s computationally feasible, etc. Worse still, the complexity of the earth systems being studied means its often hard to know which processes in the model most need work, because the relationship between particular earth system processes and the overall behaviour of the climate system is exactly what the researchers are working to understand.

Which means that model development looks most like an agile software development process, where the both the requirements and the set of techniques needed to implement them are unknown (and unknowable) up-front. So they build a little, and then explore how well it works. The closest they come to a formal specification is a set of hypotheses along the lines of:

“if I change <piece of code> in <routine>, I expect it to have <specific impact on model error> in <output variable> by <expected margin> because of <tentative theory about climactic processes and how they’re represented in the model>”

This hypothesis can then be tested by a formal experiment in which runs of the model with and without the altered code become two treatments, assessed against the observational data for some relevant period in the past. The expected improvement might be a reduction in the root mean squared error for some variable of interest, or just as importantly, an improvement in the variability (e.g. the seasonal or diurnal spread).

The whole process looks a bit like this (although, see Jakob’s 2010 paper for a more sophisticated view of the process):

And of course, the central V&V technique here is full integration testing. The scientists build and run the full model to conduct the end-to-end tests that constitute the experiments.

So the closest thing they have to a specification would be a chart such as the following (courtesy of Tim Johns at the UK Met Office):

This chart shows how well the model is doing on 34 selected output variables (click the graph to see a bigger version, to get a sense of what the variables are). The scores for the previous model version have been normalized to 1.0, so you can quickly see whether the new model version did better or worse for each output variable – the previous model version is the line at “1.0” and the new model version is shown as the coloured dots above and below the line. The whiskers show the target skill level for each variable. If the coloured dots are within the whisker for a given variable, then the model is considered to be within the variability range for the observational data for that variable. Colour-coded dots then show how well the current version did: green dots mean it’s within the target skill range, yellow mean it’s outside the target range, but did better than the previous model version, and red means it’s outside the target and did worse than the previous model version.

Now, as we know, agile software practices aren’t really amenable to any kind of formal verification technique. If you don’t know what’s possible before you write the code, then you can’t write down a formal specification (the ‘target skill levels’ in the chart above don’t count – these aspirational goals rather than specifications). And if you can’t write down a formal specification for the expected software behaviour, then you can’t apply formal reasoning techniques to determine if the specification was met.

So does this really mean, as Valdivino suggests, that we can’t apply any of our toolbox of formal verification methods? I think attempting to answer this would make a great research project. I have some ideas for places to look where such techniques might be applicable. For example:

  • One important built-in check in a climate model is ‘conservation of mass’. Some fluxes move mass between the different components of the model. Water is an obvious one – it’s evaporated from the oceans, to become part of the atmosphere, and is then passed to the land component as rain, thence to the rivers module, and finally back to the ocean. All the while, the total mass of water across all components must not change. Similar checks apply to salt, carbon (actually this does change due to emissions), and various trace elements. At present, such checks are this is built in to the models as code assertions. In some cases, flux corrections were necessary because of imperfections in the numerical routines or the geometry of the grid, although in most cases, the models have improved enough that most flux corrections have been removed. But I think you could automatically extract from the code an abstracted model capturing just the ways in which these quantities change, and then use a model checker to track down and reason about such problems.
  • A more general version of the previous idea: In some sense, a climate model is a giant state-machine, but the scientists don’t ever build abstracted versions of it – they only work at the code level. If we build more abstracted models of the major state changes in each component of the model, and then do compositional verification over a combination of these models, it *might* offer useful insights into how the model works and how to improve it. At the very least, it would be an interesting teaching tool for people who want to learn about how a climate model works.
  • Climate modellers generally don’t use unit testing. The challenge here is that they find it hard to write down correctness properties for individual code units. I’m not entirely clear how formal methods could help here, but it seems like someone with experience of patterns for temporal logic properties might be able to help here. Clune and Rood have a forthcoming paper on this in November’s IEEE Software. I suspect this is one of the easiest places to get started for software people new to climate models.
  • There’s one other kind of verification test that is currently done by inspection, but might be amenable to some kind of formalization: the check that the code correctly implements a given mathematical formula. I don’t think this will be a high value tool, as the fortran code is close enough to the mathematics that simple inspection is already very effective. But occasionally a subtle bug slips through – for example, I came across an example where the modellers discovered they had used the wrong logarithm (loge in place of log10), although this was more due to lack of clarity in the original published paper, rather than a coding error.

Feel free to suggest more ideas in the comments!

Occasionally I come across blog posts that I wish I’d written myself, because they capture so well some of the ideas I’ve been thinking about. Such the is the case with Ricky Rood’s series on open climate models, over at Weather Underground (which itself is an excellent resource – particularly Jeff Master’s Wunderblog):

  1. Greening of the Desert: Open Climate Models
  2. Stickiness and Climate Models
  3. Open Source Communities, What are the problems?

I’ve nothing really to add, other than to note that the points Ricky makes in the third post, on the need for governance, are crucial. Wikipedia is a huge success, but not because the technology is right (quite frankly, wikis rather suck from a usability point of view), nor because people are inherently good at massive collaborative projects. Wikipedia is a success because they got the social processes right that govern editing and quality control. Open source communities do the same. They’re not really as open as most people think – an inner core of people impose tight control over the vision for the project and the quality control of the code. And they sometimes struggle to keep the clueless newbies out, to stop them messing things up.

On Thursday, Tim Palmer of the University of Oxford and the European Centre for Medium-Range Weather Forecasts (ECMWF) gave the Bjerknes lecture, with a talk entitled “Towards a Community-Wide Prototype Probablistic Earth-System Model“. For me, it was definitely the best talk of this year’s AGU meeting. [Update: the video of the talk is now up at the AGU webcasts page]

I should note of course, that this year’s Bjerknes lecture was originally supposed to have been given by Stephen Schneider, who sadly died this summer. Stephen’s ghost seems to hover over the entire conference, with many sessions beginning and ending with tributes to him. His photo was on the screens as we filed into the room, and the session began with a moment of silence for him. I’m disappointed that I never had a chance to see one of Steve’s talks, but I’m delighted they chose Tim Palmer as a replacement. And of course, he’s eminently qualified. As the introduction said: “Tim is a fellow of pretty much everything worth being a fellow of”, and one of the few people to have won both the Rossby and the Charney awards.

Tim’s main theme was the development of climate and weather forecasting models, especially the issue of probability and uncertainty. He began by reminding us that the name Bjerknes is iconic for this. Vilhelm Bjerknes set weather prediction on its current scientific course, by posing it as a problem in mathematical physics. His son, Jacob Bjerknes, pioneered the mechanisms that underpin our ability to do seasonal forecasting, particularly air-sea coupling.

If there’s one fly in the ointment though, it’s the issue of determinism. Lorenz put a stake into the heart of determinism, through his description of the butterfly effect. As an example, Tim showed the weather forecast for the UK for 13 Oct 1987, shortly before the “great storm” that turned the town of Sevenoaks [where I used to live!] into “No-oaks”. The forecast models pointed to a ridge moving in, whereas what developed was really a very strong vortex causing a serious storm.

Nowadays the forecast models are run many hundreds of times per day, to capture the inherent uncertainty in the initial conditions. An (retrospective) ensemble forecast for 13 Oct 1987 shows this was an inherently unpredictable set of circumstances. The approach now taken is to convert a large number of runs into a probabilistic forecast. This gives a tool for decision-making across a range of sectors that takes into account the uncertainty. And then, if you know your cost function, you can use the probabilities from the weather forecast to decide what to do. For example, if you were setting out to sail in the English channel on the 15th October 1987, you’d need both the probabilistic forecast *and* some measure of the cost/benefit of your voyage.

The same probabilistic approach is used in seasonal forecasting, for example for the current forecasts of the progress of El Niño.

Moving on to the climate arena, what are the key uncertainties in climate predictions? The three key sources are: initial uncertainty, future emissions, and model uncertainty. As we go for longer and longer timescales, model uncertainty dominates – it becomes the paramount issue in assessing reliability of predictions.

Back in the 1970’s, life was simple. Since then, the models have grown dramatically in complexity as new earth system processes have been added. But at the heart of the models, the essential paradigm hasn’t changed. We believe we know the basic equations of fluid motion, expressed as differential equations. It’s quite amazing that 23 mathematical symbols are sufficient to express virtually all aspects of motion in air and oceans. But the problem comes in how to solve them. The traditional approach is to project them (e.g. onto a grid), to convert them into a large number of ordinary differential equations. And then the other physical processes have to be represented in a computationally tractable way. Some of this is empirical, based on observations, along with plausible assumptions on how these processes work.

These deterministic, bulk-parameter parameterizations are based on the presumption of a large ensemble of subgrid processes (e.g. deep convective cloud systems) within each grid box, which then means we can represent them by their overall statistics. Deterministic closures have a venerable history in fluid dynamics, and we can incorporate these subgrid closures into the climate models.

But there’s a problem. Observations indicate a shallow power law for atmospheric energy wavenumber spectra. In other words, there’s no scale separation between the resolved and unresolved scales in weather and climate. The power law is consistent with what one would deduce from the scaling symmetries of the Navier-Stokes equations, but it’s violated by conventional deterministic parameterizations.

But does it matter? Surely if we can do a half-decent job on the subgrid scales, it will be okay? Tim showed a lovely cartoon from Schertzer and Lovejoy, 1993:

As pointed out in the IPCC WG1 Chp8:

“Nevertheless, models still show significant errors. Although these are generally greater at smaller scales, important large-scale problems also remain. For example, deficiencies remain in the simulation of tropical precipitation, the El Niño-Southern Oscillation and the Madden-Julian Oscillation (an observed variation in tropical winds and rainfall with a time scale of 30 to 90 days). The ultimate source of most such errors is that many important small-scale processes cannot be represented explicitly in models, and so must be included in approximate form as they interact with larger-scale features.”

The figures from the IPCC report show the models doing a good job over the 20thC. But what’s not made clear is that each model has had its bias subtracted out before this was plotted, so you’re looking at anomalies relative the the model’s own climatology. In fact, there is an enormous spread of the models against reality.

At present, we don’t know how to close these equations, and a major part of the uncertainty is in these equations. So, a missing box on the diagram of the processes in Earth System Models is “UNCERTAINTY”.

What does the community do to estimate model uncertainty? The state of the art is the multi-model ensemble (e.g. CMIP5). The idea is to poll across the  models to assess how broad the distribution is. But as everyone involved in the process understands, there are problems that are common to all of the models, because they are all based on the same basic approach to the underlying equations. And they also typically have similar resolutions.

Another pragmatic approach, to overcome the limitation of the number of available models, is to use perturbed physics ensembles – take a single model and perturb the parameters systematically. But this approach is blind to structural errors, because the one model used as the basis.

A third approach is to use stochastic closure schemes for climate models. You replace the deterministic formulae with stochastic formulae. Potentially, we have a range of scales at which we can try this. For example, Tim has experimented with cellular automata to capture missing processes, which is attractive because it can also capture how the subgrid processes move from one grid box to another. These ideas have been implemented in the ECMWF models (and are described in the book Stochastic Physics and Climate Modelling).

So where do we go from here? Tim identified a number of reasons he’s convinced stochastic-dynamic parameterizations make sense:

1) More accurate accounts of uncertainty. For example, attempts to assess skill of seasonal forecast with various different types of ensemble. For example Weisheimer et al 2009 scored the ensembles according to how well they captured the uncertainty – stochastic physics ensembles did slightly better than other types of ensemble.

2) Stochastic closures could be more accurate. For example, Berner et al 2009 experimented with adding stochastic backscatter up the spectrum, imposed on the resolved scales. To evaluate it, they looked a model bias. Use the ECMWF model, they increased resolution by factor of 5, which is computationally very expensive, but fills out the bias in the model. They showed the backscatter scheme reduces the bias of the model, in a way that’s not dissimilar to the increased resolution model. It’s like adding symmetric noise, but means that the model on average does the right thing.

3) Taking advantage of exascale computing. Tim recently attended talk by Don Grice, IBM Chief engineer, talking about getting ready for exascale computing. He said “There will be a tension between energy efficiency and error detection”. What he meant was that if you insist on bit-reproducibility you will pay an enormous premium in energy use. So the end of bit-reproducibility might be in sight for High Performance Computting.

To Tim, this is music to his ears, as he thinks stochastic approaches will be the solution to this. He gave the example of Lyric semiconductors, who are launching a new type of computer, with 1000 times the performance, but at the cost of some accuracy – in other words, probabilistic computing.

4) More efficient use of human resources. The additional complexity in earth system models comes at a price – huge demands on human resources. For many climate modelling labs, the demands are too great. So perhaps we should pool our development teams, so that we’re not all busy trying to replicate each other’s codes.

Could we move to a more community wide approach? It happened to the aerospace industry in Europe, when the various countries got together to form Aerobus. Is it a good idea for climate modelling? Institutional directors take a dogmatic view that it’s a bad idea. The argument is that we need model diversity to have good estimates of uncertainty. Tim doesn’t want to argue against this, but points out that once we have a probabilistic modelling capability, we can test this statement objectively – in other words, we can test whether in different modes, the multi-model ensemble does better than a stochastic approach.

When we talk about modelling, it covers a large spectrum, from idealized mathematically tractable models through to comprehensive mathematical models. But this has led to a separation of the communities. The academic community develops the idealized models, while the software engineering groups in the met offices build the brute-force models.

Which brings Tim to the grand challenge: the academic community should help develop prototype probabilistic Earth System Models, based on innovative and physically robust stochastic-dynamics models. The effort has started already, at the Isaac Newton Institute. They are engaging mathematicians and climate modellers, looking at stochastic approaches to climate modelling. They have already set up a network, and Tim encouraged people who are interested to subscribe.

Finally, Tim commented on the issue of how to communicate the science in this Post-cancun, post-climategate world. He went to a talk about how climate scientists should become much more emotional about communicating climate [Presumably the authors session the previous day]. Tim wanted to give his own read on this. There is a wide body of opinion that cost of major emissions cuts is not justified given current levels of uncertainty in climate predictions (and this body of opinion has strong political traction). Repeatedly appealing to the precautionary principle, and our grandchildren is not an effective approach. They can bring out pictures of their grandchildren, saying they don’t want them to grow up in a country bankrupted by bad climate policies.

We might not be able to move forward from the current stalemate without improving the accuracy of climate predictions. And are we (as scientists and government) doing all we possibly can to assess whether climate change will be disastrous, or something we can adapt to? Tim gives us 7/10 at present.

One thing we could do is to integrate NWP and seasonal to interannual prediction into this idea of seamless prediction. NWP and climate diverged in the 1960s, and need to come together again. If he had more time, he would talk about how data assimilation can be used as a powerful tool to test and improve the models. NWP models run at much finer resolution than climate models,  but are enormously computationally expensive. So are governments giving the scientists all the tools they need? In Europe, they’re not getting enough computing resources to put onto this problem. So why aren’t we doing all we possibly can to reduce these uncertainties?

Update: John Baez has a great in-depth interview with Tim over at Azimuth.

In my last two posts, I demolished the idea that climate models need Independent Verification and Validation (IV&V), and I described the idea of a toolbox approach to V&V. Both posts were attacking myths: in the first case, the myth that an independent agent should be engaged to perform IV&V on the models, and in the second, the myth that you can critique the V&V of climate models without knowing anything about how they are currently built and tested.

I now want to expand on the latter point, and explain how the day-to-day practices of climate modellers taken together constitute a robust validation process, and that the only way to improve this validation process is just to do more of it (i.e. give the modeling labs more funds to expand their current activities, rather than to do something very different).

The most common mistake made by people discussing validation of climate models is to assume that a climate model is a thing-in-itself, and that the goal of validation is to demonstrate that some property holds of this thing. And whatever that property is, the assumption is that such measurement of it can be made without reference to its scientific milieu, and in particular without reference to its history and the processes by which it was constructed.

This mistake leads people to talk of validation in terms of how well “the model” matches observations, or how well “the model” matches the processes in some real world system. This approach to validation is, as Oreskes et al pointed out, quite impossible. The models are numerical approximations of complex physical phenomena. You can verify that the underlying equations are coded correctly in a given version of the model, but you can never validate that a given model accurately captures real physical processes, because it never will accurately capture them. Or as George Box summed it up: “All models are wrong…” (we’ll come back to the second half of the quote later).

The problem is that there is no such thing as “the model”. The body of code that constitutes a modern climate model actually represents an enormous number of possible models, each corresponding to a different way of configuring that code for a particular run. Furthermore, this body of code isn’t a static thing. The code is changed on a daily basis, through a continual process of experimentation and model improvement. Often these changes are done in parallel, so that there are multiple version at any given moment, being developed along multiple lines of investigation. Sometimes these lines of evolution are merged, to bring a number of useful enhancements together into a single version. Occasionally, the lines diverge enough to cause a fork: a point at which they are different enough that it just becomes too hard to reconcile them (See for example, this visualization of the evolution of ocean models). A forked model might at some point be given a new name, but the process by which a model gets a new name is rather arbitrary.

Occasionally, a modeling lab will label a particular snapshot of this evolving body of code as an “official release”. An official release has typically been tested much more extensively, in a number of standard configurations for a variety of different platforms. It’s likely to be more reliable, and therefore easier for users to work with. By more reliable here, I mean relatively free from coding defects. In other words, it is better verified than other versions, but not necessarily better validated (I’ll explain why shortly). In many cases, official releases also contain some significant new science (e.g. new parameterizations), and these scientific enhancements will be described in a set of published papers.

However, an official release isn’t a single model either. Again it’s just a body of code that can be configured to run as any of a huge number of different models, and it’s not unchanging either – as with all software, there will be occasional bugfix releases applied to it. Oh, and did I mention that to run a model, you have to make use of a huge number of ancillary datafiles, which define everything from the shape of the coastlines and land surfaces, to the specific carbon emissions scenario to be used. Any change to these effectively gives a different model too.

So, if you’re hoping to validate “the model”, you have to say which one you mean: which configuration of which code version of which line of evolution, and with which ancillary files. I suppose the response from those clamouring for something different in the way of model validation would say “well, the one used for the IPCC projections, of course”. Which is a little tricky, because each lab produces a large number of different runs for the CMIP process that provides input to the IPCC, and each of these is a likely to involve a different model configuration.

But let’s say for sake of argument that we could agree on a specific model configuration that ought to be “validated”. What will we do to validate it? What does validation actually mean? The Oreskes paper I mentioned earlier already demonstrated that comparison with real world observations, while interesting, does not constitute “validation”. The model will never match the observations exactly, so the best we’ll ever get along these lines is an argument that, on balance, given the sum total of the places where there’s a good match and the places where there’s a poor match, that the model does better or worse than some other model. This isn’t validation, and furthermore it isn’t even a sensible way of thinking about validation.

At this point many commentators stop, and argue that if validation of a model isn’t possible, then the models can’t be used to support the science (or more usually, they mean they can’t be used for IPCC projections). But this is a strawman argument, based on a fundamental misconception of what validation is all about. Validation isn’t about checking that a given instance of a model satisfies some given criteria. Validation is about about fitness for purpose, which means it’s not about the model at all, but about the relationship between a model and the purposes to which it is put. Or more precisely, its about the relationship between particular ways of building and configuring models and the ways in which runs produced by those models are used.

Furthermore, the purposes to which models are put and the processes by which they are developed co-evolve. The models evolve continually, and our ideas about what kinds of runs we might use them for evolve continually, which means validation must take this ongoing evolution into account. To summarize, validation isn’t about a property of some particular model instance; its about the whole process of developing and using models, and how this process evolves over time.

Let’s take a step back a moment, and ask what is the purpose of a climate model. The second half of the George Box quote is “…but some models are useful”. Climate models are tools that allow scientists to explore their current understanding of climate processes, to build and test theories, and to explore the consequences of those theories. In other words we’re dealing with three distinct systems:

We're dealing with relationships between three different systems

There does not need to be any clear relationship between the calculational system and the observational system – I didn’t include such a relationship in my diagram. For example, climate models can be run in configurations that don’t match the real world at all: e.g. a waterworld with no landmasses, or a world in which interesting things are varied: the tilt of the pole, the composition of the atmosphere, etc. These models are useful, and the experiments performed with them may be perfectly valid, even though they differ deliberately from the observational system.

What really matters is the relationship between the theoretical system and the observational system: in other words, how well does our current understanding (i.e. our theories) of climate explain the available observations (and of course the inverse: what additional observations might we make to help test our theories). When we ask questions about likely future climate changes, we’re not asking this question of the the calculational system, we’re asking it of the theoretical system; the models are just a convenient way of probing the theory to provide answers.

By the way, when I use the term theory, I mean it in exactly the way it’s used in throughout all sciences: a theory is the best current explanation of a given set of phenomena. The word “theory” doesn’t mean knowledge that is somehow more tentative than other forms of knowledge; a theory is actually the kind of knowledge that has the strongest epistemological basis of any kind of knowledge, because it is supported by the available evidence, and best explains that evidence. A theory might not be capable of providing quantitative predictions (but it’s good when it does), but it must have explanatory power.

In this context, the calculational system is valid as long as it can offer insights that help to understand the relationship between the theoretical system and the observational system. A model is useful as long as it helps to improve our understanding of climate, and to further the development of new (or better) theories. So a model that might have been useful (and hence valid) thirty years ago might not be useful today. If the old approach to modelling no longer matches current theory, then it has lost some or all of its validity. The model’s correspondence (or lack of) to the observations hasn’t changed (*), nor has its predictive power. But its utility as a scientific tool has changed, and hence its validity has changed.

[(*) except that that accuracy of the observations may have changed in the meantime, due to the ongoing process of discovering and resolving anomalies in the historical record.]

The key questions for validation then, are to do with how well the current generation of models (plural) support the discovery of new theoretical knowledge, and whether the ongoing process of improving those models continues to enhance their utility as scientific tools. We could focus this down to specific things we could measure by asking whether each individual change to the model is theoretically justified, and whether each such change makes the model more useful as a scientific tool.

To do this requires a detailed study of day-to-day model development practices, the extent to which these are closely tied with the rest of climate science (e.g. field campaigns, process studies, etc). It also takes in questions such as how modeling centres decide on their priorities (e.g. which new bits of science to get into the models sooner), and how each individual change is evaluated. In this approach, validation proceeds by checking whether the individual steps taken to construct and test changes to the code add up to a sound scientific process, and how good this process is at incorporating the latest theoretical ideas. And we ought to be able to demonstrate a steady improvement in the theoretical basis for the model. An interesting quirk here is that sometimes an improvement to the model from a theoretical point of view reduces its skill at matching observations; this happens particularly when we’re replacing bits of the model that were based on empirical parameters with an implementation that has a stronger theoretical basis, because the empirical parameters were tuned to give a better climate simulation, without necessarily being well understood. In the approach I’m describing, this would be an indicator of an improvement in validity, even while reduces the correspondence with observations. If on the other hand we based our validation on some measure of correspondence with observations, such a step would reduce the validity of the model!

But what does all of this tell us about whether it’s “valid” to use the models to produce projections of climate change into the future? Well, recall that when we ask for projections of future climate change, we’re not asking the question of the calculational system, because all that would result in is a number, or range of numbers, that are impossible to interpret, and therefore meaningless. Instead we’re asking the question of the theoretical system: given the sum total of our current theoretical understanding of climate, what is likely to happen in the future, under various scenarios for expected emissions and/or concentrations of greenhouse gases? If the models capture our current theoretical understanding well, then running the scenario on the model is a valid thing to do. If the models do a poor job of capturing our theoretical understanding, then running the models on these scenarios won’t be very useful.

Note what is happening here: when we ask climate scientists for future projections, we’re asking the question of the scientists, not of their models. The scientists will apply their judgement to select appropriate versions/configurations of the models to use, they will set up the runs, and they will interpret the results in the light of what is known about the models’ strengths and weaknesses and about any gaps between the comptuational models and the current theoretical understanding. And they will add all sorts of caveats to the conclusions they draw from the model runs when they present their results.

And how do we know whether the models capture our current theoretical understanding? By studying the processes by which the models are developed (i.e. continually evolved) be the various modeling centres, and examining how good each centre is at getting the latest science into the models. And by checking that whenever there are gaps between the models and the theory, these are adequately described by the caveats in the papers published about experiments with the models.

Summary: It is a mistake to think that validation is a post-hoc process to be applied to an individual “finished” model to ensure it meets some criteria for fidelity to the real world. In reality, there is no such thing as a finished model, just many different snapshots of a large set of model configurations, steadily evolving as the science progresses. And fidelity of a model to the real world is impossible to establish, because the models are approximations. In reality, climate models are tools to probe our current theories about how climate processes work. Validity is the extent to which climate models match our current theories, and the extent to which the process of improving the models keeps up with theoretical advances.

A common cry from climate contrarians is that climate models need better verification and validation (V&V), and in particular, that they need Independent V&V (aka IV&V). George Crews has been arguing this for a while, and now Judith Curry has taken up the cry. Having spent part of the 1990’s as lead scientist at NASA’s IV&V facility, and the last few years studying climate model development processes, I think I can offer some good insights into this question.

The short answer is “no, they don’t”. The slightly longer answer is “if you have more money to spend to enhance the quality of climate models, spending it on IV&V is probably the least effective thing you could do”.

The full answer involves deconstructing the question, to show that it is based on three incorrect assumptions about climate models: (1) that there’s some significant risk to society associated with the use of climate models; (2) that the existing models are inadequately tested / verified / validated / whatevered; and (3) that trust in the models can be improved by using an IV&V process. I will demonstrate what’s wrong with each of these assumptions, but first I need to explain what IV&V is.

Independent Verification and Validation (IV&V) is a methodology developed primarily in the aerospace industry for reducing the risk of software failures, by engaging a separate team (separate from the software development team, that is) to perform various kinds of testing and analysis on the software as it is produced. NASA adopted IV&V for development of the flight software for the space shuttle in the 1970’s. Because IV&V is expensive (it typically adds 10%-20% to the cost of a software development contract), NASA tried to cancel the IV&V on the shuttle in the early 1980’s, once the shuttle was declared operational. Then, of course the Challenger disaster occurred. Although software wasn’t implicated, a consequence of the investigation was the creation of the Leveson committee, to review the software risk. Leveson’s committee concluded that far from cancelling IV&V, NASA needed to adopt the practice across all of its space flight programs. As a result of the Leveson report, the NASA IV&V facility was established in the early 1990’s, as a centre of expertise for all of NASA’s IV&V contracts. In 1995, I was recruited as lead scientist at the facility, and while I was there, our team investigated the operational effectiveness of the IV&V contracts on the Space Shuttle, International Space Station, Earth Observation System, Cassini, as well as a few other smaller programs. (I also reviewed the software failures on NASA’s Mars missions in the 1990’s, and have a talk about the lessons learned)

The key idea for IV&V is that when NASA puts out a contract to develop flight control software, it also creates a separate contract with a different company, to provide an ongoing assessment of software quality and risk as the development proceeds. One difficulty with IV&V contracts in the US aerospace industry is that it’s hard to achieve real independence, because industry consolidation has left very few aerospace companies available to take on such contracts, and they’re not sufficiently independent from one another.

NASA’s approach demands independence along three dimensions:

  • managerial independence (the IV&V contractor is free to determine how to proceed, and where to devote effort, independently of either the software development contractor and the customer)
  • financial independence (the funding for the IV&V contract is separate from the development contract, and cannot be raided if more resources are needed for development); and
  • technical independence (the IV&V contractor is free to develop its own criteria, and apply whatever V&V methods and tools it deems appropriate).

This has led to the development of a number of small companies who specialize only in IV&V (thus avoiding any contractual relationship with other aerospace companies), and who tend to recruit ex-NASA staff to provide them with the necessary domain expertise.

For the aerospace industry, IV&V has been demonstrated to be a cost effective strategy to improve software quality and reduce risk. The problem is that the risks are extreme: software errors in the control software for a spacecraft or an aircraft are highly likely to cause loss of life, loss of the vehicle, and/or loss of the mission. There is a sharp distinction between the development phase and the operation phase for such software: it had better be correct when it’s launched. Which means the risk mitigation has to be done during development, rather than during operation. In other words, iterative/agile approaches don’t work – you can’t launch with a beta version of the software. The goal is to detect and remove software defects before the software is ever used in an operational setting. An extreme example of this was the construction of the space station, where the only full end-to-end construction of the system was done in orbit; it wasn’t possible to put the hardware together on the ground in order to do a full systems test on the software.

IV&V is essential for such projects, because it overcomes natural confirmation bias of software development teams. Even the NASA program managers overseeing the contracts suffer from this too – we discovered one case where IV&V reports on serious risks were being systematically ignored by the NASA program office, because the program managers preferred to believe the project was going well. We fixed this by changing the reporting structure, and routing the IV&V reports directly to the Office of Safety and Mission Assurance at NASA headquarters. The IV&V teams developed their own emergency strategy too – if they encountered a risk that they considered mission-critical, and couldn’t get the attention of the program office to address it, they would go and have a quiet word with the astronauts, who would then ensure the problem got seen to!

But IV&V is very hard to do right, because much of it is a sociological problem rather than a technical problem. The two companies (developer and IV&V contractor) are naturally set up in an adversarial relationship, but if they act as adversaries, they cannot be effective: the developer will have a tendency to hide things, and the IV&V contractor will have a tendency to exaggerate the risks. Hence, we observed that the relationship is most effective where there is a good horizontal communication channel between the technical staff in each company, and that they come to respect one another’s expertise. The IV&V contractor has to be careful not to swamp the communication channels with spurious low-level worries, and the development contractor must be willing to respond positively to criticism. One way this works very well is for the IV&V team to give the developers advance warning of any issues they planned to report up the hierarchy to NASA, so that the development contractor could have a solution in place as even before NASA asked for it. For a more detailed account of these coordination and communication issues, see:

Okay, let’s look at whether IV&V is applicable to climate modeling. Earlier, I identified three assumptions made by people advocating it. Let’s take them one at a time:

1) The assumption there’s some significant risk to society associated with the use of climate models.

A large part of the mistake here is to misconstrue the role of climate models in policymaking. Contrarians tend to start from an assumption that proposed climate change mitigation policies (especially any attempt to regulate emissions) will wreck the economies of the developed nations (or specifically the US economy, if it’s an American contrarian). I prefer to think that a massive investment in carbon-neutral technologies will be a huge boon to the world’s economy, but let’s set aside that debate, and assume for sake of arguments that whatever policy path the world takes, it’s incredibly risky, with a non-neglibable probability of global catastrophe if the policies are either too aggressive or not aggressive enough, i.e. if the scientific assessments are wrong.

The key observation is that software does not play the same role in this system that flight software does for a spacecraft. For a spacecraft, the software represents a single point of failure. An error in the control software can immediately cause a disaster. But climate models are not control systems, and they do not determine climate policy. They don’t even control it indirectly – policy is set by a laborious process of political manoeuvring and international negotiation, in which the impact of any particular climate model is negligible.

Here’s what happens: the IPCC committees propose a whole series of experiments for the climate modelling labs around the world to perform, as part of a Coupled Model Intercomparison Project. Each participating lab chooses those runs they are most able to do, given their resources. When they have completed their runs, they submit the data to a public data repository. Scientists around the world then have about a year to analyze this data, interpret the results, to compare performance of the models, discuss findings at conferences and workshops, and publish papers. This results in thousands of publications from across a number of different scientific disciplines. The publications that make use of model outputs take their place alongside other forms of evidence, including observational studies, studies of paleoclimate data, and so on. The IPCC reports are an assessment of the sum total of the evidence; the model results from many runs of many different models are just one part of that evidence. Jim Hansen rates models as the third most important source of evidence for understanding climate change, after (1) paleoclimate studies and (2) observed global changes.

The consequences of software errors in a model, in the worst case, are likely to extend to no more than a few published papers being retracted. This is a crucial point: climate scientists don’t blindly publish model outputs as truth; they use model outputs to explore assumptions and test theories, and then publish papers describing the balance of evidence. Further papers then come along that add more evidence, or contradict the earlier findings. The assessment reports then weigh up all these sources of evidence.

I’ve been asking around for a couple of years for examples of published papers that were subsequently invalidated by software errors in the models. I’ve found several cases where a version of the model used in the experiments reported in a published paper was later found to contain an important software bug. But in none of those cases did the bug actually invalidate the conclusions of the paper. So even this risk is probably overstated.

The other point to make is that around twenty different labs around the world participate in the Model Intercomparison Projects that provide data for the IPCC assessments. That’s a level of software redundancy that is simply impossible in the aerospace industry. It’s likely that these 20+ models are not quite as independent as they might be (e.g. see Knutti’s analysis of this), but even so, the ability to run many different models on the same set of experiments, and to compare and discuss their differences is really quite remarkable, and the Model Intercomparison Projects have been a major factor in driving the science forward in the last decade or so. It’s effectively a huge benchmarking effort for climate models, with all the benefits normally associated with software benchmarking (and worthy of a separate post – stay tuned).

So in summary, while there are huge risks to society of getting climate policy wrong, those risks are not software risks. A single error in the flight software for a spacecraft could kill the crew. A single error in a climate model can, at most, only affect a handful of the thousands of published papers on which the IPCC assessments are based. The actual results of a particular model run are far less important than the understanding the scientists gain about what the model is doing and why, and the nature of the uncertainties involved. The modellers know that the models are imperfect approximations of very complex physical, chemical and biological processes. Conclusions about key issues such as climate sensitivity are based not on particular model runs, but on many different experiments with many different models over many years, and the extent to which these experiments agree or disagree with other sources of evidence.

2) the assumption that the current models are inadequately tested / verified / validated / whatevered;

This is a common talking point among contrarians. Part of the problem is that while the modeling labs have evolved sophisticated processes for developing and testing their models, they rarely bother to describe these processes to outsiders – nearly all published reports focus on the science done with the models, rather than the modeling process itself. I’ve been working to correct this, with, first, my study of the model development processes at the UK Met Office, and more recently my comparative studies of other labs, and my accounts of the existing V&V processes. Some people have interpreted the latter as a proposal for what should be done, but it is not; it is an account of the practices currently in place across all the of the labs I have studied.

A key point is that for climate models, unlike spacecraft flight controllers, there is no enforced separation between software development and software operation. A climate model is always an evolving, experimental tool, it’s never a finished product – even the prognostic runs done as input to the IPCC process are just experiments, requiring careful interpretation before any conclusions can be drawn. If the model crashes, or gives crazy results, the only damage is wasted time.

This means that an iterative development approach is the norm, which is far superior to the waterfall process used in the aerospace industry. Climate modeling labs have elevated the iterative development process to a new height: each change to the model is treated as a scientific experiment, where the change represents a hypothesis for how to improve the model, and a series of experiments is used to test whether the hypothesis was correct. This means that software development proceeds far more slowly than commercial software practices (at least in terms of lines of code per day), but that the models are continually tested and challenged by the people who know them inside out, and comparison with observational data is a daily activity.

The result is that climate models have very few bugs, compared to commercial software, when measured using industry standard defect density measures. However, although defect density is a standard IV&V metric, it’s probably a poor measure for this type of software – it’s handy for assessing risk of failure in a control system, but a poor way of assessing the validity and utility of a climate model. The real risk is that there may be latent errors in the model that mean it isn’t doing what the modellers designed it to do. The good news is that such errors are extremely rare: nearly all coding defects cause problems that are immediately obvious: the model crashes, or the simulation becomes unstable. Coding defects can only remain hidden if they have an effect that is small enough that it doesn’t cause significant perturbations in any of the diagnostic variables collected during a model run; in this case they are indistinguishable from the acceptable imperfections that arise as a result of using approximate techniques. The testing processes for the climate models (which in most labs include a daily build and automated test across all reference configurations) are sufficient that such problems are nearly always identified relatively early.

This means that there are really only two serious error types that can lead to misleading scientific results: (1) misunderstanding of what the model is actually doing by the scientists who conduct the model experiments, and (2) structural errors, where specific earth system processes are omitted or poorly captured in the model. In flight control software, these would correspond to requirements errors, and would be probed by an IV&V team through specification analysis. Catching these in control software is vital because you only get one chance to get it right. But in climate science, these are science errors, and are handled very well by the scientific process: making such mistakes, learning from them, and correcting them are all crucial parts of doing science. The normal scientific peer review process handles these kinds of errors very well. Model developers publish the details of their numerical algorithms and parameterization schemes, and these are reviewed and discussed in the community. In many cases, different labs will attempt to build their own implementations from these descriptions, and in the process subject them to critical scrutiny. In other words, there is already an independent expert review process for the most critical parts of the models, using the normal scientific route of replicating one another’s techniques. Similarly, experimental results are published, and the data is made available for other scientists to explore.

As a measure of how well this process works for building scientifically valid models, one senior modeller recently pointed out to me that it’s increasingly the case now that when the models diverge from the observations, it’s often the observational data that turns out to be wrong. The observational data is itself error prone, and software models turn out to be an important weapon in identifying and eliminating such errors.

However, there is another risk here that needs to be dealt with. Outside of the labs where the models are developed, there is a tendency for scientists who want to make use of the models to treat them as black box oracles. Proper use of the models depends on a detailed understanding of their strengths and weaknesses, and the ways in which uncertainties are handled. If we have some funding available to improve the quality of climate models, it would be far better spent on improving the user interfaces, and better training of the broader community of model users.

The bottom line is that climate models are subjected to very intensive system testing, and the incremental development process incorporates a sophisticated regression test process that’s superior to most industrial software practices. The biggest threat to validity of climate models is errors in the scientific theories on which they are based, but such errors are best investigated through the scientific process, rather than through an IV&V process. Which brings us to:

(3) the assumption that our ability to trust  in the models can be improved by an IV&V process;

IV&V is essentially a risk management strategy for safety-critical software when which an iterative development strategy is not possible – where the software has to work correctly the first (and every) time it is used in an operational setting. Climate models aren’t like this at all. They aren’t safety critical, they can be used even while they are being developed (and hence are built by iterative refinement); and they solve complex, wicked problems, for which there’s no clear correctness criteria. In fact, as a species of software development process, I’ve come to the conclusion they are dramatically different from any of the commercial software development paradigms that have been described in the literature.

A common mistake in the software engineering community is to think that software processes can be successfully transplanted from one organisation to another. Our comparative studies of different software organizations show that this is simply not true, even for organisations developing similar types of software. There are few, if any, documented cases of a software development organisation successfully adopting a process model developed elsewhere, without very substantial tailoring. What usually happens is that ideas from elsewhere are gradually infused and re-fashioned to work in the local context. And the evidence shows that every software oganisation evolves its own development processes that are highly dependent on local context, and on the constraints they operate under. Far more important than a prescribed process is the development of a shared understanding within the software team. The idea of taking a process model that was developed in the aerospace industry, and transplanting it wholesale into a vastly different kind of software development process (climate modeling) is quite simply ludicrous.

For example, one consequence of applying IV&V is that it reduces flexibility for development team, as they have to set clearer milestones and deliver workpackages on schedule (otherwise IV&V team cannot plan their efforts). Because the development of scientific codes is inherently unpredictable, would be almost impossible to plan and resource an IV&V effort. The flexibility to explore new model improvements opportunistically, and to adjust schedules to match varying scientific rhythms, is crucial to the scientific mission – locking the development into more rigid schedules to permit IV&V would be a disaster.

If you wanted to set up an IV&V process for climate models, it would have to be done by domain experts; domain expertise is the single most important factor in successful use of IV&V in the aerospace industry. This means it would have to be done by other climate scientists. But other climate scientists already do this routinely – it’s built into the Model Intercomparison Projects, as well as the peer review process and through attempts to replicate one another’s results. In fact the Model Intercomparison Projects already achieve far more than an IV&V process would, because they are done in the open and involve a much broader community.

In other words, the available pool of talent for performing IV&V is already busy using a process that’s far more effective than IV&V ever can be: it’s called doing science. Actually, I suspect that those people calling for IV&V of climate models are really trying to say that climate scientists can’t be trusted to check each other’s work, and that some other (unspecified) group ought to do the IV&V for them. However, this argument can only be used by people who don’t understand what IV&V is. IV&V works in the aerospace industry not because of any particular process, but because it brings in the experts – the people with grey hair who understand the flight systems inside out, and understand all the risks.

And remember that IV&V is expensive. NASA’s rule of thumb was an additional 10%-20% of the development cost. This cannot be taken from the development budget – it’s strictly an additional cost. Given my estimate of the development cost of a climate model as somewhere in the ballpark of  $350 million, then we’ll need to find another $35 million for each climate modeling centre to fund their IV&V contract. And if we had such funds to add to their budgets, I would argue that IV&V is one of the least sensible ways of spending this money. Instead, I would:

  • Hire more permanent software support staff to work alongside the scientists;
  • Provide more training courses to give the scientists better software skills;
  • Do more research into modeling frameworks;
  • Experiment with incremental improvements to existing practices, such as greater use of testing tools and frameworks, pair programming and code sprints;
  • More support to grow the user communities (e.g. user workshops and training courses), and more community building and beta testing;
  • Documenting the existing software development and V&V best practices so that different labs can share ideas and experiences, and the process of model building becomes more transparent to outsiders.

To summarize, IV&V would be an expensive mistake for climate modeling. It would divert precious resources (experts) away from existing modeling teams, and reduce their flexibility to respond to the science. IV&V isn’t appropriate because this isn’t missionsafety-critical software, it doesn’t have distinct development and operational phases, and the risks of software error are minor. There’s no single point of failure, because many labs around the world build their own models, and the normal scientific processes of experimentation, peer-review, replication, and model inter-comparison already provide a sophisticated process to examine the scientific validity of the models. Virtually all coding errors are detected in routine testing, and science errors are best handled through the usual scientific process, rather than through an IV&V process. Furthermore, there is only a small pool of experts available to perform IV&V on climate models (namely, other climate modelers) and they are already hard at work improving their own models. Re-deploying them to do IV&V of each other’s models would reduce the overall quality of the science rather than improving it.

(BTW I shouldn’t have had to write this article at all…)

Following my post last week about Fortran coding standards for climate models, Tim reminded me of a much older paper that was very influential in the creation (and sharing) of coding standards across climate modeling centers:

The paper is the result of a series of discussions in the mid-1980s across many different modeling centres (the paper lists 11 labs) about how to facilitate sharing of code modules. To simplify things, the paper assumes what is being shared are parameterization modules that operate in a single column of the model. Of course, this was back in the 1980s, which means the models were primarily atmospheric models, rather than the more comprehensive earth system models of today. The dynamical core of the model handles most of the horizontal processes (e.g. wind), which means that most of the remaining physical processes (the subject of these parameterizations) affect what happens vertically within a single column, e.g. by affecting radiative or convective transfer of heat between the layers. Plugging in new parameterization modules becomes much easier if this assumption holds, because the new module needs to be called once per time step per column, and if it doesn’t interact with other columns, it doesn’t mess up the vectorization. The paper describes a number of coding conventions, effectively providing an interface specification for single-column parameterizations.

An interesting point about this paper is that popularized the term “plug compatibility” amongst the modeling community, along with the (implicit) broader goal of designing all models to be plug-compatible. (although it cites Pielke & Arrit for the origin of the term). Unfortunately, the goal seems to be still very elusive. While most modelers will agree accept that plug-compatibility is desirable, a few people I’ve spoken to are very skeptical that it’s actually possible. Perhaps the strongest statement on this is from:

  • Randall DA. A University Perspective on Global Climate Modeling. Bulletin of the American Meteorological Society. 1996;77(11):2685-2690.
    p2687: “It is sometimes suggested that it is possible to make a plug-compatible global model so that an “outside” scientist can “easily make changes”. With a few exceptions (e.g. radiation codes), however, this is a fantasy, and I am surprised that such claims are not greeted with more skepticism.”

He goes on to describe instances where parameterizations have been transplanted from one model to another, but likens it to a major organ transplant, but more painful. The problem is that the various processes of the earth system interact in complex ways, and these complex interactions have to be handled properly in the code. As Randall puts it: “…the reality is that a global model must have a certain architectural unity or it will fail”. In my interviews with climate modellers, I’ve heard many tales of it taking months, and sometimes years of effort to take a code module contributed by someone outside the main modeling group, and to make it work properly in the model.

So plug compatibility and code sharing sound great in principle. In practice, no amount of interface specification and coding standards can reduce the essential complexity of earth system processes.

Note: most of the above is about plug compatibility of parameterization modules (i.e. code packages that live within the green boxes on the Bretherton diagram). More progress has been made (especially in the last decade) in standardizing the interfaces between major earth system components (i.e. the arrows on the Bretherton diagram). That’s where standardized couplers come in – see my post on the high level architecture of earth system models for an introduction. The IS-ENES workshop on coupling technologies in December will be an interesting overview of the state of the art here, although I won’t be able to attend, as it clashes with the AGU meeting.

After an exciting sabbatical year spent visiting a number of climate modeling centres, I’ll be back to teaching in January. I’ll be introducing two brand new courses, both related to climate modeling. I already blogged about my new grad course on “Climate Change Informatics”, which will cover many current research issues to do with software and data in climate science.

But I didn’t yet mention my new undergrad course. I’ll be teaching a 199 course in January, which I’ve never done before. 199 courses are first-year seminar courses, open to all new students across the faculty of arts and science, intended to encourage critical thinking, communication and research skills. They are run as small group seminar courses (enrolment is capped at 24 students). I’ve never taught one of these courses before, so I’ve no idea what to expect – I’m hoping for an interesting mix of students with different backgrounds, so we can spend some time attacking the theme of the course from different perspectives. Here’s my course description:

“Climate Change: Software, Science and Society”

This course will examine the role of computers and software in understanding climate change. We will explore the use of computer models to build simulations of the global climate, including a historical view of the use of computer models to understand weather and climate, and a detailed look at the current state of computer modelling, especially how global climate models are tested, what kinds of experiments are performed with them, how scientists know they can trust the models, and how they deal with uncertainty. The course will also explore the role of computer models in helping to shape society’s responses to climate change, in particular, what they can (and can’t) tell us about how to make effective decisions about government policy, international treaties, community action and the choices we make as individuals. The course will take a cross-disciplinary approach to these questions, looking at the role of computer models in the physical sciences, environmental science, politics, philosophy, sociology and economics of climate change. However, students are not expected to have any specialist knowledge in any of these fields prior to the course.

If all goes well, I plan to include some hands-on experimentation with climate models, perhaps using EdGCM (or even CESM if I can simplify the process of installing it and running it for them). We’ll also look at how climate models are perceived in the media and blogosphere (that will be interesting!) and compare these perceptions to what really goes on in climate modelling labs. Of course, the nice thing about a small seminar course is that I can be flexible about responding to the students’ own interests. I’m really looking forward to this…

I had some interesting chats in the last few days with Christian Jakob, who’s visiting Hamburg at the same time as me. He’s just won a big grant to set up a new Australian Climate Research Centre, so we talked a lot about what models they’ll be using at the new centre, and the broader question of how to manage collaborations between academics and government research labs.

Christian has a paper coming out this month in BAMS on how to accelerate progress in climate model development. He points out that much of the progress now depends on the creation of new parameterizations for physical processes, but to do this more effectively requires better collaboration between the groups of people who run the coupled models and assess overall model skill, and the people who analyze observational data to improve our understanding (and simulation) of particular climate processes. The key point he makes in the paper is that process studies are often undertaken because they are interesting and or because data is available, but without much idea on whether improving a particular process will have any impact on overall model skill; conversely model skill is analyzed at modeling centers without much follow-through to identify which processes might be to blame for model weaknesses. Both activities lead to insights, but better coordination between them would help to push model development further and faster. Not that it’s easy of course: coupled models are now sufficiently complex that it’s notoriously hard to pin down the role of specific physical processes in overall model skill.

So we talked a lot about how the collaboration works. One problem seems to stem from the value of the models themselves. Climate models are like very large, very expensive scientific instruments. Only large labs (typically at government agencies) can now afford to develop and maintain fully fledged earth system models. And even then the full cost is never adequately accounted for in the labs’ funding arrangements. Funding agencies understand the costs of building and operating physical instruments, like large telescopes, or particle accelerators, as shared resources across a scientific community. But because software is invisible and abstract, they don’t think of it in the same way – there’s a tendency to think that it’s just part of the IT infrastructure, and can be developed by institutional IT support teams. But of course, the climate models need huge amounts of specialist expertise to develop and operate, and they really do need to be funded like other large scientific instruments.

The complexity of the models and the lack of adequate funding for model development means that the institutions that own the models are increasingly conservative in what they do with them. They work on small incremental changes to the models, and don’t undertake big revolutionary changes – they can’t afford to take the risk. There are some examples of labs taking such risks: for example in the early 1990’s ECMWF re-wrote their model from scratch, driven in part to make it more adaptable to new, highly parallel, hardware architectures. It took several years, and a big team of coders, bringing in the scientific experts as needed. At the end of it, they had a model that was much cleaner, and (presumably) more adaptable. But scientifically, it was no different from the model they had previously. Hence, lots of people felt this was not a good use of their time – they could have made better scientific progress during that time by continuing to evolve the old model. And that was years ago – the likelihood of labs making such radical changes these days is very low.

On the other hand, academics can try the big, revolutionary stuff – if it works, they get lots of good papers about how they’re pushing the frontiers, and if it doesn’t, they can write papers about why some promising new approach didn’t work as expected. But then getting their changes accepted into the models is hard. A key problem here is that there’s no real incentive for them to follow through. Academics are judged on papers, so once the paper is written they are done. But at that point, the contribution to the model is still a long way from being ready to incorporate for others to use. Christian estimates that it takes at least as long again to get a change ready to incorporate into a model as it does to develop it in the first place (and that’s consistent with what I’ve heard other modelers say). The academic has no incentive to continue to work on it to get it ready, and the institutions have no resources to take it and adopt it.

So again we’re back to the question of effective collaboration, beyond what any one lab or university group can do. And the need to start treating the models as expensive instruments, with much higher operation and maintenance costs than anyone has yet acknowledged. In particular, modeling centers need resources for a much bigger staff to support the efforts by the broader community to extend and improve the models.

Excellent news: Jon Pipitone has finished his MSc project on the software quality of climate models, and it makes fascinating reading. I quote his abstract here:

A climate model is an executable theory of the climate; the model encapsulates climatological theories in software so that they can be simulated and their implications investigated. Thus, in order to trust a climate model one must trust that the software it is built from is built correctly. Our study explores the nature of software quality in the context of climate modelling. We performed an analysis of the reported and statically discoverable defects in several versions of leading global climate models by collecting defect data from bug tracking systems, version control repository comments, and from static analysis of the source code. We found that the climate models all have very low defect densities compared to well-known, similarly sized open-source projects. As well, we present a classification of static code faults and find that many of them appear to be a result of design decisions to allow for flexible configurations of the model. We discuss the implications of our findings for the assessment of climate model software trustworthiness.

The idea for the project came from an initial back-of-the-envelope calculation we did of the Met Office Hadley’s Centre’s Unified Model, in which we estimated the number of defects per thousand lines of code (a common measure of defect density in software engineering) to be extremely low – of the order of 0.03 defects/KLoC. By comparison, the shuttle flight software, reputedly the most expensive software per line of code ever built, clocked in at 0.1 defects/KLoC; most of the software industry does worse than this.

This initial result was startling, because the climate scientists who build this software don’t follow any of the software processes commonly prescribed in the software literature. Indeed, when you talk to them, many climate modelers are a little apologetic about this, and have a strong sense they ought to be doing more rigorous job with their software engineering. However, as we documented in our paper, climate modeling centres such as the UK Met Office do have a excellent software processes, that they have developed over many years to suit their needs. I’ve come to the conclusion that it has to be very different from mainstream software engineering processes because the context is so very different.

Well, obviously we were skeptical (scientists are always skeptical, especially when results seem to contradict established theory). So Jon set about investigating this more thoroughly for his MSc project. He tackled the question in three ways: (1) measuring defect density by using bug repositories, version history and change logs to quantify bug fixes; (2) assessing the software directly using static analysis tools and (3) interviewing climate modelers to understand how they approach software development and bug fixing in particular.

I think there are two key results of Jon’s work:

  1. The initial results on defect density bear up. Although not quite as startlingly low as my back of the envelope calculation, Jon’s assessment of three major GCMs indicate they all fall in the range commonly regarded as good quality software by industry standards.
  2. There are a whole bunch of reasons why result #1 may well be meaningless, because the metrics for measuring software quality don’t really apply well to large scale scientific simulation models.

You’ll have to read Jon’s thesis to get all the details, but it will be well worth it. The conclusion? More research needed. It opens up plenty of questions for a PhD project….

This week I attended a Dagstuhl seminar on New Frontiers for Empirical Software Engineering. It was a select gathering, with many great people, which meant lots of fascinating discussions, and not enough time to type up all the ideas we’ve been bouncing around. I was invited to run a working group on the challenges to empirical software engineering posed by climate change. I started off with a quick overview of the three research themes we identified at the Oopsla workshop in the fall:

  • Climate Modeling, which we could characterize as a kind of end-user software development, embedded in a scientific process;
  • Global collective decision-making, which involves creating the software infrastructure for collective curation of sources of evidence in a highly charged political atmosphere;
  • Green Software Engineering, including carbon accounting for the software systems lifecycle (development, operation and disposal), but where we have no existing no measurement framework, and tendency to to make unsupported claims (aka greenwashing).

Inevitably, we spent most of our time this week talking about the first topic – software engineering of computational models, as that’s the closest to the existing expertise of the group, and the most obvious place to start.

So, here’s a summary of our discussions. The bright ideas are due to the group (Vic Basili, Lionel Briand, Audris Mockus, Carolyn Seaman and Claes Wohlin), while the mistakes in presenting them here are all mine.

A lot of our discussion was focussed on the observation that climate modeling (and software for computational science in general) is a very different kind of software engineering than most of what’s discussed in the SE literature. It’s like we’ve identified a new species of software engineering, which appears to be a an outlier (perhaps an entirely new phylum?). This discovery (and the resulting comparisons) seems to tell us a lot about the other species that we thought we already understood.

The SE research community hasn’t really tackled the question of how the different contexts in which software development occurs might affect software development practices, nor when and how it’s appropriate to attempt to generalize empirical observations across different contexts. In our discussions at the workshop, we came up with many insights for mainstream software engineering, which means this is a two-way street: plenty of opportunity for re-examination of mainstream software engineering, as well as learning how to study SE for climate science. I should also say that many of our comparisons apply to computational science in general, not just climate science, although we used climate modeling for many specific examples.

We ended up discussing three closely related issues:

  1. How do we characterize/distinguish different points in this space (different species of software engineering)? We focussed particularly on how climate modeling is different from other forms of SE, but we also attempted to identify factors that would distinguish other species of SE from one another. We identified lots of contextual factors that seem to matter. We looked for external and internal constraints on the software development project that seem important. External constraints are things like resource limitations, or particular characteristics of customers or the environment where the software must run. Internal constraints are those that are imposed on the software team by itself, for example, choices of working style, project schedule, etc.
  2. Once we’ve identified what we think are important distinguishing traits (or constraints), how do we investigate whether these are indeed salient contextual factors? Do these contextual factors really explain observed differences in SE practices, and if so how? We need to consider how we would determine this empirically. What kinds of study are needed to investigate these contextual factors? How should the contextual factors be taken into account in other empirical studies?
  3. Now imagine we have already characterized this space of species of SE. What measures of software quality attributes (e.g. defect rates, productivity, portability, changeability…) are robust enough to allow us to make valid comparisons between species of SE? Which metrics can be applied in a consistent way across vastly different contexts? And if none of the traditional software engineering metrics (e.g. for quality, productivity, …) can be used for cross-species comparison, how can we do such comparisons?

In my study of the climate modelers at the UK Met Office Hadley centre, I had identified a list of potential success factors that might explain why the climate modelers appear to be successful (i.e. to the extent that we are able to assess it, they appear to build good quality software with low defect rates, without following a standard software engineering process). My list was:

  • Highly tailored software development process – software development is tightly integrated into scientific work;
  • Single Site Development – virtually all coupled climate models are developed at a single site, managed and coordinated at a single site, once they become sufficiently complex [edited – see Bob’s comments below], usually a government lab as universities don’t have the resources;
  • Software developers are domain experts – they do not delegate programming tasks to programmers, which means they avoid the misunderstandings of the requirements common in many software projects;
  • Shared ownership and commitment to quality, which means that the software developers are more likely to make contributions to the project that matter over the long term (in contrast to, say, offshored software development, where developers are only likely to do the tasks they are immediately paid for);
  • Openness – the software is freely shared with a broad community, which means that there are plenty of people examining it and identifying defects;
  • Benchmarking – there are many groups around the world building similar software, with regular, systematic comparisons on the same set of scenarios, through model inter-comparison projects (this trait could be unique – we couldn’t think of any other type of software for which this is done so widely).
  • Unconstrained Release Schedule – as there is no external customer, software releases are unhurried, and occur only when the software is considered stable and tested enough.

At the workshop we identified many more distinguishing traits, any of which might be important:

  • A stable architecture, defined by physical processes: atmosphere, ocean, sea ice, land scheme,…. All GCMs have the same conceptual architecture, and it is unchanged since modeling began, because it is derived from the natural boundaries in physical processes being simulated [edit: I mean the top level organisation of the code, not the choice of numerical methods, which do vary across models – see Bob’s comments below]. This is used as an organising principle both for the code modules, and also for the teams of scientists who contribute code. However, the modelers don’t necessarily derive some of the usual benefits of stable software architectures, such as information hiding and limiting the impacts of code changes, because the modules have very complex interfaces between them.
  • The modules and integrated system each have independent lives, owned by different communities. For example, a particular ocean model might be used uncoupled by a large community, and also be integrated into several different coupled climate models at different labs. The communities who care about the ocean model on its own will have different needs and priorities than each of communities who care about the coupled models. Hence, the inter-dependence has to be continually re-negotiated. Some other forms of software have this feature too: Audris mentioned voice response systems in telecoms, which can be used stand-alone, and also in integrated call centre software; Lionel mentioned some types of embedded control systems onboard ships, where the modules are used indendently on some ships, and as part of a larger integrated command and control system on others.
  • The software has huge societal importance, but the impact of software errors is very limited. First, a contrast: for automotive software, a software error can immediately lead to death, or huge expense, legal liability, etc,  as cars are recalled. What would be the impact of software errors in climate models? An error may affect some of the experiments performed on the model, with perhaps the most serious consequence being the need to withdraw published papers (although I know of no cases where this has happened because of software errors rather than methodological errors). Because there are many other modeling groups, and scientific results are filtered through processes of replication, and systematic assessment of the overall scientific evidence, the impact of software errors on, say, climate policy is effectively nil. I guess it is possible that systematic errors are being made by many different climate modeling groups in the same way, but these wouldn’t be coding errors – they would be errors in the understanding of the physical processes and how best to represent them in a model.
  • The programming language of choice is Fortran, and is unlikely to change for very good reasons. The reasons are simple: there is a huge body of legacy Fortran code, everyone in the community knows and understands Fortran (and for many of them, only Fortran), and Fortran is ideal for much of the work of coding up the mathematical formulae that represent the physics. Oh, and performance matters enough that the overhead of object oriented languages makes them unattractive. Several climate scientists have pointed out to me that it probably doesn’t matter what language they use, the bulk of the code would look pretty much the same – long chunks of sequential code implementing a series of equations. Which means there’s really no push to discard Fortran.
  • Existence and use of shared infrastructure and frameworks. An example used by pretty much every climate model is MPI. However, unlike Fortran, which is generally liked (if not loved), everyone universally hates MPI. If there was something better they would use it. [OpenMP doesn’t seem to have any bigger fanclub]. There are also frameworks for structuring climate models and coupling the different physics components (more on these in a subsequent post). Use of frameworks is an internal constraint that will distinguish some species of software engineering, although I’m really not clear how it will relate to choices of software development process. More research needed.
  • The software developers are very smart people. Typically with PhDs in physics or related geosciences. When we discussed this in the group, we all agreed this is a very significant factor, and that you don’t need much (formal) process with very smart people. But we couldn’t think of any existing empirical evidence to support such a claim. So we speculated that we needed a multi-case case study, with some cases representing software built by very smart people (e.g. climate models, the Linux kernel, Apache, etc), and other cases representing software built by …. stupid people. But we felt we might have some difficulty recruiting subjects for such a study (unless we concealed our intent), and we would probably get into trouble once we tried to publish the results 🙂
  • The software is developed by users for their own use, and this software is mission-critical for them. I mentioned this above, but want to add something here. Most open source projects are built by people who want a tool for their own use, but that others might find useful too. The tools are built on the side (i.e. not part of the developers’ main job performance evaluations) but most such tools aren’t critical to the developers’ regular work. In contrast, climate models are absolutely central to the scientific work on which the climate scientists’ job performance depends. Hence, we described them as mission-critical, but only in a personal kind of way. If that makes sense.
  • The software is used to build a product line, rather than an individual product. All the main climate models have a number of different model configurations, representing different builds from the codebase (rather than say just different settings). In the extreme case, the UK Met Office produces several operational weather forecasting models and several research climate models from the same unified codebase, although this is unusual for a climate modeling group.
  • Testing focuses almost exclusively on integration testing. In climate modeling, there is very little unit testing, because it’s hard to specify an appropriate test for small units in isolation from the full simulation. Instead the focus is on very extensive integration tests, with daily builds, overnight regression testing, and a rigorous process of comparing the output from runs before and after each code change. In contrast, most other types of software engineering focus instead on unit testing, with elaborate test harnesses to test pieces of the software in isolation from the rest of the system. In embedded software, the testing environment usually needs to simulate the operational environment; the most extreme case I’ve seen is the software for the international space station, where the only end-to-end software integration was the final assembly in low earth orbit.
  • Software development activities are completely entangled with a wide set of other activities: doing science. This makes it almost impossible to assess software productivity in the usual way, and even impossible to estimate the total development cost of the software. We tried this as a thought experiment at the Hadley Centre, and quickly gave up: there is no sensible way of drawing a boundary to distinguish some set of activities that could be regarded as contributing to the model development, from other activities that could not. The only reasonable path to assessing productivity that we can think of must focus on time-to-results, or time-to-publication, rather than on software development and delivery.
  • Optimization doesn’t help. This is interesting, because one might expect climate modelers to put a huge amount of effort into optimization, given that century-long climate simulations still take weeks/months on some of the world’s fastest supercomputers. In practice, optimization, where it is done, tends to be an afterthought. The reason is that the model is changed so frequently that hand optimization of any particular model version is not useful. Plus the code has to remain very understandable, so very clever designed-in optimizations tend to be counter-productive.
  • There are very few resources available for software infrastructure. Most of the funding is concentrated on the frontline science (and the costs of buying and operating supercomputers). It’s very hard to divert any of this funding to software engineering support, so development of the software infrastructure is sidelined and sporadic.
  • …and last but not least, A very politically charged atmosphere. A large number of people actively seek to undermine the science, and to discredit individual scientists, for political (ideological) or commercial (revenue protection) reasons. We discussed how much this directly impacts the climate modellers, and I have to admit I don’t really know. My sense is that all of the modelers I’ve interviewed are shielded to a large extend from the political battles (I never asked them about this). Those scientists who have been directly attacked (e.g. MannJonesSanter) tend to be scientists more involved in creation and analysis of datasets, rather than GCM developers. However, I also think the situation is changing rapidly, especially in the last few months, and climate scientists of all types are starting to feel more exposed.

We also speculated about some other contextual factors that might distinguish different software engineering species, not necessarily related to our analysis of computational science software. For example:

  • Existence of competitors;
  • Whether software is developed for single-person-use versus intended for broader user base;
  • Need for certification (and different modes by which certification might be done, for example where there are liability issues, and the need to demonstrate due diligence)
  • Whether software is expected to tolerate and/or compensate for hardware errors. For example, for automotive software, much of the complexity comes from building fault-tolerance into the software because correcting hardware problems introduced in design or manufacture is prohibitively expense. We pondered how often hardware errors occur in supercomputer installations, and whether if they did it would affect the software. I’ve no idea of the answer to the first question, but the second is readily handled by the checkpoint and restart features built into all climate models. Audris pointed out that given the volumes of data being handled (terrabytes per day), there are almost certainly errors introduced in storage and retrieval (i.e. bits getting flipped), and enough that standard error correction would still miss a few. However, there’s enough noise in the data that in general, such things probably go unnoticed, although we speculated what would happen when the most significant bit gets flipped in some important variable.

More interestingly, we talked about what happens when these contextual factors change over time. For example, the emergence of a competitor where there was none previously, or the creation of a new regulatory framework where none existed. Or even, in the case of health care, when change in the regulatory framework relaxes a constraint – such as the recent US healthcare bill, under which it (presumably) becomes easier to share health records among medical professionals if knowledge of pre-existing conditions is no longer a critical privacy concern. An example from climate modeling: software that was originally developed as part of a PhD project intended for use by just one person eventually grows into a vast legacy system, because it turns out to be a really useful model for the community to use. And another: the move from single site development (which is how nearly all climate models were developed) to geographically distributed development, now that it’s getting increasingly hard to get all the necessary expertise under one roof, because of the increasing diversity of science included in the models.

We think there are lots of interesting studies to be done of what happens to the software development processes for different species of software when such contextual factors change.

Finally, we talked a bit about the challenge of finding metrics that are valid across the vastly different contexts of the various software engineering species we identified. Experience with trying to measure defect rates in climate models suggests that it is much harder to make valid comparisons than is generally presumed in the software literature. There really has not been any serious consideration of these various contextual factors and their impact on software practices in the literature, and hence we might need to re-think a lot of the ways in which claims for generality are handled in empirical software engineering studies. We spent some time talking about the specific case of defect measurements, but I’ll save that for a future post.

This week I’m visiting the Max Planck Institute for Meteorology (MPI-M) in Hamburg. I gave my talk yesterday on the Hadley study, and it led to some fascinating discussions about software practices used for model building. One of the topics that came up in the discussion afterwards was how this kind of software development compares with agile software practices, and in particular the reliance on face-to-face communication, rather than documentation. Like many software projects, climate modellers struggle to keep good, up-to-date documentation, but generally feel they should be doing better. The problem of course, is that traditional forms of documentation (e.g. large, stand-alone descriptions of design and implementation details) are expensive to maintain, and of questionable value – the typical experience is that you wade through the documentation and discover that despite all the details, it never quite answers your question. Such documents are often produced in a huge burst of enthusiasm for the first release of the software, but then never touched again through subsequent releases. And as the code in the climate models evolves steadily over decades, the chances of any stand-alone documentation keeping up are remote.

An obvious response is that the code itself should be self-documenting. I’ve looked at a lot of climate model code, and readability is somewhat variable (to put it politely). This could be partially addressed with more attention to coding standards, although it’s not clear how familiar you would have to be with the model already to be able to read the code, even with very good coding standards. Initiatives like Clear Climate Code intend to address this problem, by re-implementing climate tools as open source projects in Python, with a strong focus on making the code as understandable as possible. Michael Tobis and I have speculated recently about how we’d scale up this kind of initiative to the development of coupled GCMs.

But readable code won’t fill the need for a higher level explanation of the physical equations and their numerical approximations used in the model, along with rationale for algorithm choices. These are often written up in various forms of (short) white papers when the numerical routines are first developed, and as these core routines rarely change, this form of documentation tends to remain useful. The problem is that these white papers tend to have no official status (or perhaps at best, they appear as technical reports), and are not linked in any usable way to distributions of the source code. The idea of literate programming was meant to solve this problem, but it never took off, probably because it demands that programmers must tear themselves away from using programming languages as their main form of expression, and start thinking about how to express themselves to other human beings. Given that most programmers define themselves in terms of the programming languages they are fluent in, the tyranny of the source code is unlikely to disappear anytime soon. In this respect, climate modelers have a very different culture from most other kinds of software development teams, so perhaps this is an area where the ideas of literate programming could take root.

Lack of access to these white papers could also be solved by publishing them as journal papers (thus instantly making them citeable objects). However, scientific journals tend not to publish descriptions of the designs of climate models, unless they are accompanied with new scientific results from the models. There are occasional exceptions (e.g. see the special issue of the Journal of Climate devoted to the MPI-M models). But things are changing, with the recent appearance of two new journals:

  • Geoscientific Model Development, an open access journal that accepts technical descriptions of the development and evaluation of the models;
  • Earth Science Informatics, a Springer Journal with a broader remit than GMD, but which does cover descriptions of the development of computational tools for climate science.

The problem is related to another dilemma in climate modeling groups: acknowledgement for the contributions of those who devote themselves more to model development rather than doing “publishable science”. Most of the code development is done by scientists whose performance is assessed by their publication record. Some modeling centres have created job positions such as “programmers” or “systems staff”, although most people hired into these roles have a very strong geosciences background. A growing recognition of the importance of their contributions represents a major culture change in the climate modeling community over the last decade.