To get myself familiar with the models at each of the climate centers I’m visiting this summer, I’ve tried to find high level architectural diagrams of the software structure. Unfortunately, there seem to be very few such diagrams around. Climate scientists tend to think of their models in terms of a set of equations, and differentiate between models on the basis of which particular equations each implements. Hence, their documentation doesn’t contain the kinds of views on the software that a software engineer might expect. It presents the equations, often followed with comments about the numerical algorithms that implement them. This also means they don’t find automated documentation tools such as Doxygen very helpful, because they don’t want to describe their models in terms of code structure (the folks at MPI-M here do use Doxygen, but it doesn’t give them the kind of documentation they most want).

But for my benefit, as I’m a visual thinker, and perhaps to better explain to others what is in these huge hunks of code, I need diagrams. There are some schematics like this around (taken from an MPI-M project site):

0c5e5ca1d3

But it’s not quite what I want. It shows the major components:

  • ECHAM – atmosphere dynamics and physics,
  • HAM – aerosols,
  • MESSy – atmospheric chemistry,
  • MPI-OM – ocean dynamics and physics,
  • HAMOCC – ocean biogeochemistry,
  • JSBACH – land surface processes,
  • HD – hydrology,
  • and the coupler, PRISM,

…but it only shows a few of the connectors, and many of the arrows are unlabeled. I need something that more clearly distinguishes the different kinds of connector, and perhaps shows where various subcomponents fit in (in part because I want to think about why particular compositional choices have been made).

The closest I can find to what I need is the Bretherton diagram, produced back in the mid 1980’s to explain what earth system science is all about:

The Bretherton Diagram of earth system processes (click to see bigger, as this is probably not readable!)

It’s not a diagram of an earth system model per se, but rather of the set of systems that such a model might simulate. There’s a lot of detail here, but it does clearly show the major systems (orange rectangles – these roughly correspond to model components) and subsystems (green rectangles), along with data sources and sinks (the brown ovals) and the connectors (pale blue rectangles, representing the data passed between components).

The diagram allows me to make a number of points. First, we can distinguish between two types of model:

  • a Global Climate Model, also known as a General Circulation Model (GCM), or Atmosphere-Ocean coupled model (AO-GCM), which only simulates the physical and dynamic processes in the atmosphere and ocean. Where a GCM does include parts of the other processes, it it typically only to supply appropriate boundary conditions.
  • an Earth System Model (ESM), which also includes the terrestrial and marine biogeochemical processes, snow and ice dynamics, atmospheric chemistry, aerosols, and so on – i.e. it includes simulations of most of the rest of the diagram.

Over the past decade, AO-GCMs have steadily evolved to become ESMs, although there are many intermediate forms around. In the last IPCC assessment, nearly all the models used for the assessment runs were AO-GCMs. For the next assessment, many of them will be ESMs.

Second, perhaps obviously, the diagram doesn’t show any infrastructure code. Some of this is substantial – for example an atmosphere-ocean coupler is a substantial component in its own right, often performing elaborate data transformations, such as re-gridding, interpolation, and synchronization. But this does reflect the way in which scientists often neglect the infrastructure code, because it is not really relevant to the science.

Third, the diagram treats all the connectors in the same way, because, at some level, they are all just data fields, representing physical quantities (mass, energy) that cross subsystem boundaries. However, there’s a wide range of different ways in which these connectors are implemented – in some cases binding the components tightly together with complex data sharing and control coupling, and in other cases keeping them very loose. The implementation choices are based on a mix of historical accident, expediency, program performance concerns, and the sheer complexity of the physical boundaries between the actual earth subsystems. For example, within an atmosphere model, the dynamical core (which computes the basic thermodynamics of air flow) is distinct from the radiation code (which computes how visible light, along with other parts of the spectrum, are scattered or absorbed by the various layers of air) and the moist processes (i.e. humidity and clouds). But the complexity of the interactions between these processes is sufficiently high that they are tightly bound together – it’s not currently possible to treat any of these parts as swappable components (at least in the current generation of models), although during development, some parts can be run in isolation for unit testing e.g. the dynanamical core is tested in isolation, but then most other subcomponents depend on it.

On the other hand, the interface between atmosphere and ocean is relatively simple — it’s the ocean surface — and as this also represents the interface between two distinct scientific disciplines (atmospheric physics and oceanography), atmosphere models and ocean model are always (?) loosely coupled. It’s common now for the two to operate on different grids (different resolution, or even different shape), and the translation of the various data to be passed between them is handled by a coupler. Some schematic diagrams do show how the coupler is connected:

Atmosphere-Ocean coupling via the OASIS coupler (source: Figure 4.2 in the MPI-Met PRISM Earth System Model Adaptation Guide)

Atmosphere-Ocean coupling via the OASIS coupler (source: Figure 4.2 in the MPI-Met PRISM Earth System Model Adaptation Guide)

Other interfaces are harder to define than the atmosphere-ocean interface. For example, the atmosphere and the terrestrial processes are harder to decouple: Which parts of the water cycle should be handled by the atmosphere model and which should be handled by the land surface model? Which module should handle evaporation from plants and soil? In some models, such as ECHAM, the land surface is embedded within the atmosphere model, and called as a subroutine at each time step. In part this is historical accident – the original atmosphere model had no vegetation processes, but used soil heat and moisture parameterization as a boundary condition. The land surface model, JSBACH, was developed by pulling out as much of this code as possible, and developing it into a separate vegetation model, and this is sometimes run as a standalone model by the land surface community. But it still shares some of the atmosphere infrastructure code for data handling, so its not as loosely coupled as the ocean is. By contrast, in CESM, the land surface model is a distinct component, interacting with the atmosphere model only via the coupler. This facilitates the switching of different land and/or atmosphere components into the coupled scheme, and also allows the land surface model to have a different grid.

The interface between the ocean model and the sea ice model is also tricky, not least because the area covered by the ice varies with the seasonal cycle. So if you use a coupler to keep the two components separate, the coupler needs information about which grid points contain ice and which do not at each timestep, and it has to alter its behaviour accordingly. For this reason, the sea ice is often treated as a subroutine of the ocean model, which then avoids having to expose all this information to the coupler. But again we have the same trade-off. Working through the coupler ensures they are self-contained components and can be swapped for other compatible models as needed; but at the cost of increasing the complexity of the coupler interfaces, reducing information hiding, and making future changes harder.

Similar challenges occur for:

  • the coupling between the atmosphere and the atmospheric chemistry (which handles chemical processes as gases and various types of pollution are mixed up by atmospheric dynamics).
  • the coupling between the ocean and marine biogeochemistry (which handles the way ocean life absorbs and emits various chemicals while floating around on ocean currents).
  • the coupling between the land surface processes and terrestrial hydrology (which includes rivers, lakes, wetlands and so on). Oh, and between both of these and the atmosphere, as water moves around so freely. Oh, and the ocean as well, because we have to account for how outflows from rivers enter the ocean at coastlines all around the world.
  • …and so on, as we account for more and more of the earth’s system into the models.

Overall, it seems that the complexity of the interactions between the various earth system processes is so high that traditional approaches to software modularity don’t work. Information hiding is hard to do, because these processes are so tightly inter-twined. A full object-oriented approach would be a radical departure from how these models are built currently, with the classes built on the data objects (the pale blue boxes in the Bretherton diagram) rather than the processes (the green boxes). But the computational demands of the processes in the green boxes is so high that the only way to make them efficient is to give them full access to the low level data structures. So any attempt to abstract away these processes from the data objects they operate on will lead to a model that is too inefficient to be useful.

Which brings me back to the question of how to draw pictures of the architecture so that I can compare the coupling and modularity of different models. I’m thinking the best approach might be to start with the Bretherton diagram, and then overlay it to show how various subsystems are grouped into components, and which connectors are handled by a separate coupler.

Postscript: While looking for good diagrams, I came across this incredible collection of visualizations of various aspects of sustainability, some of which are brilliant, while others are just kooky.

I had some interesting chats in the last few days with Christian Jakob, who’s visiting Hamburg at the same time as me. He’s just won a big grant to set up a new Australian Climate Research Centre, so we talked a lot about what models they’ll be using at the new centre, and the broader question of how to manage collaborations between academics and government research labs.

Christian has a paper coming out this month in BAMS on how to accelerate progress in climate model development. He points out that much of the progress now depends on the creation of new parameterizations for physical processes, but to do this more effectively requires better collaboration between the groups of people who run the coupled models and assess overall model skill, and the people who analyze observational data to improve our understanding (and simulation) of particular climate processes. The key point he makes in the paper is that process studies are often undertaken because they are interesting and or because data is available, but without much idea on whether improving a particular process will have any impact on overall model skill; conversely model skill is analyzed at modeling centers without much follow-through to identify which processes might be to blame for model weaknesses. Both activities lead to insights, but better coordination between them would help to push model development further and faster. Not that it’s easy of course: coupled models are now sufficiently complex that it’s notoriously hard to pin down the role of specific physical processes in overall model skill.

So we talked a lot about how the collaboration works. One problem seems to stem from the value of the models themselves. Climate models are like very large, very expensive scientific instruments. Only large labs (typically at government agencies) can now afford to develop and maintain fully fledged earth system models. And even then the full cost is never adequately accounted for in the labs’ funding arrangements. Funding agencies understand the costs of building and operating physical instruments, like large telescopes, or particle accelerators, as shared resources across a scientific community. But because software is invisible and abstract, they don’t think of it in the same way – there’s a tendency to think that it’s just part of the IT infrastructure, and can be developed by institutional IT support teams. But of course, the climate models need huge amounts of specialist expertise to develop and operate, and they really do need to be funded like other large scientific instruments.

The complexity of the models and the lack of adequate funding for model development means that the institutions that own the models are increasingly conservative in what they do with them. They work on small incremental changes to the models, and don’t undertake big revolutionary changes – they can’t afford to take the risk. There are some examples of labs taking such risks: for example in the early 1990’s ECMWF re-wrote their model from scratch, driven in part to make it more adaptable to new, highly parallel, hardware architectures. It took several years, and a big team of coders, bringing in the scientific experts as needed. At the end of it, they had a model that was much cleaner, and (presumably) more adaptable. But scientifically, it was no different from the model they had previously. Hence, lots of people felt this was not a good use of their time – they could have made better scientific progress during that time by continuing to evolve the old model. And that was years ago – the likelihood of labs making such radical changes these days is very low.

On the other hand, academics can try the big, revolutionary stuff – if it works, they get lots of good papers about how they’re pushing the frontiers, and if it doesn’t, they can write papers about why some promising new approach didn’t work as expected. But then getting their changes accepted into the models is hard. A key problem here is that there’s no real incentive for them to follow through. Academics are judged on papers, so once the paper is written they are done. But at that point, the contribution to the model is still a long way from being ready to incorporate for others to use. Christian estimates that it takes at least as long again to get a change ready to incorporate into a model as it does to develop it in the first place (and that’s consistent with what I’ve heard other modelers say). The academic has no incentive to continue to work on it to get it ready, and the institutions have no resources to take it and adopt it.

So again we’re back to the question of effective collaboration, beyond what any one lab or university group can do. And the need to start treating the models as expensive instruments, with much higher operation and maintenance costs than anyone has yet acknowledged. In particular, modeling centers need resources for a much bigger staff to support the efforts by the broader community to extend and improve the models.

One of the exercises I set myself while visiting NCAR this month is to try porting the climate model CESM1 to my laptop (a MacBook Pro). Partly because I wanted to understand what makes it tick, and partly because I thought it would be neat to be able to run it anywhere. At first I thought the idea was crazy – these things are designed to be run on big supercomputers. But CESM is also intended to be portable, as part of a mission to support a broader community of model users. So, porting it to my laptop is a simple proof of concept test – if it ports easily, that’s a good sign that the code is robust.

It took me several days of effort to complete the port, but most of that time was spent on two things that have very little to do with the model itself. The first was a compiler bug that I tripped over (yeah, yeah, blame the compiler, right?) and the second was the issue of getting all the necessary third party packages installed. But in the end I was successful. I’ve just completed two very basic test runs of the model. The first is what’s known as an ‘X’ component set, in which all the major components (atmosphere, ocean, land, ice, etc) don’t actually do anything – this just tests that the framework code builds and runs. The second is an “A” compset at a low resolution, in which all the components are static data models (this ran for five days of simulation time in about 1.5 minutes). If I was going on to test the port correctly, there’s a whole sequence of port validation tests that I ought to perform, for example to check that my runs are consistent with the benchmark runs, that I can stop and restart the model from the data files, that I get the same results in different model configurations, etc. And then eventually there’s the scientific validation tests – checks that the simulated climate in my ported model is realistic.

But for now, I just want to reflect on the process of getting the model to build and run on a new (unsupported) platform. I’ll describe some of the issues I encountered, and then reflect on what I’ve learned about the model. First, some stats. The latest version of the model, CESM1.0 was released on June 25, 2010. It contains just under 1 million lines of code. Three quarters of this is Fortran (mainly Fortran 90), the rest is a mix of shell scripts (of several varieties), XML and HTML:

Lines of Code count for CESM v1.0 (not including build scripts), as calculated by cloc.sourceforge.net v1.51

In addition to the model itself, there are another 12,000 lines of perl and shell script that handle the installing, configuring, building and running the model.

The main issues that tripped me up were

  • The compiler. I decided to use the gnu compiler package (gfortran, included in the gcc package), because it’s free. But it’s not one of the compilers that’s supported for CESM, because in general CESM is used with commercial compilers (e.g. IBM’s) on the supercomputers. I grabbed the newest version of gcc that I could find a pre-built Mac binary for (v4.3.0), but it turned out not to be new enough – I spent a few hours diagnosing what turned out to be a (previously undiscovered?) bug in gfortran v4.3.0 that’s fixed in newer versions (I switched to v4.4.1). And then there’s a whole bunch of compiler flags (mainly to do with compatibility for certain architectures and language versions) that are not compatible with the commercial compilers, which I needed to track down.
  • Third party packages such as MPI (the message passing interface used for exchanging data between model components) and NetCDF (the data formating standard used for geoscience data). It turns out that the Mac already has MPI installed, but without Fortran and Parallel IO support, so I had to rebuild it. And it took me a few rebuilds to get both these packages installed with all the right options.

Once I’d got these sorted, and figured out which compiler flags I needed, the build went pretty smoothly, and I’ve had no problems so far running it. Which leads me to draw a number of (tentative) conclusions about portability. First, CESM is a little unusual compared to most climate models, because it is intended as a community effort, and hence portability is a high priority. It has already been ported to around 30 different platforms, including a variety IBM and Cray supercomputers, and various Linux clusters. Just the process of running the code through many different compilers shakes out not just portability issues, but good coding practices too, as different compilers tend to be picky about different language constructs.

Second, in the process of building the model, it’s quite easy to see that it consists of a number of distinct components, written by different communities, to different coding standards. Most obviously, CESM itself is built from five different component models (atmosphere, ocean, sea ice, land ice, land surface), along with a coupler that allows them to interact. There’s a tension between the needs of scientists who develop code just for a particular component model (run as a standalone model) versus scientists who want to use a fully coupled model. These communities overlap, but not completely, and coordinating the different needs takes considerable effort. Sometimes code that makes sense in a standalone module will break the coupling scheme.

But there’s another distinction that wasn’t obvious to me previously:

  • Scientific code – the bulk of the Fortran code in the component modules. This includes the core numerical routines, radiation schemes, physics parameterizations, and so on. This code is largely written by domain experts (scientists), for whom scientific validity is the over-riding concern (and hence they tend to under-estimate the importance of portability, readability, maintainability, etc).
  • Infrastructure code – including the coupler that allows the components to interact, the shared data handling routines, and a number of shared libraries. Most of this I could characterize as a modeling framework – it provides an overall architecture for a coupled model, and calls the scientific code as and when needed. This code is written jointly by the software engineering team and the various scientific groups.
  • Installation code – including configuration and build scripts. These are distinct from the model itself, but intended to provide flexibility to the community to handle a huge variety of target architectures and model configurations. These are written exclusively by the software engineering team (I think!), and tend to suffer from a serious time crunch: making this code clean and maintainable is difficult, given the need to get a complex and ever-changing model working in reasonable time.

In an earlier post, I described the rapid growth of complexity in earth system models as a major headache. This growth of complexity can be seen in all three types of software, but the complexity growth is compounded in the latter two: modeling frameworks need to support a growing diversity of earth system component models, which then leads to exponential growth in the number of possible model configurations that the build scripts have to deal with. Handling the growing complexity of the installation code is likely to be one of the biggest software challenges for the earth system modeling community in the next few years.

Gavin beat me to posting the best quote from the CCSM workshop last week – the Uncertainty Prayer. Uncertainty cropped up as a theme throughout the workshop. In discussions about the IPCC process, one issue came up several times: the likelihood that the spread of model projections in the next IPCC assessment will be larger than in AR4. The models are significantly more complex than they were five years ago, incorporating a broader set of earth system phenomena and resolving finer grain processes. The uncertainties in a more complex earth system model have a tendency to multiply, leading to a broader spread.

There is a big concern here about how to communicate this. Does this mean the science is going backwards – that we know less now than we did five years ago (imagine the sort of hay that some of the crazier parts of the blogosphere will make of that)? Well, there has been all sorts of progress in the past five years, much of it to do with understanding the uncertainties. And one result is the realization that the previous generations of models have under-represented uncertainty in the physical climate system – i.e. the previous projections for future climate change were more precise than they should have been. The implications are very serious for policymaking, not because there is any weaker case now for action, but precisely the opposite – the case for urgent action is stronger because the risks are worse, and good policy must be based on sound risk assessment. A bigger model spread means there’s now a bigger risk of more extreme climate responses to anthropogenic emissions. This problem was discussed at a fascinating session at the AGU meeting last year on validating model uncertainty (See: “How good are predictions from climate models?“).

At the CCSM meeting last week, Julia Slingo, chief scientist at the UK Met Office put the problem of dealing with uncertainty into context, by reviewing the current state of the art in short and long term forecasting, in a fascinating talk “Uncertainty in Weather and Climate Prediction”.

She began with the work of Ed Lorenz. The Lorenz attractor is the prototype chaotic model. A chaotic system is not random, and the non-linear equations of a chaotic system demonstrate some very interesting behaviours. If it’s not random, then it must be predictable, but this predictability is flow dependent – where you are in the attractor will determine where you will go, but some starting points lead to a much more tightly constrained set of behaviours than others. Hence, the spread of possible outcomes depends on the initial state, and some states have more predictable outcomes than others.

Why stochastic forecasting is better than deterministic forecasting

Much of the challenge in weather forecasting is to sample the initial condition uncertainty. Rather than using a single (deterministic) forecast run, modern weather forecasting makes use of ensemble forecasts, which probe the space of possible outcomes from a given (uncertain) initial state. This then allows the forecasters to assess possible outcomes, estimate risks and possibilities, and communicate risks to the users. Note the phrase “to allow the forecasters to…” – the role of experts in interpreting the forecasts and explaining the risks is vital.

As an example, Julia showed two temperature forecasts for London, using initial conditions for 26 June on two consecutive years, 1994 and 1995. The red curves show the individual members of an ensemble forecast. The ensemble spread is very different in each case, demonstrating that some initial conditions are more predictable than others: one has very high spread of model forecasts, and the other doesn’t (although note that in both cases the actual observations lie within the forecast spread):

Ensemble forecasts for two different initial states (click for bigger)

The problem is that in ensemble forecasting, the root mean squared (rms) error of the ensemble mean often grows faster than the spread, which indicates that the forecast is under-dispersive; in other words, the models don’t capture enough of the internal variability in the system. In such cases, improving the models (by eliminating modeling errors) will lead to increased internal variability, and hence larger ensemble spread.

One response to this problem is the work on stochastic parameterizations. Essentially, this introduces noise into the model to simulate variability in the sub-grid processes. This can then reduce the systematic model error if it better captures the chaotic behaviour of the system. Julia mentioned three schemes that have been explored for doing this:

  • Random Parameters (RP), in which some of the tunable model parameters are varied randomly. This approach is not very convincing as it indicates we don’t really know what’s going on in the model.
  • Stochastic Convective Vorticity (SCV)
  • Stochastic Kinetic Energy Backscatter (SKEB)

The latter two approaches tackle known weaknesses in the models, at the boundaries between resolved physical processes and sub-scale parameterizations. There is plenty of evidence in recent years that there are upscale energy cascades from unresolved scales, and that parametrizations don’t capture this. For example, in the backscatter scheme, some fraction of dissipated energy is scattered upscale and acts as a forcing for the resolved-scale flow. By including this in the ensemble prediction system, the forecast is no longer under-dispersive.

The other major approach is to increase the resolution of the model. Higher resolutions models will explicitly resolve more of the moist processes in sub-kilometer scale, and (presumably) remove this source of model error, although it’s not yet clear how successful this will be.

But what about seasonal forecasting – surely this growth of uncertainty prevents any kind of forecasting? People frequently ask “If we can’t predict weather beyond the next week, why is it possible to make seasonal forecasts?” The reason is that for longer term forecasts, the boundary forcings start to matter more. For example, if you add a boundary forcing to the Lorenz attractor, it changes the time in which the system stays in some part of the attractor, without changing the overall behaviour of the chaotic system. For a weak forcing, the frequency of occurrence of different regimes is changed, but the number and spatial patterns are unchanged. Under strong forcing, even the patterns of regimes are modified as the system goes through bifurcation points. So if we know something about the forcing, we can forecast the general statistics of weather, even if it’s not possible to say what the weather will be at a particular location at a particular time.

Of course, there’s still a communication problem: people feel weather, not the statistics of climate.

Building on the early work of Charney and Shukla (e.g. see their 1981 paper on monsoon predictability), seasonal to decadal prediction using coupled atmosphere-ocean systems does work, whereas 20 years ago, we would never have believed it. But again, we get the problem that some parts of the behaviour space are easier to predict than others. For example, the onset of El Niño is much harder to predict than the decay.

In a fully coupled system, systematic and model-specific errors grow much more strongly. Because the errors can grow quickly, and bias the probability distribution of outcomes, seasonal and decadal forecasts may not be reliable. So we assess reliability of a given model using hindcasts. Every time you change the model, you have to redo the hindcasts to check reliability. This gives a reasonable sanity check for seasonal forecasting, but for decadal prediction, it is challenging has we have very limited observational base.

And now, we have another problem: climate change is reducing the suitability of observations from the recent past to validate the models, even for seasonal prediction:

Climate Change shifts the climatology, so that models tuned to 20th century climate might no longer give good forecasts

Hence, a 40-year hindcast set might no longer be useful for validating future forecasts. As an example, the UK Met Office got into trouble for failing to predict the cold winter in the UK for 2009-2010. Re-analysis of the forecasts indicates why: Models that are calibrated on a 40-year hindcast gave only 20% probability of cold winter (and this was what was used for the seasonal forecast last year). However, models that are calibrated on just the past 20-years gave a 45% probability. Which indicates that the past 40 years might no longer be a good indicator of future seasonal weather. Climate change makes seasonal forecasting harder!

Today, the state-of-the-art for longer term forecasts is multi-model ensembles, but it’s not clear this is really the best approach, it just happens to be where we are today. Multi-model ensembles have a number of strengths: Each model is extensively tested by its own community and a large pool of alternative components provides some sampling across structural assumptions. But they are still an ensemble of opportunity – they do not systematically sample uncertainties. Also the set is rather small – e.g. 21 different models. So the sample is too small for determining the distribution of possible changes, and the ensembles are especially weak for predicting extreme events.

There has been a major effort on quantifying uncertainty over last few years at the Hadley Centre, using a perturbed physics ensemble. This allows for a larger sample: 100s (or even 10,000s in climateprediction.net) of variants of the same model. The poorly constrained model parameters are systematically perturbed, within expert-suggested ranges. But this still doesn’t sample the structural uncertainty in the models, because all the variants are from a single base model. As an example of this work, the UKCP09 project was an attempt to move from uncertainty ranges (as in AR4) to a probability density function (pdf) for likely change. UKCP uses over 400 model projections to compute the pdf. Although there are many problems with the UKCP (see the AGU discussion for a critique), but they were a step forward in understanding how to quantify uncertainty. [Note: Julia acknowledged weaknesses in both CP.net and the UKCP projects, but pointed out that they are mainly interesting as examples of how forecasting methodology is changing]

Another approach is to show which factors tend to dominate the uncertainty. For example, a pie chart showing impact of different sources of uncertainty (model weaknesses, carbon cycle, natural variability, downscaling uncertainty) on the forecast for rainfall in 2020s vs 2080s is interesting – for the 2020s, the uncertainty about the carbon cycle is relatively small factor, whereas for the 2080s it’s a much bigger factor.

Julia suggests it’s time for a coordinated study of the effects of model resolution on uncertainty. Every modeling group is looking at this, but they are not doing standardized experiments, so comparisons are hard.

Here is an example from Tim Palmer. In AR4, WG1 chapter 11 gave an assessment of regional patterns of change in precipitation. For some regions, it was impossible to give a prediction (the white areas), whereas for others, the models appear to give highly confident predictions. But the confidence might be misplaced because many of the models have known weaknesses that are relevant to future precipitation. For example, the models don’t simulate persistent blocking anticyclones very well. Which means that it’s wrong to assume that if most models agree, we can be confident in the prediction. For example, the Athena experiments with very high resolution models (T1259) showed much better blocking behaviour against the observational dataset ERA40. This implies we need to be more careful about selecting models for a multi-model ensemble for certain types of forecast.

The real butterfly effect raises some fundamental unanswered questions about convergence of climate simlations with increasing resoltion. Maybe there is an irreducible level of uncertainty in climate change. And if so, what is it? How much will increased resolution reduce the uncertainty? Will things be much better when we can resolve processes at  20km, 2km, or even 0.2km? compared to say 200km? Once we reach a certain resolution (e.g. 20km) is it just as good to represent small scale motions using stochastic equations? And what’s the most effective way to use the available computing resources as we increase the resolution? [There’s an obvious trade-off between increasing the size of the ensemble, and increasing the resolution of individual ensemble members]

Julia’s main conclusion is that Lorenz’ theory of chaotic systems now pervades all aspects of weather and climate prediction. Estimating and reducing uncertainty requires better multi-scale physics, higher resolution models, and more complete observations.

Some of the questions after the talk probed these issues a little more. For example, Julia was asked  how to handle policymakers demanding better decadal prediction, when we’re not ready to deliver it. Her response was that she believes higher resolution modeling will help, but that we haven’t proved this yet, so we have to manage expectations very carefully. She was also asked about the criteria to use to use for including different models in an ensemble – e.g. should we exclude models that don’t conserve physical quantities, that don’t do blocking, etc? For UKCP09, the criteria were global in nature, but this isn’t sufficient – we need criteria that test for skill with specific phenomena such as El Nino. Because the inclusion criteria aren’t clear enough yet, the UKCP project couldn’t give advice on wind in the projections. In the long run, the focus should be on building the best model we can, rather than putting effort into exploring perturbed physics, but we have to balance needs of users for better probablistic predictions against need to get on and develop better phyiscs in the models.

Finally, on the question of interpretation, Julia was asked what if users (of the forecasts) can’t understand or process probablistic forecasts? Julia pointed out that some users can process probablistic forecasts, and indeed that’s exactly what they need. For example, the insurance industry. Others use it as input for risk assessment – e.g. water utilities. So we do have to distinguish the needs of different types of users.

Of all the global climate models, the Community Earth System Model, CESM, seems to come closest to the way an open source community works. The annual CESM workshop, this week in Breckenridge, Colorado, provides an example of how the community works. There are about 350 people attending, and much of the meeting is devoted to detailed discussion of the science and modeling issues across a set of working groups: Atmosphere model, Paleoclimate, Polar Climate, Ocean model, Chemistry-climate, Land model, Biogeochemistry, Climate Variability, Land Ice, Climate Change, Software Engineering, and Whole Atmosphere.

In the opening plenary on Monday, Mariana Vertenstein (who is hosting my visit to NCAR this month), was awarded the 2010 CESM distinguished achievement award for her role in overseeing the software engineering of the CESM. This is interesting for a number of reasons, not least because it demonstrates how much the CESM community values the role of the software engineering team, and the advances that the software engineering working group has made improving the software infrastructure over the last few years.

Earth system models are generally developed in a manner that’s very much like agile development. Getting the science working in the model is prioritized, with issues such as code structure, maintainability and portability worked in later, as needed. To some extent, this is appropriate – getting the science right is the most important thing, and it’s not clear how much a big upfront design effort would payoff, especially in the early stages of model development, when it’s not clear whether the model will become anything more than an interesting research idea. The downside of this strategy, is that as the model grows in sophistication, the software architecture ends up being a mess. As Mariana explained in her talk, coupled models like the CESM have reached a point in their development where this approach no longer works. In effect, a massive refactoring effort is needed to clean up the software infrastructure to permit future maintainability.

Mariana’s talk was entitled “Better science through better software”. She identified a number of major challenges facing the current generation of earth system models, and described some of the changes in the software infrastructure that have been put in place for the CESM to address them.

The challenges are:

1) New system complexity, as new physics, and new grids are incorporated into the models. For example, the CESM now has a new land ice model, which along with the atmosphere, ocean, land surface, and sea ice components brings the total to five distinct geophysical component models, each operating on different grids, and each with its own community of users. These component models exchange boundary information via the coupler, and the entire coupled model now runs to about 1.2 million lines of code (compare with the previous generation model, CCSM3, now six years old, which had about 330KLoC).

The increasing number of component models increases the complexity of the coupler. It now has to handle regridding (where data such as energy and mass is exchanged between component models with different grids), data merging, atmosphere-ocean fluxes, and conservation diagnostics (e.g. to ensure the entire model conserves energy and mass). Note: Older versions of the model were restricted, for example with the atmosphere, ocean and land surface schemes all required to use the same grid.

Users also want to be able to swap in different versions of each major component. For example, a particular run might demand a fully prognostic atmosphere model, coupled with a prescribed ocean parameterization (taken from observational data, for example). Then, within each major component, users might want different configurations:  multiple dynamic cores, multiple chemistry modes, etc.

Another source of complexity comes from resolutions. Model components now run over a much wider range of resolutions, and the re-gridding challenges are substantial. And finally, whereas the old model used rectangular latitude-longitude grids, now people want to accommodate many different types of grid.

2) Ultra-high resolution. The trend towards higher resolution grids poses serious challenges for scalability, especially given the massive increase in volume of data being handled. All components (and the coupler) need to be scalable in terms of both memory and performance.

Higher resolution increases the need for more parallelism, and there has been tremendous progress on this in the last few years. A few years back, as part of the DOE/LLNL grand challenge, CCSM3 managed 0.5 simulation years per day, running on 4,000 cores, and this was considered a great achievement. This year, the new version of CESM has successfully run on 80,000 cores, to give 3 simyears per day in a very high resolution model: 0.125° grid for the atmosphere, 0.25° for the land and 0.1° for the ocean.

Interestingly, in these highly parallel configurations, the ocean model, POP, is no longer dominant for processing time; the sea ice and atmosphere models start to dominate because the two of them are coupled sequentially. Hence the ocean model scales more readily.

3) Data assimilation. For weather forecasting models, this has long been standard analysis practice. Briefly, the model state and the observational data are combined at each timestep to give a detailed analysis of the current state of the system, which helps to overcome limitations in both the model and the data, and to better understand the physical processes underlying the observational data. It’s also useful in forecasting, as it allows you to arrive at a more accurate initial state for a forecast run.

In climate modeling, data assimilation is a relatively new capability. The current version of the CESM can do data assimilation in both the atmosphere and ocean. The new framework also supports experiments where multiple versions of the same component are used within a run. For example, the model might have multiple atmosphere components in a single simulation, each coupled with its own instance of the ocean, where one is an assimilation module and the other a prognostic model.

4) The needs of the user community. Supporting a broad community of model users adds complexity, especially as the community becomes more diverse. The community needs more frequent releases of the model (e.g. more often than every six years!), and people ned to be able to merge new releases more easily into their own sandboxes.

These challenges have inspired a number of software infrastructure improvements in the CESM. Mariana described a number of advances.

The old model, CCSM3 was run as multiple executables, one for each major component, exchanging data with a coupler via MPI. And each component used to have its own way of doing coupling. But this kills efficiency – processors end up idling when a component has to wait on data from the others. It’s also very hard in this scheme to understand the time evolution as the model runs, which then also makes it very hard to debug. And the old approach was notoriously hard to port to different platforms.

The new framework has a top level driver that controls time evolution, with all coupling done at the top level. Then the component models can be laid out across the available processors, either all in parallel, or in a hybrid parallel-sequential mode. For example, atmosphere, land scheme and sea ice modules might be called in sequence, with the ocean model running in parallel with the whole set. The chosen architecture is specified in a single XML file. This brings a number of benefits:

  • Better flexibility for very different platforms;
  • Facilitates model configurations with huge amounts of parallelism across a very large number of processors;
  • Allows the coupler & components to be ESMF compliant, so the model can can couple with other ESMF compliant models;
  • Integrated release cycle – it’s now all one model, whereas in the past each component model had it’s own separate releases.
  • Much easier to debug, as it’s easier to follow the time evolution.

The new infrastructure also includes scripting tools that support the process of setting up an experiment, and making sure it runs with optimal performance on a particular platform. For example, the current release includes script to create wide variety of out-of-the-box experiments. It also includes a load balancing tool, to check how much time each component is idle during a run, and new scripts with hints for porting to new platforms, based on a set of generic machine templates.

The model also has a new parallel I/O library (PIO), which adds a layer of abstraction between the data structures used in each model component and the arrangement of the data when written to disk.

The new versions of the model are now being released via the subversion repository (rather than a .tar file, as used in the past). Hence, users can use an svn merge to get the latest release. There have been three model releases since January:

  • CCSM Alpha, released in January 2010;
  • CCSM 4.0 full release, in April 2010;
  • CESM 1.0 released June 2010.

Mariana ended her talk with a summary of the future work – complete the CMIP5 runs for the next round of the IPCC assessment process; regional refinement with scalable grids; extend the data assimilation capability; handle super-parameterization (e.g. include cloud resolving models); add hooks for human dimensions within the models (e.g. to support the DOE program on integrated assessment); and improved validation metrics.

Note: the CESM is the successor to CCSM – the community climate system model. The name change recognises the wider set of earth systems now incorporated into the model.

Short notice, but an interesting talk tomorrow by Balaji of Princeton University and NOAA/GFDL. Balaji is head of the Modeling Systems Group at NOAA/GFDL. The talk is scheduled for 4 p.m., in the Physics building, room MP408.

Climate Computing: Computational, Data, and Scientific Scalability

V. Balaji
Princeton University

Climate modeling, in particular the tantalizing possibility of making projections of climate risks that have predictive skill on timescales of many years, is a principal science driver for high-end computing. It will stretch the boundaries of computing along various axes:

  • resolution, where computing costs scale with the 4th power of problem size along each dimension
  • complexity, as new subsystems are added to comprehensive earth system models with feedbacks
  • capacity, as we build ensembles of simulations to sample uncertainty, both in our knowledge and representation, and of that inherent in the chaotic system. In particular, we are interested in characterizing the “tail” of the pdf (extreme weather) where a lot of climate risk resides.

The challenge probes the limits of current computing in many ways. First, there is the problem of computational scalability, where the community is adapting to an era where computational power increases are dependent on concurrency of computing and no longer on raw clock speed. Second, we increasingly depend on experiments coordinated across many modeling centres which result in petabyte-scale distributed archives. The analysis of results from distributed archives poses the problem of data scalability.

Finally, while climate research is still performed by dedicated research teams, its potential customers are many: energy policy, insurance and re-insurance, and most importantly the study of climate
change impacts — on agriculture, migration, international security, public health, air quality, water resources, travel and trade — are all domains where climate models are increasingly seen as tools that
could be routinely applied in various contexts. The results of climate research have engendered entire fields of “downstream” science as societies try to grapple with the consequences of climate change. This poses the problem of scientific scalability: how to enable the legions of non-climate scientists, vastly outnumbering the climate research community, to benefit from climate data.

The talks surveys some aspects of current computational climate research as it rises to meet the simultaneous challenges of computational, data and scientific scalability.

Update: Neil blogged a summary of Balaji’s talk.

Here are two upcoming conferences, both relevant to the overlap of computer science and climate science:

…and I won’t make it to either as I’ll be doing my stuff at NCAR. I will get to attend this though:

I guess I’ll have to send some of my grad students off to the other conferences (hint, hint).

Excellent news: Jon Pipitone has finished his MSc project on the software quality of climate models, and it makes fascinating reading. I quote his abstract here:

A climate model is an executable theory of the climate; the model encapsulates climatological theories in software so that they can be simulated and their implications investigated. Thus, in order to trust a climate model one must trust that the software it is built from is built correctly. Our study explores the nature of software quality in the context of climate modelling. We performed an analysis of the reported and statically discoverable defects in several versions of leading global climate models by collecting defect data from bug tracking systems, version control repository comments, and from static analysis of the source code. We found that the climate models all have very low defect densities compared to well-known, similarly sized open-source projects. As well, we present a classification of static code faults and find that many of them appear to be a result of design decisions to allow for flexible configurations of the model. We discuss the implications of our findings for the assessment of climate model software trustworthiness.

The idea for the project came from an initial back-of-the-envelope calculation we did of the Met Office Hadley’s Centre’s Unified Model, in which we estimated the number of defects per thousand lines of code (a common measure of defect density in software engineering) to be extremely low – of the order of 0.03 defects/KLoC. By comparison, the shuttle flight software, reputedly the most expensive software per line of code ever built, clocked in at 0.1 defects/KLoC; most of the software industry does worse than this.

This initial result was startling, because the climate scientists who build this software don’t follow any of the software processes commonly prescribed in the software literature. Indeed, when you talk to them, many climate modelers are a little apologetic about this, and have a strong sense they ought to be doing more rigorous job with their software engineering. However, as we documented in our paper, climate modeling centres such as the UK Met Office do have a excellent software processes, that they have developed over many years to suit their needs. I’ve come to the conclusion that it has to be very different from mainstream software engineering processes because the context is so very different.

Well, obviously we were skeptical (scientists are always skeptical, especially when results seem to contradict established theory). So Jon set about investigating this more thoroughly for his MSc project. He tackled the question in three ways: (1) measuring defect density by using bug repositories, version history and change logs to quantify bug fixes; (2) assessing the software directly using static analysis tools and (3) interviewing climate modelers to understand how they approach software development and bug fixing in particular.

I think there are two key results of Jon’s work:

  1. The initial results on defect density bear up. Although not quite as startlingly low as my back of the envelope calculation, Jon’s assessment of three major GCMs indicate they all fall in the range commonly regarded as good quality software by industry standards.
  2. There are a whole bunch of reasons why result #1 may well be meaningless, because the metrics for measuring software quality don’t really apply well to large scale scientific simulation models.

You’ll have to read Jon’s thesis to get all the details, but it will be well worth it. The conclusion? More research needed. It opens up plenty of questions for a PhD project….

This week I attended a Dagstuhl seminar on New Frontiers for Empirical Software Engineering. It was a select gathering, with many great people, which meant lots of fascinating discussions, and not enough time to type up all the ideas we’ve been bouncing around. I was invited to run a working group on the challenges to empirical software engineering posed by climate change. I started off with a quick overview of the three research themes we identified at the Oopsla workshop in the fall:

  • Climate Modeling, which we could characterize as a kind of end-user software development, embedded in a scientific process;
  • Global collective decision-making, which involves creating the software infrastructure for collective curation of sources of evidence in a highly charged political atmosphere;
  • Green Software Engineering, including carbon accounting for the software systems lifecycle (development, operation and disposal), but where we have no existing no measurement framework, and tendency to to make unsupported claims (aka greenwashing).

Inevitably, we spent most of our time this week talking about the first topic – software engineering of computational models, as that’s the closest to the existing expertise of the group, and the most obvious place to start.

So, here’s a summary of our discussions. The bright ideas are due to the group (Vic Basili, Lionel Briand, Audris Mockus, Carolyn Seaman and Claes Wohlin), while the mistakes in presenting them here are all mine.

A lot of our discussion was focussed on the observation that climate modeling (and software for computational science in general) is a very different kind of software engineering than most of what’s discussed in the SE literature. It’s like we’ve identified a new species of software engineering, which appears to be a an outlier (perhaps an entirely new phylum?). This discovery (and the resulting comparisons) seems to tell us a lot about the other species that we thought we already understood.

The SE research community hasn’t really tackled the question of how the different contexts in which software development occurs might affect software development practices, nor when and how it’s appropriate to attempt to generalize empirical observations across different contexts. In our discussions at the workshop, we came up with many insights for mainstream software engineering, which means this is a two-way street: plenty of opportunity for re-examination of mainstream software engineering, as well as learning how to study SE for climate science. I should also say that many of our comparisons apply to computational science in general, not just climate science, although we used climate modeling for many specific examples.

We ended up discussing three closely related issues:

  1. How do we characterize/distinguish different points in this space (different species of software engineering)? We focussed particularly on how climate modeling is different from other forms of SE, but we also attempted to identify factors that would distinguish other species of SE from one another. We identified lots of contextual factors that seem to matter. We looked for external and internal constraints on the software development project that seem important. External constraints are things like resource limitations, or particular characteristics of customers or the environment where the software must run. Internal constraints are those that are imposed on the software team by itself, for example, choices of working style, project schedule, etc.
  2. Once we’ve identified what we think are important distinguishing traits (or constraints), how do we investigate whether these are indeed salient contextual factors? Do these contextual factors really explain observed differences in SE practices, and if so how? We need to consider how we would determine this empirically. What kinds of study are needed to investigate these contextual factors? How should the contextual factors be taken into account in other empirical studies?
  3. Now imagine we have already characterized this space of species of SE. What measures of software quality attributes (e.g. defect rates, productivity, portability, changeability…) are robust enough to allow us to make valid comparisons between species of SE? Which metrics can be applied in a consistent way across vastly different contexts? And if none of the traditional software engineering metrics (e.g. for quality, productivity, …) can be used for cross-species comparison, how can we do such comparisons?

In my study of the climate modelers at the UK Met Office Hadley centre, I had identified a list of potential success factors that might explain why the climate modelers appear to be successful (i.e. to the extent that we are able to assess it, they appear to build good quality software with low defect rates, without following a standard software engineering process). My list was:

  • Highly tailored software development process – software development is tightly integrated into scientific work;
  • Single Site Development – virtually all coupled climate models are developed at a single site, managed and coordinated at a single site, once they become sufficiently complex [edited – see Bob’s comments below], usually a government lab as universities don’t have the resources;
  • Software developers are domain experts – they do not delegate programming tasks to programmers, which means they avoid the misunderstandings of the requirements common in many software projects;
  • Shared ownership and commitment to quality, which means that the software developers are more likely to make contributions to the project that matter over the long term (in contrast to, say, offshored software development, where developers are only likely to do the tasks they are immediately paid for);
  • Openness – the software is freely shared with a broad community, which means that there are plenty of people examining it and identifying defects;
  • Benchmarking – there are many groups around the world building similar software, with regular, systematic comparisons on the same set of scenarios, through model inter-comparison projects (this trait could be unique – we couldn’t think of any other type of software for which this is done so widely).
  • Unconstrained Release Schedule – as there is no external customer, software releases are unhurried, and occur only when the software is considered stable and tested enough.

At the workshop we identified many more distinguishing traits, any of which might be important:

  • A stable architecture, defined by physical processes: atmosphere, ocean, sea ice, land scheme,…. All GCMs have the same conceptual architecture, and it is unchanged since modeling began, because it is derived from the natural boundaries in physical processes being simulated [edit: I mean the top level organisation of the code, not the choice of numerical methods, which do vary across models – see Bob’s comments below]. This is used as an organising principle both for the code modules, and also for the teams of scientists who contribute code. However, the modelers don’t necessarily derive some of the usual benefits of stable software architectures, such as information hiding and limiting the impacts of code changes, because the modules have very complex interfaces between them.
  • The modules and integrated system each have independent lives, owned by different communities. For example, a particular ocean model might be used uncoupled by a large community, and also be integrated into several different coupled climate models at different labs. The communities who care about the ocean model on its own will have different needs and priorities than each of communities who care about the coupled models. Hence, the inter-dependence has to be continually re-negotiated. Some other forms of software have this feature too: Audris mentioned voice response systems in telecoms, which can be used stand-alone, and also in integrated call centre software; Lionel mentioned some types of embedded control systems onboard ships, where the modules are used indendently on some ships, and as part of a larger integrated command and control system on others.
  • The software has huge societal importance, but the impact of software errors is very limited. First, a contrast: for automotive software, a software error can immediately lead to death, or huge expense, legal liability, etc,  as cars are recalled. What would be the impact of software errors in climate models? An error may affect some of the experiments performed on the model, with perhaps the most serious consequence being the need to withdraw published papers (although I know of no cases where this has happened because of software errors rather than methodological errors). Because there are many other modeling groups, and scientific results are filtered through processes of replication, and systematic assessment of the overall scientific evidence, the impact of software errors on, say, climate policy is effectively nil. I guess it is possible that systematic errors are being made by many different climate modeling groups in the same way, but these wouldn’t be coding errors – they would be errors in the understanding of the physical processes and how best to represent them in a model.
  • The programming language of choice is Fortran, and is unlikely to change for very good reasons. The reasons are simple: there is a huge body of legacy Fortran code, everyone in the community knows and understands Fortran (and for many of them, only Fortran), and Fortran is ideal for much of the work of coding up the mathematical formulae that represent the physics. Oh, and performance matters enough that the overhead of object oriented languages makes them unattractive. Several climate scientists have pointed out to me that it probably doesn’t matter what language they use, the bulk of the code would look pretty much the same – long chunks of sequential code implementing a series of equations. Which means there’s really no push to discard Fortran.
  • Existence and use of shared infrastructure and frameworks. An example used by pretty much every climate model is MPI. However, unlike Fortran, which is generally liked (if not loved), everyone universally hates MPI. If there was something better they would use it. [OpenMP doesn’t seem to have any bigger fanclub]. There are also frameworks for structuring climate models and coupling the different physics components (more on these in a subsequent post). Use of frameworks is an internal constraint that will distinguish some species of software engineering, although I’m really not clear how it will relate to choices of software development process. More research needed.
  • The software developers are very smart people. Typically with PhDs in physics or related geosciences. When we discussed this in the group, we all agreed this is a very significant factor, and that you don’t need much (formal) process with very smart people. But we couldn’t think of any existing empirical evidence to support such a claim. So we speculated that we needed a multi-case case study, with some cases representing software built by very smart people (e.g. climate models, the Linux kernel, Apache, etc), and other cases representing software built by …. stupid people. But we felt we might have some difficulty recruiting subjects for such a study (unless we concealed our intent), and we would probably get into trouble once we tried to publish the results 🙂
  • The software is developed by users for their own use, and this software is mission-critical for them. I mentioned this above, but want to add something here. Most open source projects are built by people who want a tool for their own use, but that others might find useful too. The tools are built on the side (i.e. not part of the developers’ main job performance evaluations) but most such tools aren’t critical to the developers’ regular work. In contrast, climate models are absolutely central to the scientific work on which the climate scientists’ job performance depends. Hence, we described them as mission-critical, but only in a personal kind of way. If that makes sense.
  • The software is used to build a product line, rather than an individual product. All the main climate models have a number of different model configurations, representing different builds from the codebase (rather than say just different settings). In the extreme case, the UK Met Office produces several operational weather forecasting models and several research climate models from the same unified codebase, although this is unusual for a climate modeling group.
  • Testing focuses almost exclusively on integration testing. In climate modeling, there is very little unit testing, because it’s hard to specify an appropriate test for small units in isolation from the full simulation. Instead the focus is on very extensive integration tests, with daily builds, overnight regression testing, and a rigorous process of comparing the output from runs before and after each code change. In contrast, most other types of software engineering focus instead on unit testing, with elaborate test harnesses to test pieces of the software in isolation from the rest of the system. In embedded software, the testing environment usually needs to simulate the operational environment; the most extreme case I’ve seen is the software for the international space station, where the only end-to-end software integration was the final assembly in low earth orbit.
  • Software development activities are completely entangled with a wide set of other activities: doing science. This makes it almost impossible to assess software productivity in the usual way, and even impossible to estimate the total development cost of the software. We tried this as a thought experiment at the Hadley Centre, and quickly gave up: there is no sensible way of drawing a boundary to distinguish some set of activities that could be regarded as contributing to the model development, from other activities that could not. The only reasonable path to assessing productivity that we can think of must focus on time-to-results, or time-to-publication, rather than on software development and delivery.
  • Optimization doesn’t help. This is interesting, because one might expect climate modelers to put a huge amount of effort into optimization, given that century-long climate simulations still take weeks/months on some of the world’s fastest supercomputers. In practice, optimization, where it is done, tends to be an afterthought. The reason is that the model is changed so frequently that hand optimization of any particular model version is not useful. Plus the code has to remain very understandable, so very clever designed-in optimizations tend to be counter-productive.
  • There are very few resources available for software infrastructure. Most of the funding is concentrated on the frontline science (and the costs of buying and operating supercomputers). It’s very hard to divert any of this funding to software engineering support, so development of the software infrastructure is sidelined and sporadic.
  • …and last but not least, A very politically charged atmosphere. A large number of people actively seek to undermine the science, and to discredit individual scientists, for political (ideological) or commercial (revenue protection) reasons. We discussed how much this directly impacts the climate modellers, and I have to admit I don’t really know. My sense is that all of the modelers I’ve interviewed are shielded to a large extend from the political battles (I never asked them about this). Those scientists who have been directly attacked (e.g. MannJonesSanter) tend to be scientists more involved in creation and analysis of datasets, rather than GCM developers. However, I also think the situation is changing rapidly, especially in the last few months, and climate scientists of all types are starting to feel more exposed.

We also speculated about some other contextual factors that might distinguish different software engineering species, not necessarily related to our analysis of computational science software. For example:

  • Existence of competitors;
  • Whether software is developed for single-person-use versus intended for broader user base;
  • Need for certification (and different modes by which certification might be done, for example where there are liability issues, and the need to demonstrate due diligence)
  • Whether software is expected to tolerate and/or compensate for hardware errors. For example, for automotive software, much of the complexity comes from building fault-tolerance into the software because correcting hardware problems introduced in design or manufacture is prohibitively expense. We pondered how often hardware errors occur in supercomputer installations, and whether if they did it would affect the software. I’ve no idea of the answer to the first question, but the second is readily handled by the checkpoint and restart features built into all climate models. Audris pointed out that given the volumes of data being handled (terrabytes per day), there are almost certainly errors introduced in storage and retrieval (i.e. bits getting flipped), and enough that standard error correction would still miss a few. However, there’s enough noise in the data that in general, such things probably go unnoticed, although we speculated what would happen when the most significant bit gets flipped in some important variable.

More interestingly, we talked about what happens when these contextual factors change over time. For example, the emergence of a competitor where there was none previously, or the creation of a new regulatory framework where none existed. Or even, in the case of health care, when change in the regulatory framework relaxes a constraint – such as the recent US healthcare bill, under which it (presumably) becomes easier to share health records among medical professionals if knowledge of pre-existing conditions is no longer a critical privacy concern. An example from climate modeling: software that was originally developed as part of a PhD project intended for use by just one person eventually grows into a vast legacy system, because it turns out to be a really useful model for the community to use. And another: the move from single site development (which is how nearly all climate models were developed) to geographically distributed development, now that it’s getting increasingly hard to get all the necessary expertise under one roof, because of the increasing diversity of science included in the models.

We think there are lots of interesting studies to be done of what happens to the software development processes for different species of software when such contextual factors change.

Finally, we talked a bit about the challenge of finding metrics that are valid across the vastly different contexts of the various software engineering species we identified. Experience with trying to measure defect rates in climate models suggests that it is much harder to make valid comparisons than is generally presumed in the software literature. There really has not been any serious consideration of these various contextual factors and their impact on software practices in the literature, and hence we might need to re-think a lot of the ways in which claims for generality are handled in empirical software engineering studies. We spent some time talking about the specific case of defect measurements, but I’ll save that for a future post.

Nature news runs some very readable articles on climate science, but is unfortunately behind a paywall. Which is a shame because they really should be widely read. Here’s a couple of recent beauties:

The Real Holes in Climate Science, (published 21 Jan 2010) points out that climate change denialists keep repeating long debunked myths about things they believe undermine the science. Meanwhile, in the serious scientific literature, there are some important open questions over real uncertainties in the science (h/t to AH). These are discussed openly in the IPCC reports (see for example, the 59 robust findings and 55 uncertainties listed in section 6 of the Technical Summary for WG1). None of these uncertainties pose a serious challenge to our basic understanding of climate change, but they do prevent absolute certainty about any particular projection. Not only that, many of these uncertainties suggest a strong application of the precautionary principle, because many of them suggest the potential for the IPCC to be underestimating the seriousness of climate change. The Nature News article identifies the following as particularly relevant:

  • Regional predictions. While the global models do a good job of simulating global trends in temperature, they often do poorly on fine-grained regional projections. Geographic features, such as mountain ridges, which mark the boundary of different climatic zones, occur at scales much smaller than the typical grids in GCMs, which means the GCMs get these zonal boundaries wrong, especially when coarse-grain predictions are downscaled.
  • Precipitation. As the IPCC report made clear, many of the models disagree even on the sign of the change in rainfall over much of the globe, especially for winter projections. The differences are due to uncertainties over convection processes. Worryingly, studies of recent trends (published after the IPCC report was compiled)  indicate the models are underestimating precipitation changes, such as the drying of the subtropics.
  • Aerosols. Estimates of the effect on climate from airborne particles (mainly from industrial pollution) vary by an order of magnitude. Some aerosols (e.g. suphates) induce a cooling effect by reflecting sunlight, while others (e.g. black carbon) produce a warming effect by absorbing sunlight. The extent to which these aerosols are masking the warming we’re already ‘owed’ from increased greenhouse gases is hard to determine.
  • Temperature reconstructions prior to the 20th century. The Nature News article discusses at length the issues in the tree ring data used as one of the proxies for reconstructing past temperature records, prior to the instrumental data from the last 150 years. The question of what causes the tree ring data to diverge from instrumental records in recent decades is obviously an interesting question, but to me it seems to be of marginal importance to climate science.

The Climate Machine, (published  24 Feb 2010) describes the Hadley Centre’s HadGEM-2 as an example of the current generation of earth system models, and discusses the challenges of capturing more and more earth systems into the models (h/t to JH). The article quotes many of the modelers I’ve been interviewing about their software development processes. Of particular interest is the discussion about the growing complexity of these models, once other earth systems processes are added: clouds, trees, tundra, land ice, and … pandas (the inclusion of pandas in the models is an in-joke in the modeling community) . There is likely to be a limit to the growth of this complexity, simply because the task of managing the contributions of a growing (and diversifying) group of experts gets harder and harder. The article also points out that one interesting result is likely to be an increase in some uncertainty ranges from these models in the next IPCC report, due to the additional variability introduced from these additional earth system processes.

I would post copies of the full articles, but I’m bound to get takedown emails from Macmillan publishing. But I guess they’re unlikely to object if I respond to emails requesting copies from me for research and education purposes…

I’ve just ordered the book “A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming” by Paul Edwards. It’s out next month, and I’m looking forward to reading it. I found out about the book a couple of weeks ago, while idly browsing Paul’s website while on the phone with him. What I didn’t realise, until today, is that Spencer Weart’s wonderful account of the history of general circulation models (an absolute must read!), which I’ve dipped into many times, is based originally on Paul’s work. Small world, huh?

This week I’m visiting NCAR, in Colorado.

As it’s my first visit, I’m still blown away by the beauty of the place – both the location and the building itself (which was designed by I M Pei, the architect better known for the Louvre pyramid). I’m hoping to get some time today to visit the hiking trails around the facility.

Anyway, I gave my talk yesterday, and have had many interesting chats with the scientists and the software engineering team working on the Community Climate System Model (CCSM).

My plan is to do a detailed comparison of the software development practices at NCAR with what I saw in my Hadley study. There seem to be more similarities than differences, but three differences that have struck me so far are:

  • the much greater use of multi-site development (which I expected – it is a community model, after all);
  • the fact that each part of the coupled model (ocean, atmosphere, land, sea ice,…) has a distinct stand-alone identity, each with its own release cycle, which means there are some interesting challenges negotiating the (sometimes) conflicting needs for the stand-alone models, versus the CCSM coupled model;
  • a much longer release cycle – years between official releases, compared to the Hadley’ Centre’s four month release cycle.

We speculated that the length of the release cycles might be largely to do with the major uses of the model. At the Hadley Centre, the climate and weather forecasting models are unified in a single code base. The weather forecasters need regular model improvements to meet annual targets for forecast improvements, and also need to make sure they are using a stable, robust model version (never a pre-release experimental one). Hence a short release cycle makes sense. At NCAR, the main driver for official releases is the IPCC assessment process, which operates on a 5-year cycle. Hence, carefully maintained official releases are only needed every few years. Meanwhile, scientists who want a more up-to-date model can play with unreleased experimental versions at their own risk, if they choose. Creating and supporting an official release takes a large software engineering overhead, and the resources just aren’t available to do it very often, in part because funding agencies much prefer to fund the science, rather than the software infrastructure needed to support that science. The lack of resources for software support seems to be a consistent problem across all the modeling centres I’ve visited so far.

A reader writes to me from New Zealand, arguing that climate science isn’t a science at all because there is no possibility to conduct experiments. This misconception appears to be common, even among some distinguished scientists, who presumably have never taken the time to read many published papers in climatology. The misconception arises because people assume that climate science is all about predicting future climate change, and because such predictions are for decades/centuries into the future, and we only have one planet to work with, we can’t check to see if these predictions are correct until it’s too late to be useful.

In fact, predictions of future climate are really only a by-product of climate science. The science itself concentrates on improving our understanding of the processes that shape climate, by analyzing observations of past and present climate, and testing how well we understand them. For example, detection/attribution studies focus on the detection of changes in climate that are outside the bounds of natural variability (using statistical techniques), and determining how much of the change can be attributed to each of a number of possible forcings (e.g. changes in: greenhouse gases, land use, aerosols, solar variation, etc). Like any science, the attribution is done by creating hypotheses about possible effects of each forcing, and then testing those hypotheses. Such hypotheses can be tested by looking for contradictory evidence (e.g. other episodes in the past where the forcing was present or absent, to test how well the hypothesis explains these too). They can also be tested by encoding each hypothesis in a climate model, and checking how well it simulates the observed data.

I’m not a climate modeler, but I have conducted anthropological studies of how how climate modelers work. Climate models are developed slowly and carefully over many years, as scientific instruments. One of the most striking aspects of climate model development is that it is an experimental science in the strongest sense. What do I mean?

Well, a climate model is a detailed theory of some subset of the earth’s physical processes. Like all theories, it is a simplification that focusses on those processes that are salient to a particular set of scientific questions, and approximates or ignores those processes that are less salient. Climate modelers use their models as experimental instruments. They compare the model run with the observational record for some relevant historical period. They then come up with a hypothesis to explain any divergences between the run and the observational record, and make a small improvement to the model that the hypothesis predicts will reduce the divergence. They then run an experiment in which the old version of the model acts as a control, and the new version is the experimental case. By comparing the two runs with the observational record, they determine whether the predicted improvement was achieved (and whether the change messed anything else up in the process). After a series of such experiments, the modelers will eventually either accept the change to the model as an improvement to be permanently incorporated into the model code, or they discard it because the experiments failed (i.e. they failed to give the expected improvement). By doing this day after day, year after year, the models get steadily more sophisticated, and steadily better at simulating real climactic processes.

This experimental approach has another interesting effect: the software appears to be tested much more thoroughly than most commercial software. Whether this actually delivers higher quality code is an interesting question; however, it is clear that the approach is much more thorough than most industry practices for software regression testing.

Interesting article by Andrew Jones entitled Are we taking supercomputing code seriously?:

Part of the problem is that in their rush to do science, scientists fail to spot the software for what it is: the analogue of the experimental instrument. Consequently, it needs to be treated with the same respect that a physical experiment would receive.

Any reputable physical experiment would ensure the instruments are appropriate to the job and have been tested. They would be checked for known error behaviour in the parameter regions of study, and chosen for their ability to give a satisfactory result within a useful timeframe and budget. Those same principles should apply to a software model.

Weather and climate are different. Weather varies tremendously from day to day, week to week, season to season. Climate, on the other hand is average weather over a period of years; it can be thought of as the boundary conditions on the variability of weather. We might get an extreme cold snap, or a heatwave at a particular location, but our knowledge of the local climate tells us that these things are unusual, temporary phenomena, and sooner or later things will return to normal. Forecasting the weather is therefore very different from forecasting changes in the climate. One is an initial value problem, and the other is a boundary value problem. Let me explain.

Good weather forecasts depend upon an accurate knowledge of the current state of the weather system. You gather as much data you can about current temperatures, winds, clouds, etc., feed them all into a simulation model and then run it forward to see what happens. This is hard because the weather is an incredibly complex system. The amount of information needed is huge: both the data and the models are incomplete and error-prone. Despite this, weather forecasting has come a long way over the past few decades. Through a daily process of generating forecasts, comparing them with what happened, and thinking about how to reduce errors, we have incredibly accurate 1- and 3- day temperature forecasts. Accurate forecasts of rain, snow, and so on for a specific location is a little harder because of the chance that the rainfall will be in a slightly different place (e.g a few kilometers away) or a slightly different time than the model forecasts, even if the overall amount of precipitation is right. Hence, daily forecasts give fairly precise temperatures, but put probabilistic values on things like rain (Probability of Precipitation, PoP), based on knowledge of the uncertainty factors in the forecast. The probabilities are known because we have a huge body of previous forecasts to compare with.

The limit on useful weather forecasts seems to be about one week. There are inaccuracies and missing information in the inputs, and the models are only approximations of the real physical processes. Hence, the whole process is error prone. At first these errors tend to be localized, which means the forecast for the short term (a few days) might be wrong in places, but is good enough in most of the region we’re interested in to be useful. But the longer we run the simulation for, the more these errors multiply, until they dominate the computation. At this point, running the simulation for longer is useless. 1-day forecasts are much more accurate than 3-day forecasts, which are better than 5-day forecasts, and beyond that it’s not much better than guessing. However, steady improvements mean that 3-day forecasts are now as accurate as 2-day forecasts were a decade ago. Weather forecasting centres are very serious about reviewing the accuracy of their forecasts, and set themselves annual targets for accuracy improvements.

A number of things help in this process of steadily improving forecasting accuracy. Improvements to the models help, as we get better and better at simulating physical processes in the atmosphere and oceans. Advances in high performance computing help too – faster supercomputers mean we can run the models at a higher resolution, which means we get more detail about where exactly energy (heat) and mass (winds, waves) are moving. But all of these improvements are dwarfed by the improvements we get from better data gathering. If we had more accurate data on current conditions, and could get it into the models faster, we could get big improvements in the forecast quality. In other words, weather forecasting is an “initial value” problem. The biggest uncertainty is knowledge of the initial conditions.

One result of this is that weather forecasting centres (like the UK Met Office) can get an instant boost to forecasting accuracy whenever they upgrade to a faster supercomputer. This is because the weather forecast needs to be delivered to a customer (e.g. a newspaper or TV station) by a fixed deadline. If the models can be made to run faster, the start of the run can be delayed, giving the meteorologists more time to collect newer data on current conditions, and more time to process this data to correct for errors, and so on. For this reason, the national weather forecasting services around the world operate many of the world’s fastest supercomputers.

Hence weather forecasters are strongly biased towards data collection as the most important problem to tackle. They tend to regard computer models as useful, but of secondary importance to data gathering. Of course, I’m generalizing – developing the models is also a part of meteorology, and some meteorologists devote themselves to modeling, coming up with new numerical algorithms, faster implementations, and better ways of capturing the physics. It’s quite a specialized subfield.

Climate science has the opposite problem. Using pretty much the same model as for numerical weather prediction, climate scientists will run the model for years, decades or even centuries of simulation time. After the first few days of simulation, the similarity to any actual weather conditions disappears. But over the long term, day-to-day and season-to-season variability in the weather is constrained by the overall climate. We sometimes describe climate as “average weather over a long period”, but in reality it is the other way round – the climate constrains what kinds of weather we get.

For understanding climate, we no longer need to worry about the initial values, we have to worry about the boundary values. These are the conditions that constraint the climate over the long term: the amount of energy received from the sun, the amount of energy radiated back into space from the earth, the amount of energy absorbed or emitted from oceans and land surfaces, and so on. If we get these boundary conditions right, we can simulate the earth’s climate for centuries, no matter what the initial conditions are. The weather itself is a chaotic system, but it operates within boundaries that keep the long term averages stable. Of course, a particularly weird choice of initial conditions will make the model behave strangely for a while, at the start of a simulation. But if the boundary conditions are right, eventually the simulation will settle down into a stable climate. (This effect is well known in chaos theory: the butterfly effect expresses the idea that the system is very sensitive to initial conditions, and attractors are what cause a chaotic system to exhibit a stable pattern over the long term)

To handle this potential for initial instability, climate modellers create “spin-up” runs: pick some starting state, run the model for say 30 years of simulation, until it has settled down to a stable climate, and then use the state at the end of the spin-up run as the starting point for science experiments. In other words, the starting state for a climate model doesn’t have to match real weather conditions at all; it just has to be a plausible state within the bounds of the particular climate conditions we’re simulating.

To explore the role of these boundary values on climate, we need to know whether a particular combination of boundary conditions keep the climate stable, or tend to change it. Conditions that tend to change it are known as forcings. But the impact of these forcings can be complicated to assess because of feedbacks. Feedbacks are responses to the forcings that then tend to amplify or diminish the change. For example, increasing the input of solar energy to the earth would be a forcing. If this then led to more evaporation from the oceans, causing increased cloud cover, this could be a feedback, because clouds have a number of effects: they reflect more sunlight back into space (because they are whiter than the land and ocean surfaces they cover) and they trap more of the surface heat (because water vapour is a strong greenhouse gas). The first of these is a negative feedback (it reduces the surface warming from increased solar input) and the second is a positive feedback (it increases the surface warming by trapping heat). To determine the overall effect, we need to set the boundary conditions to match what we know from observational data (e.g. from detailed measurements of solar input, measurements of greenhouse gases, etc). Then we run the model and see what happens.

Observational data is again important, but this time for making sure we get the boundary values right, rather than the initial values. Which means we need different kinds of data too – in particular, longer term trends rather than instantaneous snapshots. But this time, errors in the data are dwarfed by errors in the model. If the algorithms are off even by a tiny amount, the simulation will drift over a long climate run, and it stops resembling the earth’s actual climate. For example, a tiny error in calculating where the mass of air leaving one grid square goes could mean we lose a tiny bit of mass on each time step. For a weather forecast, the error is so small we can ignore it. But over a century long climate run, we might end up with no atmosphere left! So a basic test for climate models is that they conserve mass and energy over each timestep.

Climate models have also improved in accuracy steadily over the last few decades. We can now use the known forcings over the last century to obtain a simulation that tracks the temperature record amazingly well. These simulations demonstrate the point nicely. They don’t correspond to any actual weather, but show patterns in both small and large scale weather systems that mimic what the planet’s weather systems actually do over the year (look at August – see the the daily bursts of rainfall in the Amazon, the gulf stream sending rain to the UK all summer long, and the cyclones forming off the coast of Japan by the middle of the month). And these patterns aren’t programmed into the model – it is all driven by sets of equations derived from the basic physics. This isn’t a weather forecast, because on any given day, the actual weather won’t look anything like this. But it is an accurate simulation of typical weather over time (i.e. climate). And, as was the case with weather forecasts, some bits are better than others – for example the Indian monsoons tend to be less well-captured than the North Atlantic Oscillation.

At first sight, numerical weather prediction and climate models look very similar. They model the same phenomena (e.g. how energy moves around the planet via airflows in the atmosphere and currents in the ocean), using the same computational techniques (e.g., three dimensional models of fluid flow on a rotating sphere). And quite often they use the same program code. But the problems are completely different: one is an initial value problem, and one is a boundary value problem.

Which also partly explains why a small minority of (mostly older, mostly male) meteorologists end up being climate change denialists. They fail to understand the difference in the two problems, and think that climate scientists are misusing the models. They know that the initial value problem puts serious limits on our ability to predict the weather, and assume the same limit must prevent the models being used for studying climate. Their experience tells them that weaknesses in our ability to get detailed, accurate, and up-to-date data about current conditions is the limiting factor for weather forecasting, and they assume this limitation must be true of climate simulations too.

Ultimately, such people tend to suffer from “senior scientist” syndrome: a lifetime of immersion in their field gives them tremendous expertise in that field, which in turn causes them to over-estimate how well their expertise transfers to a related field. They can become so heavily invested in a particular scientific paradigm that they fail to understand that a different approach is needed for different problem types. This isn’t the same as the Dunning-Kruger effect, because the people I’m talking about aren’t incompetent. So perhaps we need a new name. I’m going to call it the Dyson-effect, after one of it’s worst sufferers.

I should clarify that I’m certainly not stating that meteorologists in general suffer from this problem (the vast majority quite clearly don’t), nor am I claiming this is the only reason why a meteorologist might be skeptical of climate research. Nor am I claiming that any specific meteorologists (or physicists such as Dyson) don’t understand the difference between initial value and boundary value problems. However, I do think that some scientists’ ideological beliefs tend to bias them to be dismissive of climate science because they don’t like the societal implications, and the Dyson-effect disinclines them to finding out what climate science actually does.

I am, however, arguing that if more people understood this distinction between the two types of problem, we could get past silly soundbites about “we can’t even forecast the weather…” and “climate models are garbage in garbage out”, and have a serious conversation about how climate science works.

Update: Zeke has a more detailed post on the role of parameterizations climate models.