For many decades, computational speed has been the main limit on the sophistication of climate models. Climate modelers have become one of the most demanding groups of users for high performance computing, and access to faster and faster machines drives much of the progress, permitting higher resolution models and more earth system processes being explicitly resolved in the models. But from my visits to NCAR, MPI-M and IPSL this summer, I’m learning that growth in volumes of data handled is increasingly a dominant factor. The volume of data generated from today’s models has grown so much that supercomputer facilities find it hard to handle.

Currently, the labs are busy with the CMIP5 runs that will form one of the major inputs to the next IPCC assessment report. See here for a list of the data outputs required from the models (and note that the requirements were last changed on Sept 17, 2010 -well after most centers have started their runs; after all  it will take months to complete the runs, and the target date for submitting the data is the end of this year)

Climate modelers have requirements that are somewhat different from most other users of supercomputing facilities anyway:

  • very long runs – e.g. runs that take weeks or even months to complete;
  • frequent stop and restart of runs – e.g. the runs might be configured to stop once per simulated year, at which point they generate a restart file, and then automatically restart, so that intermediate results can be checked and analyzed, and because some experiments make use of multiple model variants, initialized from a restart file produced partway through a baseline run.
  • very high volumes of data generated – e.g. the CMIP5 runs currently underway at IPSL generate 6 terabytes per day, and in postprocessing, this goes up to 30 terabytes per day. Which is a problem, given that the NEC SX-9 being used for these runs has a 4 terabyte work disk and a 35 terabyte scratch disk. It’s getting increasingly hard to move the data to the tape archive fast enough.

Everyone seems to have underestimated the volumes of data generated from these CMIP5 runs. The implication is that data throughput rates are becoming a more important factor than processor speed, which may mean that climate computing centres require a different architecture than most high performance computing centres offer.

Anyway, I was going to write more about the infrastructure needed for this data handling problem, but Bryan Lawrence beat me to it, with his presentation to the NSF cyberinfrastructure “data task force”. He makes excellent points about the (lack of) scaleability of the current infrastructure, and the social and cultural issues with questions of how people get credit for the work they put into this infrastructure, and the issues of data curation and trust. Which means the danger is we will create a WORN (write-once, read-never) archive with all this data…!

This will keep me occupied with good reads for the next few weeks – this month’s issue of the Journal Studies in History and Philosophy of Modern Physics is a special on climate modeling. Here’s the table of contents:

Some very provocative titles there. I’m curious to see how much their observations cohere with my own…

I’ve been meaning to write a summary of the V&V techniques used for Earth System Models (ESMs) for ages, but never quite got round to it. However, I just had to put together a piece for a book chapter, and thought I would post it here to see if folks have anything to add (or argue with)).

Verification and Validation for ESMs is hard because running the models is an expensive proposition (a fully coupled simulation run can take weeks to complete), and because there is rarely a “correct” result – expert judgment is needed to assess the model outputs.

However, it is helpful to distinguish between verification and validation, because the former can often be automated, while the latter cannot. Verification tests are objective tests of correctness. These include basic tests (usually applied after each code change) that the model will compile and run without crashing in each of its standard configurations, that a run can be stopped and restarted from the restart files without affecting the results, and that identical results are obtained when the model is run using different processor layouts. Verification would also include the built-in tests for conservation of mass and energy over the global system on very long simulation runs.

In contrast, validation refers to science tests, where subjective judgment is needed. These include tests that the model simulates a realistic, stable climate, given stable forcings, that it matches the trends seen in observational data when subjected to historically accurate forcings, and that the means and variations (e.g. seasonal cycles) are realistic for the main climate variables (E.g. see Phillips et al, 2004).

While there is an extensive literature on the philosophical status of model validation in computational sciences (see for example, Oreskes et al (1994); Sterman (1994); Randall and Wielicki (1997); Stehr (2001)), much of it bears very little relation to practical techniques for ESM validation, and very little has been written on practical testing techniques for ESMs. In practice, testing strategies rely on a hierarchy of standard tests, starting with the simpler ones, and building up to the most sophisticated.

Pope and Davies (2002) give one such sequence for testing atmosphere models:

  • Simplified tests – e.g. reduce 3D equations of motion to 2D horizontal flow (e.g. a shallow water testbed). This is especially useful if the reduction has an analytical solution, or if a reference solution is available. It also permits assessment of relative accuracy and stability over a wide parameter space, and hence is especially useful when developing new numerical routines.
  • Dynamical core tests – test for numerical convergence of the dynamics with physical parameterizations replaced by a simplified physics model (e.g. no topography, no seasonal or diurnal cycle, simplified radiation).
  • Single-column tests – allows testing of individual physical parameterizations separately from the rest of the model. A single column of data is used, with horizontal forcing prescribed from observations or from idealized profiles. This is useful for understanding a new parameterization, and for comparing interaction between several parameterizations, but doesn’t cover interaction with large-scale dynamics, nor interaction with adjacent grid points. This type of test also depends on availability of observational datasets.
  • Idealized aquaplanet – test the fully coupled atmosphere-ocean model, but with idealized sea-surface temperatures at all grid points. This allows for testing of numerical convergence in the absence of complications of orography and coastal effects.
  • Uncoupled model components tested against realistic climate regimes – test each model component in stand-alone mode, with a prescribed set of forcings. For example, test the atmosphere on its own, with prescribed sea surface temperatures, sea-ice boundary conditions, solar forcings, and ozone distribution. Statistical tests are then applied to check for realistic mean climate and variability.
  • Double-call tests. Run the full coupled model, and test a new scheme by calling both the old and new scheme at each timestep, but with the new scheme’s outputs not fed back in to the model. This allows assessment of the performance of new scheme in comparison with older schemes.
  • Spin-up tests. Run the full ESM for just a few days of simulation (typically between 1 and 5 days of simulation), starting from an observed state. Such tests are cheap enough that they can be run many times, sampling across the initial state uncertainty. Then the average of a large number of such tests can be analyzed (Pope and Davies suggest that 60 is enough for statistical significance). This allows the results from different schemes to be compared, to explore differences in short term tendencies.

Whenever a code change is made to an ESM, in principle, an extensive set of simulation runs are needed to assess whether the change has a noticeable impact on the climatology of the model. This in turn requires a sub jective judgment for whether minor variations constitute acceptable variations, or whether they add up to a significantly different climatology.

Because this testing is so expensive, a standard shortcut is to require exact reproducibility for minor changes, which can then be tested quickly through the use of bit comparison tests . These are automated checks over a short run (e.g. a few days of simulation time) that the outputs or restart files of two different model configurations are identical down to the least significant bits. This is useful for checking that a change didn’t break anything it shouldn’t, but requires that each change be designed so that it can be “turned off” (e.g. via run-time switches) to ensure previous experiments can be reproduced. Bit comparison tests can also check that different configurations give identical results. In effect, bit reproducibility over a short run is a proxy for testing that two different versions of the model will give the same climate over a long run. It’s much faster than testing the full simulations, and it catches most (but not all) errors that would affect the model climatology.

Bit comparison tests do have a number of drawbacks, however, in that they restrict the kinds of change that can be made to the model. Occasionally, bit reproducibility cannot be guaranteed from one version of the model to another, for example when there is a change of compiler, change of hardware, a code refactoring, or almost any kind of code optimization. The decision about whether to insist on bit reproducibility, or whether to allow it to be broken from one version of the model to the next, is a difficult trade-off between flexibility and ease of testing.

A number of simple practices can be used to help improve code sustainability and remove coding errors. These include running the code through multiple compilers, which is effective because different compilers give warnings about different language features, and some allow poor or ambiguous code which others will report. It’s better to identify and remove such problems when they are first inserted, rather than discover later on that it will takes months of work to port the code to a new compiler.

Building conservation tests directly into the code also helps. These would typically be part of the coupler, and can check the global mass balance for carbon, water, salt, atmospheric aerosols, and so on. For example the coupler needs to check that water flowing from rivers enters the ocean; that the total mass of carbon is conserved as it cycles through atmosphere, oceans, ice, vegetation, and so on. Individual component models sometimes neglect such checks, as the balance isn’t necessarily conserved in a single component. However, for long runs of coupled models, such conservation tests are important.

Another useful strategy is to develop a verification toolkit for each model component, and for the entire coupled system. These contain a series of standard tests which users of the model can run themselves, on their own platforms, to confirm that the model behaves in the way it should in the local computation environment. They also provide the users with a basic set of tests for local code modifications made for a specific experiment. This practice can help to overcome the tendency of model users to test only the specific physical process they are interested in, while assuming the rest of the model is okay.

During development of model components, informal comparisons with models developed by other research groups can often lead to insights in how to improve the model, and also as a method for confirming and identifying suspected coding errors. But more importantly, over the last two decades, model intercomparisons have come to play a critical role in improving the quality of ESMs through a series of formally organised Model Intercomparison Projects (MIPs).

In the early days, these projects focussed on comparisons of the individual components of ESMs, for example, the Atmosphere Model Intercomparison Project (AMIP), which began in 1990 (Gates, 1992). But by the time of the IPCC second assessment report, there was a widespread recognition that a more systematic comparison of coupled models was needed, which led to the establishment of the Coupled Model Intercomparison Pro jects (CMIP), which now play a central role in the IPCC assessment process (Meehl et al, 2000).

For example, CMIP3, which was organized for the fourth IPCC assessment, involved a massive effort by 17 modeling groups from 12 countries with 24 models (Meehl et al, 2007). As of September 2010, the list of MIPs maintained by the World Climate Research Program included 44 different model intercomparison projects (Pirani, 2010).

Model Intercomparison Projects bring a number of important benefits to the modeling community. Most obviously, they bring the community together with a common purpose, and hence increase awareness and collaboration between different labs. More importantly, they require the participants to reach a consensus on a standard set of model scenarios, which often entails some deep thinking about what the models ought to be able to do. Likewise, they require the participants to define a set of standard evaluation criteria, which then act as benchmarks for comparing model skill. Finally, they also produce a consistent body of data representing a large ensemble of model runs, which is then available for the broader community to analyze.

The benefits of these MIPs are consistent with reports of software benchmarking efforts in other research areas. For example, Sim et al (2003) report that when a research community that builds software tools come together to create benchmarks, they frequently experience a leap forward in research progress, arising largely from the insights gained from the process of reaching consensus on the scenarios and evaluation criteria to be used in the benchmark. However, the definition of precise evaluation criteria is an important part of the benchmark – without this, the intercomparison pro ject can become unfocussed, with uncertain outcomes and without the huge leap forward in progress (Bueler, 2008).

Another form of model intercomparison is the use of model ensembles (Collins, 2007), which increasingly provide a more robust prediction system than single models runs, but which also play an important role in model validation:

  • Multi-model ensembles – to compare models developed at different labs on a common scenario.
  • Multi-model ensembles using variants of a single model – to compare different schemes for parts of the model, e.g. different radiation schemes.
  • Perturbed physics ensembles – to explore probabilities of different outcomes, in response to systematically varying physical parameters in a single model.
  • Varied initial conditions within a single model – to test the robustness of the model, and to better quantify probabilities for predicted climate change signals.

Last week I attended the workshop in Exeter to lay out the groundwork for building a new surface temperature record. My head is still buzzing with all the ideas we kicked around, and it was a steep learning curve for me because I wasn’t familiar with many of the details (and difficulties) of research in this area. In many ways it epitomizes what Paul Edwards terms “Data Friction” – the sheer complexity of moving data around in the global observing system means there are many points where it needs to be transformed from one form to another, each of which requires people’s energy and time, and, just like real friction, generates waste and slows down the system. (Oh, and some of these data transformations seem to generate a lot of heat too, which rather excites the atoms of the blogosphere).

Which brings us to the reasons the workshop existed in the first place. In many ways, it’s a necessary reaction to the media frenzy over the last year or so around alleged scandals in climate science, in which scientists are supposed to be hiding or fabricating data, which has allowed the ignoranti to pretend that the whole of climate science is discredited. However, while the nature and pace of the surface temperatures initiative has clearly been given a shot in the arm by this media frenzy, the roots of the workshop go back several years, and have a strong scientific foundation. Quite simply, scientists have recognized for years that we need a more complete and consistent surface temperature record with a much higher temporal resolution than currently exists. Current long term climatological records are mainly based on monthly summary data. Which is inadequate to meet the needs of current climate assessment, particularly the need for better understanding of the impact of climate change on extreme weather. Most weather extremes don’t show up in the monthly data, because they are shorter term – lasting for a few days or even just a few hours. This is not always true of course; Albert Klein Tank pointed out in his talk that this summer’s heatwave in Moscow occured mainly in a single calendar month, and hence shows up strongly in the monthly record. But in general, that is unusual, and so the worry is that monthly records tend to mask the occurrence of extremes (and hence may conceal trends in extremes).

The opening talks at the workshop also pointed out that the intense public scrutiny puts us in a whole new world, and one that many of the workshop attendees are clearly still struggling to come to terms with. Now, it’s clear that any new temperature record needs to be entirely open and transparent, so that every piece of research based on it could (in principle) be traced all the way back to basic observational records, and to echo the way John Christy put it at the workshop – every step of the research now has to be available as admissible evidence that could stand up in a court of law, because that’s the kind of scrutiny we’re being subjected to. Of course, the problem is that not only isn’t science ready for this (no field of science is anywhere near that transparent), it’s also not currently feasible, given the huge array of data sources being drawn on, the complexities of ownership and access rights, the expectations that much of the data will have high commercial value.

I’ll attempt a summary, but it will be rather long, as I don’t have time to make it any shorter. The slides from the workshop are now all available, and the outcomes from the workshop will be posted soon. The main goals were summarized in Peter Thorne’s opening talk: to create a (longish) list of principles, a roadmap for how to proceed, an identification of any overlapping initiatives so that synergies can be exploited, an agree method to engage with broader audiences (including the general public), and an initial governance model.

Did we achieve that? Well, you can skip to the end and see the summary slides, and judge for yourself. Personally, I thought the results were mixed. One obvious problem is that there is no funding on the table for this initiative, and it’s being launched at a time when everyone is cutting budgets, especially in the UK. Which meant that occasionally it felt like we were putting together a Heath Robinson device (Rube Goldberg to you Americans) – cobbling it together out of whatever we could find lying around. Which is ironic really given that the major international bodies (e.g. WMO) seem to fully appreciate the importance of this. And of course, the fact that it will be a vital part of our ability to assess the impacts of climate change over the next few decades.

Another problem is that the workshop attendees struggled to reach consensus on some of the most important principles. For example, should the databank be entirely open, or does it need a restricted section? The argument for the latter is that large parts of the source data are not currently open, as the various national weather services that collect it charge a fee on a cost recovery basis, and wish to restrict access to non-commercial uses as commercial applications are (in some cases) a significant portion of their operating budgets. The problem is that while the monthly data has been shared freely with international partners for many years, the daily and sub-daily records have not, because these are the basis for commercial weather forecasting services. So an insistence on full openness might mean a very incomplete dataset, which then defeats the purpose, as researchers will continue to use other (private) sources for more complete records.

And what about an appropriate licensing model? Some people argued that the data must be restricted to non-commercial uses, because that’s likely to make negotiations with national weather services easier. But others argued that unrestricted licenses should be used, so that the databank can help to lay the foundation for the development of a climate services industry (which would create jobs, and therefore please governments). [Personally, I felt that if governments really want to foster the creation of such an industry, then they ought to show more willingness to invest in this initiative, and until they do, we shouldn’t pander to them. I’d go for a cc by-nc-sa license myself, but I think I was outvoted]. Again, existing agreements are likely to get in the way: 70% of the European data would not be available if the research-only clause clause was removed.

There was also some serious disagreement about timelines. Peter outlined a cautious roadmap that focussed on building momentum, and delivering the occasional reports and white papers over the next year or so. The few industrial folks in the audience (most notably, Amy Luers from Google) nearly choked on their cookies – they’d be rolling out a beta version of the software within a couple of weeks if they were running the project. Quite clearly, as Amy urged in her talk, the project needs to plan for software needs right from the start, release early, prepare for iteration and flexibility, and invest in good visualizations.

Oh, and there wasn’t much agreement on open source software either. The more software oriented participants (most notably, Nick Barnes, from the Climate Code Foundation) argued strongly that all software, including every tool used to process the data every step of the way should be available as open source. But for many of the scientists, this represented a huge culture change. There was even some confusion about what open source means (e.g. that ‘open’ and ‘free’ aren’t necessarily the same thing).

On the other hand, some great progress was made in many areas, including identifying many important data services, building on lessons learnt from other large climate and weather data curation efforts, offers of help from many of the international partners (including offers of data from NCDC, NCAR, EURO4M, from across Europe and North America, as well as Russia, China, Indonesia, and Argentina). Agreement was clear that version control and good metadata are vital, and need to be planned for right from the start, but also that providing full provenance for each data item is an important long term goal, but cannot be a rule from the start, as we will have to build on existing data sources that come with little or no provenance information. Oh, and I was very impressed with the deep thinking and planning around benchmarking for homogenization tools (I’ll blog more on this soon, as it fascinates me).

Oh, and on the size of the task. Estimates of the number of undigitized paper records in the basements of various weather services ran to hundreds of millions of pages. But I still didn’t get a sense of the overall size of the planned databank…

Things I learnt:

  • Steve Worley from NCAR, reflecting on lessons from running ICOADS, pointed out that no matter how careful you think you’ve been, people will end up mis-using the data because they ignore or don’t understand the flags in the metadata.
  • Steve also pointed out that a drawback with open datasets is the proliferation of secondary archives, which then tend to get out of date and mislead users (as they rarely direct users back to the authoritative source).
  • Oh, and the scope of the uses of such data is usually surprisingly large and diverse.
  • Jay Lawrimore, reflecting on lessons from NCDC, pointed out that monthly data and daily and sub-daily data are collected and curated along independent routes, which then makes it hard to reconcile them. The station names sometimes don’t match, the lat/long coords don’t match (e.g. because of differences in rounding), and the summarized data are similar but not exact.
  • Another problem is that it’s not always clear exactly which 24-hour period a daily summary refers to (e.g. did they use a local or UTC midnight?). Oh, and this also means that 3- and 6-hour synoptic readings might not match the daily summaries either.
  • Some data doesn’t get transmitted, and so has to be obtained later, even to the point of having to re-key it from emails. Long delays in obtaining some of the data mean the datasets frequently have to be re-released.
  • Personal contacts and workshops in different parts of the world play a surprisingly important role in tracking down some of the harder to obtain data.
  • NCDC runs a service called Datzilla (similar to Bugzilla for software) for recording and tracking reported defects in the dataset.
  • Albert Klein Tank, describing the challenges in regional assessment of climate change and extremes, pointed out that the data requirements for analyzing extreme events are much higher than for assessing global temperature change. For example, we might need to know not just how many days were above 25°C compared to normal, but also how much did it cool off overnight (because heat stress and human health depend much more on overnight relief from the heat).
  • John Christy, introducing the breakout group on data provenance, had some nice examples in his slides of the kinds of paper records they have to deal with, and a fascinating example of a surface station that’s now under a lake, and hence old maps are needed to pinpoint its location.
  • From Michael de Podesta, who insisted on a healthy dose of serious metrology (not to be confused with meteorology): All measurements ought to come with an estimation of uncertainty, and people usually make a mess of this because they confuse accuracy and precision.
  • Uncertainty information isn’t metadata, it’s data. [Oh, and for that matter anything that’s metadata to one community is likely to be data to another. But that’s probably confusing things too much]
  • Oh, and of course, we have to distinguish Type A and Type B uncertainty. Type A is where the uncertainty is describable using statistics, so that collecting bigger samples will reduce it. Type B is where you just don’t know, so that collecting more data cannot reduce the uncertainty.
  • From Matt Menne, reflecting on lessons from the GHCN dataset, explaining the need for homogenization (which is climatology jargon for getting rid of errors in the observational data that arise because of changes over time in the way the data was measured). Some of the inhomogeneities are due to abrupt changes (e.g. because a recording station was moved, or got a new instrument), and also gradual changes (e.g. because the environment for a recording station slowly changes, e.g. gradual urbanization of its location).
  • Matt has lots of interesting examples of inhomogeneities in his slides, includes some really nasty ones. For example, a station in Reno, Nevada, that was originally in town, and then moved to the airport. There’s a gradual upwards trend in the early part of the record, from an urban heat island effect, and another similar trend in the latter part, after it moved to the airport, as the airport was also eventually encroached by urbanisation. But if you correct for both of these, as well as the step change when the station moved, you’re probably over-correcting….
  • which led Matt to suggest the Climate Scientist’s version of the Hippocratic Oath: First, do not flag good data as bad; Then do not make bias adjustments where none are warranted.
  • While criticism from non-standard sources (that’s polite-speak for crazy denialists) is coming faster than any small group can respond to (that’s code for the CRU), useful allies are beginning to emerge, also from the blogosphere, in the form of serious citizen scientists (such as Zeke Hausfather) who do their own careful reconstructions, and help address some of the crazier accusations from denialists. So there’s an important role in building community with such contributors.
  • John Kennedy, talking about homogenization for Sea Surface Temperatures, pointed out that Sea Surface and Land Surface data are entirely different beasts, requiring totally different approaches to homogenization. Why? because SSTs are collected from buckets on ships, engine intakes on ships, drifting buoys, fixed buoys, and so on. Which means you don’t have long series of observations from a fixed site like you do with land data – every observation might be from a different location!

Things I hope I managed to inject into the discussion:

  • “solicitation of input from the community at large” is entirely the wrong set of terms for white paper #14. It should be about community building and engagement. It’s never a one-way communication process.
  • Part of the community building should be the support for a shared set of open source software tools for analysis and visualization, contributed by the various users of the data. The aim would be for people to share their tools, and help build on what’s in the collection, rather than having everyone re-invent their own software tools. This could be as big a service to the research community as the data itself.
  • We desperately need a clear set of use cases for the planned data service (e.g. who wants access to which data product, and what other information will they be needing and why?). Such use cases should illustrate what kinds of transparency and traceability will be needed by users.
  • Nobody seems to understand just how much user support will need to be supplied (I think it will be easy for whatever resources are put into this to be overwhelmed, given the scrutiny that temperature records are subjected to these days)…
  • The rate of change in this dataset is likely to be much higher than has been seen in past data curation efforts, given the diversity of sources, and the difficulty of recovering complete data records.
  • Nobody (other than Bryan) seemed to understand that version control will need to be done at a much finer level of granularity than whole datasets, and that really every single data item needs to have a unique label so that it can be referred to in bug reports, updates, etc. Oh and that the version management plan should allow for major and minor releases, given how often even the lowest data products will change, as more data and provenance information is gradually recovered.
  • And of course, the change process itself will be subjected to ridiculous levels of public scrutiny, so the rational for accepting/rejecting changes and scheduling new releases needs to be clear and transparent. Which means far more attention to procedures and formal change control boards than past efforts have used.
  • I had lots of suggestions about how to manage the benchmarking effort, including planning for the full lifecycle: making sure the creation of the benchmark is a really community consensus building effort, and planning for retirement of each benchmark, to avoid the problems of overfitting. Susan Sim wrote an entire PhD on this.
  • I think the databank will need to come with a regularly updated blog, to provide news about what’s happening with the data releases, highlight examples of how it’s being used, explain interesting anomalies, interpret published papers based on the data, etc. A bit like RealClimate. Oh, and with serious moderation of the comment threads to weed out the crazies. Which implies some serious effort is needed.
  • …and I almost but not quite entirely learned how to pronounce the word ‘inhomogeneities’ without tripping over my tongue. I’m just going to call them ‘bugs’.

Update Sept 21, 2010: Some other reports from the workshop.

I’ve mentioned the Clear Climate Code project before, but it’s time to give them an even bigger shout out, as the project is a great example of of the kind of thing I’m calling for in my grand challenge paper. The project is building an open source community around the data processing software used in climate science. Their showcase project is an open source Python re-implementation of gistemp, and very impressive it is too.

Now they’ve gone one better, and launched the Climate Code Foundation, a non-profit organisation aimed at “improving the public understanding of climate science through the improvement and publication of climate science software”. The idea is for it to become an umbrella body that will nurture many more open source projects, and promote greater openness of the software tools and data used for the science.

I had a long chat with Nick Barnes, one of the founders of CCF, on the train to Exeter last night, and was very impressed with his enthusiasm and energy. He’s actively seeking more participants, more open source projects for the foundation to support, and of course, for funding to keep the work going. I think this could be the start of something beautiful.

Here’s a question I’ve been asking a few people lately, ever since I asserted that climate models are big expensive scientific instruments: How expensive are we talking about? Unfortunately, it’s almost impossible to calculate. The effort of creating a climate model is tangled up with the scientific research, such that you can’t even reliably determine how much of a particular scientist’s time is “model development” and how much is “doing science”. The problem is that you can’t build the model without a lot of that “doing science” part, because the model is the result of a lot of thinking, experimentation, theory building, testing hypotheses, analyzing simulation results, and discussions with other scientists. Many pieces of the model are based on the equations or empirical results in published research papers; even if you’re not doing the research yourself, you still have to keep up with the literature, understand the state-of-the-art, and know which bits of research are mature enough to incorporate into the model.

So, my first cut, which will be an over-estimation, is that *all* of the effort at a climate modeling lab is necessary to build the model. Labs vary in size, but a typical climate modeling lab is of the order of 200 people (including scientists, technicians, and admin support). And most of the models I’ve looked at have been under steady development for twenty years or more. So, that gives us starting point of 200*20 = 4,000 person-years. Luckily, most scientists care more about science than salary, so they’re much cheaper than software professionals. Given we’ll have a mix of postdocs and senior scientists, let’s say average salary would be around $150,000 per year including benefits and other overheads. Thats $600 million.

Oh, and that doesn’t including the costs of equipping and operating a tier-2 supercomputing facility, as the climate model runs will easily keep such a facility fully loaded full time (and we’ll need to factor in the cost to replace the supercomputer every few years to take advantage of performance increases). In most cases, the supercomputing facilities are shared with other scientific uses of high performance computing. But there is one centre that’s dedicated to climate modeling, the DKRZ in Hamburg, which has an annual budget of around 30 million euro. Let’s pretend euros are dollars, and call that $30 million per year, which for 20 years gives us another $600 million. The latest supercomputer at DKRZ, Blizzard, cost 35 million euro. Let’s say we replace this every five years, and throw some more money in for many terabytes of data storage, that’ll get us to around $200 million for hardware.

Grand total: $1.4 billion.

Now, I said that’s an over-estimate. Over lunch today I quizzed some of the experts here at IPSL in Paris, and they thought that 1,000 person-years (50 persons per year for 20 years) was a better estimate of the actual model development effort. This seems reasonable – it means that only 1/4 of the research at my 200 person research institute directly contributes to model development, the rest is science that uses the model but isn’t essential for developing it. So, that brings the salary figure down to $150 million. I’ve probably got to do the same conversion for the supercomputing facilities – let’s say about 1/4 of the supercomputing capacity is reserved for model development and testing. That also feels about right: 5-10% of the capacity is reserved for test processes (e.g. the ones that run automatically every day to do the automated build-and-test process), and a further 10%-20% might be used for validation runs on development versions of the model.

That brings the grand total down to $350 million.

Now, it has been done for less than this. For example, the Canadian Climate Centre, CCCma, has a modeling team one tenth this size, although they do share a lot of code with the Canadian Meteorological Service. And their model isn’t as full-featured as some of the other GCMs (it also has a much smaller user base). As with other software projects, the costs don’t scale linearly with functionality: a team of 5 software developers can achieve much more than 1/10th of what a team of 50 can (cf The Mythical Man Month). Oh, and the computing costs won’t come down much at all – the CCCma model is no more efficient than other models. So we’re still likely to be above the $100 million mark.

Now, there are probably other ways of figuring it – so far we’ve only looked at the total cumulative investment in one of today’s world leading climate models. What about replacement costs? If we had to build a new model from scratch, using what we already know (rather than doing all the research over again), how much would that cost? Well, nobody has ever done this, but there are few experiences we could draw on. For example, the Max Planck Institute has been developing a new model from scratch, ICON, which uses a icosahedral grid and hence needs a new approach to the dynamics. The project has been going for 8 years. It started with just a couple of people, and has ramped up to about a dozen. But they’re still a long way from being done, and they’re re-using a lot of the physics code from their old model, ECHAM. On the other hand, its an entirely new approach to the grid structure, so a lot of the early work was pure research.

Where does that leave us? It’s really a complete guess, but I would suggest a team of 10 people (half of them scientists, half scientific programmers) could re-implement the old model from scratch (including all the testing and validation) in around 5 years. Unfortunately, climate science is a fast moving field. What we’d get at the end of 5 years is a model that, scientifically speaking, is 5 years out of date. Unless of course we also paid for a large research effort to bring the latest science into the model while we were constructing it, but then we’re back where we started. I think this means you can’t replace a state-of-the-art climate model for much less than the original development costs.

What’s the conclusion? The bottom line is that the development cost of a climate model is in the hundreds of millions of dollars.

Here’s a whole set of things I can’t make it to. The great thing about being on sabbatical is the ability to travel, visit different labs, and so on. The downside is that there are far more interesting places and events than I can possibly make it to, and many of them clash. Here’s some I won’t be able to make it to this fall:

I’m pleased to see that my recent paper, “Climate Change: A Software Grand Challenge” is getting some press attention. However, I’m horrified to see how it’s been distorted in the echo chamber of the media. Danny Bradbury, writing in the Guardian, gives his piece the headline “Climate scientists should not write their own software, says researcher“. Aaaaaaargh! Nooooo! That’s the exact opposite of what I would say!

Our research shows that earth system models, the workhorses of climate science, appear to have very few bugs, and produce remarkably good simulations of past climate. One of the most important success factors is that the code is written by the scientists themselves, as they understand the domain inside out. Now, of course, this leads to other problems, for instance the code is hard to understand, and hard to modify. And the job of integrating the various components of the models is really hard. But there are no obvious solutions to fix this without losing this hands-on relationship between the scientists and the code. Handing the code development over to software professionals is likely to be a disaster.

I’ve posted a comment on Bradbury’s article, but I have very little hope he’ll alter the headline, as it obviously plays into a storyline that’s popular with denialists right now (see update, below).

Some other reports:

Update (2/9/10): Well that’s a delight! I just got off the overnight train to Paris, and discover that Danny has commented here, and wants to put everything right, and has already corrected the headline in the BusinessGreen version. So, apologies to Danny for doubting him, and also, thanks for restoring my faith in journalism. As is clear in some of the comments, it’s easy to see how one might draw the conclusion that climate scientists shouldn’t write their own code from a reading of my paper. It’s a subtle point, so I probably need to write a longer piece on this to explain…

Update #2 (later that same day): And now the Guardian headline has been changed too. Victory for honest journalism!