I took a break from blogging for the last few weeks to take a vacation with the family in Europe. We fell in love with Venice, a city full of charming alleyways and canals, with no wheeled transport of any kind. Part of the charm is the dilapidated, medieval feel to the place – the buildings are subsiding, their facias are crumbing, and most of the city’s infrastructure doesn’t work very well. In fact, given what a sorry state the whole city is in, I’m surprised how much I fell in love with the place.

But one thing we didn’t expect was that Venice flooded while we were there. It turns out that several times a year, particularly during the high spring and autumn tides, the meteorological conditions are such that more water than usual is driven into the lagoon, and the high tide washes over the canal sides, across the sidewalks, and into the houses and shops:

The locals all take this in their stride, don their long boots, and get to work pumping it all out the buildings again. The tourists stand there looking bewildered. But the kids loved it:

High tide in Venice, Oct 5th, 2010

In Venice, it’s just another Acqua Alta. I’d heard about Venice sinking, given that the buildings sit on wooden rafts, which in turn are supported by wooden pillars driven deep into the soft mud on the lagoon bottom. And of course, I know that sea level rise due to global warming threatens many of the world’s coastal cities. But I didn’t realise just how low Venice really is, and the flooding we saw got me thinking again about whether the future is already here.

The last IPCC report forecasts a rise of up to 59cm in sea level rise by the end of this century, due to thermal expansion and melting glaciers. And as we know, the IPCC numbers exclude the contribution of the Greenland and Antarctic ice sheets, which together could be considerably more, and it also fudges the point that sea level rise won’t magically stop in 2100. Which means that Venice, a city that’s around 1500 years old, is very unlikely to survive into the twenty-second century.

But like all attempts to pin down the impacts of climate change, it gets complicated. It turns out that Acqua Alta isn’t a recent thing – it has occurred throughout Venice’s history. Technically, Aqua Alta occurs when the high tide is more than 90cm above the average sea level (actually, the average as was measured in the year 1897, according to wikipedia). In the foreign media, floods in Venice are typically portrayed in breathless terms as a disaster (see HuffPost for some dramatic photos from last winter). The locals don’t see it that way at all, and get furious at these media reports as they damage the tourist trade on which Venice depends almost entirely.

One problem is that the media reports confuse the measures. As I said, the floods are measured in terms of height above a 1897 sea level average. These days, even low tides are often above this baseline too. Here’s the forecast for the next 48 hours:

Venice tides for Oct 29-31, 2010, from http://www.comune.venezia.it/flex/cm/pages/ServeBLOB.php/L/IT/IDPagina/1748

A you can see, the sea level is expected to vary from low tides around 0cm, and high tides in the range 50-75cm. Which is classified as normal high tides for Venice. A high tide of  up to 90cm causes almost no flooding, while one of +150cm floods about 2/3 of the city – this happens once every few years. The confusion in the media is that +150cm is about 5 feet; so the papers duly report Venice as being under 5 feet of water. But really the water is rarely more than ankle deep, as the flooding is only the difference between the canal sides and the high water. On the day we took these photos (5th Oct 2010), the high tide reached about 107cm, enough to flood about 14% of the city, but as you can see, the actual flood is only a few centimeters deep.

But a sea level rise of +50cm due to climate change shifts things so that every high tide will flood a significant proportion of the city. Flooding twice a day throughout the year is a very different proposition from a little light flooding a few times in the spring and fall.

Can Venice be saved? MOSE, a large and controversial flood barrier project, has been under construction for the last few years, and is anticipated to be ready by 2012. It aims to protect Venice with automatic flood barriers around the entrance to the lagoon. The project has been severely criticized both for high cost, for it’s impact on the lagoon ecosystems, and because it doesn’t provide an incremental solution – if sea levels continue to rise, they will overwhelm the barriers, and there’s no obvious way to extend them. The design for the barriers is based on the IPCC projections of up to 60cm sea level rise (although I haven’t been able to find any detailed specifications of exactly what height of tide they will work for). The problem is, if the IPCC reports underestimate sea level rise (and increasingly it looks like they do), then a vast multi-billion dollar project will only buy Venice a few more decades. The techno-optimism of the engineers who designed MOSE seems to be symptomatic a broader mindset when it comes to climate change, which says we can just invent our way out of the problem. It would be nice if it’s correct, but based on the science, I wouldn’t bet on it.

For many decades, computational speed has been the main limit on the sophistication of climate models. Climate modelers have become one of the most demanding groups of users for high performance computing, and access to faster and faster machines drives much of the progress, permitting higher resolution models and more earth system processes being explicitly resolved in the models. But from my visits to NCAR, MPI-M and IPSL this summer, I’m learning that growth in volumes of data handled is increasingly a dominant factor. The volume of data generated from today’s models has grown so much that supercomputer facilities find it hard to handle.

Currently, the labs are busy with the CMIP5 runs that will form one of the major inputs to the next IPCC assessment report. See here for a list of the data outputs required from the models (and note that the requirements were last changed on Sept 17, 2010 -well after most centers have started their runs; after all  it will take months to complete the runs, and the target date for submitting the data is the end of this year)

Climate modelers have requirements that are somewhat different from most other users of supercomputing facilities anyway:

  • very long runs – e.g. runs that take weeks or even months to complete;
  • frequent stop and restart of runs – e.g. the runs might be configured to stop once per simulated year, at which point they generate a restart file, and then automatically restart, so that intermediate results can be checked and analyzed, and because some experiments make use of multiple model variants, initialized from a restart file produced partway through a baseline run.
  • very high volumes of data generated – e.g. the CMIP5 runs currently underway at IPSL generate 6 terabytes per day, and in postprocessing, this goes up to 30 terabytes per day. Which is a problem, given that the NEC SX-9 being used for these runs has a 4 terabyte work disk and a 35 terabyte scratch disk. It’s getting increasingly hard to move the data to the tape archive fast enough.

Everyone seems to have underestimated the volumes of data generated from these CMIP5 runs. The implication is that data throughput rates are becoming a more important factor than processor speed, which may mean that climate computing centres require a different architecture than most high performance computing centres offer.

Anyway, I was going to write more about the infrastructure needed for this data handling problem, but Bryan Lawrence beat me to it, with his presentation to the NSF cyberinfrastructure “data task force”. He makes excellent points about the (lack of) scaleability of the current infrastructure, and the social and cultural issues with questions of how people get credit for the work they put into this infrastructure, and the issues of data curation and trust. Which means the danger is we will create a WORN (write-once, read-never) archive with all this data…!

This will keep me occupied with good reads for the next few weeks – this month’s issue of the Journal Studies in History and Philosophy of Modern Physics is a special on climate modeling. Here’s the table of contents:

Some very provocative titles there. I’m curious to see how much their observations cohere with my own…

I’ve been meaning to write a summary of the V&V techniques used for Earth System Models (ESMs) for ages, but never quite got round to it. However, I just had to put together a piece for a book chapter, and thought I would post it here to see if folks have anything to add (or argue with)).

Verification and Validation for ESMs is hard because running the models is an expensive proposition (a fully coupled simulation run can take weeks to complete), and because there is rarely a “correct” result – expert judgment is needed to assess the model outputs.

However, it is helpful to distinguish between verification and validation, because the former can often be automated, while the latter cannot. Verification tests are objective tests of correctness. These include basic tests (usually applied after each code change) that the model will compile and run without crashing in each of its standard configurations, that a run can be stopped and restarted from the restart files without affecting the results, and that identical results are obtained when the model is run using different processor layouts. Verification would also include the built-in tests for conservation of mass and energy over the global system on very long simulation runs.

In contrast, validation refers to science tests, where subjective judgment is needed. These include tests that the model simulates a realistic, stable climate, given stable forcings, that it matches the trends seen in observational data when subjected to historically accurate forcings, and that the means and variations (e.g. seasonal cycles) are realistic for the main climate variables (E.g. see Phillips et al, 2004).

While there is an extensive literature on the philosophical status of model validation in computational sciences (see for example, Oreskes et al (1994); Sterman (1994); Randall and Wielicki (1997); Stehr (2001)), much of it bears very little relation to practical techniques for ESM validation, and very little has been written on practical testing techniques for ESMs. In practice, testing strategies rely on a hierarchy of standard tests, starting with the simpler ones, and building up to the most sophisticated.

Pope and Davies (2002) give one such sequence for testing atmosphere models:

  • Simplified tests – e.g. reduce 3D equations of motion to 2D horizontal flow (e.g. a shallow water testbed). This is especially useful if the reduction has an analytical solution, or if a reference solution is available. It also permits assessment of relative accuracy and stability over a wide parameter space, and hence is especially useful when developing new numerical routines.
  • Dynamical core tests – test for numerical convergence of the dynamics with physical parameterizations replaced by a simplified physics model (e.g. no topography, no seasonal or diurnal cycle, simplified radiation).
  • Single-column tests – allows testing of individual physical parameterizations separately from the rest of the model. A single column of data is used, with horizontal forcing prescribed from observations or from idealized profiles. This is useful for understanding a new parameterization, and for comparing interaction between several parameterizations, but doesn’t cover interaction with large-scale dynamics, nor interaction with adjacent grid points. This type of test also depends on availability of observational datasets.
  • Idealized aquaplanet – test the fully coupled atmosphere-ocean model, but with idealized sea-surface temperatures at all grid points. This allows for testing of numerical convergence in the absence of complications of orography and coastal effects.
  • Uncoupled model components tested against realistic climate regimes – test each model component in stand-alone mode, with a prescribed set of forcings. For example, test the atmosphere on its own, with prescribed sea surface temperatures, sea-ice boundary conditions, solar forcings, and ozone distribution. Statistical tests are then applied to check for realistic mean climate and variability.
  • Double-call tests. Run the full coupled model, and test a new scheme by calling both the old and new scheme at each timestep, but with the new scheme’s outputs not fed back in to the model. This allows assessment of the performance of new scheme in comparison with older schemes.
  • Spin-up tests. Run the full ESM for just a few days of simulation (typically between 1 and 5 days of simulation), starting from an observed state. Such tests are cheap enough that they can be run many times, sampling across the initial state uncertainty. Then the average of a large number of such tests can be analyzed (Pope and Davies suggest that 60 is enough for statistical significance). This allows the results from different schemes to be compared, to explore differences in short term tendencies.

Whenever a code change is made to an ESM, in principle, an extensive set of simulation runs are needed to assess whether the change has a noticeable impact on the climatology of the model. This in turn requires a sub jective judgment for whether minor variations constitute acceptable variations, or whether they add up to a significantly different climatology.

Because this testing is so expensive, a standard shortcut is to require exact reproducibility for minor changes, which can then be tested quickly through the use of bit comparison tests . These are automated checks over a short run (e.g. a few days of simulation time) that the outputs or restart files of two different model configurations are identical down to the least significant bits. This is useful for checking that a change didn’t break anything it shouldn’t, but requires that each change be designed so that it can be “turned off” (e.g. via run-time switches) to ensure previous experiments can be reproduced. Bit comparison tests can also check that different configurations give identical results. In effect, bit reproducibility over a short run is a proxy for testing that two different versions of the model will give the same climate over a long run. It’s much faster than testing the full simulations, and it catches most (but not all) errors that would affect the model climatology.

Bit comparison tests do have a number of drawbacks, however, in that they restrict the kinds of change that can be made to the model. Occasionally, bit reproducibility cannot be guaranteed from one version of the model to another, for example when there is a change of compiler, change of hardware, a code refactoring, or almost any kind of code optimization. The decision about whether to insist on bit reproducibility, or whether to allow it to be broken from one version of the model to the next, is a difficult trade-off between flexibility and ease of testing.

A number of simple practices can be used to help improve code sustainability and remove coding errors. These include running the code through multiple compilers, which is effective because different compilers give warnings about different language features, and some allow poor or ambiguous code which others will report. It’s better to identify and remove such problems when they are first inserted, rather than discover later on that it will takes months of work to port the code to a new compiler.

Building conservation tests directly into the code also helps. These would typically be part of the coupler, and can check the global mass balance for carbon, water, salt, atmospheric aerosols, and so on. For example the coupler needs to check that water flowing from rivers enters the ocean; that the total mass of carbon is conserved as it cycles through atmosphere, oceans, ice, vegetation, and so on. Individual component models sometimes neglect such checks, as the balance isn’t necessarily conserved in a single component. However, for long runs of coupled models, such conservation tests are important.

Another useful strategy is to develop a verification toolkit for each model component, and for the entire coupled system. These contain a series of standard tests which users of the model can run themselves, on their own platforms, to confirm that the model behaves in the way it should in the local computation environment. They also provide the users with a basic set of tests for local code modifications made for a specific experiment. This practice can help to overcome the tendency of model users to test only the specific physical process they are interested in, while assuming the rest of the model is okay.

During development of model components, informal comparisons with models developed by other research groups can often lead to insights in how to improve the model, and also as a method for confirming and identifying suspected coding errors. But more importantly, over the last two decades, model intercomparisons have come to play a critical role in improving the quality of ESMs through a series of formally organised Model Intercomparison Projects (MIPs).

In the early days, these projects focussed on comparisons of the individual components of ESMs, for example, the Atmosphere Model Intercomparison Project (AMIP), which began in 1990 (Gates, 1992). But by the time of the IPCC second assessment report, there was a widespread recognition that a more systematic comparison of coupled models was needed, which led to the establishment of the Coupled Model Intercomparison Pro jects (CMIP), which now play a central role in the IPCC assessment process (Meehl et al, 2000).

For example, CMIP3, which was organized for the fourth IPCC assessment, involved a massive effort by 17 modeling groups from 12 countries with 24 models (Meehl et al, 2007). As of September 2010, the list of MIPs maintained by the World Climate Research Program included 44 different model intercomparison projects (Pirani, 2010).

Model Intercomparison Projects bring a number of important benefits to the modeling community. Most obviously, they bring the community together with a common purpose, and hence increase awareness and collaboration between different labs. More importantly, they require the participants to reach a consensus on a standard set of model scenarios, which often entails some deep thinking about what the models ought to be able to do. Likewise, they require the participants to define a set of standard evaluation criteria, which then act as benchmarks for comparing model skill. Finally, they also produce a consistent body of data representing a large ensemble of model runs, which is then available for the broader community to analyze.

The benefits of these MIPs are consistent with reports of software benchmarking efforts in other research areas. For example, Sim et al (2003) report that when a research community that builds software tools come together to create benchmarks, they frequently experience a leap forward in research progress, arising largely from the insights gained from the process of reaching consensus on the scenarios and evaluation criteria to be used in the benchmark. However, the definition of precise evaluation criteria is an important part of the benchmark – without this, the intercomparison pro ject can become unfocussed, with uncertain outcomes and without the huge leap forward in progress (Bueler, 2008).

Another form of model intercomparison is the use of model ensembles (Collins, 2007), which increasingly provide a more robust prediction system than single models runs, but which also play an important role in model validation:

  • Multi-model ensembles – to compare models developed at different labs on a common scenario.
  • Multi-model ensembles using variants of a single model – to compare different schemes for parts of the model, e.g. different radiation schemes.
  • Perturbed physics ensembles – to explore probabilities of different outcomes, in response to systematically varying physical parameters in a single model.
  • Varied initial conditions within a single model – to test the robustness of the model, and to better quantify probabilities for predicted climate change signals.

Last week I attended the workshop in Exeter to lay out the groundwork for building a new surface temperature record. My head is still buzzing with all the ideas we kicked around, and it was a steep learning curve for me because I wasn’t familiar with many of the details (and difficulties) of research in this area. In many ways it epitomizes what Paul Edwards terms “Data Friction” – the sheer complexity of moving data around in the global observing system means there are many points where it needs to be transformed from one form to another, each of which requires people’s energy and time, and, just like real friction, generates waste and slows down the system. (Oh, and some of these data transformations seem to generate a lot of heat too, which rather excites the atoms of the blogosphere).

Which brings us to the reasons the workshop existed in the first place. In many ways, it’s a necessary reaction to the media frenzy over the last year or so around alleged scandals in climate science, in which scientists are supposed to be hiding or fabricating data, which has allowed the ignoranti to pretend that the whole of climate science is discredited. However, while the nature and pace of the surface temperatures initiative has clearly been given a shot in the arm by this media frenzy, the roots of the workshop go back several years, and have a strong scientific foundation. Quite simply, scientists have recognized for years that we need a more complete and consistent surface temperature record with a much higher temporal resolution than currently exists. Current long term climatological records are mainly based on monthly summary data. Which is inadequate to meet the needs of current climate assessment, particularly the need for better understanding of the impact of climate change on extreme weather. Most weather extremes don’t show up in the monthly data, because they are shorter term – lasting for a few days or even just a few hours. This is not always true of course; Albert Klein Tank pointed out in his talk that this summer’s heatwave in Moscow occured mainly in a single calendar month, and hence shows up strongly in the monthly record. But in general, that is unusual, and so the worry is that monthly records tend to mask the occurrence of extremes (and hence may conceal trends in extremes).

The opening talks at the workshop also pointed out that the intense public scrutiny puts us in a whole new world, and one that many of the workshop attendees are clearly still struggling to come to terms with. Now, it’s clear that any new temperature record needs to be entirely open and transparent, so that every piece of research based on it could (in principle) be traced all the way back to basic observational records, and to echo the way John Christy put it at the workshop – every step of the research now has to be available as admissible evidence that could stand up in a court of law, because that’s the kind of scrutiny we’re being subjected to. Of course, the problem is that not only isn’t science ready for this (no field of science is anywhere near that transparent), it’s also not currently feasible, given the huge array of data sources being drawn on, the complexities of ownership and access rights, the expectations that much of the data will have high commercial value.

I’ll attempt a summary, but it will be rather long, as I don’t have time to make it any shorter. The slides from the workshop are now all available, and the outcomes from the workshop will be posted soon. The main goals were summarized in Peter Thorne’s opening talk: to create a (longish) list of principles, a roadmap for how to proceed, an identification of any overlapping initiatives so that synergies can be exploited, an agree method to engage with broader audiences (including the general public), and an initial governance model.

Did we achieve that? Well, you can skip to the end and see the summary slides, and judge for yourself. Personally, I thought the results were mixed. One obvious problem is that there is no funding on the table for this initiative, and it’s being launched at a time when everyone is cutting budgets, especially in the UK. Which meant that occasionally it felt like we were putting together a Heath Robinson device (Rube Goldberg to you Americans) – cobbling it together out of whatever we could find lying around. Which is ironic really given that the major international bodies (e.g. WMO) seem to fully appreciate the importance of this. And of course, the fact that it will be a vital part of our ability to assess the impacts of climate change over the next few decades.

Another problem is that the workshop attendees struggled to reach consensus on some of the most important principles. For example, should the databank be entirely open, or does it need a restricted section? The argument for the latter is that large parts of the source data are not currently open, as the various national weather services that collect it charge a fee on a cost recovery basis, and wish to restrict access to non-commercial uses as commercial applications are (in some cases) a significant portion of their operating budgets. The problem is that while the monthly data has been shared freely with international partners for many years, the daily and sub-daily records have not, because these are the basis for commercial weather forecasting services. So an insistence on full openness might mean a very incomplete dataset, which then defeats the purpose, as researchers will continue to use other (private) sources for more complete records.

And what about an appropriate licensing model? Some people argued that the data must be restricted to non-commercial uses, because that’s likely to make negotiations with national weather services easier. But others argued that unrestricted licenses should be used, so that the databank can help to lay the foundation for the development of a climate services industry (which would create jobs, and therefore please governments). [Personally, I felt that if governments really want to foster the creation of such an industry, then they ought to show more willingness to invest in this initiative, and until they do, we shouldn’t pander to them. I’d go for a cc by-nc-sa license myself, but I think I was outvoted]. Again, existing agreements are likely to get in the way: 70% of the European data would not be available if the research-only clause clause was removed.

There was also some serious disagreement about timelines. Peter outlined a cautious roadmap that focussed on building momentum, and delivering the occasional reports and white papers over the next year or so. The few industrial folks in the audience (most notably, Amy Luers from Google) nearly choked on their cookies – they’d be rolling out a beta version of the software within a couple of weeks if they were running the project. Quite clearly, as Amy urged in her talk, the project needs to plan for software needs right from the start, release early, prepare for iteration and flexibility, and invest in good visualizations.

Oh, and there wasn’t much agreement on open source software either. The more software oriented participants (most notably, Nick Barnes, from the Climate Code Foundation) argued strongly that all software, including every tool used to process the data every step of the way should be available as open source. But for many of the scientists, this represented a huge culture change. There was even some confusion about what open source means (e.g. that ‘open’ and ‘free’ aren’t necessarily the same thing).

On the other hand, some great progress was made in many areas, including identifying many important data services, building on lessons learnt from other large climate and weather data curation efforts, offers of help from many of the international partners (including offers of data from NCDC, NCAR, EURO4M, from across Europe and North America, as well as Russia, China, Indonesia, and Argentina). Agreement was clear that version control and good metadata are vital, and need to be planned for right from the start, but also that providing full provenance for each data item is an important long term goal, but cannot be a rule from the start, as we will have to build on existing data sources that come with little or no provenance information. Oh, and I was very impressed with the deep thinking and planning around benchmarking for homogenization tools (I’ll blog more on this soon, as it fascinates me).

Oh, and on the size of the task. Estimates of the number of undigitized paper records in the basements of various weather services ran to hundreds of millions of pages. But I still didn’t get a sense of the overall size of the planned databank…

Things I learnt:

  • Steve Worley from NCAR, reflecting on lessons from running ICOADS, pointed out that no matter how careful you think you’ve been, people will end up mis-using the data because they ignore or don’t understand the flags in the metadata.
  • Steve also pointed out that a drawback with open datasets is the proliferation of secondary archives, which then tend to get out of date and mislead users (as they rarely direct users back to the authoritative source).
  • Oh, and the scope of the uses of such data is usually surprisingly large and diverse.
  • Jay Lawrimore, reflecting on lessons from NCDC, pointed out that monthly data and daily and sub-daily data are collected and curated along independent routes, which then makes it hard to reconcile them. The station names sometimes don’t match, the lat/long coords don’t match (e.g. because of differences in rounding), and the summarized data are similar but not exact.
  • Another problem is that it’s not always clear exactly which 24-hour period a daily summary refers to (e.g. did they use a local or UTC midnight?). Oh, and this also means that 3- and 6-hour synoptic readings might not match the daily summaries either.
  • Some data doesn’t get transmitted, and so has to be obtained later, even to the point of having to re-key it from emails. Long delays in obtaining some of the data mean the datasets frequently have to be re-released.
  • Personal contacts and workshops in different parts of the world play a surprisingly important role in tracking down some of the harder to obtain data.
  • NCDC runs a service called Datzilla (similar to Bugzilla for software) for recording and tracking reported defects in the dataset.
  • Albert Klein Tank, describing the challenges in regional assessment of climate change and extremes, pointed out that the data requirements for analyzing extreme events are much higher than for assessing global temperature change. For example, we might need to know not just how many days were above 25°C compared to normal, but also how much did it cool off overnight (because heat stress and human health depend much more on overnight relief from the heat).
  • John Christy, introducing the breakout group on data provenance, had some nice examples in his slides of the kinds of paper records they have to deal with, and a fascinating example of a surface station that’s now under a lake, and hence old maps are needed to pinpoint its location.
  • From Michael de Podesta, who insisted on a healthy dose of serious metrology (not to be confused with meteorology): All measurements ought to come with an estimation of uncertainty, and people usually make a mess of this because they confuse accuracy and precision.
  • Uncertainty information isn’t metadata, it’s data. [Oh, and for that matter anything that’s metadata to one community is likely to be data to another. But that’s probably confusing things too much]
  • Oh, and of course, we have to distinguish Type A and Type B uncertainty. Type A is where the uncertainty is describable using statistics, so that collecting bigger samples will reduce it. Type B is where you just don’t know, so that collecting more data cannot reduce the uncertainty.
  • From Matt Menne, reflecting on lessons from the GHCN dataset, explaining the need for homogenization (which is climatology jargon for getting rid of errors in the observational data that arise because of changes over time in the way the data was measured). Some of the inhomogeneities are due to abrupt changes (e.g. because a recording station was moved, or got a new instrument), and also gradual changes (e.g. because the environment for a recording station slowly changes, e.g. gradual urbanization of its location).
  • Matt has lots of interesting examples of inhomogeneities in his slides, includes some really nasty ones. For example, a station in Reno, Nevada, that was originally in town, and then moved to the airport. There’s a gradual upwards trend in the early part of the record, from an urban heat island effect, and another similar trend in the latter part, after it moved to the airport, as the airport was also eventually encroached by urbanisation. But if you correct for both of these, as well as the step change when the station moved, you’re probably over-correcting….
  • which led Matt to suggest the Climate Scientist’s version of the Hippocratic Oath: First, do not flag good data as bad; Then do not make bias adjustments where none are warranted.
  • While criticism from non-standard sources (that’s polite-speak for crazy denialists) is coming faster than any small group can respond to (that’s code for the CRU), useful allies are beginning to emerge, also from the blogosphere, in the form of serious citizen scientists (such as Zeke Hausfather) who do their own careful reconstructions, and help address some of the crazier accusations from denialists. So there’s an important role in building community with such contributors.
  • John Kennedy, talking about homogenization for Sea Surface Temperatures, pointed out that Sea Surface and Land Surface data are entirely different beasts, requiring totally different approaches to homogenization. Why? because SSTs are collected from buckets on ships, engine intakes on ships, drifting buoys, fixed buoys, and so on. Which means you don’t have long series of observations from a fixed site like you do with land data – every observation might be from a different location!

Things I hope I managed to inject into the discussion:

  • “solicitation of input from the community at large” is entirely the wrong set of terms for white paper #14. It should be about community building and engagement. It’s never a one-way communication process.
  • Part of the community building should be the support for a shared set of open source software tools for analysis and visualization, contributed by the various users of the data. The aim would be for people to share their tools, and help build on what’s in the collection, rather than having everyone re-invent their own software tools. This could be as big a service to the research community as the data itself.
  • We desperately need a clear set of use cases for the planned data service (e.g. who wants access to which data product, and what other information will they be needing and why?). Such use cases should illustrate what kinds of transparency and traceability will be needed by users.
  • Nobody seems to understand just how much user support will need to be supplied (I think it will be easy for whatever resources are put into this to be overwhelmed, given the scrutiny that temperature records are subjected to these days)…
  • The rate of change in this dataset is likely to be much higher than has been seen in past data curation efforts, given the diversity of sources, and the difficulty of recovering complete data records.
  • Nobody (other than Bryan) seemed to understand that version control will need to be done at a much finer level of granularity than whole datasets, and that really every single data item needs to have a unique label so that it can be referred to in bug reports, updates, etc. Oh and that the version management plan should allow for major and minor releases, given how often even the lowest data products will change, as more data and provenance information is gradually recovered.
  • And of course, the change process itself will be subjected to ridiculous levels of public scrutiny, so the rational for accepting/rejecting changes and scheduling new releases needs to be clear and transparent. Which means far more attention to procedures and formal change control boards than past efforts have used.
  • I had lots of suggestions about how to manage the benchmarking effort, including planning for the full lifecycle: making sure the creation of the benchmark is a really community consensus building effort, and planning for retirement of each benchmark, to avoid the problems of overfitting. Susan Sim wrote an entire PhD on this.
  • I think the databank will need to come with a regularly updated blog, to provide news about what’s happening with the data releases, highlight examples of how it’s being used, explain interesting anomalies, interpret published papers based on the data, etc. A bit like RealClimate. Oh, and with serious moderation of the comment threads to weed out the crazies. Which implies some serious effort is needed.
  • …and I almost but not quite entirely learned how to pronounce the word ‘inhomogeneities’ without tripping over my tongue. I’m just going to call them ‘bugs’.

Update Sept 21, 2010: Some other reports from the workshop.

I’ve mentioned the Clear Climate Code project before, but it’s time to give them an even bigger shout out, as the project is a great example of of the kind of thing I’m calling for in my grand challenge paper. The project is building an open source community around the data processing software used in climate science. Their showcase project is an open source Python re-implementation of gistemp, and very impressive it is too.

Now they’ve gone one better, and launched the Climate Code Foundation, a non-profit organisation aimed at “improving the public understanding of climate science through the improvement and publication of climate science software”. The idea is for it to become an umbrella body that will nurture many more open source projects, and promote greater openness of the software tools and data used for the science.

I had a long chat with Nick Barnes, one of the founders of CCF, on the train to Exeter last night, and was very impressed with his enthusiasm and energy. He’s actively seeking more participants, more open source projects for the foundation to support, and of course, for funding to keep the work going. I think this could be the start of something beautiful.

Here’s a question I’ve been asking a few people lately, ever since I asserted that climate models are big expensive scientific instruments: How expensive are we talking about? Unfortunately, it’s almost impossible to calculate. The effort of creating a climate model is tangled up with the scientific research, such that you can’t even reliably determine how much of a particular scientist’s time is “model development” and how much is “doing science”. The problem is that you can’t build the model without a lot of that “doing science” part, because the model is the result of a lot of thinking, experimentation, theory building, testing hypotheses, analyzing simulation results, and discussions with other scientists. Many pieces of the model are based on the equations or empirical results in published research papers; even if you’re not doing the research yourself, you still have to keep up with the literature, understand the state-of-the-art, and know which bits of research are mature enough to incorporate into the model.

So, my first cut, which will be an over-estimation, is that *all* of the effort at a climate modeling lab is necessary to build the model. Labs vary in size, but a typical climate modeling lab is of the order of 200 people (including scientists, technicians, and admin support). And most of the models I’ve looked at have been under steady development for twenty years or more. So, that gives us starting point of 200*20 = 4,000 person-years. Luckily, most scientists care more about science than salary, so they’re much cheaper than software professionals. Given we’ll have a mix of postdocs and senior scientists, let’s say average salary would be around $150,000 per year including benefits and other overheads. Thats $600 million.

Oh, and that doesn’t including the costs of equipping and operating a tier-2 supercomputing facility, as the climate model runs will easily keep such a facility fully loaded full time (and we’ll need to factor in the cost to replace the supercomputer every few years to take advantage of performance increases). In most cases, the supercomputing facilities are shared with other scientific uses of high performance computing. But there is one centre that’s dedicated to climate modeling, the DKRZ in Hamburg, which has an annual budget of around 30 million euro. Let’s pretend euros are dollars, and call that $30 million per year, which for 20 years gives us another $600 million. The latest supercomputer at DKRZ, Blizzard, cost 35 million euro. Let’s say we replace this every five years, and throw some more money in for many terabytes of data storage, that’ll get us to around $200 million for hardware.

Grand total: $1.4 billion.

Now, I said that’s an over-estimate. Over lunch today I quizzed some of the experts here at IPSL in Paris, and they thought that 1,000 person-years (50 persons per year for 20 years) was a better estimate of the actual model development effort. This seems reasonable – it means that only 1/4 of the research at my 200 person research institute directly contributes to model development, the rest is science that uses the model but isn’t essential for developing it. So, that brings the salary figure down to $150 million. I’ve probably got to do the same conversion for the supercomputing facilities – let’s say about 1/4 of the supercomputing capacity is reserved for model development and testing. That also feels about right: 5-10% of the capacity is reserved for test processes (e.g. the ones that run automatically every day to do the automated build-and-test process), and a further 10%-20% might be used for validation runs on development versions of the model.

That brings the grand total down to $350 million.

Now, it has been done for less than this. For example, the Canadian Climate Centre, CCCma, has a modeling team one tenth this size, although they do share a lot of code with the Canadian Meteorological Service. And their model isn’t as full-featured as some of the other GCMs (it also has a much smaller user base). As with other software projects, the costs don’t scale linearly with functionality: a team of 5 software developers can achieve much more than 1/10th of what a team of 50 can (cf The Mythical Man Month). Oh, and the computing costs won’t come down much at all – the CCCma model is no more efficient than other models. So we’re still likely to be above the $100 million mark.

Now, there are probably other ways of figuring it – so far we’ve only looked at the total cumulative investment in one of today’s world leading climate models. What about replacement costs? If we had to build a new model from scratch, using what we already know (rather than doing all the research over again), how much would that cost? Well, nobody has ever done this, but there are few experiences we could draw on. For example, the Max Planck Institute has been developing a new model from scratch, ICON, which uses a icosahedral grid and hence needs a new approach to the dynamics. The project has been going for 8 years. It started with just a couple of people, and has ramped up to about a dozen. But they’re still a long way from being done, and they’re re-using a lot of the physics code from their old model, ECHAM. On the other hand, its an entirely new approach to the grid structure, so a lot of the early work was pure research.

Where does that leave us? It’s really a complete guess, but I would suggest a team of 10 people (half of them scientists, half scientific programmers) could re-implement the old model from scratch (including all the testing and validation) in around 5 years. Unfortunately, climate science is a fast moving field. What we’d get at the end of 5 years is a model that, scientifically speaking, is 5 years out of date. Unless of course we also paid for a large research effort to bring the latest science into the model while we were constructing it, but then we’re back where we started. I think this means you can’t replace a state-of-the-art climate model for much less than the original development costs.

What’s the conclusion? The bottom line is that the development cost of a climate model is in the hundreds of millions of dollars.

Here’s a whole set of things I can’t make it to. The great thing about being on sabbatical is the ability to travel, visit different labs, and so on. The downside is that there are far more interesting places and events than I can possibly make it to, and many of them clash. Here’s some I won’t be able to make it to this fall:

I’m pleased to see that my recent paper, “Climate Change: A Software Grand Challenge” is getting some press attention. However, I’m horrified to see how it’s been distorted in the echo chamber of the media. Danny Bradbury, writing in the Guardian, gives his piece the headline “Climate scientists should not write their own software, says researcher“. Aaaaaaargh! Nooooo! That’s the exact opposite of what I would say!

Our research shows that earth system models, the workhorses of climate science, appear to have very few bugs, and produce remarkably good simulations of past climate. One of the most important success factors is that the code is written by the scientists themselves, as they understand the domain inside out. Now, of course, this leads to other problems, for instance the code is hard to understand, and hard to modify. And the job of integrating the various components of the models is really hard. But there are no obvious solutions to fix this without losing this hands-on relationship between the scientists and the code. Handing the code development over to software professionals is likely to be a disaster.

I’ve posted a comment on Bradbury’s article, but I have very little hope he’ll alter the headline, as it obviously plays into a storyline that’s popular with denialists right now (see update, below).

Some other reports:

Update (2/9/10): Well that’s a delight! I just got off the overnight train to Paris, and discover that Danny has commented here, and wants to put everything right, and has already corrected the headline in the BusinessGreen version. So, apologies to Danny for doubting him, and also, thanks for restoring my faith in journalism. As is clear in some of the comments, it’s easy to see how one might draw the conclusion that climate scientists shouldn’t write their own code from a reading of my paper. It’s a subtle point, so I probably need to write a longer piece on this to explain…

Update #2 (later that same day): And now the Guardian headline has been changed too. Victory for honest journalism!

Here’s an appalling article by Andy Revkin on dotEarth which epitomizes everything that is wrong with media coverage of climate change. Far from using his position to educate and influence the public by seeking the truth, journalists like Revkin now seem to have taken to just making shit up, reporting what he reads in blogs as the truth, rather than investigating for himself what scientists actually do.

Revkin kicks off by citing a Harvard cognitive scientist found guilty of academic misconduct, and connecting it with “assertions that climate research suffered far too much from group think, protective tribalism and willingness to spin findings to suit an environmental agenda”. Note the juxtaposition. On the one hand, a story of a lone scientist who turned out to be corrupt (which is rare, but does happen from time to time). On the other hand, a set of insinuations about thousands of climate scientists, with no evidence whatsoever. Groupthink? Tribalism? Spin? Can Revkin substantiate these allegations? Does he even try? Of course not. He just repeats a lot of gossip from a bunch of politically motivated blogs, and demonstrates his own total ignorance of how scientists work.

He does offer two pieces of evidence to back up his assertion of bias. The first is the well-publicized mistake in the IPCC report on the retreat of the Himalayan glaciers. Unfortunately, the quotes from the IPCC authors in the very article Revkin points to, show it was the result of an honest mistake, despite an entire cadre of journalists and bloggers trying to spin it into some vast conspiracy theory. The second is about a paper on the connection between vanishing frogs and climate change, cited in the IPCC report. The IPCC report quite correctly cites the paper, and gives a one sentence summary of it. Somehow or other, Revkin seems to think this is bias or spin. It must have entirely escaped his notice that the IPCC report is supposed to summarize the literature in order to assess our current understanding of the science. Some of that literature is tentative, and some less so. Now, maybe Revkin has evidence that there is absolutely no connection between the vanishing frogs and climate change. If so, he completely fails to mention it. Which means that the IPCC is merely reporting on the best information we have on the subject. Come on Andy, if you want to demonstrate a pattern of bias in the IPCC reports, you’re gonna have to work damn harder than that. Oh, but I forgot. You’re just repeating a bunch of conspiracy theories to pretend you have something useful to say, rather than actually, say, investigating a story.

From here, Revkin weaves a picture of climate science as “done by very small tribes (sea ice folks, glacier folks, modelers, climate-ecologists, etc)”, and hence suggests they must therefore be guilty of groupthink and confirmation bias. Does he offer any evidence for this tribalism? No he does not, for there is none. He merely repeats the allegations of a bunch of people like Steve McIntyre, who working on the fringes of science, clearly do belong to a minor tribe, one that does not interact in any meaningful way with real climate scientists. So, I guess we’re meant to conclude that because McIntyre and a few others have formed a little insular tribe, that this must mean mainstream climate scientists are tribal too? Such reasoning would be laughable, if this wasn’t such a serious subject.

Revkin claims to have been “following the global warming saga – science and policy – for nearly a quarter century”. Unfortunately, in all that time, he doesn’t appear to have actually educated himself about how the science is done. If he’d spent any time in a climate science research institute, he’d know this allegation of tribalism is about as far from the truth as it’s possible to get. Oh, but of course, actually going and observing scientists in action would require some effort. That seems to be just a little too much to ask.

So, to educate Andy, and to save him the trouble of finding out for himself, let me explain. First, a little bit of history. The modern concern about the potential impacts of climate change probably dates back to the 1957 Revelle and Suess paper, in which they reported that the oceans absorb far less anthropogenic carbon emissions than was previously thought. Revelle was trained in geology and oceanography. Suess was a nuclear physicist, who studied the distribution of carbon-14 in the atmosphere. Their collaboration was inspired by discussions with Libby, a physical chemist famous for the development of radio-carbon dating. As head of the Scripps Institute, Revelle brought together oceanographers with atmospheric physicists (including initiating the Mauna Loa of the measurement of carbon dioxide concentrations in the atmosphere), atomic physicists studying dispersal of radioactive particles, and biologists studying the biological impacts of  radiation. Tribalism? How about some truly remarkable inter-disciplinary research?

I suppose Revkin might argue that those were the old days, and maybe things have gone downhill since then. But again, the evidence says otherwise. In the 1970’s, the idea of earth system science began to emerge, and in the last decade, it has become central to the efforts to build climate simulation models to improve our understandings of the connections between the various earth subsystems: atmosphere, ocean, atmospheric chemistry, ocean biogeochemistry, biology, hydrology, glaciology and meteorology. If you visit any of the major climate research labs today, you’ll find a collection of scientists from many of these different disciplines working alongside one another, collaborating on the development of integrated models, and discussing the connections between the different earth subsystems. For example, when I visited the UK Met Office two years ago, I was struck by their use of cross-disciplinary teams to investigate specific problems in the simulation models. When I visited, they had just formed such a cross-disciplinary team to investigate how to improve the simulation of the Indian monsoons in their earth system models. This week, I’m just wrapping up a month long visit to the Max Planck Institute for Meteorology in Hamburg, where I’ve also regularly sat in on meetings between scientists from the various disciplines, sharing ideas about, for example, the relationships between atmospheric radiative transfer and ocean plankton models.

The folks in Hamburg have been kind enough to allow me to sit in on their summer school this week, in which they’re training the next generation of earth science PhD students how to work with earth system models. The students are from a wide variety of disciplines: some study glaciers, some clouds, some oceanography, some biology, and so on. The set of experiments we’ve been given to try out the model include: changing the cloud top mass flux, altering the rate of decomposition in soils, changing the ocean mixing ratio, altering the ocean albedo, and changing the shape of the earth. Oh, and they’ve mixed up the students, so they have to work in pairs with people from another discipline. Tribalism? No, right from the get go, PhD training includes the encouragement of cross-disciplinary thinking and cross-disciplinary working.

Of course, if Revkin ever did wander into a climate science research institute he would see this for himself. But no, he prefers pontificating from the comfort of his armchair, repeating nonsense allegations he reads on the internet. And this is the standard that journalists hold for themselves? No wonder the general public is confused about climate change. Instead of trying to pick holes in a science they clearly don’t understand, maybe people like Revkin ought to do some soul searching and investigate the gaping holes in journalistic coverage of climate change. Then finally we might find out where the real biases lie.

So, here’s a challenge for Andy Revkin: Do not write another word about climate science until you have spent one whole month as a visitor in a climate research institute. Attend the seminars, talk to the PhD students, sit in on meetings, find out what actually goes on in these places. If you can’t be bothered to do that, then please STFU [about this whole bias, groupthink and tribalism meme].

Update: On reflection, I think I was too generous to Revkin when I accused him of making stuff up, so I deleted that bit. He’s really just parroting other people who make stuff up.

Update #2: Oh, did I mention that I’m a computer scientist? I’ve been welcomed into various climate research labs, invited to sit in on meetings and observe their working practices, and to spend my time hanging out with all sorts of scientists from all sorts of disciplines. Because obviously they’re a bunch of tribalists who are trying to hide what they do. NOT.

Update #3: I’ve added a clarifying rider to my last paragraph  – I don’t mean to suggest Andy should shut up altogether, just specifically about these ridiculous memes about tribalism and so on.

Nearly everything we ever do depends on vast social and technical infrastructures, which, when they work, are largely invisible. Science is no exception – modern science is only possible because we have built the infrastructure to support it: classification systems, international standards, peer-review, funding agencies, and, most importantly, systems for the collection and curation of vast quantities of data about the world. Star and Ruhleder point out the infrastructure that supports scientific work is embedded inside of other social and technical systems, and becomes invisible when we come to rely on it. Indeed, the process of learning how to make use of a particular infrastructure is, to a large extent, what defines membership in a particular community of practice. They also observe that our infrastructures are closely intertwined with our conventions and standards. As a simple example, they point to the QWERTY keyboard, which despite its limitations, shapes much of our interaction with computers (even the design of office furniture!), such that learning to use the keyboard is a crucial part of learning to use a computer. And once you can type, you cease to be aware of the keyboard itself, except when it breaks down. This invisibility-in-use is similar to Heidegger’s notion of tools that are ready-to-hand; the key difference is that tools are local to the user, while infrastructures have vast spatial and/or temporal extent.

A crucial point is that what counts as infrastructure depends on the nature of the work that it supports. What is invisible infrastructure for one community might not be for another. The internet is a good example – most users just accept it exists and make use of it, without asking how it works. However, to computer scientists, a detailed understanding of its inner workings is vital. A refusal to treat the internet as invisible infrastructure is a condition to entry into certain geek cultures.

In their book Sorting Things Out, Star and Bowker introduced the term infrastructural inversion, for a process of focusing explicitly on the infrastructure itself, in order to expose and study its inner workings. It’s a rather cumbersome phrase for a very interesting process, kind of like a switch of figure and ground. In their case, infrastructural inversion is a research strategy that allows them to explore how things like classification systems and standards are embedded in so much of scientific practice, and to understand how these things evolve with the science itself.

Paul Edwards applies infrastructural inversion to climate science in his book A Vast Machine, where he examines the history of attempts by meteorologists to create a system for collecting global weather data, and for sharing that data with the international weather forecasting community. He points out that climate scientists also come to rely on that same infrastructure, but that it doesn’t serve their needs so well, and hence there is a difference between weather data and climate data. As an example, meteorologists tolerate changes in the nature and location of a particular surface temperature station over time, because they are only interested in forecasting over the short term (days or weeks). But to a climate scientist trying to study long-term trends in climate, such changes (known as inhomogeneities) are crucial. In this case, the infrastructure breaks down, as it fails to serve the needs of this particular community of scientists.

Hence, as Edwards points out, climate scientists also perform infrastructural inversion regularly themselves, as they dive into the details of the data collection system, trying to find and correct inhomogeneities. In the process, almost any aspect of how this vast infrastructure works might become important, revealing clues about what parts of the data can be used and which parts must be re-considered. One of the key messages in Paul’s book is that the usual distinction between data and models is now almost completely irrelevant in meteorology and climate science. The data collection depends on a vast array of models to turn raw instrumental readings into useful data, while the models themselves can be thought of sophisticated data reconstructions. Even GCMs, which now have the ability to do data assimilation and re-analysis, can be thought of as large amounts of data made executable through a set of equations that define spatial and temporal relationships within that data.

As an example, Edwards describes the analysis performed by Christy and Spencer at UAH on the MSU satellite data, from which they extracted measurements of the temperature of the upper atmosphere. In various congressional hearing, Spencer and Christy frequently touted their work, which showed a slight cooling trend in the upper atmosphere, as superior to other work that showed a warming trend because they were able to “actually measure the temperature of the free atmosphere” whereas other work was merely “estimation” from models (Edwards, p414). However, this completely neglects the fact that the MSU data doesn’t measure temperature in the lower troposphere directly at all, it measures radiance at the top of the atmosphere. Temperature readings for the lower troposphere are constructed from these readings via a complex set of models that take into account the chemical composition of the atmosphere, the trajectory of the satellite, and the position of the sun, among other factors. More importantly, a series of corrections in these models over several years gradually removed the apparent cooling trend, finally revealing a warming trend, as predicted by the theory (see Karl et al for a more complete account). The key point is that the data needed for meteorology and climate science is so vast and so complex that it’s no longer possible to disentangle models from data. The data depends on models to make it useful, and the models are sophisticated tools for turning one kind of data into another.

While the vast infrastructure for collecting and sharing data has become largely invisible to many working meteorologists, but must be continually inverted by climate scientists, in order to use it for analysis of longer term trends. The project to develop a new global surface temperature record that I described yesterday is one example of such inversion – it will involve a painstaking process of search and rescue on original data records dating back more than a century, because of the needs for a more complete, higher resolution temperature record than is currently available.

So far, I’ve only described constructive uses of infrastructural inversion, performed in the pursuit of science, to improve our understanding of how things work, and to allow us to re-adapt an infrastructure for new purposes. But there’s another use of infrastructural inversion, applied as a rhetorical technique to undermine scientific research. It has been applied increasingly in recent years in an attempt to slow down progress on enacting climate change mitigation policies, by sowing doubt and confusion about the validity of our knowledge about climate change. The technique is to dig down into the vast infrastructure that supports climate science, identify weaknesses in this infrastructure, and tout them as reasons to mistrust scientists’ current understanding of the climate system. And it’s an easy game to play, for two reasons: (1) all infrastructures are constructed through a series of compromises (e.g. standards are never followed exactly), and communities of practice develop workarounds that naturally correct for infrastructural weaknesses and (2) as described above, the data collection for weather forecasting frequently does fail to serve the needs of climate scientists. The climate scientists are painfully aware of these infrastructural weaknesses and have to deal with them every day, while those playing this rhetorical game ignore this, and pretend instead that there’s a vast conspiracy to lie about the science.

The problem is that, at first sight, many of these attempts at infrastructural inversion look like honest citizen-scientist attempt to increase transparency and improve the quality of the science (e.g. see Edwards, p421-427). For example, Anthony Watt’s SurfaceStations.org project is an attempt to document the site details of a large number of surface weather measuring stations, to understand how problems in their siting (e.g. growth of surrounding buildings) and placement of instruments might create biases in the long term trends constructed from their data. At face value, this looks like a valuable citizen-science exercise in infrastructural inversion. However, Watts wraps the whole exercise in the rhetoric of conspiracy theory, frequently claiming that climate scientists are dishonest, that they are covering up these problems, and that climate change itself is a myth. This not only ignores the fact that climate scientists themselves routinely examine such weaknesses in the temperature record, but also has the effect of biasing the entire exercise, as Watts’ followers are increasingly motivated to report only those problems that would cause a warming bias, and ignore those that do not. Recent independent studies that have examined the data collected by the SurfaceStations.org project demonstrate that the corrections demanded by Watts are irrelevant.

The recent project launched by the UK Met Office might look to many people like it’s a desperate response to “ClimateGate“, a mea culpa, an attempt to claw back some credibility. But, put into the context of the history of continual infrastructural inversion performed by climate scientists throughout the history of the field, it is nothing of the sort. It’s just one more in a long series of efforts to build better and more complete datasets to allow climate scientists to answer new research questions. This is what climate scientists do all the time. In this case, it is an attempt to move from monthly to daily temperature records, to improve our ability to understand the regional effects of climate change, and especially to address the growing need to understand the effect of climate change on extreme weather events (which are largely invisible in monthly averages).

So, infrastructural inversion is a fascinating process, used by at least three different groups:

  • Researchers who study scientific work (e.g. Star, Bowker, Edwards) use it to understand the interplay between the infrastructure and the scientific work that it supports;
  • Climate scientists use it all the time to analyze and improve the weather data collection systems that they need to understand longer term climate trends;
  • Climate change denialists use it to sow doubt and confusion about climate science, to further a political agenda of delaying regulation of carbon emissions.

And unfortunately, sorting out constructive uses of infrastructural inversion from its abuses is hard, because in all cases, it looks like legitimate questions are being asked.

Oh, and I can’t recommend Edward’s book highly enough. As Myles Allen writes in his review: “A Vast Machine […] should be compulsory reading for anyone who now feels empowered to pontificate on how climate science should be done.”

I’ve been invited to a workshop at the UK Met Office in a few weeks time, to brainstorm a plan to create (and curate) a new global surface temperature data archive. Probably the best introduction to this is the article by Stott and Thorne in Nature, back in May.

There’s now a series of white papers, to set out some of the challenges, and to solicit input from a broad range of stakeholders prior to the workshop. The white papers are available at http://www.surfacetemperatures.org/ and there’s a moderated blog to collect comments, which is open until Sept 1st (yes, I know that’s real soon now – I’m a little slow blogging this).

I’ll blog some of my reflections on what I think is missing from the white papers over the next few days. For now, here’s a quick summary of the white papers and the issues they cover (yes, the numbering starts at 3 – don’t worry about it!)

Paper #3, on Retrieval of Historical Data is a good place to start, as it sets out the many challenges in reconstructing a fully traceable archive of the surface temperature data. It offers the following definitions of the data products:

  • Level 0: original raw instrumental readings, or digitized images of logs;
  • Level 1: data as originally keyed in, typically converted to some local (native) format;
  • Level 2: data converted to common format;
  • Level 3: data consolidated into a databank;
  • Level 4: quality controlled derived product (eg corrected for station biases, etc)
  • Level 5: homogenized derived product (eg regridded, interpolated, etc)

The central problem is that most existing temperature records are level 3 data or above, and traceability to lower levels has not been maintained. The original records are patchy, and sometimes only higher level products have been archived. Also, the are multiple ways of deriving higher level products, in some cases because of improved techniques that supersede previous approaches, and in other cases because of multiple valid methodologies suited to different analysis purposes.

Effort to recover the original source data will be expensive, and hence will need some prioritization criteria. It will often be hard to tell whether peripheral information will turn out to be important, eg comments in ships log books may provide important context to explain anomalies in the data. The paper suggests prioritizing records that add substantially to the existing datasets – eg under-represented regions, especially for cases where it’s likely to be easy to get agreement from (eg national centres) that hold the data records.

Scoping decisions will be hard too. The focus is on surface air temperature records, but it might be cost-effective to include related data, such as all parameters from land stations, and, anticipating an interest in extremes, maybe want hydrological data too… And so on. Also, original, paper based records are important as historical documents, for purposes beyond meteorology. Hence, scanned images may be important, in addition to the digital data extraction.

Data records exist at various temporal resolution (hourly, daily, monthly, seasonal, etc), but availability of each type is variable. By retrieving the original records, it may be possible to backfill the various records at these different resolutions, but this won’t necessarily produce consistent records, due to differences in techniques used to produce aggregates. Furthermore, differences occur anyway between regions, and even between different eras in the same series. Hence, homogenization is tricky. Full traceability between different data levels and the processing techniques that link them is therefore an important goal, but will be very hard to achieve given the size and complexity of the data, and the patchiness of the metadata. In many cases the metadata is poor or non-existent. This includes descriptions of the stations themselves, the instruments used, calibration, precision, and even the units and timings of readings.

Then of course there is the problem of ownership. Much of the data was originally collected by national meteorological services, some of which depend on revenues from this data for their very operations, and some are keen to protect their interests in using this data to provide commercial forecasting services. Hence, it won’t always be possible to release all the lower level data publicly.

Suitable policies will be needed to decide what to do when lower levels from which level 3 data was derived are no longer available. We probably don’t want to exclude such data, but do need to clearly flag it. We need to give end users full flexibility in deciding how to filter the products they want to use.

Finally, the paper takes pains to point out how large an effort it will take to recover, digitize and make traceable all the level 0, 1 and 2 data. Far more paper based records exist than there is effort available to digitize them. The authors speculate about crowd sourcing the digitization, but that brings quality control issues. Also some of the paper records are fragile, and deteriorating (which might also imply some urgency).

(The paper also lists a number of current global and national databanks, with some notes on what each contains, along with some recent efforts to recover lower level data for similar datasets.)

Paper #4 on Near Real-Time Updates describes the existing Global Telecommunications System (GTS) used by the international meteorological community, which is probably easiest to describe via a couple of pictures:

Data Collection by the National Meteorological Services (NMS)

National Meteorological Centers (NMC) and Regional Telecommunications Hubs (RTH) in the WMO's Global Telecommunication System

The existing global telecommunications system is good for collecting low time-resolution (e.g. monthly) data, but hasn’t kept pace with the need for rapid transmission of daily and sub-daily data, nor does it do a particularly good job with metadata. The paper mentions a target of 24 hours for transmission of daily and sub-daily data, and within 5 days of the end of the month for monthly data, but points out that the target is rarely met. And it describes some of the weaknesses in the existing system:

  • The system depends on a set of catalogues that define the station metadata and routing tables (list of who publishes and subscribes to each data stream), which allow the data transmission to be very terse. But these catalogues aren’t updated frequently enough, leading to many apparent inconsistencies in the data, which can be hard to track down.
  • Some nations lack the resources to transmit their data in a timely manner (or in some cases, at all)
  • Some nations are slow to correct errors in the data record (e.g. when the wrong month’s data is transmitted)
  • Attempts to fill gaps and correct errors often yield data via email and/or parcel post, which therefore bypasses the GTS, so availability isn’t obvious to all subscribers.
  • The daily and sub-daily data often isn’t shared via the GTS, which means the historical record is incomplete.
  • There is no mechanism for detecting and correcting errors in the daily data.
  • The daily data also contains many errors, due to differences in defining the 24-hour reporting period (it’s supposed to be midnight to midnight UTC time, but often isn’t)
  • The international agreements aren’t in place for use of the daily data (although there is a network of bi-lateral agreements), and it is regarded as commercially valuable by many of the national meteorological services.

Paper #5 on Data Policy describes the current state of surface temperature records (e.g. those held at CRU and NOAA-NCDC), which contain just monthly averages for a subset of the available stations. These archives don’t store any of the lower level data sources, and differ where they’ve used different ways of computing the monthly averages (e.g. mean of the 3-hourly observations, versus mean of the daily minima and maxima). While in theory, the World Meteorological Organization (WMO) is committed to free exchange of the data collected by the national meteorological services, in practice there is a mix of different restrictions on data from different providers. For example, some is restricted to academic use only, while other providers charge fees for the data to enable them to fund their operations. In both cases, handing the data on to third parties is therefore not permitted.

One response to this problem has been to run a series of workshops in various remote parts of the world, in which local datasets are processed to produce high quality derived products, even where the low level data cannot be released. These workshops have the benefit of engaging the local meteorological services in analyzing regional climate change (often for the first time), and raising awareness of the importance of data sharing.

Paper #6 on Data provenance, version control, configuration management is a first attempt at identifying the requirements for curating the proposed data archive (I wish they’d use the term ‘curating’ in the white papers). The paper starts by making a very important point: the aim is not “to assess derived products as to whether they meet higher standards required by specific communities (i.e. scientific, legal, etc.)” but rather it’s “to archive and disseminate derived products as long as the homogenization algorithm is documented by the peer review process”. Which is important, because it means the goal is to support the normal process of doing science, rather than to constrain it.

Some of the identified requirements are:

  • The need for a process (the paper suggests a certification panel) to rate the authenticity of source material and its relationship to primary sources; and that this process must be dynamic, because of the potential for new information to cast doubt on material previously rated as authentic.
  • The need for version control, and the difficult question of what counts as a configuration unit for versioning. E.g. temporal blocks (decade-by-decade?), individual surface stations, regional datasets, etc?
  • The need for a pre-authentication database to hold potential updates prior to certification
  • The need to limit the frequency of version changes on the basic (level 2 and below) data, due to the vast amount of work that will be invested into science based on these.
  • The need to version control all the software used for producing the data, along with the test cases too.
  • The likelihood that there will be multiple versions of a station record at level 1, with varying levels of confidence rating.

Papers 8 (Creation of quality controlled homogenised datasets from the databank), 9 (Benchmarking homogenisation algorithm performance against test cases) and 10 (Dataset algorithm performance assessment based upon all efforts) go into detail about the processes used by this community for detecting bugs (inhomogeneities) in the data, and for fixing them. Such bugs arise most often because of changes over time in some aspect of the data collection at a particular station, or in the algorithms used to process the data. A particularly famous example is the growth of urbanization having the effect that a recording station that was originally in a rural environment ends up in an urban environment, and hence may suffer from the urban heat island effect.

I won’t go into detail here on these problems (read the papers!) except to note that the whole problem looks to me very similar to code debugging: there are an unknown number of inhomogeneities in the dataset, we’re unlikely to find them all, and some of them have been latent for so long, with so much subsequent work overlaid on them, that they might end up being treated as features if we can establish that they don’t impact the validity of that work. Also, the process of creating benchmarks to test the skill of homogenisation algorithms looks very much like bug seeding techniques – we insert deliberate errors into a realistic dataset and check how many are detected.

Paper 11 (Spatial and temporal interpolation) covers interpolation techniques used to fill in missing data, and/or to convert the messy real data to a regularly spaced grid. The paper also describes the use of reanalysis techniques, whereby a climate model is used to fill in missing data by running the model with it constrained by whatever data is available over a period of time, using the model values to fill in the blanks, and iterating on this process until a best fit with the real data is achieved.

Paper 13 (Publication, collation of results, presentation of audit trails) gets into the issue of how the derived products (levels 4 and 5 data) will be described in publications, and how to ensure reproducibility of results. Most importantly, publication of papers describing each derived product is an important part of making the dataset available to the community, and documenting it. Published papers need to give detailed version information for all data that was used, to allow others to retrieve the same source data. Any homogenisation algorithms that are applied ought to have also been described in the peer reviewed literature, and tested against the standard benchmarks (and presumably the version details will be given for these algorithms too). To ensure audit trails are available, all derived products in the databank must include details on the stations and periods used, quality control flags, breakpoint locations and adjustment factors, any ancillary datasets, and any intermediate steps especially for iterative homogenization procedures. Oh, and the databank should provide templates for the acknowledgements sections of published papers.

As an aside, I can’t help but think this imposes a set of requirements on the scientific community (or at least the publication process) that contradicts the point made in paper 6 about not being in the game of assessing whether higher level products meet certain scientific standards.

Paper 14 (Solicitation of input from the community at large including non-climate fields and discussion of web presence) tackles the difficult question of how to manage communication with broader audiences, including non-specialists and the general public. However, it narrows down the scope of the discussion, to consider as useful inputs from this broader audience only contributions to data collection, analysis and visualization (although it does acknowledge the role of broader feedback about the project as a whole and the consequences of the work).

Three distinct groups of stakeholders are identified: (i) the scientific community who already work with this type of data, (ii) active users of derived products, but who are unlikely to make contributions directly to the datasets and (iii) the lay audience who may need to understand and trust the work that is done by the other two groups.

The paper discusses the role of various communication channels (email, blogs, wikis, the peer reviewed literature, workshops, etc) for each of these stakeholder groups. There’s some discussion about the risks associated with making the full datasets completely open, for example  the potential that users may misunderstanding the metadata and data quality fields, leading to confusing analyses, and time-consuming discussions with users to clarify such issues.

The paper also suggests engaging with schools and with groups of students, for example by proposing small experiments with the data, and hosting networks of schools doing their own data collection and comparison.

Paper 15 (Governance) is a very short discussion, giving some ideas for appropriate steering committees and reporting mechanisms. The project has been endorsed by the various international bodies WMO, WCRP and GCOS, and therefore will be jointly owned by them. Funding will be pursued from the European Framework program, NSF, Google.org, etc. Finally, Paper 16 (Interactions with other activities) describes other related projects, which may partially overlap with this effort, although none of them are directly tackling the needs outlined in this project.

Great news – I’ve had my paper accepted for the 2010 FSE/SDP Workshop on the Future of Software Engineering Research, in Santa Fe, in November! The workshop sounds very interesting – 2 days intensive discussion on where we as a research community should be going. Here’s my contribution:

Climate Change: A Grand Software Challenge

Abstract

Software is a critical enabling technology in nearly all aspects of climate change, from the computational models used by climate scientists to improve our understanding of the impact of human activities on earth systems, through to the information and control systems needed to build an effective carbon-neutral society. Accordingly, we, as software researchers and software practitioners, have a major role to play in responding to the climate crisis. In this paper we map out the space in which our contributions are likely to be needed, and suggest a possible research agenda.

Introduction

Climate change is likely to be the defining issue of the 21st century. The science is unequivocal – concentrations of greenhouse gases are rising faster than at any previous era in the earth’s history, and the impacts are already evident [1]. Future impacts are likely to include a reduction of global food and water supplies, more frequent extreme weather events, sea level rise, ocean acidification, and mass extinctions [10]. In the next few decades, serious impacts are expected on human health from heat stress and vector-borne diseases [2].

Unfortunately, the scale of the systems involved makes the problem hard to understand, and hard to solve. For example, the additional carbon in greenhouse gases tends to remain in atmosphere-ocean circulation for centuries, which means past emissions commit us to further warming throughout this century, even if new emissions are dramatically reduced [12]. The human response is also very slow – it will take decades to complete a worldwide switch to carbon-neutral energy sources, during which time atmospheric concentrations of greenhouse gases will continue to rise. These lags in the system mean that further warming is inevitable, and catastrophic climate disruption is likely on the business-as-usual scenario.

Hence, we face a triple challenge: mitigation to avoid the worst climate change effects by rapidly transitioning the world to a low-carbon economy; adaptation to re-engineer the infrastructure of modern society so that we can survive and flourish on a hotter planet; and education to improve public understanding of the inter-relationships of the planetary climate system and human activity systems, and of the scale and urgency of the problem.

These challenges are global in nature, and pervade all aspects of society. To address them, researchers, engineers, policymakers, and educators from many different disciplines need to come to the table and ask what they can contribute. In the short term, we need to deploy, as rapidly as possible, existing technology to produce renewable energy[8] and design government policies and international treaties to bring greenhouse gas emissions under control. In the longer term, we need to complete the transition to a global carbon-neutral society by the latter half of this century [1]. Meeting these challenges will demand the mobilization of entire communities of expertise.

Software plays a ma jor role, both as part of the problem and as part of the solution. A large part of the massive growth of energy consumption in the past few decades is due to the manufacture and use of computing and communication technologies, and the technological advances they make possible. Energy efficiency has never been a key requirement in the development of software-intensive technologies, and so there is a very large potential for efficiency improvements [16].

But software also provides the critical infrastructure that supports the scientific study of climate change, and the use of that science by society. Software allows us to process vast amounts of geoscientific data, to simulate earth system processes, to assess the implications, and to explore possible policy responses. Software models allow scientists, activists and policymakers to share data, explore scenarios, and validate assumptions. The extent of this infrastructure is often invisible, both to those who rely on it, and to the general public [6]. Yet weaknesses in this software (whether real or imaginary) will impede our ability to make progress in tackling climate change. We need to solve hard problems to improve the way that society finds, assesses, and uses knowledge to support collective decision-making.

In this paper, we explore the role of the software community in addressing these challenges, and the potential for software infrastructure to bridge the gaps between scientific disciplines, policymakers, the media, and public opinion. We also identify critical weaknesses in our ability to develop and validate this software infrastructure, particularly as traditional software engineering methods are poorly adapted to the construction of such a vast, evolving knowledge-intensive software infrastructure.

Now read the full paper here (don’t worry, it’s only four pages, and you’ve now already read the first one!)

Oh, and many thanks to everyone who read drafts of this and sent me comments!

Over at Only in it for the Gold, Michael Tobis has been joining the dots about recent climate disruption in Russia and Pakistan, and asking some hard questions. I think it’s probably too early to treat this as a symptom that we’ve entered a new climate regime, but it it does help to clarify a few things. Like the fact that a few degrees average temperature rise isn’t really the thing we should worry about – a change in the global average temperature is just a symptom of the real problem. The real problem is the disruption to existing climates in unpredictable ways at unpredictable times caused by a massive injection of extra energy in the Earth’s systems. Sure, this leads to a measurable rise in the global average temperature, but it’s all that extra energy slopping around, disrupting existing climate regimes that should scare us witless.

Look at this pattern of temperature anomalies for July, and consider the locations of both Moscow and the headwaters of the rivers of Pakistan (from NASA). The world’s climate system has developed a new pattern. This specific pattern is probably temporary, but the likelihood of more weird patterns in different parts of the world will only grow:

Color bar for Global Temperature Anomalies, July 2010

As I said, the future is already here, it’s just not evenly distributed.

Which means that for much of this year, the North American media has been telling the wrong story. They were obsessed with an oil spill in the gulf, and the environmental damage it caused. Only one brave media outlet realised this wasn’t the real story – the real story is the much bigger environmental disaster that occurs when the oil doesn’t spill but makes it safely to port. Trust the Onion to tell it like it is.

I’ve pointed out a number of times that the software processes used to build the Earth System Models used in climate science don’t look anything like conventional software engineering practices. One very noticeable difference is the absence of detailed project plans, estimates, development phases, etc. While scientific steering committees do discuss long term strategy and set high level goals for the development of the model, the vast majority of model development work occurs bottom-up, through a series of open-ended, exploratory changes to the code. The scientists who work most closely with the models get together and decide what needs doing, typically on a week-to-week basis. Which is a little like agile planning, but without any of the agile planning techniques. Is this the best approach? Well, if the goal was to deliver working software to some external customer by a certain target date, then probably not. But that’s not the goal at all – the goal is to do good science. Which means that much of the work is exploratory and opportunistic.  It’s difficult to plan model development in any detail, because it’s never clear what will work, nor how long it will take to try out some new idea. Nearly everything that’s worth doing to improve the model hasn’t been done before.

This approach also favours a kind of scientific bricolage. Imagine we have sketched out a conceptual architecture for an earth system model. The conventional software development approach would be to draw up a plan to build each of the components on a given timeline, such that they would all be ready by some target date for integration. And it would fail spectacularly, because it would be impossible to estimate timelines for each component – each part involves significant new research. The best we can do is to get groups of scientists to go off and work on each subsystem, and wait to see what emerges. And to be willing to try incorporating new pieces of code whenever they seem to be mature enough, no matter where they came from.

So we might end up with a coupled earth system model where each of the major components was built at a different lab, each was incorporated into the model at a different stage in its development, and none of this was planned long in advance. And, as a consequence, each component has its own community of developers and users who have goals that often diverge from the goals of the overall earth system model. Typically, each community wants to run its component model in stand-alone model, to pursue scientific questions specific to that subfield. For example, ocean models are built by oceanographers to study oceanography. Plant growth models are built by biologists to study the carbon cycle. And so on.

One problem is that if you take components from each of these communities to incorporate into a coupled model, you don’t want to fork the code. A fork would give you the freedom to modify the component to make it work in the coupled scheme. But, as with forking in open source projects, is nearly always a mistake. It fragments the community, and means the forked copy no longer gets the ongoing improvements to the original software (or more precisely, it quickly becomes too costly to transplant such improvements into the forked code). Access to the relevant community of expertise and their ongoing model improvements are at least as important as any specific snapshot of their code, otherwise the coupled model will fail to keep up with the latest science. Which means a series of compromises must be made – some changes might be necessary to make the component work in a coupled scheme, but these must not detract from the ability of the community to continue working with the component as a stand-alone model.

So, building an earth system model means assembling a set of components that weren’t really designed to work together, and a continual process of negotiation between the requirements for the entire coupled model and the requirements of the individual modeling communities. The alternative, re-building each component from scratch, doesn’t make sense financially or scientifically. It would be expensive and time consuming, and you’d end up with untested software, that scientifically, is several years behind the state-of-the-art. [Actually, this might be true of any software: see this story of the netscape rebuild].

Over the long term, a set of conventions have emerged that help to make it easier to couple together components built by different communities. These include the basic data formating and message passing standards, as well as standard couplers. And more recently, modeling frameworks, metadata standards and data sharing infrastructure. But as with all standardization efforts, it takes a long time (decades?) for these to be accepted across the various modeling communities, and there is always resistance, in part because meeting the standard incurs a cost and usually detracts from the immediate goals of each particular modeling community (with the benefits accruing elsewhere – specifically to those interested in working with coupled models). Remember: these models are expensive scientific instruments. Changes that limit the use of the component as a standalone model, or which tie it to a particular coupling scheme, can diminish its value to the community that built it.

So, we’re stuck with the problem of incorporating a set of independently developed component models, without the ability to impose a set of interface standards on the teams that build the components. The interface definitions have to be continually re-negotiated. Bryan Lawrence has some nice slides on the choices, which he characterizes as the “coupler approach” and the “framework approach” (I shamelessly stole his diagrams…)

The coupler approach leaves the models almost unchanged, with a communication library doing any necessary transformation on the data fields.

The framework approach splits the original code into smaller units, adapting their data structures and calling interfaces, allowing them to be recombined in a more appropriate calling hierarchy

The advantage of the coupler approach is that it requires very little change to the original code, and allows the coupler itself to be treated as just another stand-alone component that can be re-used by other labs. However, it’s inefficient, and seriously limits the opportunities to optimize the run configuration: while the components can run in parallel, the coupler must still wait on each component to do its stuff.

The advantage of the framework approach is that it produces a much more flexible and efficient coupled model, with more opportunities to lay out the subcomponents across a parallel machine architecture, and a greater ability to plug other subcomponents in as desired. The disadvantage is that component models might need substantial re-factoring to work in the framework. The trick here is to get the framework accepted as a standard across a variety of different modeling communities. This is, of course, a bit of a chicken-and-egg problem, because its advantages have to be clearly demonstrated with some success stories before such acceptance can happen.

There is a third approach, adopted by some of the bigger climate modeling labs: build everything (or as much as possible) in house, and build ad hoc interfaces between various components as necessary. However, as earth system models become more complex, and incorporate more and more different physical, chemical and biological processes, the ability to do it all in-house is getting harder and harder. This is not a viable long term strategy.