Here’s an announcement for a Workshop on Climate Knowledge Discovery, to be held at the Supercomputing 2011 Conference in Seattle on 13 November 2011.

Numerical simulation based science follows a new paradigm: its knowledge discovery process rests upon massive amounts of data. We are entering the age of data intensive science. Climate scientists generate data faster than can be interpreted and need to prepare for further exponential data increases. Current analysis approaches are primarily focused on traditional methods, best suited for large-scale phenomena and coarse-resolution data sets. Tools that employ a combination of high-performance analytics, with algorithms motivated by network science, nonlinear dynamics and statistics, as well as data mining and machine learning, could provide unique insights into challenging features of the Earth system, including extreme events and chaotic regimes. The breakthroughs needed to address these challenges will come from collaborative efforts involving several disciplines, including end-user scientists, computer and computational scientists, computing engineers, and mathematicians.

The SC11 CKD workshop will bring together experts from various domains to investigate the use and application of large-scale graph analytics, semantic technologies and knowledge discovery algorithms in climate science. The workshop is the second in a series of planned workshops to discuss the design and development of methods and tools for knowledge discovery in climate science.

Proposed agenda topics include:

  • Science Vision for Advanced Climate Data Analytics
  • Application of massive scale data analytics to large-scale distributed interdisciplinary environmental data repositories
  • Application of networks and graphs to spatio-temporal climate data, including computational implications
  • Application of semantic technologies in climate data information models, including RDF and OWL
  • Enabling technologies for massive-scale data analytics, including graph construction, graph algorithms, graph oriented computing and user interfaces.

The first Climate Knowledge Discovery (CKD) workshop was hosted by the German Climate Computing Center (DKRZ) in Hamburg, Germany from 30 March to 1 April 2011. This workshop brought together climate and computer scientists from major US and European laboratories, data centers and universities, as well as representatives from the industry, the broader academic, and the Semantic Web communities. Papers and presentations are available online.

We hope that you will be able to participate and look forward to seeing you in Seattle. For further information or questions, please do not hesitate to contact any of the co-organizers:

  • Reinhard Budich – MPI für Meteorologie (MPI-M)
  • John Feo – Pacific Northwest National Laboratory (PNNL)
  • Per Nyberg – Cray Inc.
  • Tobias Weigel – Deutsches Klimarechenzentrum GmbH (DKRZ)

I’ve been invited to a workshop at the UK Met Office in a few weeks time, to brainstorm a plan to create (and curate) a new global surface temperature data archive. Probably the best introduction to this is the article by Stott and Thorne in Nature, back in May.

There’s now a series of white papers, to set out some of the challenges, and to solicit input from a broad range of stakeholders prior to the workshop. The white papers are available at and there’s a moderated blog to collect comments, which is open until Sept 1st (yes, I know that’s real soon now – I’m a little slow blogging this).

I’ll blog some of my reflections on what I think is missing from the white papers over the next few days. For now, here’s a quick summary of the white papers and the issues they cover (yes, the numbering starts at 3 – don’t worry about it!)

Paper #3, on Retrieval of Historical Data is a good place to start, as it sets out the many challenges in reconstructing a fully traceable archive of the surface temperature data. It offers the following definitions of the data products:

  • Level 0: original raw instrumental readings, or digitized images of logs;
  • Level 1: data as originally keyed in, typically converted to some local (native) format;
  • Level 2: data converted to common format;
  • Level 3: data consolidated into a databank;
  • Level 4: quality controlled derived product (eg corrected for station biases, etc)
  • Level 5: homogenized derived product (eg regridded, interpolated, etc)

The central problem is that most existing temperature records are level 3 data or above, and traceability to lower levels has not been maintained. The original records are patchy, and sometimes only higher level products have been archived. Also, the are multiple ways of deriving higher level products, in some cases because of improved techniques that supersede previous approaches, and in other cases because of multiple valid methodologies suited to different analysis purposes.

Effort to recover the original source data will be expensive, and hence will need some prioritization criteria. It will often be hard to tell whether peripheral information will turn out to be important, eg comments in ships log books may provide important context to explain anomalies in the data. The paper suggests prioritizing records that add substantially to the existing datasets – eg under-represented regions, especially for cases where it’s likely to be easy to get agreement from (eg national centres) that hold the data records.

Scoping decisions will be hard too. The focus is on surface air temperature records, but it might be cost-effective to include related data, such as all parameters from land stations, and, anticipating an interest in extremes, maybe want hydrological data too… And so on. Also, original, paper based records are important as historical documents, for purposes beyond meteorology. Hence, scanned images may be important, in addition to the digital data extraction.

Data records exist at various temporal resolution (hourly, daily, monthly, seasonal, etc), but availability of each type is variable. By retrieving the original records, it may be possible to backfill the various records at these different resolutions, but this won’t necessarily produce consistent records, due to differences in techniques used to produce aggregates. Furthermore, differences occur anyway between regions, and even between different eras in the same series. Hence, homogenization is tricky. Full traceability between different data levels and the processing techniques that link them is therefore an important goal, but will be very hard to achieve given the size and complexity of the data, and the patchiness of the metadata. In many cases the metadata is poor or non-existent. This includes descriptions of the stations themselves, the instruments used, calibration, precision, and even the units and timings of readings.

Then of course there is the problem of ownership. Much of the data was originally collected by national meteorological services, some of which depend on revenues from this data for their very operations, and some are keen to protect their interests in using this data to provide commercial forecasting services. Hence, it won’t always be possible to release all the lower level data publicly.

Suitable policies will be needed to decide what to do when lower levels from which level 3 data was derived are no longer available. We probably don’t want to exclude such data, but do need to clearly flag it. We need to give end users full flexibility in deciding how to filter the products they want to use.

Finally, the paper takes pains to point out how large an effort it will take to recover, digitize and make traceable all the level 0, 1 and 2 data. Far more paper based records exist than there is effort available to digitize them. The authors speculate about crowd sourcing the digitization, but that brings quality control issues. Also some of the paper records are fragile, and deteriorating (which might also imply some urgency).

(The paper also lists a number of current global and national databanks, with some notes on what each contains, along with some recent efforts to recover lower level data for similar datasets.)

Paper #4 on Near Real-Time Updates describes the existing Global Telecommunications System (GTS) used by the international meteorological community, which is probably easiest to describe via a couple of pictures:

Data Collection by the National Meteorological Services (NMS)

National Meteorological Centers (NMC) and Regional Telecommunications Hubs (RTH) in the WMO's Global Telecommunication System

The existing global telecommunications system is good for collecting low time-resolution (e.g. monthly) data, but hasn’t kept pace with the need for rapid transmission of daily and sub-daily data, nor does it do a particularly good job with metadata. The paper mentions a target of 24 hours for transmission of daily and sub-daily data, and within 5 days of the end of the month for monthly data, but points out that the target is rarely met. And it describes some of the weaknesses in the existing system:

  • The system depends on a set of catalogues that define the station metadata and routing tables (list of who publishes and subscribes to each data stream), which allow the data transmission to be very terse. But these catalogues aren’t updated frequently enough, leading to many apparent inconsistencies in the data, which can be hard to track down.
  • Some nations lack the resources to transmit their data in a timely manner (or in some cases, at all)
  • Some nations are slow to correct errors in the data record (e.g. when the wrong month’s data is transmitted)
  • Attempts to fill gaps and correct errors often yield data via email and/or parcel post, which therefore bypasses the GTS, so availability isn’t obvious to all subscribers.
  • The daily and sub-daily data often isn’t shared via the GTS, which means the historical record is incomplete.
  • There is no mechanism for detecting and correcting errors in the daily data.
  • The daily data also contains many errors, due to differences in defining the 24-hour reporting period (it’s supposed to be midnight to midnight UTC time, but often isn’t)
  • The international agreements aren’t in place for use of the daily data (although there is a network of bi-lateral agreements), and it is regarded as commercially valuable by many of the national meteorological services.

Paper #5 on Data Policy describes the current state of surface temperature records (e.g. those held at CRU and NOAA-NCDC), which contain just monthly averages for a subset of the available stations. These archives don’t store any of the lower level data sources, and differ where they’ve used different ways of computing the monthly averages (e.g. mean of the 3-hourly observations, versus mean of the daily minima and maxima). While in theory, the World Meteorological Organization (WMO) is committed to free exchange of the data collected by the national meteorological services, in practice there is a mix of different restrictions on data from different providers. For example, some is restricted to academic use only, while other providers charge fees for the data to enable them to fund their operations. In both cases, handing the data on to third parties is therefore not permitted.

One response to this problem has been to run a series of workshops in various remote parts of the world, in which local datasets are processed to produce high quality derived products, even where the low level data cannot be released. These workshops have the benefit of engaging the local meteorological services in analyzing regional climate change (often for the first time), and raising awareness of the importance of data sharing.

Paper #6 on Data provenance, version control, configuration management is a first attempt at identifying the requirements for curating the proposed data archive (I wish they’d use the term ‘curating’ in the white papers). The paper starts by making a very important point: the aim is not “to assess derived products as to whether they meet higher standards required by specific communities (i.e. scientific, legal, etc.)” but rather it’s “to archive and disseminate derived products as long as the homogenization algorithm is documented by the peer review process”. Which is important, because it means the goal is to support the normal process of doing science, rather than to constrain it.

Some of the identified requirements are:

  • The need for a process (the paper suggests a certification panel) to rate the authenticity of source material and its relationship to primary sources; and that this process must be dynamic, because of the potential for new information to cast doubt on material previously rated as authentic.
  • The need for version control, and the difficult question of what counts as a configuration unit for versioning. E.g. temporal blocks (decade-by-decade?), individual surface stations, regional datasets, etc?
  • The need for a pre-authentication database to hold potential updates prior to certification
  • The need to limit the frequency of version changes on the basic (level 2 and below) data, due to the vast amount of work that will be invested into science based on these.
  • The need to version control all the software used for producing the data, along with the test cases too.
  • The likelihood that there will be multiple versions of a station record at level 1, with varying levels of confidence rating.

Papers 8 (Creation of quality controlled homogenised datasets from the databank), 9 (Benchmarking homogenisation algorithm performance against test cases) and 10 (Dataset algorithm performance assessment based upon all efforts) go into detail about the processes used by this community for detecting bugs (inhomogeneities) in the data, and for fixing them. Such bugs arise most often because of changes over time in some aspect of the data collection at a particular station, or in the algorithms used to process the data. A particularly famous example is the growth of urbanization having the effect that a recording station that was originally in a rural environment ends up in an urban environment, and hence may suffer from the urban heat island effect.

I won’t go into detail here on these problems (read the papers!) except to note that the whole problem looks to me very similar to code debugging: there are an unknown number of inhomogeneities in the dataset, we’re unlikely to find them all, and some of them have been latent for so long, with so much subsequent work overlaid on them, that they might end up being treated as features if we can establish that they don’t impact the validity of that work. Also, the process of creating benchmarks to test the skill of homogenisation algorithms looks very much like bug seeding techniques – we insert deliberate errors into a realistic dataset and check how many are detected.

Paper 11 (Spatial and temporal interpolation) covers interpolation techniques used to fill in missing data, and/or to convert the messy real data to a regularly spaced grid. The paper also describes the use of reanalysis techniques, whereby a climate model is used to fill in missing data by running the model with it constrained by whatever data is available over a period of time, using the model values to fill in the blanks, and iterating on this process until a best fit with the real data is achieved.

Paper 13 (Publication, collation of results, presentation of audit trails) gets into the issue of how the derived products (levels 4 and 5 data) will be described in publications, and how to ensure reproducibility of results. Most importantly, publication of papers describing each derived product is an important part of making the dataset available to the community, and documenting it. Published papers need to give detailed version information for all data that was used, to allow others to retrieve the same source data. Any homogenisation algorithms that are applied ought to have also been described in the peer reviewed literature, and tested against the standard benchmarks (and presumably the version details will be given for these algorithms too). To ensure audit trails are available, all derived products in the databank must include details on the stations and periods used, quality control flags, breakpoint locations and adjustment factors, any ancillary datasets, and any intermediate steps especially for iterative homogenization procedures. Oh, and the databank should provide templates for the acknowledgements sections of published papers.

As an aside, I can’t help but think this imposes a set of requirements on the scientific community (or at least the publication process) that contradicts the point made in paper 6 about not being in the game of assessing whether higher level products meet certain scientific standards.

Paper 14 (Solicitation of input from the community at large including non-climate fields and discussion of web presence) tackles the difficult question of how to manage communication with broader audiences, including non-specialists and the general public. However, it narrows down the scope of the discussion, to consider as useful inputs from this broader audience only contributions to data collection, analysis and visualization (although it does acknowledge the role of broader feedback about the project as a whole and the consequences of the work).

Three distinct groups of stakeholders are identified: (i) the scientific community who already work with this type of data, (ii) active users of derived products, but who are unlikely to make contributions directly to the datasets and (iii) the lay audience who may need to understand and trust the work that is done by the other two groups.

The paper discusses the role of various communication channels (email, blogs, wikis, the peer reviewed literature, workshops, etc) for each of these stakeholder groups. There’s some discussion about the risks associated with making the full datasets completely open, for example  the potential that users may misunderstanding the metadata and data quality fields, leading to confusing analyses, and time-consuming discussions with users to clarify such issues.

The paper also suggests engaging with schools and with groups of students, for example by proposing small experiments with the data, and hosting networks of schools doing their own data collection and comparison.

Paper 15 (Governance) is a very short discussion, giving some ideas for appropriate steering committees and reporting mechanisms. The project has been endorsed by the various international bodies WMO, WCRP and GCOS, and therefore will be jointly owned by them. Funding will be pursued from the European Framework program, NSF,, etc. Finally, Paper 16 (Interactions with other activities) describes other related projects, which may partially overlap with this effort, although none of them are directly tackling the needs outlined in this project.

I like playing with data. One of my favourite tools is Gapminder, which allows you to plot graphs with any of a large number of country-by-country indicators, and even animate the graphs to see how they change over time. For example, looking at their CO2 emissions data, I could plot CO2 emissions against population (notice the yellow and red dots at the top: the US and China respectively – both with similar total annual emissions, but the US much worse on emissions per person). Press the ‘play’ button to see everyone’s emissions grow year-by-year, and play around with different indicators.

Gapminder looks good, but it’s lacking a narrative – these various graphs are only really interesting when used to tell a story. You get some sense of how to add narrative with the videos of presentations based on Gapminder, for example, this gapcast, which creates a narrative around the CO2 emissions data for the US and China.

But narrative on its own isn’t enough. We also need a way to challenge such narratives. For example, the gapcast above makes it clear that China’s gross annual emissions caught up with the US in the last couple of years, largely because of China’s reliance on coal as a cheap source of electricity. But what it doesn’t tell you is that a significant chunk (one fifth) of China’s emissions are due to carbon outsourcing: creation of goods and services exported to the west. In other words, one fifth of China’s emissions really ought to be counted as belonging to the US and Europe, because it’s our desire for cheap stuff that leads to all that coal being burnt. Without this information, the Gapminder graphs are misleading.

The only tool I’ve come across so far for challenging narratives in this way is: the blog. Many of my favourite blog posts are written as reactions (challenges) to someone else’s narrative. Which leads me to suggest that the primary value of a blog isn’t so much the contents per se, but the way each post creates new links between existing chunks of information, and adds commentary to those links. Now if only I had a tool for visualizing those links, so I could get an overview of who’s commenting on what, without having to read through thousands of blog posts…

Here’s a challenge for the requirements modelling experts. I’ve phrased it as an exam question for my graduate course on requirements engineering (the course is on hiatus, which is lucky, because it would be a long exam…):

Q: The governments of all the nations on a small blue planet want to fix a problem with the way their reliance on fossil fuels is altering the planet’s climate. Draw a goal model (using any appropriate goal modeling notation) showing the key stakeholders, their interdependencies, and their goals. Be sure to show how the set of solutions they are considering contribute to satisfying their goals. The attached documents may be useful in answering this question: (a) A outline of the top level goals; (b) A description of the available solutions, characterized as a set of Stabilization Wedges; (c) A domain expert’s view of the feasbility of the solutions.

Update: Someone’s done the initial identification of actors already.