I’ve been using the term Climate Informatics informally for a few years to capture the kind of research I do, at the intersection of computer science and climate science. So I was delighted to be asked to give a talk at the second annual workshop on Climate Informatics at NCAR, in Boulder this week. The workshop has been fascinating – an interesting mix of folks doing various kinds of analysis on (often huge) climate datasets, mixing up techniques from Machine Learning and Data Mining with the more traditional statistical techniques used by field researchers, and the physics-based simulations used in climate modeling.

I was curious to see how this growing community defines itself – i.e. what does the term “climate informatics” really mean? Several of the speakers offered definitions, largely drawing on the idea of the Fourth Paradigm, a term coined by Jim Gray, who explained it as follows. Originally, science was purely empirical. In the last few centuries, theoretical science came along, using models and generalizations, and in the latter half of the twentieth century, computational simulations. Now, with the advent of big data, we can see a fourth scientific research paradigm emerging, sometimes called eScience, focussed on extracting new insights from vast collections of data. By this view, climate informatics could be defined as data-driven inquiry, and hence offers a complement to existing approaches to climate science.

However, there’s still some confusion, in part because the term is new, and crosses disciplinary boundaries. For example, some people expected that Climate Informatics would encompass the problems of managing and storing big data (e.g. the 3 petabytes generated by the CMIP5 project, or the exabytes of observational data that is now taxing the resources of climate data archivists). However, that’s not what this community does. So, I came up with my own attempt to define the term:

I like this definition for three reasons. First, by bringing Information Science into the mix, we can draw a distinction between climate informatics and other parts of computer science that are relevant to climate science (e.g. the work of building the technical infrastructure for exascale computing, designing massively parallel machines, data management, etc). Secondly, information science brings with it a concern for the broader societal and philosophical questions of the nature of information and why people need it, a concern that’s often missing from computer science. Oh, and I also like this definition because I also work at the intersection of the three fields, even though I don’t really do data-driven inquiry (although I did, many years ago, write an undergraduate thesis on machine learning). Hence, it creates a slightly broader definition than just associating the term with the ‘fourth paradigm’.

Having defined the field this way, it immediately suggests that climate informatics should also concern itself with the big picture of how we get get beyond mere information, and start to produce knowledge and (hopefully) wisdom:

This diagram is adapted from a classic paper by Russ Ackoff “From Data to Wisdom”, Journal of Applied Systems Analysis, Volume 16, 1989 p 3-9. Ackoff originally had Understanding as one of the circles, but subsequent authors have pointed out that it makes more sense as one of two dimensions you move along as you make sense of the data, the other being ‘context’ or ‘connectedness’.

The machine learning community offers a number of tools primarily directed at moving from Data towards Knowledge, by finding patterns in complex datasets. The output of a machine learner is a model, but it’s a very different kind of model from the computational models used in climate science: it’s a mathematical model that describes the discovered relationships in the data. In contrast, the physics-based computational models that climate scientists build are more geared towards moving in the opposite direction, from knowledge (in the form of physical theories of climactic processes) towards data, as a way of exploring how well current theory explains the data. Of course, you can also run a climate model to project the future (and, presumably, help society choose a wise path into the future), but only once you’ve demonstrated it really does explain the data we already have about the past. Clearly the two approaches are complementary, and ought to be used together to build a strong bridge between data and wisdom.

Note: You can see all the other invited talks (including mine), at the workshop archive. You can also explore the visuals I used (with no audio) at Prezi (hint: use full screen for best viewing).

As today is the deadline for proposing sessions for the AGU fall meeting in December, we’ve submitted a proposal for a session to explore open climate modeling and software quality. If we get the go ahead for the session, we’ll be soliciting abstracts over the summer. I’m hoping we’ll get a lively session going with lots of different perspectives.

I especially want to cover the difficulties of openness as well as the benefits, as we often hear a lot of idealistic talk on how open science would make everything so much better. While I think we should always strive to be more open, it’s not a panacea. There’s evidence that open source software isn’t necessarily better quality, and of course, there’re plenty of people using lack of openness as a political weapon, without acknowledging just how many hard technical problems there are to solve along the way, not least because there’s a lack of consensus over the meaning of openness among it’s advocates.

Anyway, here’s our session proposal:

TITLE: Climate modeling in an open, transparent world

AUTHORS (FIRST NAME INITIAL LAST NAME): D. A. Randall1, S. M. Easterbrook4, V. Balaji2, M. Vertenstein3

INSTITUTIONS (ALL): 1. Atmospheric Science, Colorado State University, Fort Collins, CO, United States. 2. Geophysical Fluid Dynamics Laboratory, Princeton, NJ, United States. 3. National Center for Atmospheric Research, Boulder, CO, United States. 4. Computer Science, University of Toronto, Toronto, ON, Canada.

Description: This session deals with climate-model software quality and transparent publication of model descriptions, software, and results. The models are based on physical theories but implemented as software systems that must be kept bug-free, readable, and efficient as they evolve with climate science. How do open source and community-based development affect software quality? What are the roles of publication and peer review of the scientific and computational designs in journals or other curated online venues? Should codes and datasets be linked to journal articles? What changes in journal submission standards and infrastructure are needed to support this? We invite submissions including experience reports, case studies, and visions of the future.

Here’s an announcement for a Workshop on Climate Knowledge Discovery, to be held at the Supercomputing 2011 Conference in Seattle on 13 November 2011.

Numerical simulation based science follows a new paradigm: its knowledge discovery process rests upon massive amounts of data. We are entering the age of data intensive science. Climate scientists generate data faster than can be interpreted and need to prepare for further exponential data increases. Current analysis approaches are primarily focused on traditional methods, best suited for large-scale phenomena and coarse-resolution data sets. Tools that employ a combination of high-performance analytics, with algorithms motivated by network science, nonlinear dynamics and statistics, as well as data mining and machine learning, could provide unique insights into challenging features of the Earth system, including extreme events and chaotic regimes. The breakthroughs needed to address these challenges will come from collaborative efforts involving several disciplines, including end-user scientists, computer and computational scientists, computing engineers, and mathematicians.

The SC11 CKD workshop will bring together experts from various domains to investigate the use and application of large-scale graph analytics, semantic technologies and knowledge discovery algorithms in climate science. The workshop is the second in a series of planned workshops to discuss the design and development of methods and tools for knowledge discovery in climate science.

Proposed agenda topics include:

  • Science Vision for Advanced Climate Data Analytics
  • Application of massive scale data analytics to large-scale distributed interdisciplinary environmental data repositories
  • Application of networks and graphs to spatio-temporal climate data, including computational implications
  • Application of semantic technologies in climate data information models, including RDF and OWL
  • Enabling technologies for massive-scale data analytics, including graph construction, graph algorithms, graph oriented computing and user interfaces.

The first Climate Knowledge Discovery (CKD) workshop was hosted by the German Climate Computing Center (DKRZ) in Hamburg, Germany from 30 March to 1 April 2011. This workshop brought together climate and computer scientists from major US and European laboratories, data centers and universities, as well as representatives from the industry, the broader academic, and the Semantic Web communities. Papers and presentations are available online.

We hope that you will be able to participate and look forward to seeing you in Seattle. For further information or questions, please do not hesitate to contact any of the co-organizers:

  • Reinhard Budich – MPI für Meteorologie (MPI-M)
  • John Feo – Pacific Northwest National Laboratory (PNNL)
  • Per Nyberg – Cray Inc.
  • Tobias Weigel – Deutsches Klimarechenzentrum GmbH (DKRZ)

There’s an excellent article in the inaugural issue of Nature Climate Change this month, written by Kurt Kleiner, entitled Data on Demand. Kurt interviewed many of the people who are active in making climate code and data more open: Gavin Schmidt from NASA GISS, Nick Barnes, of the Climate Code Foundation, John Wilbanks of Creative Commons, Peter Murray-Rust at Cambridge University, David Randall, at Colorado State U, David Carlson, Director of the International Polar Year Programme, Mark Parsons of the National Snow and Ice Data Center (NSIDC), Cameron Neylon, of the UK Science and Technology Facilities Council, Greg Wilson of Software Carpentry, and me.

This post contains lots of questions and few answers. Feel free to use the comment thread to help me answer them!

Just about every scientist I know would agree that being more open is a good thing. But in practice, being fully open is fraught with difficulty, and most scientists fall a long way short of the ideal. We’ve discussed some of the challenges for openness in computational sciences before: the problem of hostile people deliberately misinterpreting or misusing anything you release, the problem of ontological drift, so that even honest collaborators won’t necessarily interpret what you release in the way you intended. And for software, all the extra effort it takes to make code ready for release, and the fact that there is no reward system in place for those who put in such effort.

Community building is a crucial success factor for open source software (and presumably, by extension, for open science). The vast majority of open source projects never build a community, so, while we often think of the major successes of open source (after all, that’s how the internet was built), these successes are vastly outnumbered by the detritus of open source projects that never took off.

Meanwhile, any lack of openness (whether real or perceived) is a stick by which to beat climate scientists, and those wielding this stick remain clueless about the technical and institutional challenges of achieving openness.

Where am I going with this? Well, I mentioned the Climate Code Foundation a while back, and I’m delighted to be serving as a member of the advisory board. We had the first meeting of the advisory board meeting in the fall (in the open), and talked at length about organisational and funding issues, and how to get the foundation off the ground. But we didn’t get much time to brainstorm ideas for new modes of operation – for what else the foundation can do.

The foundation does have a long list of existing initiatives, and a couple of major successes, most notably, a re-implementation of GISTEMP as open source Python, which helped to validate the original GISTEMP work, and provide an open platform for new research. Moving forward, things to be done include:

  • outreach, lobbying, etc. to spread the message about the benefits of open source climate code;
  • more open source re-implementations of existing code (building on the success of ccc-GISTEMP);
  • directories / repositories of open source codes;
  • advice – e.g. white papers offering guidance to scientists on how to release their code, benefits, risks, licensing models, pitfalls to avoid, tools and resources;
  • training – e.g. workshops, tutorials etc at scientific conferences;
  • support – e.g. code sprints, code reviews, process critiques, etc.

All of which are good ideas. But I feel there’s a bit of a chicken-and-egg problem here. Once the foundation is well-known and respected, people will want all of the above, and will seek out the foundation for these things. But the climate science community is relatively conservative, and doesn’t talk much about software, coding practices, sharing code, etc in any systematic way.

To convince people, we need some high profile demonstration projects. Each such project should showcase a particular type of climate software, or a particular route to making it open source, and offer lessons learnt, especially on how to overcome some of the challenges I described above. I think such demonstration projects are likely to be relatively easy (!?) to find among the smaller data analysis tools (ccc-GISTEMP is only a few thousand lines of code).

But I can’t help but feel the biggest impact is likely to come with the GCMs. Here, it’s not clear yet what CCF can offer. Some of the GCMs are already open source, in the sense that the code is available free on the web, at least to those willing to sign a basic license agreement. But simply being available isn’t the same as being a fully fledged open source project. Contributions to the code are tightly controlled by the modeling centres, because they have to be – the models are so complex, and run in so many different configurations, that deep expertise is needed to successfully contribute to the code and test the results. So although some centres have developed broad communities of users of their models, there is very little in the way of broader communities of code contributors. And one of the key benefits of open source code is definitely missing: the code is not designed for understandability.

So where do we start? What can an organisation like the Climate Code Foundation offer to the GCM community? Are there pieces of the code that are ripe for re-implementation as clear code? Even better, are there pieces that an open source re-implementation could be useful to many different modeling centres (rather than just one)? And would such re-implementations have to come from within the existing GCM community (as is the case with all the code at the moment), or could outsiders accomplish this? Is re-implementation even the right approach for tackling the GCMs?

I should mention that there are already several examples of shared, open source projects in the GCM community, typically concerned with infrastructure code: couplers (e.g. OASIS) and frameworks (e.g. the ESMF). Such projects arose when people from different modelling labs got together and realized they could benefit from a joint software development project. Is this the right approach for opening up more of the GCMs? And if so, how can we replicate these kinds of projects more widely? And, again, how can the Climate Code Foundation help?

Call for Papers:
IEEE Software Special Issue on Climate Change: Software, Science and Society

Submission Deadline: 8 April 2011
Publication (tentative): Nov/Dec 2011

A vast software infrastructure underpins our ability to understand climate change, assess the implications, and form suitable policy responses. This software infrastructure allows large teams of scientists to construct very complex models out of many interlocking parts, and further allows scientists, activists and policymakers to share data, explore scenarios, and validate assumptions. The extent of this infrastructure is often invisible (as infrastructure often is, until it breaks down), both to those who rely on it, and to interested observers, such as politicians, journalists, and the general public. Yet weaknesses in this software (whether real or imaginary) will impede our ability to make progress on what may be the biggest challenge faced by humanity in the 21st Century.

This special issue of IEEE Software will explore the challenges in developing the software infrastructure for understanding and responding to climate change. Our aim is to help bridge the gap between the software community and the climate science community, by soliciting a collection of articles that explain the nature and extent of this software infrastructure, the technical challenges it poses, and the current state-of-the-art.

We invite papers covering any of the software challenges involved in creating this technical infrastructure, but please note that we are not soliciting papers that discuss the validity of the science itself, or which take sides in the policy debate on climate change.

We especially welcome review papers, which explain the current state-of-the-art in some specific aspect of climate software in an accessible way, and roadmap papers, which describe the challenges in the construction and validation of this software. Suitable topics for the special issue include (but are not restricted to):

  • Construction, verification and validation of computational models and data analysis tools used in climate science;
  • Frameworks, coupling strategies and software integration issues for earth system modeling;
  • Challenges of scale and complexity in climate software, including high data volumes and throughputs, massive parallelization and performance issues, numerical complexity, and coupling complexity;
  • Challenges of longevity and evolution of climate models codes, including legacy code, backwards compatibility, and computational reproducibility;
  • Experiences with model ensembles and model inter-comparison projects, particularly as these relate to software verification and validation;
  • Meta-data standards and data management for earth system data, including the challenge of making models and data self-describing;
  • Coordination of cross-disciplinary teams in the development of integrated assessment and decision support systems;
  • The role of open science and usable simulation tools in increasing public accessibility of climate science and public participation in climate policy discussions;
  • Case studies and lessons learned from application of software engineering techniques within climate science.

Manuscripts must not exceed 4,700 words including figures and tables, which count for 200 words each. Submissions in excess of these limits may be rejected without refereeing. The articles we deem within the theme’s scope will be peer-reviewed and are subject to editing for magazine style, clarity, organization, and space. Be sure to include the name of the theme you are submitting for.

Articles should have a practical orientation, and be written in a style accessible to software practitioners. Overly complex, purely research-oriented or theoretical treatments are not appropriate. Articles should be novel. IEEE Software does not republish material published previously in other venues, including other periodicals and formal conference/workshop proceedings, whether previous publication was in print or in electronic form.

Questions?

For more information about the special issue, contact the Guest Editors:

  • Steve Easterbrook, University of Toronto, Canada (sme@cs.toronto.edu)
  • Reinhard Budich, Max Planck Institute for Meteorology, Germany (reinhard.budich@zmaw.de)
  • Paul N. Edwards, University of Michigan, USA (pne@umich.edu)
  • V. Balaji, NOAA Geophysical Fluid Dynamics Laboratory, USA. (balaji@princeton.edu)

For general author guidelines: www.computer.org/software/author.htm

For submission details: software@computer.org

To submit an article: https://mc.manuscriptcentral.com/sw-cs