I’ve been using the term Climate Informatics informally for a few years to capture the kind of research I do, at the intersection of computer science and climate science. So I was delighted to be asked to give a talk at the second annual workshop on Climate Informatics at NCAR, in Boulder this week. The workshop has been fascinating – an interesting mix of folks doing various kinds of analysis on (often huge) climate datasets, mixing up techniques from Machine Learning and Data Mining with the more traditional statistical techniques used by field researchers, and the physics-based simulations used in climate modeling.

I was curious to see how this growing community defines itself – i.e. what does the term “climate informatics” really mean? Several of the speakers offered definitions, largely drawing on the idea of the Fourth Paradigm, a term coined by Jim Gray, who explained it as follows. Originally, science was purely empirical. In the last few centuries, theoretical science came along, using models and generalizations, and in the latter half of the twentieth century, computational simulations. Now, with the advent of big data, we can see a fourth scientific research paradigm emerging, sometimes called eScience, focussed on extracting new insights from vast collections of data. By this view, climate informatics could be defined as data-driven inquiry, and hence offers a complement to existing approaches to climate science.

However, there’s still some confusion, in part because the term is new, and crosses disciplinary boundaries. For example, some people expected that Climate Informatics would encompass the problems of managing and storing big data (e.g. the 3 petabytes generated by the CMIP5 project, or the exabytes of observational data that is now taxing the resources of climate data archivists). However, that’s not what this community does. So, I came up with my own attempt to define the term:

I like this definition for three reasons. First, by bringing Information Science into the mix, we can draw a distinction between climate informatics and other parts of computer science that are relevant to climate science (e.g. the work of building the technical infrastructure for exascale computing, designing massively parallel machines, data management, etc). Secondly, information science brings with it a concern for the broader societal and philosophical questions of the nature of information and why people need it, a concern that’s often missing from computer science. Oh, and I also like this definition because I also work at the intersection of the three fields, even though I don’t really do data-driven inquiry (although I did, many years ago, write an undergraduate thesis on machine learning). Hence, it creates a slightly broader definition than just associating the term with the ‘fourth paradigm’.

Having defined the field this way, it immediately suggests that climate informatics should also concern itself with the big picture of how we get get beyond mere information, and start to produce knowledge and (hopefully) wisdom:

This diagram is adapted from a classic paper by Russ Ackoff “From Data to Wisdom”, Journal of Applied Systems Analysis, Volume 16, 1989 p 3-9. Ackoff originally had Understanding as one of the circles, but subsequent authors have pointed out that it makes more sense as one of two dimensions you move along as you make sense of the data, the other being ‘context’ or ‘connectedness’.

The machine learning community offers a number of tools primarily directed at moving from Data towards Knowledge, by finding patterns in complex datasets. The output of a machine learner is a model, but it’s a very different kind of model from the computational models used in climate science: it’s a mathematical model that describes the discovered relationships in the data. In contrast, the physics-based computational models that climate scientists build are more geared towards moving in the opposite direction, from knowledge (in the form of physical theories of climactic processes) towards data, as a way of exploring how well current theory explains the data. Of course, you can also run a climate model to project the future (and, presumably, help society choose a wise path into the future), but only once you’ve demonstrated it really does explain the data we already have about the past. Clearly the two approaches are complementary, and ought to be used together to build a strong bridge between data and wisdom.

Note: You can see all the other invited talks (including mine), at the workshop archive. You can also explore the visuals I used (with no audio) at Prezi (hint: use full screen for best viewing).