First proper day of the AGU conference, and I managed to get to the (free!) breakfast for Canadian members, which was so well attended that the food ran out early. Do I read this as a great showing for Canadians at AGU, or just that we’re easily tempted with free food?

Anyway, on to the first of three poster sessions we’re involved in this week. This first poster was on TracSnap, the tool that Ainsley and Sarah worked on over the summer:

Our tracSNAP poster for the AGU meeting. Click for fullsize.

Our tracSNAP poster for the AGU meeting. Click for fullsize.

The key idea in this project is that large teams of software developers find it hard to maintain an awareness of one another’s work, and cannot easily identify the appropriate experts for different sections of the software they are building. In our observations of how large climate models are built, we noticed it’s often hard to keep up to date with what changes other people are working on, and how those changes will affect things. TracSNAP builds on previous research that attempts to visualize the social network of a large software team (e.g. who talks to whom), and relate that to couplings between code modules that team members are working on. Information about the intra-team communication patterns (e.g. emails, chat sessions, bug reports, etc) can be extracted automatically from project repositories, as can information about dependencies in the code. TracSNAP extracts data automatically from the project repository to provide answers to questions such as “Who else recently worked on the module I am about to start editing?”, and “Who else should I talk to before starting a task?”. The tool extracts hidden connections in the software by examining modules that were checked into the repository together (even though they don’t necessarily refer to each other), and offers advice on how to approach key experts by identifying intermediaries in the social network. It’s still a very early prototype, but I think it has huge potential. Ainsley is continuing to work on evaluating it on some existing climate models, to check that we can pull out of the respositories the data we think we can.

The poster session we were in, “IN11D. Management and Dissemination of Earth and Space Science Models” seemed a little disappointing as there were only three posters (a fourth poster presenter hadn’t made it to the meeting). But what we lacked in quantity, we made up in quality. Next to my poster was David Bailey‘s: “The CCSM4 Sea Ice Component and the Challenges and Rewards of Community Modeling”. I was intrigued by the second part of his title, so we got chatting about this. Supporting a broader community in climate modeling has a cost, and we talked about how university labs just cannot afford this overhead. However, it also comes with a number of benefits, particularly the existence of a group of people from different backgrounds who all take on some ownership of model development, and can come together to develop a consensus on how the model should evolve. With the CCSM, most of this happens in face to face meetings, particularly the twice-yearly user meetings. We also talked a little about the challenges of integrating the CICE sea ice model from Los Alamos with CCSM, especially given that CICE is also used in the Hadley model. Making it work in both models required some careful thinking about the interface, and hence more focus on modularity. David also mentioned people are starting to use the term kernelization as a label for the process of taking physics routines and packaging them so that they can be interchanged more easily.

Dennis Shea‘s poster, “Processing Community Model Output: An Approach to Community Accessibility” was also interesting. To tackle the problem of making output from the CCSM more accessible to the broader CCSM community, the decision was taken to standardize on netCDF for the data, and to develop and support a standard data analysis toolset, based on the NCAR Command Language. NCAR runs regular workshops on the use of these data formats and tools, as part of it’s broader community support efforts (and of course, this illustrates David’s point about universities not being able to afford to provide such support efforts).

The missing poster also looked interesting: Charles Zender from UC Irvine, comparing climate modeling practices with open source software practices. Judging from his abstract, Charles makes many of the same observations we made in our CiSE paper, so I was looking forward to comparing notes with him. Next time, I guess.

Poster sessions at these meetings are both wonderful and frustrating. Wonderful because you can wonder down aisles of posters and very quickly sample a large slice of research, and chat to the poster owners in a freeform format (which is usually much better than sitting through a talk). Frustrating because poster owners don’t stay near their posters very long (I certainly didn’t – too much to see and do!), which means you get to note an interesting piece of work, and then never manage to track down the author to chat (and if you’re like me, you also forget to write down contact details for posters you noticed. However, I did manage to make notes on two to follow up on:

  • Joe Galewsky caught my attention with an provocative title: “Integrating atmospheric and surface process models: Why software engineering is like having weasels rip your flesh”
  • I briefly caught Brendan Billingsley of NSIDC as he was taking his poster down. It caught my eye because it was a reflection on software reuse in the  Searchlight tool.

I wasn’t going to post anything about the CRU emails story (apart from my attempt at humour), because I think it’s a non-story. I’ve read a few of the emails, and it looks no different to the studies I’ve done of how climate science works. It’s messy. It involves lots of observational data sets, many of which contain errors that have to be corrected. Luckily, we have a process for that – it’s called science. It’s carried out by a very large number of people, any of whom might make mistakes. But it’s self-correcting, because they all review each other’s work, and one of the best ways to get noticed as a scientist is to identify and correct a problem in someone else’s work. There is, of course, a weakness in the way such corrections are usually done to a previously published paper. The corrections appear as letters and other papers in various journals, and don’t connect up well to the original paper. Which means you have to know an area well to understand which papers have stood the test of time and which have been discredited. Outsiders to a particular subfield won’t be able to tell the difference. They’ll also have a tendency to seize on what they think are errors, but actually have already been addressed in the literature.

If you want all the gory details about the CRU emails, read the RealClimate posts (here and here) and the lengthy discussion threads that follow from them. Many of the scientists involved in the CRU emails comment in these threads, and the resulting picture of the complexity of the data and analysis that they have to deal with is very interesting. But of course, none of this will change anyone’s minds. If you’re convinced global warming is a huge conspiracy, you’ll just see scientists trying to circle the wagons and spin the story their way. If you’re convinced that AGW is now accepted as scientific fact, then you’ll see scientists tearing their hair out at the ignorance of their detractors. If you’re not sure about this global warming stuff, you’ll see a lively debate about the details of the science, none of which makes much sense on its own. Don’t venture in without a good map and compass. (Update: I’ve no idea who’s running this site, but it’s a brilliant deconstruction of the allegations about the CRU emails).

But one issue has arisen in many discussions about the CRU emails that touches strongly on the research we’re doing in our group. Many people have looked at the emails and concluded that if the CRU had been fully open with its data and code right from the start, there would be no issue now. This is of course, a central question in Alicia‘s research on open science. While in principle, open science is a great idea, in practice, there are many hurdles, including the fear of being “scooped”, the need to give appropriate credit, the problems of metadata definition and of data provenance, the cost of curation, and the fact that software has a very short shelf-life, and so on. For the CRU dataset at the centre of the current kerfuffle, someone would have to go back to all the data sources, and re-negotiate agreements about how the data can be used. Of course, the anti-science crowd just think that’s an excuse.

However, for climate scientists there is another problem, which is the workload involved in being open. Gavin, at RealClimate, raises this issue in response to a comment that just putting the model code online is not sufficient. For example, if someone wanted to reproduce a graph that appears in a published paper, they’ll need much more: the script files, the data sets, the parameter settings, the spinup files, and so on. Michael Tobis argues that much of this can be solved with good tools and lots of automated scripts. Which would be fine if we were talking about how to help other scientists replicate the results.

Unfortunately, in the special case of climate science, that’s not what we’re talking about. A significant factor in the reluctance of climate scientists to release code and data is to protect themselves from denial-of-service attacks. There is a very well-funded and PR-savvy campaign to discredit climate science. Most scientists just don’t understand how to respond to this. Firing off hundreds of requests to CRU to release data under the freedom of information act, despite each such request being denied for good legal reasons, is the equivalent of frivolous lawsuits. But even worse, once datasets and codes are released, it is very easy for an anti-science campaign to tie the scientists up in knots trying to respond to their attempts to poke holes in the data. If the denialists were engaged in an honest attempt to push the science ahead, this would be fine (although many scientists would still get frustrated – they are human too).

But in reality, the denialists don’t care about the science at all; their aim is a PR campaign to sow doubt in the minds of the general public. In the process, they effect a denial-of-service attack on the scientists – the scientists can’t get on with doing their science because their time is taken up responding to frivolous queries (and criticisms) about specific features of the data. And their failure to respond to each and every such query will be trumpeted as an admission that an alleged error is indeed an error. In such an environment, is it perfectly rational not to release data and code – it’s better to pull up the drawbridge and get on with the drudgery of real science in private. That way the only attacks are complaints about lack of openness. Such complaints are bothersome, but much better than the alternative.

In this case, because the science is vitally important for all of us, it’s actually in the public interest that climate scientists be allowed to withhold their data. Which is really a tragic state of affairs. The forces of anti-science have a lot to answer for.

Update: Robert Grumbine has a superb post this morning on why openness and reproducibility is intrinsically hard in this field

Brad points out that much of my discussion for a research agenda in climate change informatics focusses heavily on strategies for emissions reduction (aka Mitigation) and neglects the equally important topic of ensuring communities can survive the climate changes that are inevitable (aka Adaptation). Which is an important point. When I talk about the goal of keeping temperatures to below a 2°C rise, it’s equally important to acknowledge that we’ve almost certainly already lost any chance of keeping peak temperature rise much below 2°C.

Which means, of course, that we have some serious work to do, in understanding the impact of climate change on existing infrastructure, and to integrate an awareness of the likely climate change issues into new planning and construction projects. This is, of course, what Brad’s Adaptation and Impacts research division focusses on. There are some huge challenges to do with how we take the data we have (e.g. see the datasets in the CCCSN), downscale these to provide more localized forecasts, and then figure out how to incorporate these into decision making.

One existing tool to point out is the World Bank’s ADAPT, which is intended to help analyze projects in the planning stage, and identify risks related to climate change adaptation. This is quite a different decision-making task from the emissions reduction decision tools I’ve been looking at. But just as important.

Here’s an interesting competition (with cash prizes), organized by the Usability Professionals’ Association, to develop a new concept or product, with user-centred design principles, that aims to cut energy consumption or reduce pollution.

And here’s another: A Video Game Creation Contest to create a playable video game that uses earth observations to help address environmental problems.