Open Questions about Open Source for Open Science

12. January 2011 · 8 comments · Categories: climate informatics

This post contains lots of questions and few answers. Feel free to use the comment thread to help me answer them!

Just about every scientist I know would agree that being more open is a good thing. But in practice, being fully open is fraught with difficulty, and most scientists fall a long way short of the ideal. We’ve discussed some of the challenges for openness in computational sciences before: the problem of hostile people deliberately misinterpreting or misusing anything you release, the problem of ontological drift, so that even honest collaborators won’t necessarily interpret what you release in the way you intended. And for software, all the extra effort it takes to make code ready for release, and the fact that there is no reward system in place for those who put in such effort.

Community building is a crucial success factor for open source software (and presumably, by extension, for open science). The vast majority of open source projects never build a community, so, while we often think of the major successes of open source (after all, that’s how the internet was built), these successes are vastly outnumbered by the detritus of open source projects that never took off.

Meanwhile, any lack of openness (whether real or perceived) is a stick by which to beat climate scientists, and those wielding this stick remain clueless about the technical and institutional challenges of achieving openness.

Where am I going with this? Well, I mentioned the Climate Code Foundation a while back, and I’m delighted to be serving as a member of the advisory board. We had the first meeting of the advisory board meeting in the fall (in the open), and talked at length about organisational and funding issues, and how to get the foundation off the ground. But we didn’t get much time to brainstorm ideas for new modes of operation – for what else the foundation can do.

The foundation does have a long list of existing initiatives, and a couple of major successes, most notably, a re-implementation of GISTEMP as open source Python, which helped to validate the original GISTEMP work, and provide an open platform for new research. Moving forward, things to be done include:

outreach, lobbying, etc. to spread the message about the benefits of open source climate code;
more open source re-implementations of existing code (building on the success of ccc-GISTEMP);
directories / repositories of open source codes;
advice – e.g. white papers offering guidance to scientists on how to release their code, benefits, risks, licensing models, pitfalls to avoid, tools and resources;
training – e.g. workshops, tutorials etc at scientific conferences;
support – e.g. code sprints, code reviews, process critiques, etc.

All of which are good ideas. But I feel there’s a bit of a chicken-and-egg problem here. Once the foundation is well-known and respected, people will want all of the above, and will seek out the foundation for these things. But the climate science community is relatively conservative, and doesn’t talk much about software, coding practices, sharing code, etc in any systematic way.

To convince people, we need some high profile demonstration projects. Each such project should showcase a particular type of climate software, or a particular route to making it open source, and offer lessons learnt, especially on how to overcome some of the challenges I described above. I think such demonstration projects are likely to be relatively easy (!?) to find among the smaller data analysis tools (ccc-GISTEMP is only a few thousand lines of code).

But I can’t help but feel the biggest impact is likely to come with the GCMs. Here, it’s not clear yet what CCF can offer. Some of the GCMs are already open source, in the sense that the code is available free on the web, at least to those willing to sign a basic license agreement. But simply being available isn’t the same as being a fully fledged open source project. Contributions to the code are tightly controlled by the modeling centres, because they have to be – the models are so complex, and run in so many different configurations, that deep expertise is needed to successfully contribute to the code and test the results. So although some centres have developed broad communities of users of their models, there is very little in the way of broader communities of code contributors. And one of the key benefits of open source code is definitely missing: the code is not designed for understandability.

So where do we start? What can an organisation like the Climate Code Foundation offer to the GCM community? Are there pieces of the code that are ripe for re-implementation as clear code? Even better, are there pieces that an open source re-implementation could be useful to many different modeling centres (rather than just one)? And would such re-implementations have to come from within the existing GCM community (as is the case with all the code at the moment), or could outsiders accomplish this? Is re-implementation even the right approach for tackling the GCMs?

I should mention that there are already several examples of shared, open source projects in the GCM community, typically concerned with infrastructure code: couplers (e.g. OASIS) and frameworks (e.g. the ESMF). Such projects arose when people from different modelling labs got together and realized they could benefit from a joint software development project. Is this the right approach for opening up more of the GCMs? And if so, how can we replicate these kinds of projects more widely? And, again, how can the Climate Code Foundation help?

8 Comments

Gavin
January 12, 2011 at 10:46 pm

I have to say that I don’t think GCMs are going to be a good start off project – they are too big and too fast moving for any outside programmers to be able to contribute in a short time period.

However, your suggestion of tackling smaller scale problems is a good one. How about an open source paleo-reconstuction suite? – all the methods, as much data as you can find, maximal flexibility in selecting it, some standard metrics, pseudo-proxy tests etc.

Since that topic is apparently the source of all the kerfuffle, and plentiful code already exists online, it would seem to be a no-brainer. Start with the McShane and Wyner discussion code base, add in Mann et al, Amman and Wahl, Gerd Berger, etc. and soon you’d have an open source library that could actually serve as a base for all sorts of people to get involved.

It would be high profile and challenging in some respects, but I think quite rewarding (well, as long as you are ok being called names when you get the same results as everyone else!).
Michael Saunby
January 13, 2011 at 9:08 am

There are many initiatives to make important data available to everyone, e.g. http://data.worldbank.org/ but so far climate data isn’t so easy to access even though it’s often available free. So more tools to support the sharing of climate data with other disciplines would be nice.
Tim van Beek
January 14, 2011 at 12:03 pm

What kind of people would you like to get interested in climate models? What kind of people would participate in an open source project about a GCM that don’t care now?

To answer these questions for myself: I’m interested in this kind of stuff because I like to think about mathematics and physics as a hobby, and I develop software as a profession, and working on GCM would combine both. The fact that most GCMs are not open source is not the main problem that alienates me, it is the use of a programming language – FORTRAN – that I don’t know and that I have no desire to learn and master, and the coding and documentation style which makes the existing code impenetrable (at least I think so).

Open source projects are successful – if they are successful – partly because working on them is a positive item on one’s CVs, working on an obscure GCM using even more obscure tools would be more of a negative item on my CV. And I don’t like to spend a lot of time pondering a piece of code just because the author did not care the least to explain what he was doing. Last but not least I don’t know how I could help at all, so a list of the most pressing problems would be helpful, too.
Eli Rabett
January 15, 2011 at 11:11 pm

Elis’ suggestion is hopelessly naive but he has been playing with the idea for about a year. A GCM built for GPU (NVIDIA Tesla) workstations might make a significant dent. It would be excellent to have fluid dynamics simulations on such a system.
Alastair McKinstry
January 16, 2011 at 10:38 am

Eli: such work is seriously non-trivial (I should know, I’m doing it at work). With GCMs weighing in at 1-2 million lines of code, getting them to run on GPUs is several years work, with no guarantee that they will be any faster (ie that you’re not just moving the bottlenecks around).

GCMs aren’t really Fluid dynamics in the CFD sense (look at the parameterisations needed for turbulence, etc), and just throwing cores at it won’t necessarily work (look at the presentations from http://www.ecmwf.int/newsevents/meetings/workshops/2010/high_performance_computing_14th/index.html for example). Just because you can make a small GCM work on a Tesla card on your desktop doesn’t mean that we can scale the code to 1000 Tesla cards and x1000 speedup.

But a useful starting point is tools: visualisation and analysis of GCM data, for example. I’m currently working on putting (Grads, Ferret, CDAT, ..) into Debian Linux. Just making the analysis tools more robust, better documented and available in everyday Linux would be a great advance. Add to that processing such data into OGC formats for viewing on geospatial tools, etc.
gary
January 16, 2011 at 10:04 pm

“The Economist” has a review of a new book on the open source movement, “The Comingled Code: Open Source and Economic Development”, that looks interesting.
Alfred Differ
January 17, 2011 at 2:03 am

A list of bite-sized problems sure would be useful. That might enable a GNU style approach to building a toolset that gets used widely.
Nick Barnes
January 17, 2011 at 5:14 am

Thanks for the blog post, Steve, and I’m obviously very interested in any feedback generated. Today I’m at the Peter Murray-Rust symposium in Cambridge: “Visions of a semantic molecular future”, which is mostly about open science, one of PMR’s main interests. It’s good to see that the challenges of openness – open access, open data, open bibliography, and open source – are being addressed across various scientific fields. The Blue Obelisk unproject brings together many examples of this sort of work in Chemistry.

There is a particular problem regarding professional recognition for software contributions. The history is not a proud one. Particular pieces of software are often essential to a field, or even to creating a whole new field of science. Yet the scientists who have devoted years to developing this software may be lost in obscurity, while colleagues using the software in their research may generate impressive publication records.

We need to encourage the production and sharing of software, to recognise it as an essential contribution, and to change the systems and attitudes of professional assessment and recognition, across science, to better account for it.

I am sitting in the back of a talk by Professor Tom Blundell, a distinguished chemist, who has been building, contributing to, and leading development of crucial open-source chemistry software for about 30 years. He has just referred to a paper of his from the mid-80s, describing a program, which now has more than 4,000 citations.
Pingback: Climate modeling in an open, transparent world | Serendipity

Open Questions about Open Source for Open Science

8 Comments

Leave a Reply