As a follow-on from yesterday’s post on making climate software open source, I’d like to pick up on the oft-repeated slogan “Many eyeballs make all bugs shallow”. This is sometimes referred to as Linus’ Law (after Linus Torvalds, creator of Linux), although this phrase is actually attributed to Eric Raymond (Torvalds would prefer “Linus’s Law” to be something completely different). Judging from the number of times this slogan is repeated in the blogosphere, there must be lots of very credulous people out there. (Where are the real skeptics when you need them?)

Robert Glass tears this one apart as a myth in his book “Facts and Fallacies about Software Engineering“, on the basis of three points: it’s self-evidently not true (the depth of a bug has nothing to do with how many people are looking for it); there’s plenty of empirical evidence that the utility of adding additional reviewers to a review team tails off very quickly after around 3-4 reviewers; and finally there is no empirical evidence that open source software is less buggy than its alternatives.

More interestingly, companies like Coverity, who specialize in static analysis tools, love to run their tools over open source software and boast about the number of bugs they find (it shows off what their tools can do). For example, their 2009 study found 38,453 bugs in 60 million lines of source code (a bug density of about 0.64 defect/KLOC). Quite clearly, there are many types of bugs that you need automated tools to find, no matter how many eyeballs have looked at the code.

Part of the problem is that the “many eyeballs” part isn’t actually true anyway. In a study conducted by Xu et. al. in 2005 of the sourceforge community, they found that participation in projects follows the power law well known in social network theory: a few open source projects have a very large number of participants, and a very large number have very few participants. Similarly, a very small number of open source developers participate in lots of projects; the majority participate in just one or two:

SourceForge Project and Developer Community Scale Free Degree Distributions (Figure 7d from Xu et al 2005)

SourceForge Project and Developer Community Scale Free Degree Distributions (Figure 7d from Xu et al 2005)

For example, the data shown in these graphs include all developers and active users for about 160,000 sourceforge projects. Of these projects, 25% had only a single person involved (as either developer or user!), and a further 10% had only 2-3 people involved. Clearly, a significant number of open source projects never manage to build a community of any size.

This is relevant to the climate science community because many of the tens of thousands of scientists actively pursing research relevant to our understanding of climate change build software. If all of them release their software as open source, there’s no reason to expect a different distribution from the graphs above. So most of this software will never attract any participants outside the handful of scientists who wrote it, because there simply aren’t enough eyeballs or interest available. The kind of software described in the famous “Harry” files at the CRU is exactly of this nature – if it hadn’t been picked out in the stolen CRU emails, nobody other than “Harry” would ever take the time to look at it. And even if lots of people’s attention was drawn to this particular software (as it has been), there are still thousands of other scraps of similar software out there which would also remain single person projects like those on sourceforge. In contrast, a very small number of projects will attract hundreds of developers/users.

The thing is, this is exactly how the climate science community operates already. A small number of projects (like the big GCMs, listed here) already have a large number of developers and users – for example, CCSM and Hadley’s UM have hundreds of active developers, and a very mature review process. Meanwhile a very large number of custom data analysis tools are built by a single person for his/her own use. Declaring all of these projects to be open source will not magically bring “many eyeballs” to bear on them. And indeed, as Cameron Neylon argues, those that do will immediately have to protect themselves from a large number of clueless newbies by doing exactly what many successful open source projects do: the inner clique closes ranks and refuses to deal with outsiders, ignores questions on the mailing lists, etc. Isn’t that supposed to be the problem we were trying to solve?

The argument that making climate software open source will somehow magically make it higher quality is therefore specious. The big climate models already have many eyeballs, and the small data handling tools will never attract large numbers of eyeballs. So, if any of the people screaming about openness are truly interested in improving software quality, they’ll argue for something that is actually likely to make a difference.