Here’s a letter I’ve sent to the Guardian newspaper. I wonder if they’ll print it? [Update - I've marked a few corrections since sending it. Darn]

Professor Darrel Ince, writing in the Guardian on February 5th, reflects on lessons from the emails and documents stolen from the Climatic Research Unit at the University of East Anglia. Prof Ince uses an example from the stolen emails to argue that there are serious concerns about software quality and openness in climate science, and goes on to suggest that this perceived alleged lack of openness is unscientific. Unfortunately, Prof Ince makes a serious error of science himself – he bases his entire argument on a single data point, without asking whether the example is in any way representative.

The email and files from the CRU that were released to the public are quite clearly a carefully chosen selection, where the selection criteria appears to be those that might cause maximum embarrassment to the climate scientists. I’m quite sure that I could find equally embarrassing examples of poor software on the computers of Prof Ince and his colleagues. The Guardian has been conducting a careful study of claims that have been made about these emails, and has shown that the allegations that have been made about defects in the climate science are unfounded. However, these investigations haven’t covered the issues that Prof Ince raises, so it is worth examining them in more detail.

The Harry README file does appear to be a long struggle by a junior scientist to get some poor quality software to work. Does this indicate that there is a systemic problem of software quality in climate science? To answer that question, we would need more data. Let me offer one more data point, representing the other end of the spectrum. Two years ago I carried out a careful study of the software development methods used for main climate simulation models developed at the UK Met Office. I was expecting to see many of the problems Prof Ince describes, because such problems are common across the entire software industry. However, I was extremely impressed with the care and rigor by which the climate models are constructed, and the extensive testing they are subjected to. In many ways, this process achieves a higher quality code than the vast majority of commercial software that I have studied, which includes the spacecraft flight control code developed by NASA’s contractors. [My results were published here: http://dx.doi.org/10.1109/MCSE.2009.193].

The climate models are developed over many years, by a large team of scientists, through a process of scientific experimentation. The scientists understand that their models are approximations of complex physical processes in the Earth’s atmosphere and oceans. They build their models through a process of iterative refinement. They run the models, and compare them with observational data, to look for the places where the models perform poorly. They then create hypotheses for how to improve the model, and then run experiments: using the previous version of the model as a control, and the new version as the experimental case, they compare both runs with the observational data to determine whether the hypothesis was correct. By a continual process of making small changes, and experimenting with the results, they end up testing their models far more effectively than most commercial software developers. And through careful use of tools to keep track of this process, they can reproduce past experiments on old versions of the model whenever necessary. The main climate models are also subjected to extensive model intercomparison tests, as part of the IPCC assessment process. Models from different labs are run on the same scenarios, and the results compared in detail, to explore the strengths and weaknesses of each model.

Like many software industries, different types of climate software are verified to different extents, representing choices of where to apply limited resources. The main climate models are tested extensively, as I described above. But often scientists need to develop other programs for occasional data analysis tasks. Sometimes, they do this rather haphazardly (which appears to be the case with the Harry file). Many of these tasks are experimental tentative in nature, and correspond to the way software engineers regularly throw a piece of code together to try out an idea. What matters is that, if the idea matures, and leads to results that are published or shared with other scientists, the results are checked out carefully by other scientists. Getting hold of the code and re-running it is usually a poor way of doing this (I’ve found over the years that replicating someone else’s experiment is fraught with difficulties, and not primarily exclusively because of problems with code quality). A much better approach is for other scientists to write their own code, and check independently whether the results are confirmed. This avoids the problem of everyone relying on one particular piece of software, as we can never be sure any software is entirely error-free.

The claim that many climate scientists have refused to publish their computer programs is also specious. I compiled a list last summer of how to access the code for the 23 main models used in the IPCC report. Although only a handful are fully open source, most are available free under fairly light licensing arrangements. For our own research we have asked for and obtained the the full code, version histories, and bug databases from several centres, with no difficulties (other than the need for a little patience as the appropriate licensing agreements were sorted out). Climate and weather forecasting code has a number of potential commercial applications, so the modeling centres use a license agreement that permits academic research, but prohibits commercial use. This is no different from what would be expected when we obtain code from any commercial organization.

Professor Ince mentions Hatton’s work, which is indeed an impressive study, and one of the few that that has been carried out on scientific code. And it is quite correct that there is a lot of shoddy scientific software out there. We’ve applied some of Hatton’s research methods to climate model software, and have found that, by standard software quality metrics, the climate models are consistently good quality code. Unfortunately, is it is not clear that standard software engineering quality metrics apply well to this code. Climate models aren’t built to satisfy a specification, but to address a scientific problem where the answer is not known in advance, and where only approximate solutions are possible. Many standard software testing techniques don’t work in this domain, and it is a shame that the software engineering research community has almost completely ignored this problem – we desperately need more research into this.

Prof Ince also echoes a belief that seems to be common across the academic software community that releasing the code will solve the quality problems seen in the specific case of the Harry file. This is a rather dubious claim. There is no evidence that, in general, open source software is any less buggy than closed source software. Dr Xu at the University of Notre Dame studied thousands of open source software projects, and found that the majority had nobody other than the original developer using them, while a very small number of projects had attracted a big community of developers. This pattern would be true of scientific software: the problem isn’t lack of openness, it’s lack of time – most of the code thrown together to test out an idea by a particular scientist is only of interest to that one scientist. If a result is published and other scientists think it’s interesting and novel, they attempt to replicate the result themselves. Sometimes they ask for the original code (and in my experience, are nearly always given it). But in general, they write their own versions, because what matters isn’t independent verification of the code, but independent verification of the scientific results.

I am encouraged that my colleagues in the software engineering research community are starting to take an interest in studying the methods by which climate science software is developed. I fully agree that this is an important topic, and have been urging my colleagues to address it for a number of years. I do hope that they take the time to study the problem more carefully though, before drawing conclusions about overall software quality of climate code.

Prof Steve Easterbrook, University of Toronto

Update: The Guardian never published my letter, but I did find a few other rebuttals to Ince’s article in various blogs. Davec’s is my favourite!

22 Comments

  1. They build their models through a process of iterative refinement. They run the models, and compare them with observational data, to look for the places where the models perform poorly. They then create hypotheses for how to improve the model, and then run experiments: using the previous version of the model as a control, and the new version as the experimental case, they compare both runs with the observational data to determine whether the hypothesis was correct.

    This process is exactly why the climate modeling community gets criticisms from outside their field about post hocery (or as a statistician friend of mine called it ‘rummaging through the residuals’):

    practitioners frequently engage in post hoc model modification when confronted with models exhibiting unacceptable fit. Little is currently known about the extent to which such procedures capitalize on sampling error.

    Perhaps those criticisms are unfounded, but it’s an honest criticism. It’s one that could be answered by demonstrating the stability of this approach by validation against future predictions (cross validation on hindcasts is another way, but is less convincing).

    By continual process of making small changes, and experimenting with the results, they end up testing their models far more effectively than most commercial software developers.

    I believe your claims about software correctness, examining your code methodically with quantitative techniques is bound to catch the important errors (ones that prevent you from solving the equations correctly).

    Hopefully they’ll publish your letter.

  2. Josh: If “rummaging through the residuals” was what they are doing, then they would deserve that criticism. However, that’s not what’s happening. The models are used as part of a theory-building process in which the goal is to improve understanding; improving the model is a by-product. The changes to the models aren’t a random process of seeking a better fit, it’s a systematic process of developing theories to explain the lack of fit. I’ve plenty of data to support this. For example, often the changes they want to make to the models to improve the realism of the physics have bad effects: the new scheme does worse (in terms of rms error) than the old scheme, especially where the old scheme had been tuned to compensate for known inaccuracies. Often the new scheme is slower too. At the UK Met Office, the models are also used for weather forecasting, where speed and skill are at a premium, so these conflicts have to be resolved before the changes are accepted, usually by bundling such changes together with others that improve speed and skill in other parts of the model. This concern for “getting the physics right” over and above improving model speed and skill demonstrates that the climate scientists aren’t doing post hoc model modification; they are working to improve their understanding of the processes they are studying. To an outsider, if you just looked at versions of the models over time, you probably couldn’t tell the difference. However, in my studies, I spend months observing their day-to-day work practices, and listening in on their conversations. You need to see what they actually do with the models to understand the nature of this science.

  3. You’re right, calling what they do ‘rummaging’ is unfair. Their process is susceptible to the same failings though, because physics that’s ok to neglect today, won’t be ok to neglect tomorrow (I understand there is a wealth of observational data, covering a wide range of climate conditions).

    The models are used as part of a theory-building process in which the goal is to improve understanding

    I buy that, I completely agree that the process you describe will result in a set of physics based models that accurately describes what has been observed. The connection to predictive capability is my hang-up.

  4. Yep, most scientists I’ve spoken to share your hang-up – the predictive capability is nowhere near what we need to drive planning and policymaking. However, it’s the best we’ve got, and the model predictions corroborate (more or less) with assessments of equilibrium climate sensitivity from paleoclimate studies. Which means if the projections summarized in the IPCC reports are wrong, they could be wrong in either direction. Worse still, recent studies of feedbacks suggest that we are likely to be significantly under-estimating the positive feedbacks.

    I think you have preferred to assume that we shouldn’t set new policy until we have better predictive capability. In contrast, I prefer the precautionary principle – the best knowledge available today tells us we’re heading for a disaster, and that knowledge comes from many lines of evidence, not just the models. And that evidence also indicates that the longer we wait to do anything about it, the more we compound the problem. So, you can label this response as “alarmist” if you like, but I think the IPCC have given us a rational assessment of the risks and uncertainties, and we damn well ought to act on it.

  5. Darn. I noticed some minor wording errors in the letter I sent off (corrected above). And I should have pointed out that nobody has replicated Hatton’s studies either. Oh well.

  6. Steve you make some fair points: I am happy to admit that Met Office code and also the NASA GISS code is of a high quality. My focus was on code generated outside government supported environments where there is a structured development that supports the production of good code.

    Let me say also that I am also fairly liberal about the release of code: programming a model is hugely intensive work and I think that it is reasonable to keep your code for a period in order to make as much hay as possible—Steve Schneider suggests two years, I think thats probaly too long. There are also problems with intellectual property rights; the sort of thing that delayed Prof Mann’s release. I am also liberal about the past. There is the problem that its only over the last few years that the computer has really impinged on science in the way that it didnt until fairly recently–not just in terms of what it does to the data but how it helps collect the data.

    I also take your point about developing different programs from a published model provides a good validation, but what do you do when somebody generates a different result from your same description of an algorithm. Check this New Statesman article out as an example of what I am saying.

    http://www.newscientist.com/article/dn18307-sceptical-climate-researcher-wont-divulge-key-program.html

    Here’s a thought experiment. A researcher who is sceptical about my work develops a computer climate model that is described in a paper which claims different results to mine. What would I do? Wait for some further validations to support me and have my work regarded as invalid? No I would take that as a scientific criticism of my work and I would ask the authors for their data and their code. If they refused to release either or both the program then I would say that they were behaving unscientifically.

    I am hoping to persuade The Guardian, the THES and my university to sponsor a conference on this. However, it will not take place yet; things are too clouded by the UEA stuff for it to be entirely dispassionate. If I am successful I hope that I might persuade you to present something. I shall be asking Les Hatton along as well. Incidentally he used to work at the Met Office.

  7. Pingback: Two Views « Software Carpentry

  8. Darrel: Thanks for taking the time to respond. To be fair, not all the code at the big government labs is as good quality as the main climate models. I think the distinction between code that’s central to the science and “side experiments” is just as big a factor in determining quality as where the code originated. Of course problems occur when those “side experiments” mature into important science. But the resulting problems with poor legacy code are everywhere. We can (and should) teach better coding practices to all computational scientists, but we should also be wary of blind prescriptions of commercial software engineering practices and tools, many of which simply don’t suit the needs of these scientists.

    The complete refusal to release code that was used in published research, to others working in the same field, is inexcusable. But I really don’t see any evidence that this happens very often. The handful of examples that are widely trumpeted in the blogosphere seem to be the tail wagging the dog.

    Anyway, I’d be delighted to see a conference on this – as I said, there are several issues here where a serious concerted research effort would help (and I don’t think the UK’s eScience initiative, nor Grey’s 4th paradigm work, are addressing the important software issues).

  9. Thanks Steve. This is a huge area. For example let’s say some company wishes to attack the global warming community and hires some hack scientists to do some research and they publish a research report, maybe even just as a commercial for free report but cetainly as a scholarly output in say a vanity journal–there’s more and more of these. The existence of that report will put a spoke into any attempts to awake the world to the problems that we are facing. As far as I’m concerned data and code should be released; without this scientific principle important research will get hindered by those who have agendas outside science. Outside climate science this is academic; research into the mating habits of swans will never attract this level of interest and vitriol. However, its a good principle to hang onto. Thanks for your offer to come to London.

    You are right: the e-science agenda and the 4th paradigm stuff ignores this aspect, although as an example of a culture change they are valuable as is the recent statements by the chair of the NSA.

    I agree about not foisting the sort of bondage and displine methods on scientists that are often used in industry. I wouldn’t foist them on many software developers. Im currently working on a case study of a failed project which was caused by this to the point where British social workers who are responsible for children in need can spend up to 80% of their time sitting at a computer acting as a data entry clerk. I can send a draft of a paper to anyone who wants it.

    The whole problem seems to me bound up with the grubby world of politics. A world I am so unfamilar with.

    Darrel

  10. Steve:
    You raise some good points, one of which I’ll modify some and perhaps open a can of software worms. I’m not following jstults, in part, I think, because of a cultural issue (software engineering vs. climate modeling). This leads to both the modification and the can of worms.

    If I’ve made the right interpretation, jstults is looking for climate models to be written to specification, and then verified/validated/… against how well they meet that specification. In a sense this could be done. We know the laws of dynamics and thermodynamics, and the radiative transfer, and a pretty fair amount of the cloud physics and other things involved in climate.

    If we were to write a climate model to specification, this would be the specification. I could probably write such a model myself — it’s actually a good bit easier to write this way than the way that we actually do.

    The reason we don’t do it that way is because for just the dynamics, we would need computers something like 10^30 times more powerful than currently exist. I’m not sure how bad it is for doing all radiation line by line, at every time step. Probably several (tens?) of orders of magnitude worse. But we do know how to do it. The physics are understood.

    This is where the iteration in model building comes from. Since our computers are egregiously underpowered compared to what would let us build the model to specification, we have to develop parameterizations to represent the processes we know are happening (turbulence, for instance) but for which we don’t have the computing power to represent the way we know it fundamentally happens. Each approximation (parameterization) gets assessed against how well it does in representing the thing it’s aimed at. And then we stick it in to the model to see how the overall performance changes. Because the rest of the model also has simplifications which, as you note, were tuned to give good results in spite of the bad old parameterization, not infrequently we wind up getting worse overall results with our spiffy new parameterization.

  11. I’m not following jstults, in part, I think, because of a cultural issue (software engineering vs. climate modeling)

    I think you are right, it’s mostly different communities use terms differently. Here’s a good description of some relevant terms and the definitions they’ve been given by different folks:

    Institute of Electrical and Electronics Engineers (IEEE) defined verification as follows [167, 168]:
    Verification: The process of evaluating the products of a software development phase to provide assurance that they meet the requirements defined for them by the previous phase.
    [...]
    IEEE defined validation as follows [167, 168]:
    Validation: The process of testing a computer program and evaluating the results to ensure compliance with specific requirements.
    [...]
    The AIAA definitions are given as follows [12]:
    Verification: The process of determining that a model implementation accurately represents the developer’s conceptual description of the model and the solution to the model.
    Validation: The process of determining the degree to which a model is an accurate representation of the real world from the perspective of the intended uses of the model.
    – Verification and Validation in Computational Fluid Dynamics

    IEEE-like usage is common in software engineering, while AIAA-like usage is common in computational physics.

    I think your can of worms is about parameterizations for unresolved flow features / physics. That’s related to validation (AIAA version), but probably our main source of confusion is just definitions of terms. Steve’s description of where he and I differ is pretty accurate I think.

  12. Hi Steve,

    From your 2nd paragraph: “Unfortunately, Prof Ince makes a serious error of science himself – he bases his entire argument on a single data point, without asking whether the example is in any way representative.”

    However, IMHO, Darrel’s argument is just: “Many climate scientists have refused to publish their computer programs. I suggest is that this is both unscientific behaviour and, equally importantly, ignores a major problem: that scientific software has got a poor reputation for error.”

    IMHO, his argument would not be a serious error in the appropriate application of the scientific method even if his suggestion applied to only one climate scientist. Further, his remark on the poor reputation of scientific software is justified by an authoritative source: Les Hatton.

    In response to Darrel’s description of some of Les Hatton’s earlier work (critical of the quality of scientific code), you write that you’ve “applied some of Hatton’s research methods to climate model software, and have found that, by standard software quality metrics, the climate models are consistently good quality code.” This is good news. However, your published results seem to be behind a pay-wall. So most of us can’t get to it.

    And even if we could, you then immediately state that: “it is not clear that standard software engineering quality metrics apply well to this code.” So how can we think that your work helps to dispel scientific software’s poor reputation for error?

    But besides all this, the main reason that I favor Darrel’s argument is because IMHO, it is the only way to combat the (actually quite rational) divergence of viewpoints over the climate change issue. A divergence that makes climate software V&V much more difficult. For more on this, see the last paragraph of this post and E.T. Jaynes understanding of the issue of rational divergence here.

    George

  13. It’s not as simple as one might think to share code.

    I once had a student, who spent four years building a new physics parameterisation for a model. Four years. Lots of validation work, lots of fun. One thesis, but no papers … (at that particular time). Was asked by a number of folk who wanted to try and use it. We said fine, but he needed to be a co-author on any initial papers using the code. They said no thanks. It’s not an uncommon story.

    Serious analysis code (as opposed to trivial algorithm reconstruction) takes serious time. Folks need academic credit for that time. Giving my code to Joe Average when I’m Sarah Special doesn’t make sense unless either a) they have to cite my code, or b) I’m a coauthor, or c) I’ve gotten a better algorithm up my sleeve.

    I think two years is long enough to think I’ve gotten onto something better. However, less than two years is a bit marginal … especially for a doctoral student who might (unlike my example above) get a paper out at the end of their first year …

    … and I’d love to come to the same workshop!

  14. Jstults:
    Definitely in line with my thinking. There’s a disconcerting aspect, not really mentioned in that note. The direct numerical simulation (DNS — I first thought domain name service, which was confusing) had 210 million grid points. Now multiply that by the 1 mm grid spacing I mentioned. It means the volume that could be directly simulated is on the order of 0.2 m^3 — 200 kg of water, 0.2 kg of air. You’re looking at a moderately large fish tank (50 gallons).

  15. Hi Robert,
    Things aren’t quite as bad as that, here’s a paper that discusses the relevant scaling for an atmospheric boundary layer. The direct numerical simulation is still at much lower Reynolds number than a real atmosphere (simulation Re on the order of 10^3, observed Re on the order of 10^8), but eventually you can establish what the right statistics look like for your sub-grid parameterizations in the limit of infinite Reynolds number, so you can then use them for large eddy simulations (LES) with realistic Reynolds numbers.

  16. jstults:
    The direct simulation was, and is, an issue with respect to what I saw as your requirement that climate models be written to specification. The a priori specification that can be made is the Navier-Stokes equations, and direct numerical solution thereof. I’m perfectly aware of the LES and other means of working out sub-grid parameterizations. But all such efforts run afoul of your complaint against modelers going back and forth between parameterizations and results, and retuning parameterizations after seeing the results.

  17. Robert,

    what I saw as your requirement that climate models be written to specification. The a priori specification that can be made is the Navier-Stokes equations, and direct numerical solution thereof.

    You are misunderstanding me. The specification can be any governing equation set and set of sub-grid scale parameterizations, no need to require DNS (esp. since climate models fall well short of that sort of resolution and they make analytical simplifications to the conservation laws as well, so they aren’t even solving the Navier Stokes equations).

    But all such efforts run afoul of your complaint against modelers going back and forth between parameterizations and results, and retuning parameterizations after seeing the results.

    That’s not true; the parameterizations should be based on sound physical reasoning rather than post-hoc tuning (see this post for an example of what I mean). There’s nothing wrong with sub-grid scale models or empirical closures, but the predictive capability you can expect depends strongly on how they were developed and calibrated. This is not a new or controversial idea; there’s tons of stuff in the computational physics literature about this (the folks out at Sandia have a lot of their stuff easily available on the net). The early work in turbulence modeling in the fluid dynamics community is an instructive case study as well (model parameters that were originally thought to be universally valid turned out to depend on the type of flow, but experimental results were limited early on).

    It would be nice if I could see some grid-independent climate model results. I’m not asking for DNS, just grid-convergence of the functionals we care about with the chosen governing equation set and parameterizations. Right now the accepted state of the practice in climate modeling seems to be ‘validation’ of a choice of parameters on a specific grid resolution, that process does not engender hope of significant predictive capability.

  18. Robert,
    I thought of a better example. Suppose you need to us an empirical closure for, say, the viscosity of your fluid or the equation of state. Usually you develop this sort of thing with some physical insight based on kinetic theory and lab tests of various types to get fits over a useful range of temperatures and pressures, then you use this relation in your code (generally without modification based on the code’s output). An alternative way to approach this closure problem would be to run your code with variations in viscosity models and parameter values and pick the set that gave you outputs that were in good agreement with high-entropy functionals (like an average solution state, there’s many ways to get the same answer, and nothing to choose between them) for a particular set of flows, this would be a sort of inverse modeling approach. Either way gives you an answer that can demonstrate consistency with your data, but there’s probably a big difference in the predictive capability between the models so developed.

  19. Josh,

    This example allows for discussion of a couple of nitty-gritty points about tuning parameters. Let me assume that when you say viscosity you mean the actual viscosity of the fluid; a property of the fluid. Almost none of the parameterizations used in any codes refer to properties of the material. Instead they refer to states of the material that are attained under various conditions. More specifically, they very frequently are used to replace gradients of driving potentials for mass, momentum, and energy exchanges both internal to a material and at the interface between different materials.

    As you mentioned in another comment, the original turbulence modeling approaches based on adding descriptive equations to the continuous system were initially claimed to need only parameters that were somewhat universal in nature. Those kinds of statements were soon displaced by more realistic assessments after it was realized that in fact the parameters reflected states of the material, not the material. I recall that among the first of these models, testing showed that it was basically limited to parallel shear flows in which no recirculation was present; basically flows described by parabolic equations.

    In the case of this viscosity example, and in my opinion for all parameterizations, the parameters must be tuned strictly by looking at responses directly related to the phenomena the parameter has been designed to represent. In the case of viscosity, and let’s assume this is not some kind of turbulent viscosity, the calculations must reflect a situation for which this parameter is the sole controlling parameter. That is, in this case, the calculation must represent the kinds of experiments that are used to determine the physical property of the fluid. Equally important, the experiments must be strictly designed to reflect this requirement. For the case of fluid viscosity, these experimental devices can be found in the literature and that is the flow that must be calculated by the code.

    If this guideline is not followed, the classic right-answer-for-the-wrong-reason will obtain. Worse, the numerical value for the parameter will eventually be found to not reflect reality. A feature of the state of the material will be used to determine what should be a property of the fluid.

    This problem is extremely compounded in the case of models and codes that describe a multitude of inherently complex multi-scale physical phenomena and processes. Looking at global solution functionals in attempts to tune parameters under such conditions can frequently result in getting the right answer for the wrong reason and introduction compensating errors into the modeling. The GCMs are a good example, I think. There are few meta-global solution functionals that are as detail-annihilating than some kind of average temperature of some parts of the systems over the entire solution domain and for an entire year.

  20. Just so I can keep track, there’s more on Josh’s blog relevant to this discussion. Testing of parameterizations came up a lot in my discussions at NCAR last week; I plan to make it a central question if I manage to get back there for a longer visit in the fall.

  21. Thanks for the mention! I confess that I haven’t been keeping up with developments (or lack thereof) related to Darrel’s article in the Guardian.

    However, it is refreshing to stumble across an actual reasoned discussion of an aspect – any aspect – of climate science (by yourself, Darrel and others). I think I’ve lost too many brain cells recently courtesy of the Australian media’s often-horrendous treatment of the subject. (You may be familiar with Tim Lambert’s running series on the Australian’s War on Science.)

    I’m merely a layperson regarding climate change itself, but Darrel’s article did at least provoke some contemplation of the connection between climate change and software engineering.

  22. Up the thread a ways I said:

    The specification can be any governing equation set and set of sub-grid scale parameterizations, no need to require DNS (esp. since climate models fall well short of that sort of resolution and they make analytical simplifications to the conservation laws as well, so they aren’t even solving the Navier Stokes equations).

    This comment on another thread talks a bit to the ‘grid convergence’ idea specifically for the governing equation set used commonly in GCMs.

Join the discussion: