Here’s the first of a series of posts from the American Geophysical Society (AGU) Fall meeting, which is happening this week in San Francisco. The meeting is huge – they’re expecting 19,000 scientists to attend, making it the largest such meeting in the physical sciences.

The most interesting session today was a new session for the AGU:  IN14B “Software Engineering for Climate Modeling”. And I’m not just saying that because it included my talk – all the talks were fascinating. (I’ve posted the slides for my talk, “Do Over or Make Do: Climate Models as a Software Development Challenge“).

After my talk, the next speaker was Cecelia DeLuca of NOAA, with a talk entitled “Emergence of a Common Modeling Architecture for Earth System Science”. Cecelia gave a great overview of the Earth System Modelling Framework. She began by pointing out that climate models don’t just contain science code – they consist of a number of different kinds of software. Lots of the code is infrastructure code, which doesn’t necessarily need to be written by scientists. Around ten years ago, a number of projects started up that had the aim of building shared, standards-based infrastructure code. The projects needed to develop the technical and mathematical expertise to build infrastructure code. But the advantages of separating this code development from the science code was clear: the teams building infrastructure code could prioritize best practices, run the nightly testing process, etc, whereas typically the scientists would not do this.

ESMF provides a common modelling architecture. Native model data structures (modules, fields, grids, timekeeping) are wrapped into ESMF standard data structures, which conform to relevant standards (E.g. ISO standards, CF standards, and the Metafor common information model, etc). The framework also offers runtime compliance checking (e.g. to check timekeeping behaviour is correct), and automated documentation (e.g. the ability to write out model metadata in an XML standard format).

Because of these efforts, in the US, earth system  models are converging on a common architecture. It’s built on standardized component interfaces, and creates a layer of structured information within Earth system codes. The lesson here is that if you can take the legacy code, and express it in a standard way, you get tremendous power.

The next speaker was Amy Langenhorst from GFDL, “Making sense of complexity with the FRE climate modelling workflow system”. Amy explained the organisational setup at GFDL: there are approximately 300 people organized into groups: 6 science based groups groups, plus a technical services group, and a modelling services group. The latter consists of 15 people, with one of them acting as a liaison for each of the science groups. This group provides the software engineering support for the science teams.

The Flexible Modeling System (FMS) is software framework that provides a coupler and infrastructure support. FMS releases happen about once per year; it provides an extensive testing framework that currently includes 209 different model configurations.

One of the biggest challenges for modelling groups like GFDL is the IPCC cycle. Each providing the model runs for the IPCC assessments involves massive complex data processing, for which a good workflow manager is needed. FRE is the workflow manager for FMS. Development of FRE was started in 2002 by Amy, at a time when the model services group didn’t yet exist.

FRE includes version control, configuration management, tools for building executables, control of execution, etc. It also provides facilities for creating XML model description files, model configuration (using a component-based approach), and integrated model testing (e.g. basic tests, restarts, scaling). It also allows for experiment inheritance, so that it’s possible to set up new model configurations based on variants of previous runs, which is useful for perturbation studies.

Next up was Rob Burns from NASA GSFC, talking about “Software Engineering Practices in the Development of NASA Unified Weather Research and Forecasting (NU-WRF) Model“. WRF is a weather forecasting model originally developed at NCAR, but widely used across the NWP community. NU-WRF is an attempt to unify variants of NCAR WRF and to facilitate better use of WRF. NU-WRF is built from versions of NCAR’s WRF, with separate process of folding in enhancements.

As is common with many modelling efforts, there were challenges arising from multiple science teams, with individual goals, interests and expertise, and scientists don’t consider software engineering as their first priority. At NASA, the Sofware Integration and Visualization Office (SIVO) provides Software Engineering support for the scientific modelling teams. SIVO helps to drive, but not to lead the scientific modelling efforts. They help with full software lifecycle management, assisting with all software processes from requirements to release, but with domain experts still making the scientific decisions. The code is under full version control, using Subversion, and the software engineering team coordinates the effort to get the codes into version control.

The experience with NU-WRF shows that this kind of partnership between science teams and a software support team can work well. Leadership and active engagement with the science teams is needed. However, involvement of the entire science team for decisions is too slow, so a core team was formed to do this.

The next speaker was Thomas Clune from NASA GISS, with a talk “Constraints and Opportunities in GCM Model Development“. Thomas began with the question: How did we end up with the software we have today? From a software quality perspective, we wrote the wrong software. Over the years, improvements in fidelity in the models have driven a disproportionate growth in complexity of implementations.

One important constraint is that model codes change relatively slowly, in part because of the model validation processes – it’s important to be able to validate each code change individually – they can’t be bundled together. But also because code familiarity is important – the scientists have to understand their code, and if it changes too fast, they lose this familiarity.

However, the problem now is that software quality is incommensurate with the growing socioeconomic role for our models in understanding climate change. There’s a great quote from Ward Cunningham: “Shipping first time code is like going into debt. A little debt speeds development so long as it is paid back promptly with a rewrite… The danger occurs when the debt is not repaid. Every minute spent on not-quite-right code counts as interest on that debt. Entire engineering organizations can be brought to a stand-still under the debt load of an unconsolidated implementation, object-oriented or otherwise…” Examples of this debt in climate models include long procedures, kludges, cut-and-paste duplication, short/ambiguous names, and inconsistent style.

The opportunities then are to exploit advances in software engineering from elsewhere to systematically and incrementally improve the software quality of climate models. For example:

  • Coding standards – these improve productivity through familiarity, reducesome types of bugs, and help newcomers. But must be adopted from within the community by negotiation.
  • Abandon CVS. It has too many liabilities for managing legacy code, e.g. a permanence to the directory structures. The community needs version control systems that handle branching and merging. NASA GISS is planning to switch to GIT in the new year, as soon as the IPCC runs are out of the way.
  • Unit testing. There’s a great quote from Michael Feathers: “The main thing that distinguishes legacy code from non-legacy code is tests. Or rather lack of tests”. Lack of tests leads to fear of introducing subtle bugs. Elsewhere, unit testing frameworks have caused a major shift in how commercial software development works, particularly in enabling test-driven development. Tom has been experimenting with pFUnit, a testing framework with support for parallel Fortran and MPI. The existence of such testing frameworks removes some of the excuses for not using unit testing for climate models (in most cases, the modeling community relies on regression testing in preference to unit testing). Some of the reasons commonly given for not doing unit testing seem to represent some confusion about what unit testing is for: e.g. that some constraints are unknown, that tests would just duplicate implementation, or that it’s impossible to test emergent behaviour. These kinds of excuse indicate that modelers tend to conflate scientific validation with the verification offered by unit testing.
  • Clone Detection. Tools now exist to detect code clones (places where code has been copied, sometimes with minor modifications across different parts of the software). Tom has experimented with some of these with the NASA modelE, with promising results.

The next talk was by John Krasting from GFDL, on “NOAA-GFDL’s Workflow for CMIP5/IPCC AR5 Experiments”. I didn’t take many notes, mainly because the subject was very familiar to me, having visited several modeling labs over the summer, all of whom were in the middle of the frantic process of generating their IPCC CMIP5 runs (or in some cases struggling to get started).

John explained that CMIP5 is somewhat different from the earlier CMIP projects, because it is much more comprehensive, with a much larger set of model experiments, and much larger set of model variables requested. CMIP1 focussed on pre-industrial control runs, while CMIP2 added some idealized climate change scenario experiments. For CMIP3, the entire archive (from all modeling centres) was 36 terabytes. For CMIP5, this is expected to be at least two orders of magnitude bigger. Because of the larger number of experiments, CMIP5 has a tiered structure, so that some kinds of experiments are prioritized (e.g. see the diagram from Taylor et al).

GFDL is expecting to generate around 15,000 model years of simulation, yielding around 10 petabytes of data, of which around 10%-15% will be released to the public, distributed via the ESG Gateway. The remainder of the data represents some redundancy, and some diagnostic data that’s intended for internal analysis.

The final speaker in the session was Archer Batcheller, from University of Michigan, with a talk entitled “Programming Makes Software; Support Makes Users“. Archer was reporting on the results of a study he has been conducting of several software infrastructure projects in the earth system modeling community. His main observation is that e-Science is about growing socio-technical systems, and that people are a key part of these systems. Effort is needed to nurture communities of users, but such effort is crucial for building the scientific cyberinfrastructure.

From his studies, Archer found that most people developing modeling infrastructure software divide their time about 50:50 between coding and other activities, including:

  • “selling” – explaining/promoting the software in publications, at conferences, and at community meetings (even though the software is free, it still has to be “marketed”)
  • support – helping users, which in turn helps with identifying new requirements
  • training – including 1-on-1, workshops online tutorials, etc.


  1. Pingback: Tweets that mention AGU session on Software Engineering for Climate Modeling | Serendipity --

  2. “Shipping first time code is like going into debt.”

    Big software systems can evolve to a state of entropy where you introduce with every bugfix a bug comparable in effect to the fixed bug, on the average. In this state the whole developer team is busy fixing bugs, while the software is not improved a bit 🙂 That’s the state where whole legacy systems are thrown into the trash can (or rather deployed to the null device 🙂

    “The opportunities then are to exploit advances in software engineering from elsewhere to systematically and incrementally improve the software quality of climate models.”

    I’m missing the buzzword “refactoring” in the following paragraph 🙂

  3. Some of the reasons commonly given for not doing unit testing seem to represent some confusion about what unit testing is for: e.g. that some constraints are unknown, that tests would just duplicate implementation, or that it’s impossible to test emergent behaviour. These kinds of excuse indicate that modelers tend to conflate scientific validation with the verification offered by unit testing.

    I think it would be valuable if you went in to this unit test / verification test distinction more. For example, what parts of the “correctness checking” activities that are already being done would work better as unit tests?

  4. Until Steve’s return I hope I can sensibly reply to jstults. Unfortunately

    * I am Just Another Coder, not @ SME’s level
    * terminology in this domain (i.e., software testing) is not particularly standard
    * this @#$%^&! blog doesn’t do ul/ol, making list markup poorer

    OTOH I can rely on the our Wikipedian pals (some of whom are me : – ) for more detailed explanations.

    SME: modelers tend to conflate scientific validation with the verification offered by unit testing.

    jstults:unit test / verification test distinction

    Software engineering, especially of artifacts as large/complex as GCMs, is about composition and layering. Modern softwares are typically composed of many named parts (e.g., functions, libraries, methods, objects, procedures), typically written or maintained by different people, or the same person at widely varying times, or both. Composite systems are valorized as “modular”; others are deprecated as “monolithic.”

    Software tests can therefore differ in scope. A “verification test” (aka “acceptance test,” “blackbox test,” “function test,” “system test“) targets the entire system: e.g. one runs the entire software system (e.g., a model) on an input set and assesses the “correctness” of the output set. Unfortunately, “correctness” in this domain (i.e. earth-system modeling) is (IIUC) often difficult to specify, much less to automate, and often devolves to “eyeballing” (e.g. “does it ENSO?”).

    A unit test targets one, typically small, code component. A unit test must therefore be written and maintained in tandem with the unit it targets, but must be runnable in (at least relative) isolation from the larger system, which makes unit testing both rewarding and difficult. (A subject for a larger post, probably already written by Kent Beck.)

    For the maintenance of software quality there is no substitute for global verification test(s). But what is necessary is often not sufficient:

    what parts of the “correctness checking” activities that are already being done would work better as unit tests?

    Short answer: if your debugging process, or your integration testing, relies exclusively on system level tests (i.e. if ya gotta run the model to do either of the above), you need to start writing/running/maintaining unit tests. Long answer:

    Unit tests should (IMHO) be viewed as complements to verification tests, but complements which will very often provide better ROI (i.e. a greater increase in software quality for a given investment) than will an equal investment in additional verification test. (And I do mean “investment”: tests and their infrastructure are capital, not overhead.) In addition to the benefits cited here (of which the second, preventing “big bangs,” cannot be overstressed), I will add merely that unit tests add value whether or not the code passes verification test:

    * if the code fails verification: unit tests can help

    ** direct debugging. Some use the term “reductionism” to deprecate, but it is very often the case that system failures are due to one or a few component failures. These can be difficult to identify without good unit test coverage: to paraphrase Franklin, one unit test can save a lotta log slog.

    ** (typically) speed debugging. In addition to saving the effort described in the prvious item, it is also typically true for large systems that a large bucket of unit tests can be run faster than a single system test.

    * if the code verifies: unit tests can detect countervailing errors. (I tend to doubt these happen much in practice, but my experience is low in this domain.)

    * in either case: unit tests help

    ** prevent regression. The terms “informatics” and “software engineering” are greatly to be preferred to the phrase “computer science,” since the latter is almost always either normative or mathematical. However there are a few empirical results in CS, including that (ceteris paribus) individual coders (like other humans) tend to make the same errors repeatedly. The SE implications are,

    **1 when you detect a bug in your code, write a test for it

    **2 run your test bucket often (notably, before committing)

    ** maintain coder sanity. Unit tests improve one’s understanding of, and therefore one’s confidence in using or modifying, code, e.g. to refactor. (Thanks to van Beek above for using the R word.) Conversely, fear of regression can induce “codertonia,” where exposure to a given piece of code (e.g., for extension, maintenance, or reuse) causes, e.g., keyboard paralysis, nausea, sweating, and occasionally death.

    : – ) HTH, Tom Roche

  5. The other SE implication:

    – process improvement: when you detect a bug in your code, change your process so that *bugs of this kind* will be avoided in future. A test or set of tests may be part of this, but this is more than just regression testing. For instance, it might include adding to the guidelines which you follow when writing unit tests (“test with under-sized, over-sized, and unusually-sized input datasets”).

  6. Refactoring is rarely used on the science code, because it messes up reproducibility. The scientists find it extremely useful to be able to get an exact (down to the least significant bit) reproduction of an old experiment from a newer version of the model – it’s often used as testing trick to check the changes didn’t affect the climatology – you do a series of short tests (e.g. 5 days of simulation) and check the results compare bit-for-bit, in place of having to run century long simulations (which can take weeks of wallclock time) and eyeball it. Given the nature of optimizing compilers, almost any refactoring breaks bit-for-bit reproducibility.
    The issue came up in Tim Palmer’s talk yesterday, which I’ll blog as soon as a I get my notes together…

  7. Pingback: xkcd: How to Write Good Code | Serendipity

  8. Lack of respect for programming as an endeavor was the main hurtle I encountered. Unrelenting arrogance and assumption that PhD smarts lead to good coding killed many an effort to improve software in this business. Culture is culture. Like trying to discuss philosophy in Tehran. There is no point and there is no future in such an endeavor. And the young ones come in knowing that the approach is wrong; but to survive the adapt, and over the years adaptation becomes acceptance, acceptance becomes advocacy.
    I loved software too much to stay in the business longer than I had to…

Leave a Reply

Your email address will not be published. Required fields are marked *