This week I attended a Dagstuhl seminar on New Frontiers for Empirical Software Engineering. It was a select gathering, with many great people, which meant lots of fascinating discussions, and not enough time to type up all the ideas we’ve been bouncing around. I was invited to run a working group on the challenges to empirical software engineering posed by climate change. I started off with a quick overview of the three research themes we identified at the Oopsla workshop in the fall:

  • Climate Modeling, which we could characterize as a kind of end-user software development, embedded in a scientific process;
  • Global collective decision-making, which involves creating the software infrastructure for collective curation of sources of evidence in a highly charged political atmosphere;
  • Green Software Engineering, including carbon accounting for the software systems lifecycle (development, operation and disposal), but where we have no existing no measurement framework, and tendency to to make unsupported claims (aka greenwashing).

Inevitably, we spent most of our time this week talking about the first topic – software engineering of computational models, as that’s the closest to the existing expertise of the group, and the most obvious place to start.

So, here’s a summary of our discussions. The bright ideas are due to the group (Vic Basili, Lionel Briand, Audris Mockus, Carolyn Seaman and Claes Wohlin), while the mistakes in presenting them here are all mine.

A lot of our discussion was focussed on the observation that climate modeling (and software for computational science in general) is a very different kind of software engineering than most of what’s discussed in the SE literature. It’s like we’ve identified a new species of software engineering, which appears to be a an outlier (perhaps an entirely new phylum?). This discovery (and the resulting comparisons) seems to tell us a lot about the other species that we thought we already understood.

The SE research community hasn’t really tackled the question of how the different contexts in which software development occurs might affect software development practices, nor when and how it’s appropriate to attempt to generalize empirical observations across different contexts. In our discussions at the workshop, we came up with many insights for mainstream software engineering, which means this is a two-way street: plenty of opportunity for re-examination of mainstream software engineering, as well as learning how to study SE for climate science. I should also say that many of our comparisons apply to computational science in general, not just climate science, although we used climate modeling for many specific examples.

We ended up discussing three closely related issues:

  1. How do we characterize/distinguish different points in this space (different species of software engineering)? We focussed particularly on how climate modeling is different from other forms of SE, but we also attempted to identify factors that would distinguish other species of SE from one another. We identified lots of contextual factors that seem to matter. We looked for external and internal constraints on the software development project that seem important. External constraints are things like resource limitations, or particular characteristics of customers or the environment where the software must run. Internal constraints are those that are imposed on the software team by itself, for example, choices of working style, project schedule, etc.
  2. Once we’ve identified what we think are important distinguishing traits (or constraints), how do we investigate whether these are indeed salient contextual factors? Do these contextual factors really explain observed differences in SE practices, and if so how? We need to consider how we would determine this empirically. What kinds of study are needed to investigate these contextual factors? How should the contextual factors be taken into account in other empirical studies?
  3. Now imagine we have already characterized this space of species of SE. What measures of software quality attributes (e.g. defect rates, productivity, portability, changeability…) are robust enough to allow us to make valid comparisons between species of SE? Which metrics can be applied in a consistent way across vastly different contexts? And if none of the traditional software engineering metrics (e.g. for quality, productivity, …) can be used for cross-species comparison, how can we do such comparisons?

In my study of the climate modelers at the UK Met Office Hadley centre, I had identified a list of potential success factors that might explain why the climate modelers appear to be successful (i.e. to the extent that we are able to assess it, they appear to build good quality software with low defect rates, without following a standard software engineering process). My list was:

  • Highly tailored software development process – software development is tightly integrated into scientific work;
  • Single Site Development – virtually all coupled climate models are developed at a single site, managed and coordinated at a single site, once they become sufficiently complex [edited – see Bob’s comments below], usually a government lab as universities don’t have the resources;
  • Software developers are domain experts – they do not delegate programming tasks to programmers, which means they avoid the misunderstandings of the requirements common in many software projects;
  • Shared ownership and commitment to quality, which means that the software developers are more likely to make contributions to the project that matter over the long term (in contrast to, say, offshored software development, where developers are only likely to do the tasks they are immediately paid for);
  • Openness – the software is freely shared with a broad community, which means that there are plenty of people examining it and identifying defects;
  • Benchmarking – there are many groups around the world building similar software, with regular, systematic comparisons on the same set of scenarios, through model inter-comparison projects (this trait could be unique – we couldn’t think of any other type of software for which this is done so widely).
  • Unconstrained Release Schedule – as there is no external customer, software releases are unhurried, and occur only when the software is considered stable and tested enough.

At the workshop we identified many more distinguishing traits, any of which might be important:

  • A stable architecture, defined by physical processes: atmosphere, ocean, sea ice, land scheme,…. All GCMs have the same conceptual architecture, and it is unchanged since modeling began, because it is derived from the natural boundaries in physical processes being simulated [edit: I mean the top level organisation of the code, not the choice of numerical methods, which do vary across models – see Bob’s comments below]. This is used as an organising principle both for the code modules, and also for the teams of scientists who contribute code. However, the modelers don’t necessarily derive some of the usual benefits of stable software architectures, such as information hiding and limiting the impacts of code changes, because the modules have very complex interfaces between them.
  • The modules and integrated system each have independent lives, owned by different communities. For example, a particular ocean model might be used uncoupled by a large community, and also be integrated into several different coupled climate models at different labs. The communities who care about the ocean model on its own will have different needs and priorities than each of communities who care about the coupled models. Hence, the inter-dependence has to be continually re-negotiated. Some other forms of software have this feature too: Audris mentioned voice response systems in telecoms, which can be used stand-alone, and also in integrated call centre software; Lionel mentioned some types of embedded control systems onboard ships, where the modules are used indendently on some ships, and as part of a larger integrated command and control system on others.
  • The software has huge societal importance, but the impact of software errors is very limited. First, a contrast: for automotive software, a software error can immediately lead to death, or huge expense, legal liability, etc,  as cars are recalled. What would be the impact of software errors in climate models? An error may affect some of the experiments performed on the model, with perhaps the most serious consequence being the need to withdraw published papers (although I know of no cases where this has happened because of software errors rather than methodological errors). Because there are many other modeling groups, and scientific results are filtered through processes of replication, and systematic assessment of the overall scientific evidence, the impact of software errors on, say, climate policy is effectively nil. I guess it is possible that systematic errors are being made by many different climate modeling groups in the same way, but these wouldn’t be coding errors – they would be errors in the understanding of the physical processes and how best to represent them in a model.
  • The programming language of choice is Fortran, and is unlikely to change for very good reasons. The reasons are simple: there is a huge body of legacy Fortran code, everyone in the community knows and understands Fortran (and for many of them, only Fortran), and Fortran is ideal for much of the work of coding up the mathematical formulae that represent the physics. Oh, and performance matters enough that the overhead of object oriented languages makes them unattractive. Several climate scientists have pointed out to me that it probably doesn’t matter what language they use, the bulk of the code would look pretty much the same – long chunks of sequential code implementing a series of equations. Which means there’s really no push to discard Fortran.
  • Existence and use of shared infrastructure and frameworks. An example used by pretty much every climate model is MPI. However, unlike Fortran, which is generally liked (if not loved), everyone universally hates MPI. If there was something better they would use it. [OpenMP doesn’t seem to have any bigger fanclub]. There are also frameworks for structuring climate models and coupling the different physics components (more on these in a subsequent post). Use of frameworks is an internal constraint that will distinguish some species of software engineering, although I’m really not clear how it will relate to choices of software development process. More research needed.
  • The software developers are very smart people. Typically with PhDs in physics or related geosciences. When we discussed this in the group, we all agreed this is a very significant factor, and that you don’t need much (formal) process with very smart people. But we couldn’t think of any existing empirical evidence to support such a claim. So we speculated that we needed a multi-case case study, with some cases representing software built by very smart people (e.g. climate models, the Linux kernel, Apache, etc), and other cases representing software built by …. stupid people. But we felt we might have some difficulty recruiting subjects for such a study (unless we concealed our intent), and we would probably get into trouble once we tried to publish the results 🙂
  • The software is developed by users for their own use, and this software is mission-critical for them. I mentioned this above, but want to add something here. Most open source projects are built by people who want a tool for their own use, but that others might find useful too. The tools are built on the side (i.e. not part of the developers’ main job performance evaluations) but most such tools aren’t critical to the developers’ regular work. In contrast, climate models are absolutely central to the scientific work on which the climate scientists’ job performance depends. Hence, we described them as mission-critical, but only in a personal kind of way. If that makes sense.
  • The software is used to build a product line, rather than an individual product. All the main climate models have a number of different model configurations, representing different builds from the codebase (rather than say just different settings). In the extreme case, the UK Met Office produces several operational weather forecasting models and several research climate models from the same unified codebase, although this is unusual for a climate modeling group.
  • Testing focuses almost exclusively on integration testing. In climate modeling, there is very little unit testing, because it’s hard to specify an appropriate test for small units in isolation from the full simulation. Instead the focus is on very extensive integration tests, with daily builds, overnight regression testing, and a rigorous process of comparing the output from runs before and after each code change. In contrast, most other types of software engineering focus instead on unit testing, with elaborate test harnesses to test pieces of the software in isolation from the rest of the system. In embedded software, the testing environment usually needs to simulate the operational environment; the most extreme case I’ve seen is the software for the international space station, where the only end-to-end software integration was the final assembly in low earth orbit.
  • Software development activities are completely entangled with a wide set of other activities: doing science. This makes it almost impossible to assess software productivity in the usual way, and even impossible to estimate the total development cost of the software. We tried this as a thought experiment at the Hadley Centre, and quickly gave up: there is no sensible way of drawing a boundary to distinguish some set of activities that could be regarded as contributing to the model development, from other activities that could not. The only reasonable path to assessing productivity that we can think of must focus on time-to-results, or time-to-publication, rather than on software development and delivery.
  • Optimization doesn’t help. This is interesting, because one might expect climate modelers to put a huge amount of effort into optimization, given that century-long climate simulations still take weeks/months on some of the world’s fastest supercomputers. In practice, optimization, where it is done, tends to be an afterthought. The reason is that the model is changed so frequently that hand optimization of any particular model version is not useful. Plus the code has to remain very understandable, so very clever designed-in optimizations tend to be counter-productive.
  • There are very few resources available for software infrastructure. Most of the funding is concentrated on the frontline science (and the costs of buying and operating supercomputers). It’s very hard to divert any of this funding to software engineering support, so development of the software infrastructure is sidelined and sporadic.
  • …and last but not least, A very politically charged atmosphere. A large number of people actively seek to undermine the science, and to discredit individual scientists, for political (ideological) or commercial (revenue protection) reasons. We discussed how much this directly impacts the climate modellers, and I have to admit I don’t really know. My sense is that all of the modelers I’ve interviewed are shielded to a large extend from the political battles (I never asked them about this). Those scientists who have been directly attacked (e.g. MannJonesSanter) tend to be scientists more involved in creation and analysis of datasets, rather than GCM developers. However, I also think the situation is changing rapidly, especially in the last few months, and climate scientists of all types are starting to feel more exposed.

We also speculated about some other contextual factors that might distinguish different software engineering species, not necessarily related to our analysis of computational science software. For example:

  • Existence of competitors;
  • Whether software is developed for single-person-use versus intended for broader user base;
  • Need for certification (and different modes by which certification might be done, for example where there are liability issues, and the need to demonstrate due diligence)
  • Whether software is expected to tolerate and/or compensate for hardware errors. For example, for automotive software, much of the complexity comes from building fault-tolerance into the software because correcting hardware problems introduced in design or manufacture is prohibitively expense. We pondered how often hardware errors occur in supercomputer installations, and whether if they did it would affect the software. I’ve no idea of the answer to the first question, but the second is readily handled by the checkpoint and restart features built into all climate models. Audris pointed out that given the volumes of data being handled (terrabytes per day), there are almost certainly errors introduced in storage and retrieval (i.e. bits getting flipped), and enough that standard error correction would still miss a few. However, there’s enough noise in the data that in general, such things probably go unnoticed, although we speculated what would happen when the most significant bit gets flipped in some important variable.

More interestingly, we talked about what happens when these contextual factors change over time. For example, the emergence of a competitor where there was none previously, or the creation of a new regulatory framework where none existed. Or even, in the case of health care, when change in the regulatory framework relaxes a constraint – such as the recent US healthcare bill, under which it (presumably) becomes easier to share health records among medical professionals if knowledge of pre-existing conditions is no longer a critical privacy concern. An example from climate modeling: software that was originally developed as part of a PhD project intended for use by just one person eventually grows into a vast legacy system, because it turns out to be a really useful model for the community to use. And another: the move from single site development (which is how nearly all climate models were developed) to geographically distributed development, now that it’s getting increasingly hard to get all the necessary expertise under one roof, because of the increasing diversity of science included in the models.

We think there are lots of interesting studies to be done of what happens to the software development processes for different species of software when such contextual factors change.

Finally, we talked a bit about the challenge of finding metrics that are valid across the vastly different contexts of the various software engineering species we identified. Experience with trying to measure defect rates in climate models suggests that it is much harder to make valid comparisons than is generally presumed in the software literature. There really has not been any serious consideration of these various contextual factors and their impact on software practices in the literature, and hence we might need to re-think a lot of the ways in which claims for generality are handled in empirical software engineering studies. We spent some time talking about the specific case of defect measurements, but I’ll save that for a future post.