This week I attended a Dagstuhl seminar on New Frontiers for Empirical Software Engineering. It was a select gathering, with many great people, which meant lots of fascinating discussions, and not enough time to type up all the ideas we’ve been bouncing around. I was invited to run a working group on the challenges to empirical software engineering posed by climate change. I started off with a quick overview of the three research themes we identified at the Oopsla workshop in the fall:
- Climate Modeling, which we could characterize as a kind of end-user software development, embedded in a scientific process;
- Global collective decision-making, which involves creating the software infrastructure for collective curation of sources of evidence in a highly charged political atmosphere;
- Green Software Engineering, including carbon accounting for the software systems lifecycle (development, operation and disposal), but where we have no existing no measurement framework, and tendency to to make unsupported claims (aka greenwashing).
Inevitably, we spent most of our time this week talking about the first topic – software engineering of computational models, as that’s the closest to the existing expertise of the group, and the most obvious place to start.
So, here’s a summary of our discussions. The bright ideas are due to the group (Vic Basili, Lionel Briand, Audris Mockus, Carolyn Seaman and Claes Wohlin), while the mistakes in presenting them here are all mine.
A lot of our discussion was focussed on the observation that climate modeling (and software for computational science in general) is a very different kind of software engineering than most of what’s discussed in the SE literature. It’s like we’ve identified a new species of software engineering, which appears to be a an outlier (perhaps an entirely new phylum?). This discovery (and the resulting comparisons) seems to tell us a lot about the other species that we thought we already understood.
The SE research community hasn’t really tackled the question of how the different contexts in which software development occurs might affect software development practices, nor when and how it’s appropriate to attempt to generalize empirical observations across different contexts. In our discussions at the workshop, we came up with many insights for mainstream software engineering, which means this is a two-way street: plenty of opportunity for re-examination of mainstream software engineering, as well as learning how to study SE for climate science. I should also say that many of our comparisons apply to computational science in general, not just climate science, although we used climate modeling for many specific examples.
We ended up discussing three closely related issues:
- How do we characterize/distinguish different points in this space (different species of software engineering)? We focussed particularly on how climate modeling is different from other forms of SE, but we also attempted to identify factors that would distinguish other species of SE from one another. We identified lots of contextual factors that seem to matter. We looked for external and internal constraints on the software development project that seem important. External constraints are things like resource limitations, or particular characteristics of customers or the environment where the software must run. Internal constraints are those that are imposed on the software team by itself, for example, choices of working style, project schedule, etc.
- Once we’ve identified what we think are important distinguishing traits (or constraints), how do we investigate whether these are indeed salient contextual factors? Do these contextual factors really explain observed differences in SE practices, and if so how? We need to consider how we would determine this empirically. What kinds of study are needed to investigate these contextual factors? How should the contextual factors be taken into account in other empirical studies?
- Now imagine we have already characterized this space of species of SE. What measures of software quality attributes (e.g. defect rates, productivity, portability, changeability…) are robust enough to allow us to make valid comparisons between species of SE? Which metrics can be applied in a consistent way across vastly different contexts? And if none of the traditional software engineering metrics (e.g. for quality, productivity, …) can be used for cross-species comparison, how can we do such comparisons?
In my study of the climate modelers at the UK Met Office Hadley centre, I had identified a list of potential success factors that might explain why the climate modelers appear to be successful (i.e. to the extent that we are able to assess it, they appear to build good quality software with low defect rates, without following a standard software engineering process). My list was:
- Highly tailored software development process – software development is tightly integrated into scientific work;
- Single Site Development – virtually all coupled climate models are developed at a single site, managed and coordinated at a single site, once they become sufficiently complex [edited – see Bob’s comments below], usually a government lab as universities don’t have the resources;
- Software developers are domain experts – they do not delegate programming tasks to programmers, which means they avoid the misunderstandings of the requirements common in many software projects;
- Shared ownership and commitment to quality, which means that the software developers are more likely to make contributions to the project that matter over the long term (in contrast to, say, offshored software development, where developers are only likely to do the tasks they are immediately paid for);
- Openness – the software is freely shared with a broad community, which means that there are plenty of people examining it and identifying defects;
- Benchmarking – there are many groups around the world building similar software, with regular, systematic comparisons on the same set of scenarios, through model inter-comparison projects (this trait could be unique – we couldn’t think of any other type of software for which this is done so widely).
- Unconstrained Release Schedule – as there is no external customer, software releases are unhurried, and occur only when the software is considered stable and tested enough.
At the workshop we identified many more distinguishing traits, any of which might be important:
- A stable architecture, defined by physical processes: atmosphere, ocean, sea ice, land scheme,…. All GCMs have the same conceptual architecture, and it is unchanged since modeling began, because it is derived from the natural boundaries in physical processes being simulated [edit: I mean the top level organisation of the code, not the choice of numerical methods, which do vary across models – see Bob’s comments below]. This is used as an organising principle both for the code modules, and also for the teams of scientists who contribute code. However, the modelers don’t necessarily derive some of the usual benefits of stable software architectures, such as information hiding and limiting the impacts of code changes, because the modules have very complex interfaces between them.
- The modules and integrated system each have independent lives, owned by different communities. For example, a particular ocean model might be used uncoupled by a large community, and also be integrated into several different coupled climate models at different labs. The communities who care about the ocean model on its own will have different needs and priorities than each of communities who care about the coupled models. Hence, the inter-dependence has to be continually re-negotiated. Some other forms of software have this feature too: Audris mentioned voice response systems in telecoms, which can be used stand-alone, and also in integrated call centre software; Lionel mentioned some types of embedded control systems onboard ships, where the modules are used indendently on some ships, and as part of a larger integrated command and control system on others.
- The software has huge societal importance, but the impact of software errors is very limited. First, a contrast: for automotive software, a software error can immediately lead to death, or huge expense, legal liability, etc, as cars are recalled. What would be the impact of software errors in climate models? An error may affect some of the experiments performed on the model, with perhaps the most serious consequence being the need to withdraw published papers (although I know of no cases where this has happened because of software errors rather than methodological errors). Because there are many other modeling groups, and scientific results are filtered through processes of replication, and systematic assessment of the overall scientific evidence, the impact of software errors on, say, climate policy is effectively nil. I guess it is possible that systematic errors are being made by many different climate modeling groups in the same way, but these wouldn’t be coding errors – they would be errors in the understanding of the physical processes and how best to represent them in a model.
- The programming language of choice is Fortran, and is unlikely to change for very good reasons. The reasons are simple: there is a huge body of legacy Fortran code, everyone in the community knows and understands Fortran (and for many of them, only Fortran), and Fortran is ideal for much of the work of coding up the mathematical formulae that represent the physics. Oh, and performance matters enough that the overhead of object oriented languages makes them unattractive. Several climate scientists have pointed out to me that it probably doesn’t matter what language they use, the bulk of the code would look pretty much the same – long chunks of sequential code implementing a series of equations. Which means there’s really no push to discard Fortran.
- Existence and use of shared infrastructure and frameworks. An example used by pretty much every climate model is MPI. However, unlike Fortran, which is generally liked (if not loved), everyone universally hates MPI. If there was something better they would use it. [OpenMP doesn’t seem to have any bigger fanclub]. There are also frameworks for structuring climate models and coupling the different physics components (more on these in a subsequent post). Use of frameworks is an internal constraint that will distinguish some species of software engineering, although I’m really not clear how it will relate to choices of software development process. More research needed.
- The software developers are very smart people. Typically with PhDs in physics or related geosciences. When we discussed this in the group, we all agreed this is a very significant factor, and that you don’t need much (formal) process with very smart people. But we couldn’t think of any existing empirical evidence to support such a claim. So we speculated that we needed a multi-case case study, with some cases representing software built by very smart people (e.g. climate models, the Linux kernel, Apache, etc), and other cases representing software built by …. stupid people. But we felt we might have some difficulty recruiting subjects for such a study (unless we concealed our intent), and we would probably get into trouble once we tried to publish the results 🙂
- The software is developed by users for their own use, and this software is mission-critical for them. I mentioned this above, but want to add something here. Most open source projects are built by people who want a tool for their own use, but that others might find useful too. The tools are built on the side (i.e. not part of the developers’ main job performance evaluations) but most such tools aren’t critical to the developers’ regular work. In contrast, climate models are absolutely central to the scientific work on which the climate scientists’ job performance depends. Hence, we described them as mission-critical, but only in a personal kind of way. If that makes sense.
- The software is used to build a product line, rather than an individual product. All the main climate models have a number of different model configurations, representing different builds from the codebase (rather than say just different settings). In the extreme case, the UK Met Office produces several operational weather forecasting models and several research climate models from the same unified codebase, although this is unusual for a climate modeling group.
- Testing focuses almost exclusively on integration testing. In climate modeling, there is very little unit testing, because it’s hard to specify an appropriate test for small units in isolation from the full simulation. Instead the focus is on very extensive integration tests, with daily builds, overnight regression testing, and a rigorous process of comparing the output from runs before and after each code change. In contrast, most other types of software engineering focus instead on unit testing, with elaborate test harnesses to test pieces of the software in isolation from the rest of the system. In embedded software, the testing environment usually needs to simulate the operational environment; the most extreme case I’ve seen is the software for the international space station, where the only end-to-end software integration was the final assembly in low earth orbit.
- Software development activities are completely entangled with a wide set of other activities: doing science. This makes it almost impossible to assess software productivity in the usual way, and even impossible to estimate the total development cost of the software. We tried this as a thought experiment at the Hadley Centre, and quickly gave up: there is no sensible way of drawing a boundary to distinguish some set of activities that could be regarded as contributing to the model development, from other activities that could not. The only reasonable path to assessing productivity that we can think of must focus on time-to-results, or time-to-publication, rather than on software development and delivery.
- Optimization doesn’t help. This is interesting, because one might expect climate modelers to put a huge amount of effort into optimization, given that century-long climate simulations still take weeks/months on some of the world’s fastest supercomputers. In practice, optimization, where it is done, tends to be an afterthought. The reason is that the model is changed so frequently that hand optimization of any particular model version is not useful. Plus the code has to remain very understandable, so very clever designed-in optimizations tend to be counter-productive.
- There are very few resources available for software infrastructure. Most of the funding is concentrated on the frontline science (and the costs of buying and operating supercomputers). It’s very hard to divert any of this funding to software engineering support, so development of the software infrastructure is sidelined and sporadic.
- …and last but not least, A very politically charged atmosphere. A large number of people actively seek to undermine the science, and to discredit individual scientists, for political (ideological) or commercial (revenue protection) reasons. We discussed how much this directly impacts the climate modellers, and I have to admit I don’t really know. My sense is that all of the modelers I’ve interviewed are shielded to a large extend from the political battles (I never asked them about this). Those scientists who have been directly attacked (e.g. Mann, Jones, Santer) tend to be scientists more involved in creation and analysis of datasets, rather than GCM developers. However, I also think the situation is changing rapidly, especially in the last few months, and climate scientists of all types are starting to feel more exposed.
We also speculated about some other contextual factors that might distinguish different software engineering species, not necessarily related to our analysis of computational science software. For example:
- Existence of competitors;
- Whether software is developed for single-person-use versus intended for broader user base;
- Need for certification (and different modes by which certification might be done, for example where there are liability issues, and the need to demonstrate due diligence)
- Whether software is expected to tolerate and/or compensate for hardware errors. For example, for automotive software, much of the complexity comes from building fault-tolerance into the software because correcting hardware problems introduced in design or manufacture is prohibitively expense. We pondered how often hardware errors occur in supercomputer installations, and whether if they did it would affect the software. I’ve no idea of the answer to the first question, but the second is readily handled by the checkpoint and restart features built into all climate models. Audris pointed out that given the volumes of data being handled (terrabytes per day), there are almost certainly errors introduced in storage and retrieval (i.e. bits getting flipped), and enough that standard error correction would still miss a few. However, there’s enough noise in the data that in general, such things probably go unnoticed, although we speculated what would happen when the most significant bit gets flipped in some important variable.
More interestingly, we talked about what happens when these contextual factors change over time. For example, the emergence of a competitor where there was none previously, or the creation of a new regulatory framework where none existed. Or even, in the case of health care, when change in the regulatory framework relaxes a constraint – such as the recent US healthcare bill, under which it (presumably) becomes easier to share health records among medical professionals if knowledge of pre-existing conditions is no longer a critical privacy concern. An example from climate modeling: software that was originally developed as part of a PhD project intended for use by just one person eventually grows into a vast legacy system, because it turns out to be a really useful model for the community to use. And another: the move from single site development (which is how nearly all climate models were developed) to geographically distributed development, now that it’s getting increasingly hard to get all the necessary expertise under one roof, because of the increasing diversity of science included in the models.
We think there are lots of interesting studies to be done of what happens to the software development processes for different species of software when such contextual factors change.
Finally, we talked a bit about the challenge of finding metrics that are valid across the vastly different contexts of the various software engineering species we identified. Experience with trying to measure defect rates in climate models suggests that it is much harder to make valid comparisons than is generally presumed in the software literature. There really has not been any serious consideration of these various contextual factors and their impact on software practices in the literature, and hence we might need to re-think a lot of the ways in which claims for generality are handled in empirical software engineering studies. We spent some time talking about the specific case of defect measurements, but I’ll save that for a future post.
Pingback: Tweets that mention What makes software engineering for climate models different? | Serendipity -- Topsy.com
Very timely Steve. Thanks. I’m writing a talk for next week on software infrastructures for earth system modelling. I think you’ve just given me a few slides – which will of course get due attribution.
I think you miss by a mile on a couple of points. The first is
“Single Site Development – virtually all climate models are developed at a single site, usually a government lab as universities don’t have the resources;”
Although I am not directly involved in climate modeling, I am involved in mesoscale modeling (1000 -> 300 meter horizontal resolution). Like to climate modeling community the mesoscale model development community is done at the local university AND national labs. Many of the mesoscale models AND climate models were initial developed at universities then migrated to the larger computing facilities over time. MIROC3.2, INGV-SXG, ECHAM5/MPI-OM are examples of university generated models. PCM from NCAR is a merger of multiple model types from universities into a single framework model. MM5, RAMS and WRF are examples of mesoscale models that began at universities.
A second miss is
“A stable architecture, defined by physical processes: atmosphere, ocean, sea ice, land scheme,…. All GCMs have the same conceptual architecture, and it is unchanged since modeling began, because it is derived from the natural boundaries in physical processes being simulated.”
You are correct the models are constrained by physical processes, but the models DO NOT have the same conceptual architecture and certainly have changed significantly. Indeed the primary point of having multiple models is that don’t have the same conceptual architecture (unless you mean simulate the climate). The underlying methods are very different. Finite differencing vs spectral methods vs finite element vs analytical solutions all have significantly different methods for solving the same problem.
Thirdly
“The programming language of choice is Fortran, and is unlikely to change for very good reasons. The reasons are simple: there is a huge body of legacy Fortran code, everyone in the community knows and understands Fortran (and for many of them, only Fortran), and Fortran is ideal for much of the work of coding up the mathematical formulae that represent the physics.’
Fortran is the language of choice and the reason has nothing to do with legacy code. Nearly all modelers that I know are fluent not only in Fortran, but C, C++, and Perl as well. Fortran is the language used because it allows you to express the mathematics and physics in a very clear succinct fashion. The idea here is that a craftsman has many tools in his tool chest the amateur believes everything is a nail. The only common feature in terms of programming tools amongst modelers is a universal HATRED of object-oriented programming languages, particularly python.
Object-oriented programming is the answer to a question that nobody has ever felt the need to ask. Programming in an object-oriented language is like kicking a dead whale down the beach
Bob: Thanks for your comments – that’s very helpful feedback.
The single site development thing is interesting, and I grossly oversimplified. I was trying to say that it’s an important success factor at the UK Met Office; I’m now exploring this assertion by looking at other models for which development is more distributed. I think I could claim that in general, once a coupled GCM becomes sufficiently complex, there is a tendency to migrate to a single site. This is not universally true, but where it is not, it causes coordination problems. The UK Met Office is an extreme example of single site development; other places have a more federated approach. For example, NCAR’s CCSM is a genuine community model, with some of the submodels managed at other sites. NCAR coordinates how these multiple communities contribute to the coupled model, with a team dedicated to managing coordination of the various community contributions, folding externally contributed changes into the coupled model.
On architectures, we appear to have different definitions of the word “architecture”. I’m talking about the top level structure of the code in a coupled model, and the corresponding organisation of teamwork. You’re talking about the implementation choices for the core numerical equations in an atmospheric and/or ocean model. Software architecture (at least the way software engineers use the term) is different from choice of implementation algorithm.
On Fortran, I completely agree with the reasons you give for Fortran being used; I disagree with the comment that it has nothing to do with legacy code. For climate modeling centers, it would be almost impossible to discard the existing models, or to port them to another language. Both this, and the years of experience in using Fortran in the community act as a strong constraint on language choice. We could debate the relative impact of the reasons each of us gives, but it would be pointless – the fact is that Fortran is the language of choice, and there are many good reasons why this is so.
I like python, but my python tends to look like Fortran, and I mostly call stuff written in Fortran and wrapped with F2Py (Like Bob, I’m not climate modeler either, my entire domain might reach 300 meters, but the software doesn’t care about the scale, everything is non-dimensionalized anyway).
The other thing to realize is that Fortran (modern 90/95 etc) is a high level language for scientific computing. Write down pseudo-code for an algorithm involving lots of operations on vectors and it looks pretty much the same after you translate it into Fortran90 or Matlab.
Hi Steve,
Thanks for your insights into the minds of climate modelers. As a person who has spent decades doing technical/engineering programming in areas unrelated to climate modeling, I find the post very interesting.
You ask what makes software engineering for climate models different. The answer should be — nothing. Consensus software engineering practices should be entirely sufficient when developing climate software. That this is not the case (e.g., “without following a standard software engineering process”) is, IMHO, a significant issue. I come away from the post with the feeling the climate software community is insular. This is all the more odd to me since the science upon which climate software is based is mostly multidisciplinary.
You mention that for the climate models the “developers are domain experts – they do not delegate programming tasks to programmers, which means they avoid the misunderstandings of the requirements common in many software projects”.
Great idea. Maybe we should have farmers design and build tractors. Farmers are domain experts too.
Other “success factors” you mention, such as: highly tailored process, single site development, shared ownership, commitment to quality, openness, and benchmarking are all attributes commonly found in successful software projects, regardless of ilk. A question would be, how are these common attributes measured and documented for climate software? Are they real or imagined?
You mentioned a “unconstrained release schedule” for the climate software. There is a relationship between software schedule and the software’s functionality. Arbitrary schedules merely mean arbitrary functionality. This luxury has nothing to do with quality per se.
Then you mention distinguishing traits such as stable architecture and a modular/integrated/framework (component?, OO?) design. Again, these are often features of modern software of all types. So I fail to understand what is so distinguishing about such traits.
Then there is the statement: “The software has huge societal importance, but the impact of software errors is very limited.” I don’t see how it can be both ways. How can something be of great importance whether or not it is correct? IMHO, the most serious consequence of a climate software being defective would be to then use it to make a defective political decision costing trillions of dollars to society.
Then there is: “The software developers are very smart people.” That’s great, since software development is the most complicated thing people do. That’s because, in general, if people could program anything any more complicated, and the software still work, they would. But you see the problem with this? Being smart is only an advantage if you design and build simple software. But such is not the case with the climate software. It’s as complicated as people can make it.
I could go on, but I think I’ve made my point. IMHO, climate software developers are not as different as they may think.
George
[George: Read the post again. This is a set of hypotheses from a workshop brainstorming – the whole point is that we haven’t yet studied which of these dimensions matter – we’re trying to figure out ways of measuring them. You, however, seem to be proceeding from an assumption that if they don’t use standard SE processes, then there must be something wrong with their software. I would recommend studying the domain before making such grand assumptions. – Steve]
This comment interests me. I haven’t used Fortran, but have used Python and R for data analysis in my own work. Indeed, I’ve used Python for modeling; not a climate-related model, but a relatively complex model in the software engineering domain itself (as yet unpublished).
I did consider building my model in R, but ultimately chose Python, almost entirely on the basis that it is object-oriented. Though Python lacks the specialised libraries of R, it makes architectural design easier. It seems to have just enough in the way of functional language structures that succinctly writing model equations is not too much of a chore. I do make use of polymorphism, which I think has helped in achieving a reasonably clean design in my case. (In another language, I’m sure my code would run a lot faster, but that’s not my principal concern.)
I wonder if this preference is due to differences in the domain, or in problem solving conventions shaped by the languages themselves.
The dimensions that matter for the correctness of PDE solvers are pretty well established (see Roache’s early work, also more recently Oberkampf/Trucano/et al out of Sandia and a host of various others scattered about); the dimensions that matter for usefulness in decision support are probably a little more of an open research area (but there’s still plenty of work in this area that you guys seem to be ignoring, maybe this is too far away from the development process to interest you?). I don’t think anyone is jumping to the conclusion that “there’s necessarily something wrong with the code” (were you George?), but the conclusion that “we don’t know if there’s anything wrong (and neither do you)” is pretty well supported based on the sort of process you describe which does not include formal code verification, nor calculation verification (maybe you’re leaving those parts out because they are just a given? if so, man I’d really like to see some reports on grid convergence results for the models used in the IPCC’s write-ups). Validation will remain a fundamental problem for climate modeling, but that’s probably another discussion (and a fruitful area for new research btw). Rather than compare the practice of climate modelers to standard software development practice, it might be more instructive to compare their procedures to those of other computational physicists. There are glimmers of hope, but the results were not encouraging.
Maybe George and I are just interested in slightly different questions than your research community so we’re talking past each other a bit.
In my experience (when I was a climate modeller, talking to climate modellers), many who had had *enough* exposure to object orientation, would cheerfully admit that O-O codes could well have made some model tasks easier, but given that the job was already done in Fortran …
… so my experience is precisely the opposite of Bob’s. Further, many of the folk I know choose to use O-O languages, particularly python, to do data analysis, so far from hating it … it’s an analysis language of choice.
Which is all to say that any generalisations are just that. It’d be surprising if there weren’t exceptions. I guess the interesting question is whether these variations are clumped in particular communities or whether there is a broad distribution of behaviours. I look forward to Steve telling us (because most of us down our own rabbit holes have no time to pop our heads up and survey the landscape).
To some extent yes. I will get on to the kinds of correctness/validation questions you’re asking about, but need to do some more groundwork first. For now, I’m still exploring the broader context in which the coding takes place, and how activities are organised.
I’ve some papers on my “to read” pile that might answer some of your questions, Josh. I haven’t read ’em yet, though, so no guarantees:
http://dx.doi.org/10.1109/MCISE.2002.1032431
http://www.gfdl.noaa.gov/bibliography/related_files/ih9401.pdf
http://www.osti.gov/energycitations/product.biblio.jsp?osti_id=518379
http://ams.allenpress.com/perlserv/?doi=10.1175%2FMWR2788.1&request=get-document&ct=1
The first looks like a good overview; the last gets into the guts of testing the numerical routines. The middle two papers are a little older, and seem to date from the time when atmosphere models still needed flux adjustments. (Edit: oops, I just noticed the last of these is what you already linked to; maybe you should contact the authors and see what they’ve done more recently).
BTW Bryan points out my paper is behind a paywall. I’ve added a link here (http://www.easterbrook.ca/steve/?p=974) to the draft version which is a little longer (we had to condense it for page limits).
Grid convergence. I read the links, quickly, so sorry if I go off on the wrong tack. (Disclosure: I still consider myself a dynamicist, my day job used to be the science of gravity waves and their parameterisation in models. Why tell you this, because a lot of what I did was because we knew the grids didn’t converge, and I wouldn’t want you to think I was hiding that).
There is *NO WAY* we can run models at high enough resolution to have grids converge in the way I think was suggested as required. Even if we could do that, we could only afford to run one instance of the model anyway, and that wouldn’t be very interesting … certainly not for decadal and near term prediction, and frankly, for longer term, the climate problem just doesn’t seem to be susceptible to upscaling issues (at least for the global mean). It is absolutely an issue for the regional scale. But this isn’t about software engineering, it’s about the science that we are encoding in our models.
(parameterisations do have “fudge” factors in them, but we like to call those empirical adjustments. The key point being that they are empirical.)
Generally the resolution required to show ‘convergence’ depends on the functional; for functionals like the global mean (of anything), I’d think that you could show convergence (in fact one of those papers I linked mentions that some things converge and some things don’t). So the answer is “it depends” (isn’t that always the answer?). I’ve been meaning to do a post on the different convergence behavior of a weather-like functional compared to a climate-like functional using the Lorenz ’63 system, but it keeps getting pushed down my queue…
Sure; I understand that, nobody’s (at least I’m not) asking for DNS or molecular dynamics simulations of atmospheric and ocean flows over centuries, but how solid is the empirical basis for you parameter choice when it is used as a fix for a grid you know is under-resolved? It seems reasonable, as you say, that the changes only matter for more regional things, but we really don’t know until we converge a solution (and aren’t the regional things where we can make a much more direct connection to the things people care about?).
Steve thanks for sharing from your reading list.
@George Crews
George you say
You mention that for the climate models the “developers are domain experts – they do not delegate programming tasks to programmers, which means they avoid the misunderstandings of the requirements common in many software projects”.
Great idea. Maybe we should have farmers design and build tractors. Farmers are domain experts too.
You haven’t talked to many meteorologists/physicist/engineers have you? My meteorology majors will graduate with a second degree in either mathematics or computer science. A meteorology graduate student without a math or CS backgeound won’t be able to graduate in a reasonable length of time (M.S. <= 4 years) because they will have to take the time to fill in the missing pieces. How can you write the code necessary to process the data in real-time coming from fore and aft pointing aircraft Doppler radars without knowing how interrupts are preocessed. How can you write the code to process the data in real-time from a ship-borne radar that is rolling, pitching and yawing.
A more appropriate analogy would be the the chief engineer at John Deere retiring to a farm and designing a new tractor for his hilly farm
@Bob Pasken
Hi Bob,
You say:
Unfortunately, I have encountered the following argument a distressing number of times:
1. As a good scientist, I am automatically a good engineer.
2. As a good engineer, I am automatically a good programmer.
3. As a good programmer, I am automatically a good Software Quality Assurance Analyst (whatever that is, nothing significant I would guess).
What hubris. Programming is a domain in its own right. It’s the most complicated domain there is because if people could make their programs any more complicated they would. Programming is also an art. Even a computer science degree does not make you a good programmer.
Anecdotal though it may be, it has been my experience that only about 5% of “smart people” could ever actually write great software. Or, as I like to put it, I am 95% confident nobody can write great software.
BTW, Python is my favorite programming language. It has a particularly simple object model, that you can completely ignore if you want to. And it comes with “batteries included.” Even things like SciPy. So I fail to understand the “universal HATRED” modelers have about it. Guess it means I haven’t ever really modeled anything using Python?
George
A good parallel to climate models are computational chemistry codes such as Gaussian, Spartan, Gamess, Molpro, etc. Again, usually a strong FORTRAN base, written by domain experts and increasingly commercialized.
As someone who at least has a foot (well maybe a toe) in both communities, I agree with many of the observations about the the unique aspects of developing scientific models. However, I strongly disagree that the current situation with regard to software engineering is any where near an optimum for this community. We scientists have become all too accustomed to 1000+ line procedures with short variable names, and are generally unaware of what clean, understandable code can look like. Yes, there are limits to what can be done with a complex mathematical relationship expressed in source code, but there is little excuse for much of the implementation quality in the remaining portions of the code.
Even in the hard-core numerical portions, much can often be done to improve the situation. I once worked on a project where I was helping some scientists translate some IDL code into Fortran (for performance and portability). I ran across a RK routine and could tell that a Runge-Kutta scheme was lurking in the 200-300 lines of code. As it turns out, there was also a fair bit of spherical geometry and other indirectly related code that was one “long chunk of sequential code implementing a series of equations”. I decided to refactor the Fortran implementation, and was very pleased by the results. In the end, the top level RK routine looked almost like something you would see in a text book. The spherical geometry was in a separate module, as were the bits of logic for accessing files to get offline wind fields and implementing periodicity. All far easier to understand. And why does it matter? At the very least the next developer would have a much easier time introducing for example an adaptive RK scheme. But I also was able to discover that the treatment of the lat-lon grid was terribly inaccurate near the poles and was able to introduce a more correct geometric treatment that allowed 10x larger step sizes. I never would have even attempted that change in the original code. I strongly doubt that this particular bit of code is unique in terms of the opportunity it provided.
Interestingly, in my experience, most modelers are to some degree aware that there is a problem, and are even open to assistance in reducing the source code “entropy”. They merely lack the skills and/or time to improve the code through refactoring, and our understandably apprehensive about change. But they continue to bear the costs of this entropy (often called “code debt”) through (1) steep learning curve for new scientists, (2) increased difficulty in extending the model, and (3) increased difficulty in debugging the model when it breaks. I really wish that these costs could be quantified so that we could judge whether it was in the long term interests of the organization to hire more professional software developers to improve long term scientific productivity. I’m more than prepared to admit that the costs would be counterproductive, but I want to be convinced.
The whole language choice seems always to degenerate into a religious war, and like all religious wars, the sides tend to be chosen on the basis of where one grew up.
My experience is that different scientific communities have made different choices, even though they are all coding up equations. The AMR framework, CHOMBO, is all C++, as is the VORPAL EM/accelerator/plasma framework. The Synergia beam modeling framework is Python linking in modules in Fortran and C++. Most fusion codes are Fortran, but many are C, and the FACETS framework is C++ that links in modules written in C++, Fortran, Python, and C. Much viz work and data analysis is now done with Python/matplotlib/scipy, while other is done within the VisIt C++ application.
With over 30 years of scientific programming behind me, I really don’t see any universal hatred of any language or methodology such as OO. I see more of “one should use what one is most productive in, whether that is because of familiarity or technique.” I have also seen that some communities are more conservative, others more avant garde. People in any of these communities are doing good work.
Pingback: Scientific bricolage and what to do about it | Serendipity
Pingback: Software Product Line or Not? « Rocky Dunlap's Weblog
Pingback: Do Climate Models need Independent Verification and Validation? | Serendipity
My two cents:
Mostly concur with George Crews, Tom Clune, and John Cary. I’ve seen a very smart hydrologist write code that made me cringe (As in, please let me have that for a day or two; it will run faster and I can see about 3 bugs that can be fixed in the process). Writing code is an art/domain in an of itself; knowing the mechanics of interrupts or whatever does not mean that you know how to write good code.
What is good code? First, it is simple and easy to read. Any idiot can write complicated code; it is an art, or at least requires some expertise, to be able to make something simple, or at least be able to break a complicated process down into simple bits. Code spends some small fraction of its life in development, and the remainder in maintenance. It is a heavy burden on productivity when someone finds that a piece of code has a problem and has to guess what the intentions of the author were 5, 10, or 20 years ago. (Being able to read and understand what they wrote is not the same as knowing what were they trying to do.) Or, where in this spaghetti the problems lies, or what else will be affected by a change. Being able to write a program that solves a task well enough for you in no way implies that your code is well-suited for others to utilize.
“Software developers are domain experts”
There is certainly a benefit to reducing errors introduced through communication steps, but I think you will find that it is not without costs. Scientific knowledge is an expert knowledge domain, and so is software design. I would expect that individuals who are experts in both are very rare. I suspect that in software projects, the general optimum would be a lead developer with some domain knowledge of the science. (At least enough to know to question whether the variable should be x or mean of x, if you get the reference.)
Re: “…we all agreed this is a very significant factor, and that you don’t need much (formal) process with very smart people.” Yes and no, the highly skilled people I’ve worked with benefit from a formal process less than the less skilled, but you are kidding yourselves if you think there is little benefit to tasks like code review and unit testing, even for the highly skilled.
Unit testing? Got to do it; if you don’t know if the pieces are working correctly, how can you have any confidence in the whole? Most of the time that I’ve seen push-back on unit testing had more to do with the developer not knowing how to get that done than with the inappropriateness of that type of testing to the algorithm.
“In practice, optimization, where it is done, tends to be an afterthought.”
Well, OK, even in industry, there is a tendency to focus on optimization over new features only when performance is a point of pain. However, anecdotally, I can tell you that I increased the throughput of an overall system from O(N squared) to O(N) simply by changing one line in one low-level library function. It would have taken me a lot longer to find the problem if the code base was a mash of 1000-line routines. As it was, it was fairly easy to find because the code was fairly modular, but even at that, I had to explain why my change fixed the problem to the reviewer simply because they were not aware of the processing taking place in the OS library. My point is that without good software domain knowledge, it is easy to make simple mistakes, and without good code design, mistakes are harder to find and fix.
Release schedule?
Yeah, scheduled releases have more influence on feature content than quality, but a) I have seen the pressure to meet the schedule cause some shortcuts to be taken, and b) not sure what it is like in climate research, but I do have some experience at a “publish or perish” university where there could be similar pressure to produce a “release”. The general need to get something out before your competition is similar, though not necessarily the same.
Re: “A more appropriate analogy would be the the chief engineer at John Deere retiring to a farm and designing a new tractor for his hilly farm”
Ermm, we are talking about two different fields of expertise. No doubt the engineer could design a tractor ideally suited for his terrain and soil type, but there is more to farming than having a tractor. Soil chemistry, water availability, and temperature and which of the latest cultivars is best suited for the expected conditions come to mind, as well as the economics of a multitude of other choices to be made.
Understanding the nature of the climate system is the problem, and designing tools to help with that is a means to that end, in the same way that growing food is the problem, and designing a tractor is a means to that end. Your reversal of the analogy puts the software engineer in the role of retiring to become a climatologist. I make no claim to being able to do your job as well as you can; I find ‘hubris’ to be an appropriate term for you thinking that you can design code as well as a specialist in the field.
I’ve also worked on research project code that became commercialized. It was a nightmare. It was full of inconsistent logic patterns, redundant code (so that a fix for one problem had to be applied in multiple locations, and finding that sometimes that had happened and sometimes not), single variables with multiple meanings depending on where it was accessed, and in general, a failure of good design principles that made extending capabilities and fixing defects much more difficult than it should have been.
Oh, “Green Software Engineering”? What the heck is that? You are sitting at a desk, in front of a computer, and banging on a keyboard. You have to be air-conditioned or the computer will overheat. You might as well talk about Green Office Work; the general mechanics are the same.
“Highly tailored software development process – software development is tightly integrated into scientific work;”
Well, that is the same as working in other areas where the specs are changing frequently. It requires a flexible process, agile if you will, but does not sound that different from others that exist. If anything, it means that the code should be well-designed, with an emphasis on modularity, so that changes can be made more easily.
Resource constraints. I think you will find that there is no software project manager who has all the resources they think they need to accomplish what they’d like.
Re: “The software is used to build a product line”
I think you will find that this is not much different than industrial software.
Re: “Instead the focus is on very extensive integration tests, with daily builds, overnight regression testing, and a rigorous process of comparing the output from runs before and after each code change. ”
I work in industry, and this is pretty much what we do. I suspect one difference is that in our realm, the expected results are (or should be) well-defined; there is only one set of values that is the “right” answer. I suspect this is less so for climate models. We might be talking about different things; when I say unit test, I mean something on the scale of, is the output of the implementation of this function correct over the range of input values. Those kinds of tests are run during development, and ideally, when the function is changed, but not on a regular schedule.
Re: “The reason is that the model is changed so frequently that hand optimization of any particular model version is not useful. ”
There exist tools for measuring time within a function and number of calls to that function. These might make it easier to find bottlenecks. I’ve used them before to identify areas within the code base that are likely to yield the most bang for the buck in terms of developer time to product performance time. I don’t think you want to try to approach the problem as optimizing an entire model, but rather as identifying where improvements can be made. I don’t quite get what you are saying; surely later versions inherit whatever optimizations you make.
An interesting discussion. I particularly liked: “There are very few resources available for software infrastructure. ” I fully believe this is worse for you guys than me, but you might be surprised at what takes place, even if the main business is software development.
In short, I think you are going to have to go outside of your field to find the answers to your questions. Staying within your field is likely to yield group-think results, which I can kind of detect already.
From your paper: “Scientists have additional requirements for managing scientific code: they need to keep track of exactly which version of the code was used in a particular experiment, they need to re-run experiments with precisely repeatable results, and they need to build alternative versions of the software for different kinds of experiments.”
Repeatability of a build and a test run is exactly the same kind of requirement for the software projects I work on in commercial industry. Probably a little easier for us to get access to the tools to do that though; I’m betting they aren’t giving away Rational Team Concert for free, for instance.
“…older programming languages, for which the latest software development tools are not available.”
Can’t guarantee that it meets all needs, but there exists a modern software development tool that works with FORTRAN.
http://www.eclipse.org/photran/
Eclipse also has modules that work with the more common source code management systems, which makes it nice.
“…the developers will gradually evolve a set of processes that are highly customized
to their context, irrespective of the advice of the software engineering literature.”
I think you will find this to be true regardless of the circumstances. If the shop management requires them to have a formal process, they will simply write down what they are doing anyway.
“Similarly, there is likely to be a difference in how the scientists perceive defects and bug fixes, compared to the use in the software engineering literature, because model imperfections are accepted as inevitable. Hence, any measure of defect density may not be comparable with that of other types of software.”
It should be easy enough to draw a distinction using the rule of ‘Did the software do what it was designed/intended to do?’ If the answer is ‘no’, a change to correct is a defect fix; if the answer is ‘yes’, it is a feature enhancement. Sometimes this devolves into an argument that I have had repeatedly, in that some problem in the software is result of a defect in the design. But that requires that there be some requirement/specification at a larger scope that can’t be met with the current design, and I don’t know how that would occur in the context of climate models.
Climate models might be used for weather forecasting critical weather events such as cyclones, hurricanes etc in order to increase safety and survivability of the human population by use as a basis for ordering evacuations, relocations, fire bans, tying down heavy objects, shuttering windows, moving to higher ground etc. and as a basis for built environment planning ie avoiding flood prone areas etc.Therefore a high level of accuracy is needed in climate models to ensure correct warnings are issued.