Gavin beat me to posting the best quote from the CCSM workshop last week – the Uncertainty Prayer. Uncertainty cropped up as a theme throughout the workshop. In discussions about the IPCC process, one issue came up several times: the likelihood that the spread of model projections in the next IPCC assessment will be larger than in AR4. The models are significantly more complex than they were five years ago, incorporating a broader set of earth system phenomena and resolving finer grain processes. The uncertainties in a more complex earth system model have a tendency to multiply, leading to a broader spread.

There is a big concern here about how to communicate this. Does this mean the science is going backwards – that we know less now than we did five years ago (imagine the sort of hay that some of the crazier parts of the blogosphere will make of that)? Well, there has been all sorts of progress in the past five years, much of it to do with understanding the uncertainties. And one result is the realization that the previous generations of models have under-represented uncertainty in the physical climate system – i.e. the previous projections for future climate change were more precise than they should have been. The implications are very serious for policymaking, not because there is any weaker case now for action, but precisely the opposite – the case for urgent action is stronger because the risks are worse, and good policy must be based on sound risk assessment. A bigger model spread means there’s now a bigger risk of more extreme climate responses to anthropogenic emissions. This problem was discussed at a fascinating session at the AGU meeting last year on validating model uncertainty (See: “How good are predictions from climate models?“).

At the CCSM meeting last week, Julia Slingo, chief scientist at the UK Met Office put the problem of dealing with uncertainty into context, by reviewing the current state of the art in short and long term forecasting, in a fascinating talk “Uncertainty in Weather and Climate Prediction”.

She began with the work of Ed Lorenz. The Lorenz attractor is the prototype chaotic model. A chaotic system is not random, and the non-linear equations of a chaotic system demonstrate some very interesting behaviours. If it’s not random, then it must be predictable, but this predictability is flow dependent – where you are in the attractor will determine where you will go, but some starting points lead to a much more tightly constrained set of behaviours than others. Hence, the spread of possible outcomes depends on the initial state, and some states have more predictable outcomes than others.

Why stochastic forecasting is better than deterministic forecasting

Much of the challenge in weather forecasting is to sample the initial condition uncertainty. Rather than using a single (deterministic) forecast run, modern weather forecasting makes use of ensemble forecasts, which probe the space of possible outcomes from a given (uncertain) initial state. This then allows the forecasters to assess possible outcomes, estimate risks and possibilities, and communicate risks to the users. Note the phrase “to allow the forecasters to…” – the role of experts in interpreting the forecasts and explaining the risks is vital.

As an example, Julia showed two temperature forecasts for London, using initial conditions for 26 June on two consecutive years, 1994 and 1995. The red curves show the individual members of an ensemble forecast. The ensemble spread is very different in each case, demonstrating that some initial conditions are more predictable than others: one has very high spread of model forecasts, and the other doesn’t (although note that in both cases the actual observations lie within the forecast spread):

Ensemble forecasts for two different initial states (click for bigger)

The problem is that in ensemble forecasting, the root mean squared (rms) error of the ensemble mean often grows faster than the spread, which indicates that the forecast is under-dispersive; in other words, the models don’t capture enough of the internal variability in the system. In such cases, improving the models (by eliminating modeling errors) will lead to increased internal variability, and hence larger ensemble spread.

One response to this problem is the work on stochastic parameterizations. Essentially, this introduces noise into the model to simulate variability in the sub-grid processes. This can then reduce the systematic model error if it better captures the chaotic behaviour of the system. Julia mentioned three schemes that have been explored for doing this:

  • Random Parameters (RP), in which some of the tunable model parameters are varied randomly. This approach is not very convincing as it indicates we don’t really know what’s going on in the model.
  • Stochastic Convective Vorticity (SCV)
  • Stochastic Kinetic Energy Backscatter (SKEB)

The latter two approaches tackle known weaknesses in the models, at the boundaries between resolved physical processes and sub-scale parameterizations. There is plenty of evidence in recent years that there are upscale energy cascades from unresolved scales, and that parametrizations don’t capture this. For example, in the backscatter scheme, some fraction of dissipated energy is scattered upscale and acts as a forcing for the resolved-scale flow. By including this in the ensemble prediction system, the forecast is no longer under-dispersive.

The other major approach is to increase the resolution of the model. Higher resolutions models will explicitly resolve more of the moist processes in sub-kilometer scale, and (presumably) remove this source of model error, although it’s not yet clear how successful this will be.

But what about seasonal forecasting – surely this growth of uncertainty prevents any kind of forecasting? People frequently ask “If we can’t predict weather beyond the next week, why is it possible to make seasonal forecasts?” The reason is that for longer term forecasts, the boundary forcings start to matter more. For example, if you add a boundary forcing to the Lorenz attractor, it changes the time in which the system stays in some part of the attractor, without changing the overall behaviour of the chaotic system. For a weak forcing, the frequency of occurrence of different regimes is changed, but the number and spatial patterns are unchanged. Under strong forcing, even the patterns of regimes are modified as the system goes through bifurcation points. So if we know something about the forcing, we can forecast the general statistics of weather, even if it’s not possible to say what the weather will be at a particular location at a particular time.

Of course, there’s still a communication problem: people feel weather, not the statistics of climate.

Building on the early work of Charney and Shukla (e.g. see their 1981 paper on monsoon predictability), seasonal to decadal prediction using coupled atmosphere-ocean systems does work, whereas 20 years ago, we would never have believed it. But again, we get the problem that some parts of the behaviour space are easier to predict than others. For example, the onset of El Niño is much harder to predict than the decay.

In a fully coupled system, systematic and model-specific errors grow much more strongly. Because the errors can grow quickly, and bias the probability distribution of outcomes, seasonal and decadal forecasts may not be reliable. So we assess reliability of a given model using hindcasts. Every time you change the model, you have to redo the hindcasts to check reliability. This gives a reasonable sanity check for seasonal forecasting, but for decadal prediction, it is challenging has we have very limited observational base.

And now, we have another problem: climate change is reducing the suitability of observations from the recent past to validate the models, even for seasonal prediction:

Climate Change shifts the climatology, so that models tuned to 20th century climate might no longer give good forecasts

Hence, a 40-year hindcast set might no longer be useful for validating future forecasts. As an example, the UK Met Office got into trouble for failing to predict the cold winter in the UK for 2009-2010. Re-analysis of the forecasts indicates why: Models that are calibrated on a 40-year hindcast gave only 20% probability of cold winter (and this was what was used for the seasonal forecast last year). However, models that are calibrated on just the past 20-years gave a 45% probability. Which indicates that the past 40 years might no longer be a good indicator of future seasonal weather. Climate change makes seasonal forecasting harder!

Today, the state-of-the-art for longer term forecasts is multi-model ensembles, but it’s not clear this is really the best approach, it just happens to be where we are today. Multi-model ensembles have a number of strengths: Each model is extensively tested by its own community and a large pool of alternative components provides some sampling across structural assumptions. But they are still an ensemble of opportunity – they do not systematically sample uncertainties. Also the set is rather small – e.g. 21 different models. So the sample is too small for determining the distribution of possible changes, and the ensembles are especially weak for predicting extreme events.

There has been a major effort on quantifying uncertainty over last few years at the Hadley Centre, using a perturbed physics ensemble. This allows for a larger sample: 100s (or even 10,000s in climateprediction.net) of variants of the same model. The poorly constrained model parameters are systematically perturbed, within expert-suggested ranges. But this still doesn’t sample the structural uncertainty in the models, because all the variants are from a single base model. As an example of this work, the UKCP09 project was an attempt to move from uncertainty ranges (as in AR4) to a probability density function (pdf) for likely change. UKCP uses over 400 model projections to compute the pdf. Although there are many problems with the UKCP (see the AGU discussion for a critique), but they were a step forward in understanding how to quantify uncertainty. [Note: Julia acknowledged weaknesses in both CP.net and the UKCP projects, but pointed out that they are mainly interesting as examples of how forecasting methodology is changing]

Another approach is to show which factors tend to dominate the uncertainty. For example, a pie chart showing impact of different sources of uncertainty (model weaknesses, carbon cycle, natural variability, downscaling uncertainty) on the forecast for rainfall in 2020s vs 2080s is interesting – for the 2020s, the uncertainty about the carbon cycle is relatively small factor, whereas for the 2080s it’s a much bigger factor.

Julia suggests it’s time for a coordinated study of the effects of model resolution on uncertainty. Every modeling group is looking at this, but they are not doing standardized experiments, so comparisons are hard.

Here is an example from Tim Palmer. In AR4, WG1 chapter 11 gave an assessment of regional patterns of change in precipitation. For some regions, it was impossible to give a prediction (the white areas), whereas for others, the models appear to give highly confident predictions. But the confidence might be misplaced because many of the models have known weaknesses that are relevant to future precipitation. For example, the models don’t simulate persistent blocking anticyclones very well. Which means that it’s wrong to assume that if most models agree, we can be confident in the prediction. For example, the Athena experiments with very high resolution models (T1259) showed much better blocking behaviour against the observational dataset ERA40. This implies we need to be more careful about selecting models for a multi-model ensemble for certain types of forecast.

The real butterfly effect raises some fundamental unanswered questions about convergence of climate simlations with increasing resoltion. Maybe there is an irreducible level of uncertainty in climate change. And if so, what is it? How much will increased resolution reduce the uncertainty? Will things be much better when we can resolve processes at  20km, 2km, or even 0.2km? compared to say 200km? Once we reach a certain resolution (e.g. 20km) is it just as good to represent small scale motions using stochastic equations? And what’s the most effective way to use the available computing resources as we increase the resolution? [There’s an obvious trade-off between increasing the size of the ensemble, and increasing the resolution of individual ensemble members]

Julia’s main conclusion is that Lorenz’ theory of chaotic systems now pervades all aspects of weather and climate prediction. Estimating and reducing uncertainty requires better multi-scale physics, higher resolution models, and more complete observations.

Some of the questions after the talk probed these issues a little more. For example, Julia was asked  how to handle policymakers demanding better decadal prediction, when we’re not ready to deliver it. Her response was that she believes higher resolution modeling will help, but that we haven’t proved this yet, so we have to manage expectations very carefully. She was also asked about the criteria to use to use for including different models in an ensemble – e.g. should we exclude models that don’t conserve physical quantities, that don’t do blocking, etc? For UKCP09, the criteria were global in nature, but this isn’t sufficient – we need criteria that test for skill with specific phenomena such as El Nino. Because the inclusion criteria aren’t clear enough yet, the UKCP project couldn’t give advice on wind in the projections. In the long run, the focus should be on building the best model we can, rather than putting effort into exploring perturbed physics, but we have to balance needs of users for better probablistic predictions against need to get on and develop better phyiscs in the models.

Finally, on the question of interpretation, Julia was asked what if users (of the forecasts) can’t understand or process probablistic forecasts? Julia pointed out that some users can process probablistic forecasts, and indeed that’s exactly what they need. For example, the insurance industry. Others use it as input for risk assessment – e.g. water utilities. So we do have to distinguish the needs of different types of users.

The IPCC schedule impacts nearly all aspects of climate science. At the start of this week’s CCSM workshop, Thomas Stocker from the University of Bern, and co-chair of working group 1 of the IPCC, gave an overview of the road toward the fifth assessment report (AR5), due to be released in 2013

First, Thomas reminded us that the IPCC does not perform science (it’s job is to assess the current state of the science), but increasingly it stimulates science. This causes some tension though, as curiosity-driven research must remain the priority for the scientific community.

The highly politicized environment also poses a huge risk. There are some groups actively seeking to discredit climate science and damage the IPCC, which means that rigor of the IPCC procedures are now particularly important. One important lesson from the last year is that there is no procedure for correcting serious errors in the assessment reports. Minor errors are routine, and are handled by releasing errata. But this process broke down for bigger issues such as the Himalayan glacier error.

Despite the critics, climate science is about as transparent as a scientific field can be. Anyone can download a climate model and see what’s in there. The IPCC process is founded on four key values (thanks to the advocacy of Susan Solomon): Rigor, Robustness, Transparency, and Comprehensiveness. However, there are clearly practical limits to transparency. For example, it’s not possible to open up lead author meetings, because the scientists need to be able to work together in a constructive atmosphere, rather than “having miscellaneous bloggers in the room”!

The structure of the IPCC remain the same: three working groups: WG1 on the physical science basis, WG2 on impacts and adaptation, and WG3 on mitigation, along with a task force on GHG inventories.

The most important principles for the IPCC are in article 2 and 3:

2. “The role of the IPCC is to assess on a comprehensive, objective, open and transparent basis the scientific, technical and socio-economic information relevant to understanding the scientific basis of risk of human-induced climate change, its potential impacts and options for adaptation and mitigation. IPCC reports should be neutral with respect to policy, although they may need to deal objectively with scientific, technical and socio-economic factors relevant to the application of particular policies.

3. Review is an essential part of the IPCC process. Since the IPCC is an intergovernmental body, review of IPCC documents should involve both peer review by experts and review by governments.

A series of meetings have already occurred in preparation for AR5:

  • Mar 2009: An expert meeting on science of alternative greenhouse gas metrics. The met and produced a report.
  • Sept 2009: An expert meeting on detection and attribution, which produced a report and a good practice guidance paper [which itself is a great introduction to how attribution studies are done].
  • Jan 2010: An expert meeting at NCAR on assessing and combining multi-model projections. The report from this meeting is due in a few weeks, and will also include a good practice guide.
  • Jun 2010: A workshop on sea level rise and ice sheet instability, which was needed because of the widespread recognition that AR4 was weak on this issue, perhaps too cautious.
  • And in a couple of weeks, in July 2010, a workshop on consistent treatment of uncertainties and risks. This is a cross-Working Group meeting, at which they hope to make progress on getting all three working groups to use the same approach. In the AR4, WG1 developed a standardized language for describing uncertainty, but other working groups have not yet.

Thomas then identified some important emerging questions leading up to AR5.

  1. Trends and rates of observed climate change, and in particular, the question of whether climate change has accelerated? Many recent papers and reports indicate that it has; the IPCC needs to figure out how to assess this, especially as there are mixed signals. For example, the decadal trend is accelerating in Arctic sea ice extent, but  the global temperature anomaly has not accelerated over this time period.
  2. Stability of the Western and Eastern Antarctic ice sheets (WAIS and EAIS). There has been much more dynamic change at margins of these ice sheets, accelerating mass loss, as observed by GRACE. The assessment needs to look into whether these really are accelerating trends, or if its just an artefact of limited duration of measurements.
  3. Irreversibilities and abrupt change: how robust and accurate is our understanding? For example, what long term commitment have been made already in sea level rise. And what about commitments in the hydrological cycle, where some regions (Africa, Europe) might go beyond the range of observed drought within the next couple of decades, and this may be unavoidable.
  4. Clouds and Aerosols, which will have their own entire chapter in AR5. There are still big uncertainties here. For example, low level clouds are a positive feedback in the north-east Pacific, yet all but one model are unable to simulate this.
  5. Carbon and other biogeochemical cycles. New ice core reconstructions were published just after AR4, and give us more insights into regional carbon cycle footprints caused by abrupt climate change in the past. For example, the ice cores show clear changes in soil moisture and total carbon stored  in the Amazon region.
  6. Near-term and long-term projections, for example the question of how reliable the decadal projections are. This is a difficult area. Some people say we already have seamless prediction (from decades to centuries), but not Thomas is not yet convinced. For example, there are alarming new results on number of extreme hot days across southern Europe that need to be assessed – these appear to challenge assumptions about the decadal trends.
  7. Regional issues – eg frequency and severity of impacts. Traditionally, the IPCC reports have taken an encyclopedic approach: take each region, and list the impacts in each. Instead, for AR5, the plan is to start with the physical processes, and then say something about sensitivity within each region to these processes.

Here’s an overview of the planned structure of the AR5 WG1 report:

  • Intro
  • 4 chps on observations and paleoclimate
  • 2 chps on process understanding (biogeochemistry and clouds/aerosols)
  • 3 chps from forcing to attributions
  • 2 chps on future climate change and predictability (near term and long term)
  • 2 integration chapters (one on sea level rise, and one on regional issues)

Some changes are evident from AR4. Observations have become more important. They grew to 3 chapters in AR4, and will keep the same in AR5. There will be another crack at paleoclimate, and new chapters on: sea level rise (a serious omission in AR4); clouds and aerosols; the carbon cycle; and regional change. There is also a proposal to produce an atlas which will include a series of maps summarizing the regional issues.

The final draft of the WG1 report is due in May 2013, with a final plenary in Sept 2013. WG2 will finish in March 2014, and WG3 in April 2014. Finally, the IPCC Synthesis Report is to be done no later than 12 months from WG1 report, ie. by September 2014. There has been pressure to create a process that incorporates new science throughout 2014 in to the synthesis report, however Thomas has successfully opposed this, on the basis that it will cause far more controversy if the synthesis report is not consistent with the WG reports.

The deadlines for published research to be included in the assessment is as follows. Papers need to be submitted for publication by 31 July 2012, and must be in press by 15 March 2013. The IPCC has to be very strict about this, because there are people out there who have nothing better to do than to wade through all the references in AR4 and check that all of them appeared before the cutoff date.

Of course, these dates are very relevant to the CCSM workshop audience. Thomas urged everyone not to leave this to the last minute; journal editors and reviewers will be swamped if everyone tries to get their papers published just prior to the deadline [although I suspect this is inevitable?].

Finally, here is a significant challenge in communication coming up. For AR5 we’re expecting to see a much broader model diversity than in previous assessments, partly because there are more models (and more variants), and partly because the models now include a broader range of earth system processes. This will almost certainly mean a bigger model spread,  and hence a likely increase in uncertainty. It will be a significant challenge to communicate the reasons for this to policymakers and a lay audience. Thomas argues that we must not be ashamed to present how science works – that in some cases the uncertainties multiply, during which the spread of projections grows, and then when we get the models more constrained by observations they converge again. But this also poses problems in how we do model elimination and model weighting in ensemble projections. For example, if a particular model shows no sea ice in the year 2000, it probably should be excluded as this is clearly wrong. But how do we set clear criteria for this?

I’ve speculated before about the factors that determine the length of the release cycle for climate models. The IPCC assessment process, which operates on a 5-year cycle tends to dominate everything. But there are clearly other rhythms that matter too. I had speculated that the 6-year gap between the release of CCSM3 and CCSM4 could largely be explained by the demands of the the IPCC cycle; however the NCAR folks might have blown holes in that idea by making three new releases in the last six months; clearly other temporal cycles are at play.

In discussion over lunch yesterday, Archer pointed me to the paper “Exploring Collaborative Rhythm: Temporal Flow and Alignment in Collaborative Scientific Work”  by Steven Jackson and co, who point out that while the role of space and proximity have been widely studied in colloborative work, the role of time and patterns of temporal constraints have not. They set out four different kinds of temporal rhythm that are relevant to scientific work:

  • phenomenal rhythms, arising from the objects of study – e.g. annual and seasonal cycles strongly affect when fieldwork can be done in biology/ecology; the development of a disease in an individual patient affects the flow of medical research;
  • institutional rhythms, such as the academic calendar, funding deadlines, the timing of conferences and paper deadlines, etc.
  • biographical rhythms, arising from individual needs – family time, career development milestones, illnesses and vacations, etc.
  • infrastructural rhythms, arising from the development of the buildings and equipment that scientific research depends on. Examples include the launch, operation and expected life of a scientific instrument on a satellite, the timing of software releases, and the development of classification systems and standards.

The paper gives two interesting examples of problems in aligning these rhythms. First, the example of the study of long term phenomena such as river flow on short term research grants led to mistakes where a data collected during an unusually wet period in the early 20th century led to serious deficiencies in water management plans for the Colorado river. Second, for NASA’s Mars mission MER, the decision was taken to put the support team on “Mars time” as the Martian day is 2.7% longer than the earth day. But as the team’s daily work cycle drifted from the normal earth day, serious tensions arose between the family and social needs of the project team and the demands of the project rhythm.

Here’s another example that fascinated me when I was at the NASA software verification lab in the 90s. The Cassini spacecraft took about six years to get to Saturn. Rather than develop all the mission software prior to launch, NASA took the decision to develop only the minimal software needed for launch and navigation, and delayed the start of development of the mission software until just prior to arrival at Saturn. The rational was that they didn’t want a six year gap between development and use of this software, during which time the software teams might disperse – they needed the teams in place, with recent familiarity with the code, at the point the main science missions started.

For climate science, the IPCC process is clearly a major institutional rhythm, but the infrastructural rhythms that arise in model development interact with this in complex ways. I need to spend time looking at the other rhythms as well.

Of all the global climate models, the Community Earth System Model, CESM, seems to come closest to the way an open source community works. The annual CESM workshop, this week in Breckenridge, Colorado, provides an example of how the community works. There are about 350 people attending, and much of the meeting is devoted to detailed discussion of the science and modeling issues across a set of working groups: Atmosphere model, Paleoclimate, Polar Climate, Ocean model, Chemistry-climate, Land model, Biogeochemistry, Climate Variability, Land Ice, Climate Change, Software Engineering, and Whole Atmosphere.

In the opening plenary on Monday, Mariana Vertenstein (who is hosting my visit to NCAR this month), was awarded the 2010 CESM distinguished achievement award for her role in overseeing the software engineering of the CESM. This is interesting for a number of reasons, not least because it demonstrates how much the CESM community values the role of the software engineering team, and the advances that the software engineering working group has made improving the software infrastructure over the last few years.

Earth system models are generally developed in a manner that’s very much like agile development. Getting the science working in the model is prioritized, with issues such as code structure, maintainability and portability worked in later, as needed. To some extent, this is appropriate – getting the science right is the most important thing, and it’s not clear how much a big upfront design effort would payoff, especially in the early stages of model development, when it’s not clear whether the model will become anything more than an interesting research idea. The downside of this strategy, is that as the model grows in sophistication, the software architecture ends up being a mess. As Mariana explained in her talk, coupled models like the CESM have reached a point in their development where this approach no longer works. In effect, a massive refactoring effort is needed to clean up the software infrastructure to permit future maintainability.

Mariana’s talk was entitled “Better science through better software”. She identified a number of major challenges facing the current generation of earth system models, and described some of the changes in the software infrastructure that have been put in place for the CESM to address them.

The challenges are:

1) New system complexity, as new physics, and new grids are incorporated into the models. For example, the CESM now has a new land ice model, which along with the atmosphere, ocean, land surface, and sea ice components brings the total to five distinct geophysical component models, each operating on different grids, and each with its own community of users. These component models exchange boundary information via the coupler, and the entire coupled model now runs to about 1.2 million lines of code (compare with the previous generation model, CCSM3, now six years old, which had about 330KLoC).

The increasing number of component models increases the complexity of the coupler. It now has to handle regridding (where data such as energy and mass is exchanged between component models with different grids), data merging, atmosphere-ocean fluxes, and conservation diagnostics (e.g. to ensure the entire model conserves energy and mass). Note: Older versions of the model were restricted, for example with the atmosphere, ocean and land surface schemes all required to use the same grid.

Users also want to be able to swap in different versions of each major component. For example, a particular run might demand a fully prognostic atmosphere model, coupled with a prescribed ocean parameterization (taken from observational data, for example). Then, within each major component, users might want different configurations:  multiple dynamic cores, multiple chemistry modes, etc.

Another source of complexity comes from resolutions. Model components now run over a much wider range of resolutions, and the re-gridding challenges are substantial. And finally, whereas the old model used rectangular latitude-longitude grids, now people want to accommodate many different types of grid.

2) Ultra-high resolution. The trend towards higher resolution grids poses serious challenges for scalability, especially given the massive increase in volume of data being handled. All components (and the coupler) need to be scalable in terms of both memory and performance.

Higher resolution increases the need for more parallelism, and there has been tremendous progress on this in the last few years. A few years back, as part of the DOE/LLNL grand challenge, CCSM3 managed 0.5 simulation years per day, running on 4,000 cores, and this was considered a great achievement. This year, the new version of CESM has successfully run on 80,000 cores, to give 3 simyears per day in a very high resolution model: 0.125° grid for the atmosphere, 0.25° for the land and 0.1° for the ocean.

Interestingly, in these highly parallel configurations, the ocean model, POP, is no longer dominant for processing time; the sea ice and atmosphere models start to dominate because the two of them are coupled sequentially. Hence the ocean model scales more readily.

3) Data assimilation. For weather forecasting models, this has long been standard analysis practice. Briefly, the model state and the observational data are combined at each timestep to give a detailed analysis of the current state of the system, which helps to overcome limitations in both the model and the data, and to better understand the physical processes underlying the observational data. It’s also useful in forecasting, as it allows you to arrive at a more accurate initial state for a forecast run.

In climate modeling, data assimilation is a relatively new capability. The current version of the CESM can do data assimilation in both the atmosphere and ocean. The new framework also supports experiments where multiple versions of the same component are used within a run. For example, the model might have multiple atmosphere components in a single simulation, each coupled with its own instance of the ocean, where one is an assimilation module and the other a prognostic model.

4) The needs of the user community. Supporting a broad community of model users adds complexity, especially as the community becomes more diverse. The community needs more frequent releases of the model (e.g. more often than every six years!), and people ned to be able to merge new releases more easily into their own sandboxes.

These challenges have inspired a number of software infrastructure improvements in the CESM. Mariana described a number of advances.

The old model, CCSM3 was run as multiple executables, one for each major component, exchanging data with a coupler via MPI. And each component used to have its own way of doing coupling. But this kills efficiency – processors end up idling when a component has to wait on data from the others. It’s also very hard in this scheme to understand the time evolution as the model runs, which then also makes it very hard to debug. And the old approach was notoriously hard to port to different platforms.

The new framework has a top level driver that controls time evolution, with all coupling done at the top level. Then the component models can be laid out across the available processors, either all in parallel, or in a hybrid parallel-sequential mode. For example, atmosphere, land scheme and sea ice modules might be called in sequence, with the ocean model running in parallel with the whole set. The chosen architecture is specified in a single XML file. This brings a number of benefits:

  • Better flexibility for very different platforms;
  • Facilitates model configurations with huge amounts of parallelism across a very large number of processors;
  • Allows the coupler & components to be ESMF compliant, so the model can can couple with other ESMF compliant models;
  • Integrated release cycle – it’s now all one model, whereas in the past each component model had it’s own separate releases.
  • Much easier to debug, as it’s easier to follow the time evolution.

The new infrastructure also includes scripting tools that support the process of setting up an experiment, and making sure it runs with optimal performance on a particular platform. For example, the current release includes script to create wide variety of out-of-the-box experiments. It also includes a load balancing tool, to check how much time each component is idle during a run, and new scripts with hints for porting to new platforms, based on a set of generic machine templates.

The model also has a new parallel I/O library (PIO), which adds a layer of abstraction between the data structures used in each model component and the arrangement of the data when written to disk.

The new versions of the model are now being released via the subversion repository (rather than a .tar file, as used in the past). Hence, users can use an svn merge to get the latest release. There have been three model releases since January:

  • CCSM Alpha, released in January 2010;
  • CCSM 4.0 full release, in April 2010;
  • CESM 1.0 released June 2010.

Mariana ended her talk with a summary of the future work – complete the CMIP5 runs for the next round of the IPCC assessment process; regional refinement with scalable grids; extend the data assimilation capability; handle super-parameterization (e.g. include cloud resolving models); add hooks for human dimensions within the models (e.g. to support the DOE program on integrated assessment); and improved validation metrics.

Note: the CESM is the successor to CCSM – the community climate system model. The name change recognises the wider set of earth systems now incorporated into the model.

I had a bit of a gap in blogging over the last few weeks, as we scrambled to pack up our house (we’re renting out it while we’re away), and then of course, the roadtrip to Colorado to start the first of my three studies of software development processes at climate modeling centres. This week, I’m at the CCSM workshop, and will post some notes about the workshop in the next few days. But first, a chance for some reflection.

Ten years ago, when I quit NASA, I was offered a faculty position in Toronto with immediate tenure. The offer was too good to turn down: it’s a great department, with a bunch of people I really wanted to work with. I was fed up of the NASA bureaucracy, the short term-ism of the annual budget cycle, and (most importantly) a new boss I couldn’t work with. A tenured academic post was the perfect antidote – I could focus on long-term research problems that interested me most, without anyone telling me what to study.

(Note: Lest any non-academics think this is an easy life, think again. I spend far more time chasing research funding than actually doing research, and I’m in constant competition with an entire community of workaholics with brilliant minds. It’s bloody hard work)

Tenure is an interesting beast. It’s designed to protect a professor’s independence and ability to pursue long term research objectives. It also preserves the integrity of academic researchers: if university administrators, politicians, funders, etc find a particular set of research results to be inconvenient, they cannot fire, or threaten to fire the professors responsible. But it’s also limited. While it ought to protect curiosity-driven research from the whims of political fashions, it only protects the professor’s position (and salary), not the research funding needed for equipment, travel, students, etc. But the important thing is that tenure gives the professor the freedom to direct her own research programme and the freedom to decide what research questions to tackle.

Achieving tenure is often a trial by fire, especially in the top universities. After demonstrating your research potential by getting a PhD, you then compete with other PhDs to get a tenure-track position. You have to maintain a sustained research program over six to seven years as a junior professor, publishing regularly in the top journals in your field, and gaining the attention of the top people in your field who might be asked to write letters of support for your tenure case. In judging tenure cases, the trajectory and sustainability of the research programme is taken into account – a publication record that appears to be slowing down over the pre-tenure period is a big problem; if you have several papers in a row rejected, especially towards the end of the pre-tenure period, it might be hard to put together a strong tenure case. The least risky route is to stick with the same topic you studied in your PhD, where you already have the necessary background and where you presumably have also ‘found’ your community.

The ‘finding your community’ part is crucial. Scientific research is very much a community endeavor; the myth of the lone scientist in the lab is dead wrong. You have to figure out early in your research career which subfield you belong in, and get to know the other researchers in that subfield, in order to have your own research achievements recognized. Moving around between communities, or having research results scattered across different communities might mean there is no-one who is familiar enough with your entire body of research to write you a strong letter of support for tenure.

The problem is, of course, that this system trains professors to pick a subfield and stick with it. It tends to stifle innovation, and means that many professors then just continue to work on the same problems throughout the rest of their careers. There’s a positive side to this: some hard scientific problems really do need decades of study to master. On the other hand, most of the good ideas come from new researchers – especially grad students and postdocs; many distinguished scientists did their best work when they were in their twenties, when they were new to the field, and were willing to try out new approaches and question conventions.

To get the most value out of tenure, professors should really use it to take risks: to change fields, to tackle new problems, and especially to do research they they couldn’t do when they were chasing tenure. A good example is inter-disciplinary research. It’s hard to do work that spans several recognizable disciplines when you’re chasing tenure – you have to get tenure in a single university department, which usually means you have to be well established in a single discipline. Junior researchers interested in inter-disciplinary research are always at a disadvantage compared to their mono-disciplinary colleagues. But once you make tenure, this shouldn’t matter any more.

The problem is that changing your research direction once you’re an established professor is incredibly hard. This was my experience when I decided a few years ago to switch my research from traditional software engineering questions to the issue of climate change. It meant walking away from an established set of research funding sources, and an established research community, and most especially from an established set of collaborative relationships. The latter I think was particularly hard – colleagues with whom I’ve worked closely for many years still assume I’m interested in the same problems that we’ve always worked on (and, in many ways I still am – I’m trained to be interested in them!). I’m continually invited to co-author papers, to review papers and research proposals, to participate in grant proposals, and to join conference committees in my old field. But to give myself the space to do something very different, I’ve had to be hardheaded and say no to nearly all such invitations. It’s hard to do this without also offending people (“what do you mean you’re no longer interested in this work we’ve devoted our careers to?”). And it’s hard to start over, especially as I need to find new sources of funding, and new collaborators.

One of the things I’ve had to think carefully about is how to change research areas without entirely cutting off my previous work. After many years working on the same set of problems, I believe I know a lot about them, and that knowledge and experience ought to be useful. So I’ve tried to carve out a new research area that allows me to apply ideas that I’ve studied before to an entirely new challenge problem – a change of direction if you like, rather than a complete jump. But it’s enough of a change that I’ve had to find a new community to collaborate with. And different venues to publish in.

Personally, I think this is what the tenure system is made for. Tenured professors should make use of the protection that tenure offers to take risks, and to change their research direction from time to time. And most importantly, to take the opportunity to tackle societal grand challenge problems – the big issues where inter-disciplinary research is needed.

And unfortunately, just about everything about the tenure system and the way university departments and scientific communities operate discourages such moves. I’ve been trying to get many of my old colleagues to apply themselves to climate change, as I believe we need many more brains devoted to the problem. But very few of my colleagues are interested in switching direction like this. Tenure should facilitate it, but in practice, the tenure system actively discourages it.

Congratulations to Jorge, who passed the first part of his PhD thesis defense yesterday with flying colours. Jorge’ thesis is based on a whole series of qualitative case studies of different software development teams (links go to ones he’s already published):

  • 7 successful small companies (under 50 employees) in the Toronto region;
  • 9 scientific software development groups, in an academic environment;
  • 2 studies of large companies (IBM and Microsoft);
  • 1 detailed comparative study of a company using Extreme Programming (XP) versus a similar sized company that uses more traditional development process (both building similar types of software for similar customers);

We don’t have anywhere near enough detailed case studies in software engineering – most claims for the effectiveness of various approaches to software development are based on little more than marketing claims and anecdotal evidence. There has been a push in the last decade or so for laboratory experiments, which are usually conducted along the lines of experiments in psychology: recruit a set of subjects, assign them a programming task, and measure the difference in variables like productivity or software quality when half of them are given some new tool or technique. While these experiments are sometimes useful for insights into how individual programmers work on small tasks, they really don’t tell us much about software development in the wild, where, as Parnas puts it, the interesting challenges are in multi-person development of multi-version software over long time scales. Jorge cites a particular example in his thesis of a controlled study of pair programming, which purports to show that pair programming lowers productivity. Except that it shows no such thing – any claimed benefits of pair programming are unlikely to emerge with subjects who are put together for a single day, but who otherwise have no connection with one another, and no shared context (like, for example, a project they are both committed to).

Each of Jorge’s case studies is interesting, but to me, the theory he uses them to develop is even more interesting. He starts by identifying three different traditions to the study of software development:

  • The process view, in which software construction is treated like a production line, and the details of the individuals and teams who do the construction are abstracted away, allowing researchers to talk about processes and process models, which, it is assumed, can be applied in any organizational context to achieve a predictable result. This view is predominant in the SE literature. The problem, of course, is that the experience and skills of individuals and teams do matter, and the focus on processes is a poor way to understand how software development works.
  • The information flow view, in which much of software development is seen as a problem in sharing information across software teams. This view has become popular recently, as it enables the study of electronic repositories of team communications as evidence of interaction patterns across the team, and leads to a set of theories abut how well patterns of communication acts match the technical dependencies in the software. The view is appealing because it connects well with what we know about interdependencies within the software, where clean interfaces and information hiding are important. Jorge argues that the problem with this view is that it fails to distinguish between successful and unsuccessful acts of communication. It assumes that communication is all about transmitting and receiving information, and it ignores problems in reconstructing the meaning of a message, which is particularly hard when the recipient is in a remote location, or is reading it months or years later.
  • The third view is that software development is largely about the development of a shared understanding within teams. This view is attractive because it takes seriously the intensive cognitive effort of software construction, and emphasizes the role of coordination, and the way that different forms of communication can impact coordination. It should be no surprise that Jorge and I both prefer this view.

Then comes the most interesting part. Jorge points out that software teams need to develop a shared understanding of goals, plans, status and context, and that four factors will strongly impact their success in this: proximity (how close the team members are to each other – being in the same room is much more useful than being in different cities), synchrony (talking to each other in (near) realtime is much more useful than writing documents to be read at some later time); symmetry (which means the coordination and information sharing is done best by the people whom it most concerns, rather than imposed by, say, managers) and maturity (it really helps if a team has an established set of working relationships and a shared culture).

This theory leads to a reconceptualization of many aspects of software development, such as the role of tools, the layout of physical space, the value of documentation, and the impact of growth on software teams. But you’ll have to read the thesis to get the scoop on all these…

A wonderful little news story spread quickly around a number of contrarian climate blogs earlier this week, and of course was then picked up by several major news aggregators: a 4th grader in Beeville, Texas had won the National Science Fair competition with a project entitled “Disproving Global Warming”. Denialists rubbed their hands in glee. Even more deliciously, the panel of judges included Al Gore.

Wait, what? Surely that can’t be right? Now, anyone who considers herself a skeptic would have been immediately, well, skeptical. But apparently that word no longer means what it used to mean. It took a real scientist to ask the critical questions, and investigate the source of the story: Michael Tobis took the time to drive to Beeville to investigate, as the story made no sense. And sure enough, there’s a letter that’s clearly on fake National Science Foundation letterhead, with no signature, and sure enough, the NSF have no knowledge of it. Oh, and of course, a quick google search shows that there is no such thing as a national science fair. Someone faked the whole thing (and the good folks at Reddit then dug up plenty of evidence about who).

So, huge kudos to MT for doing what journalists are supposed to do. And kudos to Sarah Taylor, the journalist who wrote the original story, for doing a full followup, once she found out it was a hoax. But this story just begs the question: how come, now that we live in such an information rich age, so few people can be bothered to check out the evidence about anything any more? Traditional investigative journalism is almost completely dead. The steady erosion of revenue from print journalism means most newspapers do little more than reprint press releases – most of them no longer retain science correspondents at all. And if traditional journalism isn’t doing investigative reporting any more, who will? Bloggers? Many bloggers like to think of themselves as “citizen journalists”. But few bloggers do anything more than repeat stuff they found on the internet, along with strident opinion on it. As Balbulican puts it: Are You A “Citizen Journalist”, or Just An Asshole?

Oh, and paging all climate denialists. Go take some science courses and learn what skepticism really means.

Short notice, but an interesting talk tomorrow by Balaji of Princeton University and NOAA/GFDL. Balaji is head of the Modeling Systems Group at NOAA/GFDL. The talk is scheduled for 4 p.m., in the Physics building, room MP408.

Climate Computing: Computational, Data, and Scientific Scalability

V. Balaji
Princeton University

Climate modeling, in particular the tantalizing possibility of making projections of climate risks that have predictive skill on timescales of many years, is a principal science driver for high-end computing. It will stretch the boundaries of computing along various axes:

  • resolution, where computing costs scale with the 4th power of problem size along each dimension
  • complexity, as new subsystems are added to comprehensive earth system models with feedbacks
  • capacity, as we build ensembles of simulations to sample uncertainty, both in our knowledge and representation, and of that inherent in the chaotic system. In particular, we are interested in characterizing the “tail” of the pdf (extreme weather) where a lot of climate risk resides.

The challenge probes the limits of current computing in many ways. First, there is the problem of computational scalability, where the community is adapting to an era where computational power increases are dependent on concurrency of computing and no longer on raw clock speed. Second, we increasingly depend on experiments coordinated across many modeling centres which result in petabyte-scale distributed archives. The analysis of results from distributed archives poses the problem of data scalability.

Finally, while climate research is still performed by dedicated research teams, its potential customers are many: energy policy, insurance and re-insurance, and most importantly the study of climate
change impacts — on agriculture, migration, international security, public health, air quality, water resources, travel and trade — are all domains where climate models are increasingly seen as tools that
could be routinely applied in various contexts. The results of climate research have engendered entire fields of “downstream” science as societies try to grapple with the consequences of climate change. This poses the problem of scientific scalability: how to enable the legions of non-climate scientists, vastly outnumbering the climate research community, to benefit from climate data.

The talks surveys some aspects of current computational climate research as it rises to meet the simultaneous challenges of computational, data and scientific scalability.

Update: Neil blogged a summary of Balaji’s talk.

I thought I wouldn’t blog any more about the CRU emails story, but this one is very close to my heart, so I can’t pass it up. Brian Angliss, over at Scholars and Rogues, has written an excellent piece on the lack of context in the stolen emails, and the reliability of any conclusions that might be based on them. To support his analysis, he quotes extensively from the paper “the Secret Life of Bugs” by Jorge Aranda and Gena Venolia from last year’s ICSE, in which they convincingly demonstrated that electronic records of discussions about software bugs are frequently unreliable, and that there is a big difference between the recorded discussions and what you find when you actually track down the participants and ask them directly.

BTW Jorge will be defending his PhD thesis in a couple of weeks, and it’s full of interesting ideas about how software teams develop a shared understanding of the software they develop, and the implications that this has on team organisation. I’ll be mining it for ideas to explore in my own studies of climate modellers later this year…

Take a look at this recent poll from Nanos on priorities for the upcoming G8/G20 meetings. Canadians ranked Global Warming and Economic Recovery as the top two priorities for the meetings, but note that global warming beats economic recovery for the top response across nearly all categories of Canadians (with the exception of the old fogeys, in the 50+ age group, and westerners, who I guess are busy getting rich from the oil sands). Overall, 33.7% of Canadians ranked Global Warming as the top priority, while 27.2% named Economic Recovery.

There’s some other interesting results in the poll. In the breakdown by party voting preferences, the Block Quebecois and the NDP seem much more worried about Global Warming than Green Party supporters: 59.3% of BQ voters and 41.5% of NDP voters ranked it first, while only 33.8% of Green Party voters did. So much for the myth that the green party is a single issue party, eh?

Oh, and if you look at the results to the later questions, Global warming is clearly the issue on which Canada is perceived to be doing most badly in terms of Canada’s place in the world.

31. May 2010 · 5 comments · Categories: psychology

I’m fascinated by the cognitive biases that affect people’s perceptions of climate change. I’ve previously written about the Dunning-Kruger effect (the least competent people tend to vastly over-rate their competence), and Kahan and Braman’s studies on social epistemology (people tend to ignore empirical evidence if its conclusions contradict their existing worldview).

Now comes a study by Nyhan and Reifler in the journal Political Behaviour entitled “When Corrections Fail: The persistence of Political Misperceptions“. N&R point out that in the literature, there is plenty of evidence that people often “base their policy preferences on false, misleading or unsubstantiated information that they believe to be true”. Studies have shown that providing people with correct factual information doesn’t necessarily affect their beliefs. However, different studies disagree about the latter point, partly because it’s often not clear in these studies whether the subject’s beliefs changed at all, and partly because previous studies have differed over how the factual information is presented (and even what counts as ‘factual’).

So N&R set out to directly study whether corrective information does indeed change erroneous beliefs. Most importantly, they were interested in what happens when this corrective information isn’t presented directly as an authoritative account of the truth, but rather (as happens more often) when it is presented as part of a larger, more equivocal set of stories in the media. One obvious factor that causes people to preserve erroneous beliefs is through selective reading – people tend to seek out information that supports their existing beliefs; hence they often don’t encounter information that corrects their misperceptions. And even when people do encounter corrective information, they are more likely to reject it (e.g. by thinking up counter-arguments) if it contradicts their prior beliefs. It is this latter process that N&R investigated, and in particular, whether this process of thinking up counter-arguments can actually reinforce the misperception; they dub this a “correction backfire”.

Four studies were conducted. In each case, the subjects were presented with a report of a speech by a well-known public figure. When a factual correction was added to the article, those subjects who are most likely to agree with the contents of the speech were unmoved by the factual correction, and in several of the studies, the correction actually strengthened their belief in the erroneous information (i.e. a clear ‘correction backfire’):

  • Study 1 conducted in the fall of 2005, examined beliefs in whether Iraq had weapons of mass destruction prior to the US invasion. Subjects were presented with a newspaper article describing a speech by president Bush in which he talks about the risk of Iraq passing these weapons on to terrorist networks. In the correction treatment, the article goes on to describe the results of the Duelfer report, which concluded there were, in fact, no weapons of mass destruction. The result shows a clear correction backfire for conservatives – the correction significantly increased their belief that Iraq really did have such weapons, while for liberals, the correction clearly decreased their belief.
  • Study 2, conducted in the spring of 2006 repeats study 1, with some variation in the wording. This study again showed that the correction was ineffective for conservatives – it didn’t decrease their belief in the existence of the weapons. However, unlike study 1, it didn’t show a correction backfire, although a re-analysis of the results indicated that there was such a backfire among those conservatives who most strongly supported the Iraq war. This study also attempted to test the effect of the source of the newspaper report – i.e. does it matter if it’s presented as being from the New York Times (perceived by many conservatives to have a liberal bias) or Fox News (perceived as being conservative)? In this case, the source of the article made no significant difference.
  • Study 3, also conducted in 2006, examined the belief that the Bush tax cuts paid for themselves by stimulating enough economic growth to actually increase government revenues. Subjects were presented with an article in which president Bush indicated the tax cuts had helped to increase revenues. In the correction treatment, the article goes on to present the actual revenues, showing that tax revenues declined sharply (both absolutely, and as a proportion of GDP) in the years after the tax cuts were enacted. Again there was a clear correction backfire among conservatives – those receiving the article presenting the actual revenues actually increased their belief that the tax cuts paid for themselves.
  • Study 4, also from 2006, examined the belief that Bush banned stem cell research. Subjects were presented with an article describing speeches by senators Edwards and Kerry in which they suggest such a ban exists. In the corrective treatment, a paragraph was added to explain that Bush didn’t actually ban stem cell research, because his restrictions didn’t apply to privately funded research. The results were that the correction did not change liberal’s belief that there was such a ban, but there was no correction backfire (i.e. it didn’t increase their beliefs in the ban).

In summary, factual corrections in newspaper articles don’t appear to work for those who are ideologically motivated to hold the misperception, and in two out of the four studies, it actually strengthened the misperception. So, fact-checking on its own is not enough to overcome ideologically-driven beliefs. (h/t to Ben Goldacre for this)

How does this relate to climate change? Well, most media reports on climate change don’t even attempt any fact-checking anyway – they ignore the vast body of assessment reports by authoritative scientific bodies, and present a “he-said-she-said” slugfest between denialists and climate scientists. The sad thing is that the addition of fact-checking won’t, on it’s own, make any difference to those whose denial of climate change is driven by their ideological leanings. If anything, such fact-checking will make them even more entrenched…

Dear God, I would like to file a bug report

(clickety-click for the full xkcd cartoon)

I’ve been working with group of enthusiastic parents in our neighbourhood over the past year on a plan to make our local elementary school a prototype for low-energy buildings. As our discussions evolved, we ended up with a much more ambitious vision: to use the building and grounds of the school for renewable power generation projects (using solar, and geothermal energy) that could potentially power many of the neighbouring houses and condos – i.e. make the school a community energy hub. And of course, engage the kids in the whole process, so that they learn about climate and energy, even as we attempt to build solutions.

In parallel with our discussions, the school board has been beefing up its ambitions too, and has recently adopted a new Climate Change Action Plan. It makes for very interesting reading. I like the triple goal: mitigation, adaptation and education, largely because the last of these, education, is often missing from discussions about how to respond to climate change, and I firmly believe that the other two goals depend on it. The body of the report is a set of ten proposed actions to cut carbon emissions from the buildings and transportation operated by the school board, funded from a variety of sources (government grants, the feed-in tariff program, operational savings, carbon credits, etc). The report still needs some beefing up on the education front, but it’s a great start!

Here are two upcoming conferences, both relevant to the overlap of computer science and climate science:

…and I won’t make it to either as I’ll be doing my stuff at NCAR. I will get to attend this though:

I guess I’ll have to send some of my grad students off to the other conferences (hint, hint).

I’ve been busy the last few weeks setting up the travel details for my sabbatical. My plan is to visit three different climate modeling centers, to do a comparative study of their software practices. The goal is to understand how the software engineering culture and practices vary across different centers, and how the differences affect the quality and flexibility of the models. The three centers I’ll be visiting are:

I’ll spend 4 weeks at each centre, starting in July, running through to October, after which I’ll spend some time analyzing the data and writing up my observations. Here’s my research plan…

Our previous studies at the UK Met Office Hadley Center suggest that there are many features of software development for earth system modeling that make it markedly different from other types of software development, and which therefore affect the applicability of standard software engineering tools and techniques. Tools developed for commercial software tend not to cater for the demands of working with high performance code for parallel architectures, and usually do not fit well with the working practices of scientific teams. Scientific code development has challenges that don’t apply to other forms of software: the need to keep track of exactly which version of the program code was used in a particular experiment, the need to re-run experiments with precisely repeatable results, the need to build alternative versions of the software from a common code base for different kinds of experiments. Checking software “correctness” is hard because frequently the software must calculate approximate solutions to numerical problems for which there is no analytical solution. Because the overall goal is to build code to explore a theory, there is no oracle for what the outputs should be, and therefore conventional approaches to testing (and perhaps code quality in general) don’t apply.

Despite this potential mismatch, the earth system modeling community has adopted (and sometimes adapted) many tools and practices from mainstream software engineering. These include version control, bug tracking, automated build and test processes, release planning, code reviews, frequent regression testing, and so on. Such tools may offer a number of potential benefits:

  • they may increase productivity by speeding up the development cycle, so that scientists can get their ideas into working code much faster;
  • they may improve verification, for example using code analysis tools to identify and remove (or even prevent) software errors;
  • they may improve the understandability and modifiability of computational models (making it easier to continue to evolve the models);
  • they may improve coordination, allowing a broader community to contribute to and make use of a shared the code base for a wider variety of experiments;
  • they may improve scalability and performance, allowing code to be configured and optimized for a wider variety of high performance architectures (including massively parallel machines), and for a wider variety of grid resolutions.

This study will investigate which tools and practices have been adopted at the different centers, identify differences and similarities in how they are applied, and, as far as is possible, assess the effectiveness of these practices. We will also attempt to characterize the remaining challenges, and identify opportunities where additional tools and techniques might be adopted.

Specific questions for the study include:

  1. Verification – What techniques are used to ensure that the code matches the scientists’ understanding of what it should do? In traditional software engineering, this is usually taken to be a question of correctness (does the code do what it is supposed to?); however, for exploratory modeling it is just as often a question of understanding (have we adequately understood what happens when the model runs?). We will investigate the practices used to test the code, to validate it against observational data, and to compare different model runs against one another, and assess how effective these are at eliminating errors of correctness and errors of understanding.
  2. Coordination – How are the contributions from across the modeling community coordinated? In particular, we will examine the challenges of synchronizing the development processes for coupled models with the development processes of their component models, and how the differences in the priorities of different, overlapping communities of users affect this coordination.
  3. Division of responsibility – How are the responsibilities for coding, verification, and coordination distributed between different roles in the organization? In particular, we will examine how these responsibilities are divided across the scientists and other support roles such as ‘systems’ or ‘software engineering’ personnel. We will also explore expectations on the quality of contributed code from end-user scientists, and the potential for testing and review practices to affect the quality of contributed code.
  4. Planning and release processes – How do modelers decide on priorities for model development, how do they decide which changes to tackle in a particular release of the model, and how they navigate between computational feasibility and scientific priorities? We will also investigate how the change process is organized, how changes are propagated to different sub-communities.
  5. Debugging – How do scientists currently debug the models, what types of bugs do they find in their code currently, and how they find them? In particular, we will develop a categorization of model errors, to use as a basis for subsequent studies into new techniques for detecting and/or eliminating such errors.

The study will be conducted through a mix of interviews and observational studies, focusing on particular changes to the model codes developed at each center. The proposed methodology is to identify a number of candidate code changes, including recently completed changes and current work-in-progress, and to build a “life story” for each such change, covering how each change was planned and conducted, what techniques were applied, and what problems were encountered. This will lead to a more detailed description of the current software development practices, which can then be compared and contrasted with studies of practices used for other types of software. This end result will be an identification of opportunities where existing tools and techniques can be readily adapted (with some clear indication of the potential benefits), along with a longer-term research agenda for problem areas where no suitable solutions currently exist.

This week we’re demoing Inflo at the Ontario Centres of Excellence Discovery Conference 2010. It’s given me a chance to play a little more with the demo, and create some new sample calculations (with Jonathan valiantly adding new features on the fly in response to my requests!). The idea of Inflo is that it should be an open source calculation tool – one that supports a larger community of people discussing and reaching consensus on the best way to calculate the answer to some (quantifiable) question.

For the demo this week, I re-did the calculation on how much of the remaining global fossil fuel reserves we can burn and still keep global warming within the target threshold of a +2°C rise over pre-industrial levels. I first did this calculation in blog post back in the fall, but I’ve been keen to see if Inflo would provide a better way of sharing the calculation. Creating the model is still a little clunky (it is, after all, a very preliminary prototype), but I’m pleased with the results. Here’s a screenshot:

And here’s a live link to try it out. A few tips: the little grey circles under a node indicate there’s some hidden subtrees. Double-clicking on one of these will expand it, while double clicking on an expanded node will collapse everything below it, so you can explore the basis for each step in the calculation. The Node Editor tool bar on the left shows you the formula for the selected node, and any notes. Some of the comments in the “Description” field are hotlinks to data sources – mouseover the text to find them. Oh, and the arrows don’t always update properly when you change views – selecting a node in the graph should force them to update. Oh, and the units are propagated (and scaled for readability) automatically, which is why they sometime look a little odd, eg. “tonne of carbon” rather than “tonnes”. One of our key design decisions is to make the numbers as human-readable as possible, and always ensure correct units are displayed.

The demo should get across some of what we’re trying to do. The idea is to create a visual, web-based calculator that can be edited and shared; eventually we hope to build wikipedia-like communities who will curate the calculations, to ensure that the appropriate sources of data are used, and that the results can be trusted. We’ll need to add more facilities for version management of calculations, and for linking discussions to (portions of) the graphs.

Here’s another example: Jono’s carbon footprint analysis of whether you should print a document or read it on the screen (double click the top node to expand the calculation).