I’ve been meaning to write a summary of the V&V techniques used for Earth System Models (ESMs) for ages, but never quite got round to it. However, I just had to put together a piece for a book chapter, and thought I would post it here to see if folks have anything to add (or argue with)).
Verification and Validation for ESMs is hard because running the models is an expensive proposition (a fully coupled simulation run can take weeks to complete), and because there is rarely a “correct” result – expert judgment is needed to assess the model outputs.
However, it is helpful to distinguish between verification and validation, because the former can often be automated, while the latter cannot. Verification tests are objective tests of correctness. These include basic tests (usually applied after each code change) that the model will compile and run without crashing in each of its standard configurations, that a run can be stopped and restarted from the restart files without affecting the results, and that identical results are obtained when the model is run using different processor layouts. Verification would also include the built-in tests for conservation of mass and energy over the global system on very long simulation runs.
In contrast, validation refers to science tests, where subjective judgment is needed. These include tests that the model simulates a realistic, stable climate, given stable forcings, that it matches the trends seen in observational data when subjected to historically accurate forcings, and that the means and variations (e.g. seasonal cycles) are realistic for the main climate variables (E.g. see Phillips et al, 2004).
While there is an extensive literature on the philosophical status of model validation in computational sciences (see for example, Oreskes et al (1994); Sterman (1994); Randall and Wielicki (1997); Stehr (2001)), much of it bears very little relation to practical techniques for ESM validation, and very little has been written on practical testing techniques for ESMs. In practice, testing strategies rely on a hierarchy of standard tests, starting with the simpler ones, and building up to the most sophisticated.
Pope and Davies (2002) give one such sequence for testing atmosphere models:
- Simpliﬁed tests – e.g. reduce 3D equations of motion to 2D horizontal ﬂow (e.g. a shallow water testbed). This is especially useful if the reduction has an analytical solution, or if a reference solution is available. It also permits assessment of relative accuracy and stability over a wide parameter space, and hence is especially useful when developing new numerical routines.
- Dynamical core tests – test for numerical convergence of the dynamics with physical parameterizations replaced by a simpliﬁed physics model (e.g. no topography, no seasonal or diurnal cycle, simpliﬁed radiation).
- Single-column tests – allows testing of individual physical parameterizations separately from the rest of the model. A single column of data is used, with horizontal forcing prescribed from observations or from idealized proﬁles. This is useful for understanding a new parameterization, and for comparing interaction between several parameterizations, but doesn’t cover interaction with large-scale dynamics, nor interaction with adjacent grid points. This type of test also depends on availability of observational datasets.
- Idealized aquaplanet – test the fully coupled atmosphere-ocean model, but with idealized sea-surface temperatures at all grid points. This allows for testing of numerical convergence in the absence of complications of orography and coastal effects.
- Uncoupled model components tested against realistic climate regimes – test each model component in stand-alone mode, with a prescribed set of forcings. For example, test the atmosphere on its own, with prescribed sea surface temperatures, sea-ice boundary conditions, solar forcings, and ozone distribution. Statistical tests are then applied to check for realistic mean climate and variability.
- Double-call tests. Run the full coupled model, and test a new scheme by calling both the old and new scheme at each timestep, but with the new scheme’s outputs not fed back in to the model. This allows assessment of the performance of new scheme in comparison with older schemes.
- Spin-up tests. Run the full ESM for just a few days of simulation (typically between 1 and 5 days of simulation), starting from an observed state. Such tests are cheap enough that they can be run many times, sampling across the initial state uncertainty. Then the average of a large number of such tests can be analyzed (Pope and Davies suggest that 60 is enough for statistical signiﬁcance). This allows the results from different schemes to be compared, to explore differences in short term tendencies.
Whenever a code change is made to an ESM, in principle, an extensive set of simulation runs are needed to assess whether the change has a noticeable impact on the climatology of the model. This in turn requires a sub jective judgment for whether minor variations constitute acceptable variations, or whether they add up to a signiﬁcantly different climatology.
Because this testing is so expensive, a standard shortcut is to require exact reproducibility for minor changes, which can then be tested quickly through the use of bit comparison tests . These are automated checks over a short run (e.g. a few days of simulation time) that the outputs or restart ﬁles of two different model conﬁgurations are identical down to the least signiﬁcant bits. This is useful for checking that a change didn’t break anything it shouldn’t, but requires that each change be designed so that it can be “turned off” (e.g. via run-time switches) to ensure previous experiments can be reproduced. Bit comparison tests can also check that different conﬁgurations give identical results. In effect, bit reproducibility over a short run is a proxy for testing that two different versions of the model will give the same climate over a long run. It’s much faster than testing the full simulations, and it catches most (but not all) errors that would affect the model climatology.
Bit comparison tests do have a number of drawbacks, however, in that they restrict the kinds of change that can be made to the model. Occasionally, bit reproducibility cannot be guaranteed from one version of the model to another, for example when there is a change of compiler, change of hardware, a code refactoring, or almost any kind of code optimization. The decision about whether to insist on bit reproducibility, or whether to allow it to be broken from one version of the model to the next, is a difficult trade-off between ﬂexibility and ease of testing.
A number of simple practices can be used to help improve code sustainability and remove coding errors. These include running the code through multiple compilers, which is effective because different compilers give warnings about different language features, and some allow poor or ambiguous code which others will report. It’s better to identify and remove such problems when they are ﬁrst inserted, rather than discover later on that it will takes months of work to port the code to a new compiler.
Building conservation tests directly into the code also helps. These would typically be part of the coupler, and can check the global mass balance for carbon, water, salt, atmospheric aerosols, and so on. For example the coupler needs to check that water ﬂowing from rivers enters the ocean; that the total mass of carbon is conserved as it cycles through atmosphere, oceans, ice, vegetation, and so on. Individual component models sometimes neglect such checks, as the balance isn’t necessarily conserved in a single component. However, for long runs of coupled models, such conservation tests are important.
Another useful strategy is to develop a veriﬁcation toolkit for each model component, and for the entire coupled system. These contain a series of standard tests which users of the model can run themselves, on their own platforms, to conﬁrm that the model behaves in the way it should in the local computation environment. They also provide the users with a basic set of tests for local code modiﬁcations made for a speciﬁc experiment. This practice can help to overcome the tendency of model users to test only the speciﬁc physical process they are interested in, while assuming the rest of the model is okay.
During development of model components, informal comparisons with models developed by other research groups can often lead to insights in how to improve the model, and also as a method for conﬁrming and identifying suspected coding errors. But more importantly, over the last two decades, model intercomparisons have come to play a critical role in improving the quality of ESMs through a series of formally organised Model Intercomparison Projects (MIPs).
In the early days, these projects focussed on comparisons of the individual components of ESMs, for example, the Atmosphere Model Intercomparison Project (AMIP), which began in 1990 (Gates, 1992). But by the time of the IPCC second assessment report, there was a widespread recognition that a more systematic comparison of coupled models was needed, which led to the establishment of the Coupled Model Intercomparison Pro jects (CMIP), which now play a central role in the IPCC assessment process (Meehl et al, 2000).
For example, CMIP3, which was organized for the fourth IPCC assessment, involved a massive effort by 17 modeling groups from 12 countries with 24 models (Meehl et al, 2007). As of September 2010, the list of MIPs maintained by the World Climate Research Program included 44 different model intercomparison projects (Pirani, 2010).
Model Intercomparison Projects bring a number of important beneﬁts to the modeling community. Most obviously, they bring the community together with a common purpose, and hence increase awareness and collaboration between different labs. More importantly, they require the participants to reach a consensus on a standard set of model scenarios, which often entails some deep thinking about what the models ought to be able to do. Likewise, they require the participants to deﬁne a set of standard evaluation criteria, which then act as benchmarks for comparing model skill. Finally, they also produce a consistent body of data representing a large ensemble of model runs, which is then available for the broader community to analyze.
The beneﬁts of these MIPs are consistent with reports of software benchmarking efforts in other research areas. For example, Sim et al (2003) report that when a research community that builds software tools come together to create benchmarks, they frequently experience a leap forward in research progress, arising largely from the insights gained from the process of reaching consensus on the scenarios and evaluation criteria to be used in the benchmark. However, the deﬁnition of precise evaluation criteria is an important part of the benchmark – without this, the intercomparison pro ject can become unfocussed, with uncertain outcomes and without the huge leap forward in progress (Bueler, 2008).
Another form of model intercomparison is the use of model ensembles (Collins, 2007), which increasingly provide a more robust prediction system than single models runs, but which also play an important role in model validation:
- Multi-model ensembles – to compare models developed at different labs on a common scenario.
- Multi-model ensembles using variants of a single model – to compare different schemes for parts of the model, e.g. different radiation schemes.
- Perturbed physics ensembles – to explore probabilities of different outcomes, in response to systematically varying physical parameters in a single model.
- Varied initial conditions within a single model – to test the robustness of the model, and to better quantify probabilities for predicted climate change signals.