{"id":1917,"date":"2010-09-18T13:11:43","date_gmt":"2010-09-18T17:11:43","guid":{"rendered":"http:\/\/www.easterbrook.ca\/steve\/?p=1917"},"modified":"2010-09-19T03:20:30","modified_gmt":"2010-09-19T07:20:30","slug":"verification-and-validation-of-earth-system-models","status":"publish","type":"post","link":"http:\/\/www.easterbrook.ca\/steve\/2010\/09\/verification-and-validation-of-earth-system-models\/","title":{"rendered":"Verification and Validation of Earth System Models"},"content":{"rendered":"<p>I&#8217;ve been meaning to write a summary of the V&amp;V techniques used for Earth System Models (ESMs) for ages, but never quite got round to it. However, I just had to put together a piece for a book chapter, and thought I would post it here to see if folks have anything to add (or argue with)).<\/p>\n<p>Verification and Validation for ESMs is hard because running the models is an expensive proposition (a fully coupled simulation run can take weeks to complete), and because there is rarely a &#8220;correct&#8221; result &#8211; expert judgment is needed to assess the model outputs.<\/p>\n<p>However, it is helpful to distinguish between verification and validation, because the former can often be automated, while the latter cannot. Verification tests are objective tests of correctness. These include basic tests (usually applied after each code change) that the model will compile and run without crashing in each of its standard configurations, that a run can be stopped and restarted from the restart files without affecting the results, and that identical results are obtained when the model is run using different processor layouts. Verification would also include the built-in tests for conservation of mass and energy over the global system on very long simulation runs.<\/p>\n<p>In contrast, validation refers to science tests, where subjective judgment is needed. These include tests that the model simulates a realistic, stable climate, given stable forcings, that it matches the trends seen in observational data when subjected to historically accurate forcings, and that the means and variations (e.g.  seasonal cycles) are realistic for the main climate variables (E.g. see\u00a0<a title=\"Phillips et al: Evaluating Parameterizations in Global Circulation Models\" href=\"http:\/\/198.128.245.140\/projects\/capt\/publications\/phillips_et_al_2004.pdf\" target=\"_blank\">Phillips et al, 2004<\/a>).<\/p>\n<p>While there is an extensive literature on the philosophical status of model\u00a0validation in computational sciences (see for example, <a title=\"Oreskes et al: Verification and Validation of Numerical Models in the Earth Sciences\" href=\"http:\/\/www.sciencemag.org\/cgi\/content\/short\/263\/5147\/641\" target=\"_blank\">Oreskes et al<\/a> (1994);\u00a0<a title=\"Sterman: The Meaning of Models\" href=\"http:\/\/adsabs.harvard.edu\/abs\/1994Sci...264..329S\" target=\"_blank\">Sterman<\/a> (1994);<a title=\"Measurements, Models, and Hypotheses in the Atmospheric Sciences\" href=\"http:\/\/journals.ametsoc.org\/doi\/abs\/10.1175\/1520-0477%281997%29078%3C0399%3AMMOHIT%3E2.0.CO%3B2\" target=\"_blank\"> Randall and Wielicki<\/a> (1997); <a title=\"Models as Focusing Tools: Linking Nature and the Social  World\" href=\"http:\/\/www.nhbs.com\/models_in_environmental_research_tefno_105434.html&amp;tab_tag=contents\" target=\"_blank\">Stehr<\/a> (2001)), much of it bears\u00a0very little relation to practical techniques for ESM validation, and very little\u00a0has been written on practical testing techniques for ESMs. In practice, testing\u00a0strategies rely on a hierarchy of standard tests, starting with the simpler ones,\u00a0and building up to the most sophisticated.<\/p>\n<p><a title=\"Testing and Evaluating Atmospheric Climate Models\" href=\"http:\/\/www.computer.org\/portal\/web\/csdl\/doi\/10.1109\/MCISE.2002.1032431\" target=\"_blank\">Pope and Davies<\/a> (2002) give one such sequence for testing atmosphere\u00a0models:<\/p>\n<ul>\n<li>Simpli\ufb01ed tests &#8211; e.g. reduce 3D equations of motion to 2D horizontal \ufb02ow\u00a0(e.g. a shallow water testbed). This is especially useful if the reduction has\u00a0an analytical solution, or if a reference solution is available. It also permits\u00a0assessment of relative accuracy and stability over a wide parameter space,\u00a0and hence is especially useful when developing new numerical routines.<\/li>\n<li>Dynamical core tests &#8211; test for numerical convergence of the dynamics with\u00a0physical parameterizations replaced by a simpli\ufb01ed physics model (e.g. no\u00a0topography, no seasonal or diurnal cycle, simpli\ufb01ed radiation).<\/li>\n<li>Single-column tests &#8211; allows testing of individual physical parameterizations separately from the rest of the model. A single column of data is\u00a0used, with horizontal forcing prescribed from observations or from idealized pro\ufb01les. This is useful for understanding a new parameterization, and\u00a0for comparing interaction between several parameterizations, but doesn\u2019t\u00a0cover interaction with large-scale dynamics, nor interaction with adjacent\u00a0grid points. This type of test also depends on availability of observational\u00a0datasets.<\/li>\n<li>Idealized aquaplanet &#8211; test the fully coupled atmosphere-ocean model, but\u00a0with idealized sea-surface temperatures at all grid points. This allows for\u00a0testing of numerical convergence in the absence of complications of orography and coastal effects.<\/li>\n<li>Uncoupled model components tested against realistic climate regimes &#8211;\u00a0test each model component in stand-alone mode, with a prescribed set\u00a0of forcings. For example, test the atmosphere on its own, with prescribed\u00a0sea surface temperatures, sea-ice boundary conditions, solar forcings, and\u00a0ozone distribution. Statistical tests are then applied to check for realistic\u00a0mean climate and variability.<\/li>\n<li>Double-call tests. Run the full coupled model, and test a new scheme by\u00a0calling both the old and new scheme at each timestep, but with the new\u00a0scheme\u2019s outputs not fed back in to the model. This allows assessment of\u00a0the performance of new scheme in comparison with older schemes.<\/li>\n<li>Spin-up tests. Run the full ESM for just a few days of simulation (typically\u00a0between 1 and 5 days of simulation), starting from an observed state. Such\u00a0tests are cheap enough that they can be run many times, sampling across\u00a0the initial state uncertainty. Then the average of a large number of such\u00a0tests can be analyzed (Pope and Davies suggest that 60 is enough for\u00a0statistical signi\ufb01cance). This allows the results from different schemes to\u00a0be compared, to explore differences in short term tendencies.<\/li>\n<\/ul>\n<p>Whenever a code change is made to an ESM, in principle, an extensive set\u00a0of simulation runs are needed to assess whether the change has a noticeable\u00a0impact on the climatology of the model. This in turn requires a sub jective\u00a0judgment for whether minor variations constitute acceptable variations, or\u00a0whether they add up to a signi\ufb01cantly different climatology.<\/p>\n<p>Because this testing is so expensive, a standard shortcut is to require exact\u00a0reproducibility for minor changes, which can then be tested quickly through\u00a0the use of bit comparison tests . These are automated checks over a short run\u00a0(e.g. a few days of simulation time) that the outputs or restart \ufb01les of two\u00a0different model con\ufb01gurations are identical down to the least signi\ufb01cant bits.\u00a0This is useful for checking that a change didn\u2019t break anything it shouldn\u2019t,\u00a0but requires that each change be designed so that it can be \u201cturned off\u201d (e.g.\u00a0via run-time switches) to ensure previous experiments can be reproduced. Bit\u00a0comparison tests can also check that different con\ufb01gurations give identical\u00a0results. In effect, bit reproducibility over a short run is a proxy for testing\u00a0that two different versions of the model will give the same climate over a long\u00a0run. It\u2019s much faster than testing the full simulations, and it catches most\u00a0(but not all) errors that would affect the model climatology.<\/p>\n<p>Bit comparison tests do have a number of drawbacks, however, in that they\u00a0restrict the kinds of change that can be made to the model. Occasionally,\u00a0bit reproducibility cannot be guaranteed from one version of the model to\u00a0another, for example when there is a change of compiler, change of hardware, a\u00a0code refactoring, or almost any kind of code optimization. The decision about\u00a0whether to insist on bit reproducibility, or whether to allow it to be broken\u00a0from one version of the model to the next, is a difficult trade-off between\u00a0\ufb02exibility and ease of testing.<\/p>\n<p>A number of simple practices can be used to help improve code sustainability and remove coding errors. These include running the code through multiple\u00a0compilers, which is effective because different compilers give warnings about\u00a0different language features, and some allow poor or ambiguous code which\u00a0others will report. It\u2019s better to identify and remove such problems when they\u00a0are \ufb01rst inserted, rather than discover later on that it will takes months of\u00a0work to port the code to a new compiler.<\/p>\n<p>Building conservation tests directly into the code also helps. These would\u00a0typically be part of the coupler, and can check the global mass balance for\u00a0carbon, water, salt, atmospheric aerosols, and so on. For example the coupler\u00a0needs to check that water \ufb02owing from rivers enters the ocean; that the total mass of carbon is conserved as it cycles through atmosphere, oceans, ice,\u00a0vegetation, and so on. Individual component models sometimes neglect such\u00a0checks, as the balance isn\u2019t necessarily conserved in a single component. However, for long runs of coupled models, such conservation tests are important.<\/p>\n<p>Another useful strategy is to develop a veri\ufb01cation toolkit for each model\u00a0component, and for the entire coupled system. These contain a series of standard tests which users of the model can run themselves, on their own platforms, to con\ufb01rm that the model behaves in the way it should in the local\u00a0computation environment. They also provide the users with a basic set of tests\u00a0for local code modi\ufb01cations made for a speci\ufb01c experiment. This practice can\u00a0help to overcome the tendency of model users to test only the speci\ufb01c physical\u00a0process they are interested in, while assuming the rest of the model is okay.<\/p>\n<p>During development of model components, informal comparisons with models developed by other research groups can often lead to insights in how to\u00a0improve the model, and also as a method for con\ufb01rming and identifying suspected coding errors. But more importantly, over the last two decades, model\u00a0intercomparisons have come to play a critical role in improving the quality of\u00a0ESMs through a series of formally organised Model Intercomparison Projects\u00a0(MIPs).<\/p>\n<p>In the early days, these projects focussed on comparisons of the individual\u00a0components of ESMs, for example, the Atmosphere Model Intercomparison\u00a0Project (AMIP), which began in 1990 (<a title=\"AMIP: The Atmospheric Model Intercomparison Project\" href=\"http:\/\/journals.ametsoc.org\/doi\/abs\/10.1175\/1520-0477%281992%29073%3C1962%3AATAMIP%3E2.0.CO%3B2\" target=\"_blank\">Gates, 1992<\/a>). But by the time of the\u00a0IPCC second assessment report, there was a widespread recognition that a\u00a0more systematic comparison of coupled models was needed, which led to the\u00a0establishment of the Coupled Model Intercomparison Pro jects (CMIP), which\u00a0now play a central role in the IPCC assessment process (<a title=\"The Coupled Model Intercomparison Project (CMIP).\" href=\"http:\/\/adsabs.harvard.edu\/abs\/2000BAMS...81..313M\" target=\"_blank\">Meehl et al, 2000<\/a>).<\/p>\n<p>For example, CMIP3, which was organized for the fourth IPCC assessment,\u00a0involved a massive effort by 17 modeling groups from 12 countries with 24\u00a0models (<a title=\"The WCRP CMIP3 multimodel dataset: A new ero in climate change research\" href=\"http:\/\/en.scientificcommons.org\/50070114\" target=\"_blank\">Meehl et al, 2007<\/a>). As of September 2010, the list of MIPs maintained by the World Climate Research Program included 44 different model\u00a0intercomparison projects (<a title=\"List of MIPS\" href=\"http:\/\/www.clivar.org\/organization\/wgcm\/pro jects.php\" target=\"_blank\">Pirani, 2010<\/a>).<\/p>\n<p>Model Intercomparison Projects bring a number of important bene\ufb01ts to\u00a0the modeling community. Most obviously, they bring the community together\u00a0with a common purpose, and hence increase awareness and collaboration between different labs. More importantly, they require the participants to reach\u00a0a consensus on a standard set of model scenarios, which often entails some\u00a0deep thinking about what the models ought to be able to do. Likewise, they\u00a0require the participants to de\ufb01ne a set of standard evaluation criteria, which\u00a0then act as benchmarks for comparing model skill. Finally, they also produce\u00a0a consistent body of data representing a large ensemble of model runs, which\u00a0is then available for the broader community to analyze.<\/p>\n<p>The bene\ufb01ts of these MIPs are consistent with reports of software benchmarking efforts in other research areas. For example, <a title=\"Using benchmarking to advance research: a challenge to software engineering\" href=\"http:\/\/portal.acm.org\/citation.cfm?id=776826\" target=\"_blank\">Sim et al<\/a> (2003) report\u00a0that when a research community that builds software tools come together\u00a0to create benchmarks, they frequently experience a leap forward in research\u00a0progress, arising largely from the insights gained from the process of reaching\u00a0consensus on the scenarios and evaluation criteria to be used in the benchmark. However, the de\ufb01nition of precise evaluation criteria is an important\u00a0part of the benchmark &#8211; without this, the intercomparison pro ject can become\u00a0unfocussed, with uncertain outcomes and without the huge leap forward in\u00a0progress (<a title=\"Lessons from the short history of ice sheet model intercomparison\" href=\"http:\/\/www.the-cryosphere-discuss.net\/2\/399\/2008\/tcd-2-399-2008.html\" target=\"_blank\">Bueler, 2008<\/a>).<\/p>\n<p>Another form of model intercomparison is the use of model ensembles (<a title=\"Ensembles and probabilities: a new era in the prediction of climate change\" href=\"http:\/\/rsta.royalsocietypublishing.org\/content\/365\/1857\/1957\" target=\"_blank\">Collins,\u00a02007<\/a>), which increasingly provide a more robust prediction system than single\u00a0models runs, but which also play an important role in model validation:<\/p>\n<ul>\n<li>Multi-model ensembles \u2013 to compare models developed at different labs\u00a0on a common scenario.<\/li>\n<li>Multi-model ensembles using variants of a single model \u2013 to compare different schemes for parts of the model, e.g. different radiation schemes.<\/li>\n<li>Perturbed physics ensembles \u2013 to explore probabilities of different outcomes, in response to systematically varying physical parameters in a single model.<\/li>\n<li>Varied initial conditions within a single model \u2013 to test the robustness\u00a0of the model, and to better quantify probabilities for predicted climate\u00a0change signals.<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;ve been meaning to write a summary of the V&amp;V techniques used for Earth System Models (ESMs) for ages, but never quite got round to it. However, I just had to put together a piece for a book chapter, and thought I would post it here to see if folks have anything to add (or [&hellip;]<\/p>\n","protected":false},"author":392,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[10],"tags":[],"aioseo_notices":[],"jetpack_sharing_enabled":true,"jetpack_featured_media_url":"","_links":{"self":[{"href":"http:\/\/www.easterbrook.ca\/steve\/wp-json\/wp\/v2\/posts\/1917"}],"collection":[{"href":"http:\/\/www.easterbrook.ca\/steve\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.easterbrook.ca\/steve\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.easterbrook.ca\/steve\/wp-json\/wp\/v2\/users\/392"}],"replies":[{"embeddable":true,"href":"http:\/\/www.easterbrook.ca\/steve\/wp-json\/wp\/v2\/comments?post=1917"}],"version-history":[{"count":2,"href":"http:\/\/www.easterbrook.ca\/steve\/wp-json\/wp\/v2\/posts\/1917\/revisions"}],"predecessor-version":[{"id":1920,"href":"http:\/\/www.easterbrook.ca\/steve\/wp-json\/wp\/v2\/posts\/1917\/revisions\/1920"}],"wp:attachment":[{"href":"http:\/\/www.easterbrook.ca\/steve\/wp-json\/wp\/v2\/media?parent=1917"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.easterbrook.ca\/steve\/wp-json\/wp\/v2\/categories?post=1917"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.easterbrook.ca\/steve\/wp-json\/wp\/v2\/tags?post=1917"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}