In my last two posts, I demolished the idea that climate models need Independent Verification and Validation (IV&V), and I described the idea of a toolbox approach to V&V. Both posts were attacking myths: in the first case, the myth that an independent agent should be engaged to perform IV&V on the models, and in the second, the myth that you can critique the V&V of climate models without knowing anything about how they are currently built and tested.

I now want to expand on the latter point, and explain how the day-to-day practices of climate modellers taken together constitute a robust validation process, and that the only way to improve this validation process is just to do more of it (i.e. give the modeling labs more funds to expand their current activities, rather than to do something very different).

The most common mistake made by people discussing validation of climate models is to assume that a climate model is a thing-in-itself, and that the goal of validation is to demonstrate that some property holds of this thing. And whatever that property is, the assumption is that such measurement of it can be made without reference to its scientific milieu, and in particular without reference to its history and the processes by which it was constructed.

This mistake leads people to talk of validation in terms of how well “the model” matches observations, or how well “the model” matches the processes in some real world system. This approach to validation is, as Oreskes et al pointed out, quite impossible. The models are numerical approximations of complex physical phenomena. You can verify that the underlying equations are coded correctly in a given version of the model, but you can never validate that a given model accurately captures real physical processes, because it never will accurately capture them. Or as George Box summed it up: “All models are wrong…” (we’ll come back to the second half of the quote later).

The problem is that there is no such thing as “the model”. The body of code that constitutes a modern climate model actually represents an enormous number of possible models, each corresponding to a different way of configuring that code for a particular run. Furthermore, this body of code isn’t a static thing. The code is changed on a daily basis, through a continual process of experimentation and model improvement. Often these changes are done in parallel, so that there are multiple version at any given moment, being developed along multiple lines of investigation. Sometimes these lines of evolution are merged, to bring a number of useful enhancements together into a single version. Occasionally, the lines diverge enough to cause a fork: a point at which they are different enough that it just becomes too hard to reconcile them (See for example, this visualization of the evolution of ocean models). A forked model might at some point be given a new name, but the process by which a model gets a new name is rather arbitrary.

Occasionally, a modeling lab will label a particular snapshot of this evolving body of code as an “official release”. An official release has typically been tested much more extensively, in a number of standard configurations for a variety of different platforms. It’s likely to be more reliable, and therefore easier for users to work with. By more reliable here, I mean relatively free from coding defects. In other words, it is better verified than other versions, but not necessarily better validated (I’ll explain why shortly). In many cases, official releases also contain some significant new science (e.g. new parameterizations), and these scientific enhancements will be described in a set of published papers.

However, an official release isn’t a single model either. Again it’s just a body of code that can be configured to run as any of a huge number of different models, and it’s not unchanging either – as with all software, there will be occasional bugfix releases applied to it. Oh, and did I mention that to run a model, you have to make use of a huge number of ancillary datafiles, which define everything from the shape of the coastlines and land surfaces, to the specific carbon emissions scenario to be used. Any change to these effectively gives a different model too.

So, if you’re hoping to validate “the model”, you have to say which one you mean: which configuration of which code version of which line of evolution, and with which ancillary files. I suppose the response from those clamouring for something different in the way of model validation would say “well, the one used for the IPCC projections, of course”. Which is a little tricky, because each lab produces a large number of different runs for the CMIP process that provides input to the IPCC, and each of these is a likely to involve a different model configuration.

But let’s say for sake of argument that we could agree on a specific model configuration that ought to be “validated”. What will we do to validate it? What does validation actually mean? The Oreskes paper I mentioned earlier already demonstrated that comparison with real world observations, while interesting, does not constitute “validation”. The model will never match the observations exactly, so the best we’ll ever get along these lines is an argument that, on balance, given the sum total of the places where there’s a good match and the places where there’s a poor match, that the model does better or worse than some other model. This isn’t validation, and furthermore it isn’t even a sensible way of thinking about validation.

At this point many commentators stop, and argue that if validation of a model isn’t possible, then the models can’t be used to support the science (or more usually, they mean they can’t be used for IPCC projections). But this is a strawman argument, based on a fundamental misconception of what validation is all about. Validation isn’t about checking that a given instance of a model satisfies some given criteria. Validation is about about fitness for purpose, which means it’s not about the model at all, but about the relationship between a model and the purposes to which it is put. Or more precisely, its about the relationship between particular ways of building and configuring models and the ways in which runs produced by those models are used.

Furthermore, the purposes to which models are put and the processes by which they are developed co-evolve. The models evolve continually, and our ideas about what kinds of runs we might use them for evolve continually, which means validation must take this ongoing evolution into account. To summarize, validation isn’t about a property of some particular model instance; its about the whole process of developing and using models, and how this process evolves over time.

Let’s take a step back a moment, and ask what is the purpose of a climate model. The second half of the George Box quote is “…but some models are useful”. Climate models are tools that allow scientists to explore their current understanding of climate processes, to build and test theories, and to explore the consequences of those theories. In other words we’re dealing with three distinct systems:

We're dealing with relationships between three different systems

There does not need to be any clear relationship between the calculational system and the observational system – I didn’t include such a relationship in my diagram. For example, climate models can be run in configurations that don’t match the real world at all: e.g. a waterworld with no landmasses, or a world in which interesting things are varied: the tilt of the pole, the composition of the atmosphere, etc. These models are useful, and the experiments performed with them may be perfectly valid, even though they differ deliberately from the observational system.

What really matters is the relationship between the theoretical system and the observational system: in other words, how well does our current understanding (i.e. our theories) of climate explain the available observations (and of course the inverse: what additional observations might we make to help test our theories). When we ask questions about likely future climate changes, we’re not asking this question of the the calculational system, we’re asking it of the theoretical system; the models are just a convenient way of probing the theory to provide answers.

By the way, when I use the term theory, I mean it in exactly the way it’s used in throughout all sciences: a theory is the best current explanation of a given set of phenomena. The word “theory” doesn’t mean knowledge that is somehow more tentative than other forms of knowledge; a theory is actually the kind of knowledge that has the strongest epistemological basis of any kind of knowledge, because it is supported by the available evidence, and best explains that evidence. A theory might not be capable of providing quantitative predictions (but it’s good when it does), but it must have explanatory power.

In this context, the calculational system is valid as long as it can offer insights that help to understand the relationship between the theoretical system and the observational system. A model is useful as long as it helps to improve our understanding of climate, and to further the development of new (or better) theories. So a model that might have been useful (and hence valid) thirty years ago might not be useful today. If the old approach to modelling no longer matches current theory, then it has lost some or all of its validity. The model’s correspondence (or lack of) to the observations hasn’t changed (*), nor has its predictive power. But its utility as a scientific tool has changed, and hence its validity has changed.

[(*) except that that accuracy of the observations may have changed in the meantime, due to the ongoing process of discovering and resolving anomalies in the historical record.]

The key questions for validation then, are to do with how well the current generation of models (plural) support the discovery of new theoretical knowledge, and whether the ongoing process of improving those models continues to enhance their utility as scientific tools. We could focus this down to specific things we could measure by asking whether each individual change to the model is theoretically justified, and whether each such change makes the model more useful as a scientific tool.

To do this requires a detailed study of day-to-day model development practices, the extent to which these are closely tied with the rest of climate science (e.g. field campaigns, process studies, etc). It also takes in questions such as how modeling centres decide on their priorities (e.g. which new bits of science to get into the models sooner), and how each individual change is evaluated. In this approach, validation proceeds by checking whether the individual steps taken to construct and test changes to the code add up to a sound scientific process, and how good this process is at incorporating the latest theoretical ideas. And we ought to be able to demonstrate a steady improvement in the theoretical basis for the model. An interesting quirk here is that sometimes an improvement to the model from a theoretical point of view reduces its skill at matching observations; this happens particularly when we’re replacing bits of the model that were based on empirical parameters with an implementation that has a stronger theoretical basis, because the empirical parameters were tuned to give a better climate simulation, without necessarily being well understood. In the approach I’m describing, this would be an indicator of an improvement in validity, even while reduces the correspondence with observations. If on the other hand we based our validation on some measure of correspondence with observations, such a step would reduce the validity of the model!

But what does all of this tell us about whether it’s “valid” to use the models to produce projections of climate change into the future? Well, recall that when we ask for projections of future climate change, we’re not asking the question of the calculational system, because all that would result in is a number, or range of numbers, that are impossible to interpret, and therefore meaningless. Instead we’re asking the question of the theoretical system: given the sum total of our current theoretical understanding of climate, what is likely to happen in the future, under various scenarios for expected emissions and/or concentrations of greenhouse gases? If the models capture our current theoretical understanding well, then running the scenario on the model is a valid thing to do. If the models do a poor job of capturing our theoretical understanding, then running the models on these scenarios won’t be very useful.

Note what is happening here: when we ask climate scientists for future projections, we’re asking the question of the scientists, not of their models. The scientists will apply their judgement to select appropriate versions/configurations of the models to use, they will set up the runs, and they will interpret the results in the light of what is known about the models’ strengths and weaknesses and about any gaps between the comptuational models and the current theoretical understanding. And they will add all sorts of caveats to the conclusions they draw from the model runs when they present their results.

And how do we know whether the models capture our current theoretical understanding? By studying the processes by which the models are developed (i.e. continually evolved) be the various modeling centres, and examining how good each centre is at getting the latest science into the models. And by checking that whenever there are gaps between the models and the theory, these are adequately described by the caveats in the papers published about experiments with the models.

Summary: It is a mistake to think that validation is a post-hoc process to be applied to an individual “finished” model to ensure it meets some criteria for fidelity to the real world. In reality, there is no such thing as a finished model, just many different snapshots of a large set of model configurations, steadily evolving as the science progresses. And fidelity of a model to the real world is impossible to establish, because the models are approximations. In reality, climate models are tools to probe our current theories about how climate processes work. Validity is the extent to which climate models match our current theories, and the extent to which the process of improving the models keeps up with theoretical advances.

Verifying Forecasting Systems
The difference between Verification and Validation

16 Comments

  1. Hi Steve. I didn’t get past the first paragraph, where your said: “I demolished the … myth that you can critique the V&V of climate models without knowing anything about how they are currently built and tested.”

    Why is Climate Science so insular? IMHO, its base of authority must be broadened. I know for a fact that if you have a new first-of-its-kind nuclear facility, the consensus is to NOT assume that conventional software quality assurance (SQA) processes (IV&V included) are appropriate. Unconventional SQA is expected. So the basis is broadened. You wind up with two levels of software IV&V.

    The first is IV&V on the scientific and engineering software itself, performed by the vendors or national labs on their own software that are to be used for the design and analysis of the new facility. Their established SQA procedures are followed, tailored appropriately for the software’s new usage.

    The second level of SQA (and an actual, “independent” organization at the first-of-its-kind nuclear facility I worked at) performs, as part of its function, IV&V on the SQA processes used in the first level of IV&V. (It also helps write the procedures used for the first level IV&V.) The requirements that this second SQA organization follows when performing IV&V on IV&V are NOT the software’s design requirements, but the SQA requirements that, by Federal regulations, all nuclear facilities MUST meet in order to assure that the public is not being put at risk. This gives the public confidence that the approach, scope, level of effort, and rigor used to perform the first IV&V are appropriate for the software’s usage at the new nuclear facility.

    Notice that the necessary domain expertise required at the second level IV&V does not entirely overlap the expertise required at the first level IV&V which does not entirely overlap the domain expertise required for design and analysis. I guess could put it another way: I demolish the myth that you can critique the V&V of climate models without knowing anything about everything. But I won’t. :-)

    Seriously, why does Climate Science not take a similar approach? Where is the SQA documentation that provides the required objective evidence supporting the unconventional SQA Climate Science wants to perform on their models? What I have outlined is a proven approach to unconventional SQA IV&V, and has be shown in real life to work well. (As well as could be expected for something as controversial as nuclear facilities). You could even get nuclear SQA experts’ consensus on the climate model SQA procedures.

    [George: if you can't get past the first paragraph of my posts, then no wonder you've nothing constructive to add. I suggest actually reading the post and responding to the arguments therein, rather than repeating your standard mantra ad nauseam - Steve]

  2. In 1994, Oreskes and others published an article in Science entitled “Verification,
    Validation, and Confirmation of Numerical Models in the Earth Sciences.” In this paper
    they argue, from a philosophical point of view, that verification and validation of
    numerical models in the earth sciences is impossible due to the fact that natural systems
    are not closed, so we can only know about them incompletely and by approximation.
    Although surfacing significant philosophical issues for verification, validation and truth
    in numerical models, unfortunately their arguments to support this thesis fail frequently
    due to conflation and misuse of the very terms and concepts the paper is intending to
    elucidate. They correctly argue that numerical accuracy of the mathematical components
    of a model can be verified. They also hold that the models that use such mathematical
    constructs (the algorithms and code) represent descriptions of systems that are not closed,
    and therefore use input parameters and fields that are not and cannot be completely
    known. But this is not a successful argument against the possibility of model
    verification; rather, it is an issue of whether the application of a numerical model
    appropriately represents the problem being studied. Hence, this is a problem in model
    validation or model legitimacy, not one in verification.
    In terms of model validation, Oreskes and others (1994) discuss establishing the
    “legitimacy” of the model as akin to validation. As far as it goes, this representation is
    accurate; but they claim further that a model that is internally consistent and evinces no
    detectable flaw can be considered as valid. This is a necessary, but not sufficient,
    condition for validation of models, and it derives from the philosophical usage of
    establishing valid logical arguments, not from practices in computational physics. Hence,
    once again their arguments are using terms from the philosophical literature that carry
    different technical meaning and reference from those same terms in the scientific
    literature. Such misuse is clear when they say that it is “misleading to suggest that a
    model is an accurate representation of physical reality” (Oreskes, et al., p.642). In point
    of fact, the intent of the scientific model is to represent reality or a restricted and defined
    simplification of physical processes, and validation is specifically the process that
    demonstrates the degree to which the conceptual model (the mathematical representation)
    actually maps to reality. The fact that a model is an approximation to reality does not
    mean such representation is “not even a theoretical possibility” (ibid, p.642) due to
    incompleteness of the datasets. Rather, such analysis and mapping dictates exactly what
    validation of the model means, and what the limits of applicability of the model are.
    Establishing validation means establishing the degree to which the conceptual model is
    even supposed to encompass physical reality.

    Verification, Validation, and Solution Quality in Computational Physics: CFD Methods applied to Ice Sheet Physics
    I’d have to agree with this gentleman doing Ice Sheet modeling and Roache writing about computational physics V&V: Oreskes’ effete philosophizing wasn’t only wrong, it’s not even useful.

    I can completely agree with this:

    In this context, the calculational system is valid as long as it can offer insights that help to understand the relationship between the theoretical system and the observational system. A model is useful as long as it helps to improve our understanding of climate, and to further the development of new (or better) theories. So a model that might have been useful (and hence valid) thirty years ago might not be useful today. If the old approach to modelling no longer matches current theory, then it has lost some or all of its validity. The model’s correspondence (or lack of) to the observations hasn’t changed (*), nor has its predictive power. But its utility as a scientific tool has changed, and hence its validity has changed.

    You’ve defined your intended use to provide the proper context for the descriptor “valid.”

    Here’s the conflation I was looking for (I knew you wouldn’t let me down ; – ),

    By this approach, the validity of a model-based projection of future climate change is a measure of how representative the particular model configuration is of current theory.

    You can understand why other folks might have different validation metrics for that particular use right?

  3. Josh “You can understand why other folks might have different validation metrics for that particular use right?”

    Yes, of course. But really, a future projection of climate change should be nothing more than “here’s what the best science available tells us is likely to happen”. If people want to attack such projections, they could:
    – attack the science (ie the theory) underlying it;
    – attack the relationship between the theory and the models (e.g. by showing the models aren’t up to date with the latest science).
    – attack the verification of the models (i.e. by showing they’re not adequately tested, don’t implement the equations correctly, contain latent coding bugs, etc)

    Any of these would be much better than vague “the models aren’t validated” comments.

  4. Any of these would be much better than vague “the models aren’t validated” comments.

    Agree; “models aren’t valid”, absent a technical understanding of what that means, is just a talking point.

  5. “a theory is the best current explanation of a given set of phenomena”

    This definition is woeful. It suggests, for instance, that an explanation which is a theory one day stops being a theory the next day, when a better explanation is devised. Which is obviously not the case.

    A theory is an explanation of some phenomena. A good theory has some evidential support and can make some testable predictions. There are many more metrics than that for the goodness of a theory (for instance, how much evidence supports it? how accurate are its predictions? how easy is it to use? how easy is it to understand or explain? how elegant is it?), so there’s no single basis for choosing one theory over another (for instance, GR is simpler, more elegant, better supported, and more accurate than Newtonian gravity, but it is a right bugger to use or to explain).

  6. Nick – erm, good point. I think.

    Actually, I think my definition still works, as long as we allow for some leeway on what “best” means, and who gets to decide. It takes a long time and a lot of discussion for a new theory to displace an old one. Kuhn characterized them as paradigm shifts, although later philosophers (e.g. Lakatos) pointed out that Kuhn’s characterization is too simple, and that in any field there are usually multiple theories competing with each other for supremacy.
    If a theory is shown conclusively to be wrong, then I would argue it stops being a theory (or if you prefer, it stops being a “scientific theory”). More often, when a new theory offers better insights and better explanations, it still doesn’t completely replace the old theory, because there are still places where the old theory is a better explanation (Larry Laudan explains this well – see the wikipedia entry for an overview)

    How about if I slightly modify my definition to be “A theory is the best explanation, or at least a candidate for best explanation, of a given set of phenomena”? It’s a little more cumbersome…

  7. Josh: I agree with the overall criticism of the Oreskes paper. Terminologically, it’s a complete mess, and the conclusions are rather pathetic. However, the point that you can’t validate a model by comparing with the real world still stands, and is, I think, consistent with the Thompson paper you quote. Thompson says “Establishing validation means establishing the degree to which the conceptual model is even supposed to encompass physical reality”. That’s largely my point too: you have to establish the correspondence between the model and the theory, to understand what the model is supposed to do. Comparing the model to the observations doesn’t validate the model, but the results of such comparisons may form part of the argument that the model does adequately capture (at least some part of) the theory. The theory will guide what kinds of comparisons make sense, and what level of correspondence should be expected.

  8. Pingback: V&V For a Forecasting System | Serendipity

  9. However, the point that you can’t validate a model by comparing with the real world still stands, and is, I think, consistent with the Thompson paper you quote.

    That seems absurd on its face to me (both your point and the claim that it is consistent with what Thompson wrote), but maybe we are just using terminology a bit differently. Here,

    Thompson says “Establishing validation means establishing the degree to which the conceptual model is even supposed to encompass physical reality”. That’s largely my point too: you have to establish the correspondence between the model and the theory

    You seem to be saying that there is a correspondence between your use of “model” and “theory” and Thompson’s use of “conceptual model” and “physical reality”. Can you elaborate on these terms as you are using them? Does your use of “theory” correspond to Thompson’s “conceptual model,” and just plain “model” means something else? Would you say “theory” or “conceptual model” is the governing equations you choose, and “model” is the implementation? Or is “theory” something like Continuum Mechanics, and “conceptual model” is the particular simplifications we make for our problem and our limited computing resources? Any or none of the above?

  10. Sorry, I wasn’t very clear. The key word in the quote from Thompson is the the word “supposed”. The only way to know the degree to which the model is *supposed* to encompass physical reality is to explore the relationship between the model and the theory.

  11. The only way to know the degree to which the model is *supposed* to encompass physical reality is to explore the relationship between the model and the theory.

    I think I disagree with this too. I’d expect that you can get broad, qualitative outlines of “supposed to encompass physical reality” from examining the theory. For example, I wouldn’t expect a Continuum Mechanics code to predict continuum breakdown very well, but that’s not even looking at the relationship between model and theory. To me, model and theory are near synonyms separated only by our inadvertent errors or limited capabilities, but I have a suspicion that I’m using those words differently than you are. When you say “model” are you talking about the actual implementation in software?

  12. Great post, Steve. Lots to think about.

  13. I don’t get most of what is being said here.

    Second bit first: The “theory” doesn’t make predictions that can be tested, because, generally, you can’t directly compute the theory – it’s too complicated. This is the reason you’re writing the model in the first place. Therefore, as I see it, the diagram has the three blocks in the wrong order. It should be

    Theory <-> Model <-> Reality

    You prove that the theory is right by building a model from it that you can actually compute. Then you see if the thing you compute matches the real world. If it does, then it provides support for the theory.

    Therefore, what you need to check about the model is
    (a) Has it implemented the correct facets of the theory to actually compute the correct answer (or a good, controlled approximation to it) for a particular physical measurement?
    (b) Is the calculation correct (in the case of a computer program, contains no bugs)?

    Regarding the first half of the post: I am not sure I understand what is being claimed. All that is being said is that there are a large number of versions that all go under the same name. But from a scientific point of view, they’re different models if they don’t give the same answers to the same questions. This we fix by a careful convention for naming them, with version numbers or build numbers or whatever. It doesn’t seem like a big deal from a philosophical point of view.

  14. A few questions: 1) If I understand correctly – a model is ‘valid’ (is that a formal term?) if the code is written to correctly represent the best theoretical science at the time – so then what do the results tell you? What are you modeling for – or what are the possible results or output of the model? If the model tells you something you weren’t expecting, does that mean it’s invalid? When would you get a result or output that conflicts with theory and then assess whether the theory needs to be reconsidered? 2) Then is it the theory and not the model that is the best tool for understanding what will happen in the future? Is the best we can say about what will happen that we have a theory that adheres to what we know about the field and that makes sense based on that knowledge? 3) What then is the protection or assurance that the theory is accurate? How can one ‘check’ predictions without simply waiting to see if they come true or not come true? I think myself and a large part of the general public that follows this stuff to some degree is under the impression that the models can support the theory – but if the models are only valid if they adhere to the best available theoretical science at the time, are they any check for the theory’s correctness at all?

  15. @Steve
    Or if it’s easier – do projections of temperature increase in due to CO2 come from the models output or from the theory?

  16. Pingback: Your Model Is Verified, But Not Valid! Huh? « Azimuth

  17. Hi Steve,

    I’m LM, I was pointed here by a comment from Dr Curry’s blog (I’ve long been searching for more information on climate model Verification + Validation as the process confuses the hell out of me).

    I liked your piece and it’s filled a few holes in my knowledge, but I think what you’re describing is NOT actually V+V proper- rather climate sciences interpretation of it. I’m an industry based scientist with a smattering or engineering experience (the ‘sharp’ end, not officially taught) and while I think I can ‘allow’ (!) and understand the verification part of climate models (while immediately qualifying this by highlighting the entirely subjective nature of this process), I do not agree with your assessment of validation: at least as I recognise the term (and I’m the first to admit that this may be where my issues arise).

    You state that real world validation of a climate model is impractical and further, should not be expected. I follow your logic, but I’ve then got to question the use of the modelling exercise at all. Specifically, I understand how ‘black box’ mechanics can be used to draw conclusions about real-world processes and how this sort of ‘experimentation’ (though it is emphatically NOT an experiment) can be used to further our knowledge of said system, but it does rely on a rather large assumption: that our understandings of this system are accurate.

    You alluded to as much in your post when you stated that the validation is based on our current understandings and as such can change completely in 30 years’ time depending on scientific and theoretical advancements. This to me sets alarm bells off. If the validation process is such that the very nature of the model can change drastically over time, then what you are performing is in fact NOT validation- but rather some sort of in process operational qualification.

    Or to put it another way, without validating the models against real world data, you DO NOT know that what you’re modelling (regardless of how well it matches your current theories) bears any resemblance to reality. In fact, you could disappear so far down this particular rabbit hole that you could literally spend years researching a dead-end.

    Now of course, validation against real-world data will not be exact. The climate system is FAR too complex to currently model accurately; though a working validation guideline or framework could definitely be set-up and would, I submit humbly, work far better than the current method (as you describe).

    Finally, for the ‘rabbit hole’ reason I outlined above I am very wary when you describe the current iterative process surrounding model ‘building’, testing and release. The ensemble process on the face of it seems correct- but it is not being carried out to its logical conclusion (a iterative model reduction) and I think the constant tweaking/modifying and retrofitting is actually holding the model development back rather than helping it progress. It’s certainly not an approach that would match an industry or engineering level V+V.

    I guess I find the whole process to be somewhat counter intuitive and I cannot shake the impression that this whole process hinges on the assumption that the theoretical aspects used to evaluate the models are accurate.

    Though, feel free to completely shoot down my conclusions and points, this is a subject I’m very interested in and I’m really keen to learn more on it.

  18. Pingback: Why trust climate models? It’s a matter of simple science

Join the discussion: