This is an excerpt from the draft manuscript of my forthcoming book, Computing the Climate.
While models are used throughout the sciences, the word ‘model’ can mean something very different to scientists from different fields. This can cause great confusion. I often encounter scientists from outside of climate science who think climate models are statistical models of observed data, and that future projections from these models must be just extrapolations of past trends. And just to confuse things further, some of the models used in climate policy analysis are like this. But the physical climate models that underpin our knowledge of why climate change occurs are fundamentally different from statistical models.
A useful distinction made by philosophers of science is between models of phenomena, and models of data. The former include models developed by physicists and engineers to capture cause-and-effect relationships. Such models are derived from theory and experimentation, and have explanatory power: the model captures the reasons why things happen. Models of data, on the other hand, describe patterns in observed data, such as correlations and trends over time, without reference to why they occur. Statistical models, for example, describe common patterns (distributions) in data, without saying anything about what caused them. This simplifies the job of describing and analyzing patterns: if you can find a statistical model that matches your data, you can reduce the data to a few parameters (sometimes just two: a mean and a standard deviation). For example, the heights of any large group of people tend to follow a normal distribution—the bell-shaped curve—but this model doesn’t explain why heights vary in that way, nor whether they always will in the future. New techniques from machine learning have extended the power of these kinds of models in recent years, allowing more complex patterns to be discovered by “training” an algorithm to find more complex kinds of pattern.
Statistical techniques and machine learning algorithms are good at discovering patterns in data (eg “A and B always seems to change together”), but hopeless at explaining why those patterns occur. To get over this, many branches of science use statistical methods together with controlled experiments, so that if we find a pattern in the data after we’ve carefully manipulated the conditions, we can argue that the changes we introduced in the experiment caused that pattern. The ability to identify a causal relationship in a controlled experiment has nothing to do with the statistical model used—it comes from the logic of the experimental design. Only if the experiment is designed properly will statistical analysis of the results provide any insights into cause and effect.
Unfortunately, for some scientific questions, experimentation is hard, or even impossible. Climate change is a good example. Even though it’s possible to manipulate the climate (as indeed we are currently doing, by adding more greenhouse gases), we can’t set up a carefully controlled experiment, because we only have one planet to work with. Instead, we use numerical models, which simulate the causal factors—a kind of virtual experiment. An experiment conducted in a causal model won’t necessarily tell us what will happen in the real world, but it often gives a very useful clue. If we run the virtual experiment many times in our causal model, under slightly varied conditions, we can then turn back to a statistical model to help analyze the results. But without the causal model to set up the experiment, a statistical analysis won’t tell us much.
Both traditional statistical models and modern machine learning techniques are brittle, in the sense that they struggle when confronted with new situations not captured in the data from which the models were derived. An observed statistical trend projected into the future is only useful as a predictor if the future is like the past; it will be a very poor predictor if the conditions that cause the trend change. Climate change in particular is likely to make a mess of all of our statistical models, because the future will be very unlike the past. In contrast, a causal model based on the laws of physics will continue to give good predictions, as long as the laws of physics still hold.
Modern climate models contain elements of both types of model. The core elements of a climate model capture cause-and-effect relationships from basic physics, such as the thermodynamics and radiative properties of the atmosphere. But these elements are supplemented by statistical models of phenomena such as clouds, which are less well understood. To a large degree, our confidence in future predictions from climate models comes from the parts that are causal models based on physical laws, and the uncertainties in these predictions derive from the parts that are statistical summaries of less well-understood phenomena. Over the years, many of the improvements in climate models have come from removing a component that was based on a statistical model, and replacing it with a causal model. And our confidence in the causal components in these models comes from our knowledge of the laws of physics, and from running a very large number of virtual experiments in the model to check whether we’ve captured these laws correctly in the model, and whether they really do explain climate patterns that have been observed in the past.