My department is busy revising the set of milestones our PhD students need to meet in the course of their studies. The milestones are intended to ensure each student is making steady progress, and to identify (early!) any problems. At the moment they don’t really do this well, in part because the faculty all seem to have different ideas about what we should expect at each milestone. (This is probably a special case of the general rule that if you gather n professors together, they will express at least n+1 mutually incompatible opinions). As a result, the students don’t really know what’s expected of them, and hence spend far longer in the PhD program than they would need to if they received clear guidance.

Anyway, in order to be helpful, I wrote down what I think are the set of skills that a PhD student needs to demonstrate early in the program, as a prerequisite for becoming a successful researcher:

  1. The ability to select a small number of significant research contributions from a larger set of published papers, and justify that selection.
  2. The ability to articulate a rationale for selection of these papers, on the basis of significance of the results, novelty of the approach, etc.
  3. The ability to relate the papers to one another, and to other research in the literature.
  4. The ability to critique the research methods used in these papers, the strengths and weaknesses of these methods, and likely threats to validity, whether acknowledged in the papers or not.
  5. The ability to suggest alternative approaches to answering the research questions posed in these papers.
  6. The ability to identify limitations on the results reported in the papers, along with their implications.
  7. The ability to identify and prioritize lines of investigation for further research, based on limitations of the research described in the papers and/or important open problems that the papers fail to answer.

My suggestion is that at the end of the first year of the PhD program, each student should demonstrate development of these skills by writing a short report that selects and critiques a handful (4-6) of papers in a particular subfield. If a student can’t do this well, they’re probably not going to succeed in the PhD program.

My proposal has now gone to the relevant committee (“where good ideas go to die™”), so we’ll see what happens…

This week, I presented our poster on Benchmarking and Assessment of Homogenisation Algorithms for the International Surface Temperature Initiative (ISTI) at the WCRP Open Science Conference (click on the poster for a readable version).

This work is part of the International Surface Temperature Initiative (ISTI) that I blogged about last year. The intent is to create a new open access database for historial surface temperature records at a much higher resolution than has previously been available. In the past, only monthly averages were widely available; daily and sub-daily observations collected by meteorological services around the world are often considered commercially valuable, and hence tend to be hard to obtain. And if you go back far enough, much of the data was never digitized and some is held in deteriorating archives.

The goal of the benchmarking part of the project is to assess the effectiveness of the tools used to remove data errors from the raw temperature records. My interest in this part of the project stems from the work that my student, Susan Sim, did a few years ago on the role of benchmarking to advance research in software engineering. Susan’s PhD thesis described a theory that explains why benchmarking efforts tend to accelerate progress within a research community. The main idea is that creating a benchmark brings the community together to build consensus on what the key research problem is, what sample tasks are appropriate to show progress, and what metrics should be used to measure that progress. The benchmark then embodies this consensus, allowing different research groups to do detailed comparisons of their techniques, and facilitating sharing of approaches that work well.

Of course, it’s not all roses. Developing a benchmark in the first place is hard, and requires participation from across the community; a benchmark put forward by a single research group is unlikely to accepted as unbiased by other groups. This also means that a research community has to be sufficiently mature in terms of their collaborative relationships and consensus on common research problems (in Kuhnian terms, they must be in the normal science phase). Also, note that a benchmark is anchored to a particular stage of the research, as it captures problems that are currently challenging; continued use of a benchmark after a few years can lead to a degeneration of the research, with groups over-fitting to the benchmark, rather than moving on to harder challenges. Hence, it’s important to retire a benchmark every few years and replace it with a new one.

The benchmarks we’re exploring for the ISTI project are intended to evaluate homogenization algorithms. These algorithms detect and remove artifacts in the data that are due to things that have nothing to do with climate – for example when instruments designed to collect short-term weather data don’t give consistent results over the long-term record. The technical term for these is inhomogeneities, but I’ll try to avoid the word, not least because I find it hard to say. I’d like to call them anomalies, but that word is already used in this field to mean differences in temperature due to climate change. Which means that anomalies and inhomogeneities are, in some ways, opposites: anomalies are the long term warming signal that we’re trying to assess, and inhomogeneities represent data noise that we have to get rid of first. I think I’ll just call them bad data.

Bad data arise for a number of reasons, usually isolated to changes at individual recording stations: a change of instruments, an instrument drifting out of calibration, a re-siting, a slow encroachment of urbanization which changes the local micro-climate. Because these problems tend to be localized, they can often be detected by statistical algorithms that compare individual stations with their neighbours. In essence, the algorithms look for step changes and spurious trends in the data such as the following:

These bad data are a serious problem in climate science – for a recent example, see the post yesterday at RealClimate, which discusses how homogenization algorithms might have gotten in the way of understanding the relationship between climate change and the Russian heatwave of 2010. Unhelpfully, they’re also used by deniers to beat up climate scientists, as some people latched onto the idea of blaming warming trends on bad data rather than, say, actual warming. Of course, this ignores two facts: (1) climate scientists already spend a lot of time assessing and removing such bad data and (2) independent analysis has repeatedly shown that the global warming signal is robust with respect to such data problems.

However, such problems in the data still matter for the detailed regional assessments that we’ll need in the near future for identifying vulnerabilities (e.g. to extreme weather), and, as the example at RealClimate shows, for attribution studies for localized weather events and hence for decision-making on local and regional adaptation to climate change.

The challenge is that it’s hard to test how well homogenization algorithms work, because we don’t have access to the truth – the actual temperatures that the observational records should have recorded. The ISTI benchmarking project aims to fill this gap by creating a data set that has been seeded with artificial errors. The approach reminds me of the software engineering technique of bug seeding (aka mutation testing), which deliberately introduce errors into software to assess how good the test suite is at detecting them.

The first challenge is where to get a “clean” temperature record to start with, because the assessment is much easier if the only bad data in the sample are the ones we deliberately seeded. The technique we’re exploring is to start with the output of a Global Climate Model (GCM), which is probably the closest we can get to a globally consistent temperature record. The GCM output is on a regular grid, and may not always match the observational temperature record in terms of means and variances. So to make it as realistic as possible, we have to downscale the gridded data to yield a set of “station records” that match the location of real observational stations, and adjust the means and variances to match the real-world:

Then we inject the errors. Of course, the error profile we use is based on what we currently know about typical kinds of bad data in surface temperature records. It’s always possible there are other types of error in the raw data that we don’t yet know about; that’s one of the reasons for planning to retire the benchmark periodically and replace it with a new one – it allows new findings about error profiles to be incorporated.

Once the benchmark is created, it will be used within the community to assess different homogenization algorithms. Initially, the actual injected error profile will be kept secret, to ensure the assessment is honest. Towards the end of the 3-year benchmarking cycle, we will release the details about the injected errors, to allow different research groups to measure how well they did. Details of the results will then be included in the ISTI dataset for any data products that use the homogenization algorithms, so that users of these data products have more accurate estimates of uncertainty in the temperature record. Such estimates are important, because use of the processed data without a quantification of uncertainty can lead to misleading or incorrect research.

For more details of the project, see the Benchmarking and Assessment Working Group website, and the group blog.

Bad news today – we just had a major grant proposal turned down. It’s the same old story – they thought the research we were proposing (on decision support tools for sustainability) was excellent, but criticized, among other things, the level of industrial commitment and our commercialization plans. Seems we’re doomed to live in times where funding agencies expect universities to take on the role of industrial R&D. Oh well.

The three external reviews were very strong. Here’s a typical paragraph from the first review:

I found the overall project to be very compelling from a “need”, potential “payoff’, technical and team perspective. The linkage between seemingly disparate technology arenas–which are indeed connected and synergistic–is especially compelling. The team is clearly capable and has a proven track record of success in each of their areas and as leaders of large projects, overall. The linkage to regional and institutional strengths and partners, in both academic and industrial dimensions, is well done and required for success.

Sounds good huh? I’m reading it through, nodding, liking the sound of what this reviewer is saying. The problem is, this is the wrong review. It’s not a review of our proposal. It’s impossible to tell that from this paragraph, but later on, mixed in with a whole bunch more generic praise, are some comments on manufacturing processes, and polymer-based approaches. That’s definitely not us. Yet I’m named at the top of the form as the PI, along with the title of our proposal. So, this review made it all the way through the panel review process, and nobody noticed it was of the wrong proposal, because most of the review was sufficiently generic that it passed muster on a quick skim-read.

It’s not the first time I’ve seen this happen. It happens for paper reviews for journals and conference. It happens for grant proposals. It even happens for tenure and promotion cases (including both of the last two tenure committees I sat on). Since we started using electronic review systems, it happens even more – software errors and human errors seem to conspire to ensure a worrying large proportion of reviews get misfiled.

Which is why every review should start with a one paragraph summary of whatever is being reviewed, in the reviewer’s own words. This acts as a check that the reviewer actually understood what the paper or proposal was about. It allows the journal editor / review panel / promotions committee to immediately spot cases of mis-filed reviews. And it allows the authors, when they receive the reviews, to get the most important feedback of all: how well did they succeed in communicating the main message of the paper/proposal?

Unfortunately, in our case, correcting the mistake is unlikely to change the funding decision (they sunk us on other grounds). But at least I can hope to use it as an example to improve the general standard of reviewing in the future.

This week I attended a Dagstuhl seminar on New Frontiers for Empirical Software Engineering. It was a select gathering, with many great people, which meant lots of fascinating discussions, and not enough time to type up all the ideas we’ve been bouncing around. I was invited to run a working group on the challenges to empirical software engineering posed by climate change. I started off with a quick overview of the three research themes we identified at the Oopsla workshop in the fall:

  • Climate Modeling, which we could characterize as a kind of end-user software development, embedded in a scientific process;
  • Global collective decision-making, which involves creating the software infrastructure for collective curation of sources of evidence in a highly charged political atmosphere;
  • Green Software Engineering, including carbon accounting for the software systems lifecycle (development, operation and disposal), but where we have no existing no measurement framework, and tendency to to make unsupported claims (aka greenwashing).

Inevitably, we spent most of our time this week talking about the first topic – software engineering of computational models, as that’s the closest to the existing expertise of the group, and the most obvious place to start.

So, here’s a summary of our discussions. The bright ideas are due to the group (Vic Basili, Lionel Briand, Audris Mockus, Carolyn Seaman and Claes Wohlin), while the mistakes in presenting them here are all mine.

A lot of our discussion was focussed on the observation that climate modeling (and software for computational science in general) is a very different kind of software engineering than most of what’s discussed in the SE literature. It’s like we’ve identified a new species of software engineering, which appears to be a an outlier (perhaps an entirely new phylum?). This discovery (and the resulting comparisons) seems to tell us a lot about the other species that we thought we already understood.

The SE research community hasn’t really tackled the question of how the different contexts in which software development occurs might affect software development practices, nor when and how it’s appropriate to attempt to generalize empirical observations across different contexts. In our discussions at the workshop, we came up with many insights for mainstream software engineering, which means this is a two-way street: plenty of opportunity for re-examination of mainstream software engineering, as well as learning how to study SE for climate science. I should also say that many of our comparisons apply to computational science in general, not just climate science, although we used climate modeling for many specific examples.

We ended up discussing three closely related issues:

  1. How do we characterize/distinguish different points in this space (different species of software engineering)? We focussed particularly on how climate modeling is different from other forms of SE, but we also attempted to identify factors that would distinguish other species of SE from one another. We identified lots of contextual factors that seem to matter. We looked for external and internal constraints on the software development project that seem important. External constraints are things like resource limitations, or particular characteristics of customers or the environment where the software must run. Internal constraints are those that are imposed on the software team by itself, for example, choices of working style, project schedule, etc.
  2. Once we’ve identified what we think are important distinguishing traits (or constraints), how do we investigate whether these are indeed salient contextual factors? Do these contextual factors really explain observed differences in SE practices, and if so how? We need to consider how we would determine this empirically. What kinds of study are needed to investigate these contextual factors? How should the contextual factors be taken into account in other empirical studies?
  3. Now imagine we have already characterized this space of species of SE. What measures of software quality attributes (e.g. defect rates, productivity, portability, changeability…) are robust enough to allow us to make valid comparisons between species of SE? Which metrics can be applied in a consistent way across vastly different contexts? And if none of the traditional software engineering metrics (e.g. for quality, productivity, …) can be used for cross-species comparison, how can we do such comparisons?

In my study of the climate modelers at the UK Met Office Hadley centre, I had identified a list of potential success factors that might explain why the climate modelers appear to be successful (i.e. to the extent that we are able to assess it, they appear to build good quality software with low defect rates, without following a standard software engineering process). My list was:

  • Highly tailored software development process – software development is tightly integrated into scientific work;
  • Single Site Development – virtually all coupled climate models are developed at a single site, managed and coordinated at a single site, once they become sufficiently complex [edited – see Bob’s comments below], usually a government lab as universities don’t have the resources;
  • Software developers are domain experts – they do not delegate programming tasks to programmers, which means they avoid the misunderstandings of the requirements common in many software projects;
  • Shared ownership and commitment to quality, which means that the software developers are more likely to make contributions to the project that matter over the long term (in contrast to, say, offshored software development, where developers are only likely to do the tasks they are immediately paid for);
  • Openness – the software is freely shared with a broad community, which means that there are plenty of people examining it and identifying defects;
  • Benchmarking – there are many groups around the world building similar software, with regular, systematic comparisons on the same set of scenarios, through model inter-comparison projects (this trait could be unique – we couldn’t think of any other type of software for which this is done so widely).
  • Unconstrained Release Schedule – as there is no external customer, software releases are unhurried, and occur only when the software is considered stable and tested enough.

At the workshop we identified many more distinguishing traits, any of which might be important:

  • A stable architecture, defined by physical processes: atmosphere, ocean, sea ice, land scheme,…. All GCMs have the same conceptual architecture, and it is unchanged since modeling began, because it is derived from the natural boundaries in physical processes being simulated [edit: I mean the top level organisation of the code, not the choice of numerical methods, which do vary across models – see Bob’s comments below]. This is used as an organising principle both for the code modules, and also for the teams of scientists who contribute code. However, the modelers don’t necessarily derive some of the usual benefits of stable software architectures, such as information hiding and limiting the impacts of code changes, because the modules have very complex interfaces between them.
  • The modules and integrated system each have independent lives, owned by different communities. For example, a particular ocean model might be used uncoupled by a large community, and also be integrated into several different coupled climate models at different labs. The communities who care about the ocean model on its own will have different needs and priorities than each of communities who care about the coupled models. Hence, the inter-dependence has to be continually re-negotiated. Some other forms of software have this feature too: Audris mentioned voice response systems in telecoms, which can be used stand-alone, and also in integrated call centre software; Lionel mentioned some types of embedded control systems onboard ships, where the modules are used indendently on some ships, and as part of a larger integrated command and control system on others.
  • The software has huge societal importance, but the impact of software errors is very limited. First, a contrast: for automotive software, a software error can immediately lead to death, or huge expense, legal liability, etc,  as cars are recalled. What would be the impact of software errors in climate models? An error may affect some of the experiments performed on the model, with perhaps the most serious consequence being the need to withdraw published papers (although I know of no cases where this has happened because of software errors rather than methodological errors). Because there are many other modeling groups, and scientific results are filtered through processes of replication, and systematic assessment of the overall scientific evidence, the impact of software errors on, say, climate policy is effectively nil. I guess it is possible that systematic errors are being made by many different climate modeling groups in the same way, but these wouldn’t be coding errors – they would be errors in the understanding of the physical processes and how best to represent them in a model.
  • The programming language of choice is Fortran, and is unlikely to change for very good reasons. The reasons are simple: there is a huge body of legacy Fortran code, everyone in the community knows and understands Fortran (and for many of them, only Fortran), and Fortran is ideal for much of the work of coding up the mathematical formulae that represent the physics. Oh, and performance matters enough that the overhead of object oriented languages makes them unattractive. Several climate scientists have pointed out to me that it probably doesn’t matter what language they use, the bulk of the code would look pretty much the same – long chunks of sequential code implementing a series of equations. Which means there’s really no push to discard Fortran.
  • Existence and use of shared infrastructure and frameworks. An example used by pretty much every climate model is MPI. However, unlike Fortran, which is generally liked (if not loved), everyone universally hates MPI. If there was something better they would use it. [OpenMP doesn’t seem to have any bigger fanclub]. There are also frameworks for structuring climate models and coupling the different physics components (more on these in a subsequent post). Use of frameworks is an internal constraint that will distinguish some species of software engineering, although I’m really not clear how it will relate to choices of software development process. More research needed.
  • The software developers are very smart people. Typically with PhDs in physics or related geosciences. When we discussed this in the group, we all agreed this is a very significant factor, and that you don’t need much (formal) process with very smart people. But we couldn’t think of any existing empirical evidence to support such a claim. So we speculated that we needed a multi-case case study, with some cases representing software built by very smart people (e.g. climate models, the Linux kernel, Apache, etc), and other cases representing software built by …. stupid people. But we felt we might have some difficulty recruiting subjects for such a study (unless we concealed our intent), and we would probably get into trouble once we tried to publish the results 🙂
  • The software is developed by users for their own use, and this software is mission-critical for them. I mentioned this above, but want to add something here. Most open source projects are built by people who want a tool for their own use, but that others might find useful too. The tools are built on the side (i.e. not part of the developers’ main job performance evaluations) but most such tools aren’t critical to the developers’ regular work. In contrast, climate models are absolutely central to the scientific work on which the climate scientists’ job performance depends. Hence, we described them as mission-critical, but only in a personal kind of way. If that makes sense.
  • The software is used to build a product line, rather than an individual product. All the main climate models have a number of different model configurations, representing different builds from the codebase (rather than say just different settings). In the extreme case, the UK Met Office produces several operational weather forecasting models and several research climate models from the same unified codebase, although this is unusual for a climate modeling group.
  • Testing focuses almost exclusively on integration testing. In climate modeling, there is very little unit testing, because it’s hard to specify an appropriate test for small units in isolation from the full simulation. Instead the focus is on very extensive integration tests, with daily builds, overnight regression testing, and a rigorous process of comparing the output from runs before and after each code change. In contrast, most other types of software engineering focus instead on unit testing, with elaborate test harnesses to test pieces of the software in isolation from the rest of the system. In embedded software, the testing environment usually needs to simulate the operational environment; the most extreme case I’ve seen is the software for the international space station, where the only end-to-end software integration was the final assembly in low earth orbit.
  • Software development activities are completely entangled with a wide set of other activities: doing science. This makes it almost impossible to assess software productivity in the usual way, and even impossible to estimate the total development cost of the software. We tried this as a thought experiment at the Hadley Centre, and quickly gave up: there is no sensible way of drawing a boundary to distinguish some set of activities that could be regarded as contributing to the model development, from other activities that could not. The only reasonable path to assessing productivity that we can think of must focus on time-to-results, or time-to-publication, rather than on software development and delivery.
  • Optimization doesn’t help. This is interesting, because one might expect climate modelers to put a huge amount of effort into optimization, given that century-long climate simulations still take weeks/months on some of the world’s fastest supercomputers. In practice, optimization, where it is done, tends to be an afterthought. The reason is that the model is changed so frequently that hand optimization of any particular model version is not useful. Plus the code has to remain very understandable, so very clever designed-in optimizations tend to be counter-productive.
  • There are very few resources available for software infrastructure. Most of the funding is concentrated on the frontline science (and the costs of buying and operating supercomputers). It’s very hard to divert any of this funding to software engineering support, so development of the software infrastructure is sidelined and sporadic.
  • …and last but not least, A very politically charged atmosphere. A large number of people actively seek to undermine the science, and to discredit individual scientists, for political (ideological) or commercial (revenue protection) reasons. We discussed how much this directly impacts the climate modellers, and I have to admit I don’t really know. My sense is that all of the modelers I’ve interviewed are shielded to a large extend from the political battles (I never asked them about this). Those scientists who have been directly attacked (e.g. MannJonesSanter) tend to be scientists more involved in creation and analysis of datasets, rather than GCM developers. However, I also think the situation is changing rapidly, especially in the last few months, and climate scientists of all types are starting to feel more exposed.

We also speculated about some other contextual factors that might distinguish different software engineering species, not necessarily related to our analysis of computational science software. For example:

  • Existence of competitors;
  • Whether software is developed for single-person-use versus intended for broader user base;
  • Need for certification (and different modes by which certification might be done, for example where there are liability issues, and the need to demonstrate due diligence)
  • Whether software is expected to tolerate and/or compensate for hardware errors. For example, for automotive software, much of the complexity comes from building fault-tolerance into the software because correcting hardware problems introduced in design or manufacture is prohibitively expense. We pondered how often hardware errors occur in supercomputer installations, and whether if they did it would affect the software. I’ve no idea of the answer to the first question, but the second is readily handled by the checkpoint and restart features built into all climate models. Audris pointed out that given the volumes of data being handled (terrabytes per day), there are almost certainly errors introduced in storage and retrieval (i.e. bits getting flipped), and enough that standard error correction would still miss a few. However, there’s enough noise in the data that in general, such things probably go unnoticed, although we speculated what would happen when the most significant bit gets flipped in some important variable.

More interestingly, we talked about what happens when these contextual factors change over time. For example, the emergence of a competitor where there was none previously, or the creation of a new regulatory framework where none existed. Or even, in the case of health care, when change in the regulatory framework relaxes a constraint – such as the recent US healthcare bill, under which it (presumably) becomes easier to share health records among medical professionals if knowledge of pre-existing conditions is no longer a critical privacy concern. An example from climate modeling: software that was originally developed as part of a PhD project intended for use by just one person eventually grows into a vast legacy system, because it turns out to be a really useful model for the community to use. And another: the move from single site development (which is how nearly all climate models were developed) to geographically distributed development, now that it’s getting increasingly hard to get all the necessary expertise under one roof, because of the increasing diversity of science included in the models.

We think there are lots of interesting studies to be done of what happens to the software development processes for different species of software when such contextual factors change.

Finally, we talked a bit about the challenge of finding metrics that are valid across the vastly different contexts of the various software engineering species we identified. Experience with trying to measure defect rates in climate models suggests that it is much harder to make valid comparisons than is generally presumed in the software literature. There really has not been any serious consideration of these various contextual factors and their impact on software practices in the literature, and hence we might need to re-think a lot of the ways in which claims for generality are handled in empirical software engineering studies. We spent some time talking about the specific case of defect measurements, but I’ll save that for a future post.

Here’s the abstract for a paper (that I haven’t written) on how to write an abstract:

How to Write an Abstract

The first sentence of an abstract should clearly introduce the topic of the paper so that readers can relate it to other work they are familiar with. However, an analysis of abstracts across a range of fields show that few follow this advice, nor do they take the opportunity to summarize previous work in their second sentence. A central issue is the lack of structure in standard advice on abstract writing, so most authors don’t realize the third sentence should point out the deficiencies of this existing research. To solve this problem, we describe a technique that structures the entire abstract around a set of six sentences, each of which has a specific role, so that by the end of the first four sentences you have introduced the idea fully. This structure then allows you to use the fifth sentence to elaborate a little on the research, explain how it works, and talk about the various ways that you have applied it, for example to teach generations of new graduate students how to write clearly. This technique is helpful because it clarifies your thinking and leads to a final sentence that summarizes why your research matters.

[I’m giving my talk on how to write a thesis to our grad students soon. Can you tell?]

Update 16 Oct 2011: This page gets lots of hits from people googling for “how to write an abstract”. So I should offer a little more constructive help for anyone still puzzling what the above really means. It comes from my standard advice for planning a PhD thesis (but probably works just as well for scientific papers, essays, etc.).

The key trick is to plan your argument in six sentences, and then use these to structure the entire thesis/paper/essay. The six sentences are:

  1. Introduction. In one sentence, what’s the topic? Phrase it in a way that your reader will understand. If you’re writing a PhD thesis, your readers are the examiners – assume they are familiar with the general field of research, so you need to tell them specifically what topic your thesis addresses. Same advice works for scientific papers – the readers are the peer reviewers, and eventually others in your field interested in your research, so again they know the background work, but want to know specifically what topic your paper covers.
  2. State the problem you tackle. What’s the key research question? Again, in one sentence. (Note: For a more general essay, I’d adjust this slightly to state the central question that you want to address) Remember, your first sentence introduced the overall topic, so now you can build on that, and focus on one key question within that topic. If you can’t summarize your thesis/paper/essay in one key question, then you don’t yet understand what you’re trying to write about. Keep working at this step until you have a single, concise (and understandable) question.
  3. Summarize (in one sentence) why nobody else has adequately answered the research question yet. For a PhD thesis, you’ll have an entire chapter, covering what’s been done previously in the literature. Here you have to boil that down to one sentence. But remember, the trick is not to try and cover all the various ways in which people have tried and failed; the trick is to explain that there’s this one particular approach that nobody else tried yet (hint: it’s the thing that your research does). But here you’re phrasing it in such a way that it’s clear it’s a gap in the literature. So use a phrase such as “previous work has failed to address…”. (if you’re writing a more general essay, you still need to summarize the source material you’re drawing on, so you can pull the same trick – explain in a few words what the general message in the source material is, but expressed in terms of what’s missing)
  4. Explain, in one sentence, how you tackled the research question. What’s your big new idea? (Again for a more general essay, you might want to adapt this slightly: what’s the new perspective you have adopted? or: What’s your overall view on the question you introduced in step 2?)
  5. In one sentence, how did you go about doing the research that follows from your big idea. Did you run experiments? Build a piece of software? Carry out case studies? This is likely to be the longest sentence, especially if it’s a PhD thesis – after all you’re probably covering several years worth of research. But don’t overdo it – we’re still looking for a sentence that you could read aloud without having to stop for breath. Remember, the word ‘abstract’ means a summary of the main ideas with most of the detail left out. So feel free to omit detail! (For those of you who got this far and are still insisting on writing an essay rather than signing up for a PhD, this sentence is really an elaboration of sentence 4 – explore the consequences of your new perspective).
  6. As a single sentence, what’s the key impact of your research? Here we’re not looking for the outcome of an experiment. We’re looking for a summary of the implications. What’s it all mean? Why should other people care? What can they do with your research. (Essay folks: all the same questions apply: what conclusions did you draw, and why would anyone care about them?)

The abstract I started with summarizes my approach to abstract writing as an abstract. But I suspect I might have been trying to be too clever. So here’s a simpler one:

(1) In widgetology, it’s long been understood that you have to glomp the widgets before you can squiffle them. (2) But there is still no known general method to determine when they’ve been sufficiently glomped. (3) The literature describes several specialist techniques that measure how wizzled or how whomped the widgets have become during glomping, but all of these involve slowing down the glomping, and thus risking a fracturing of the widgets. (4) In this thesis, we introduce a new glomping technique, which we call googa-glomping, that allows direct measurement of whifflization, a superior metric for assessing squiffle-readiness. (5) We describe a series of experiments on each of the five major types of widget, and show that in each case, googa-glomping runs faster than competing techniques, and produces glomped widgets that are perfect for squiffling. (6) We expect this new approach to dramatically reduce the cost of squiffled widgets without any loss of quality, and hence make mass production viable.

Whom do you believe: The Cato Institute, or the Hadley Centre? Both cannot be right. Yet both claim to be backed by real scientists.

First, to get this out of the way, the latest ad from Cato has been thoroughly debunked by RealClimate, including a critical look at whether the papers that Cato cites offer any support for Cato’s position (hint: they don’t), and a quick tour through related literature. So I won’t waste my time repeating their analysis.

The Cato folks attempted to answer back, but it’s largely by attacking red herrings. However, one point from this article jumped out at me:

“The fact that a scientist does not undertake original research on subject x does not have any bearing on whether that scientist can intelligently assess the scientific evidence forwarded in a debate on subject x”.

The thrust of this argument is an attempt to bury the idea of expertise, so that the opinions of the Cato institute’s miscellaneous collection of people with PhDs can somehow be equated with those of actual experts. Now, of course it is true that a (good) scientist in another field ought to be able to understand the basics of climate science, and know how to judge the quality of the research, the methods used, and the strength of the evidence, at least at some level. But unfortunately, real expertise requires a great deal of time and effort to acquire, no matter how smart you are.

If you want to publish in a field, you have to submit yourself to the peer-review process. The process is not perfect (incorrect results often do get published, and, on occasion, fabricated results too). But one thing it does do very well is to check whether authors are keeping up to date with the literature. That means that anyone who regularly publishes in good quality journals has to keep up to date with all the latest evidence. They cannot cherry pick.

Those who don’t publish in a particular field (either because they work in an unrelated field, or because they’re not active scientists at all) don’t have this obligation. Which means when they form opinions on a field other than their own, they are likely to be based on a very patchy reading of the field, and mixed up with a lot of personal preconceptions. They can cherry pick. Unfortunately, the more respected the scientist, the worse the problem. The most venerated (e.g. prize winners) enter a world in which so many people stroke their egos, they lose touch with the boundaries of their ignorance. I know this first hand, because some members of my own department have fallen into this trap: they allow their brilliance in one field to fool them into thinking they know a lot about other fields.

Hence, given two scientists who disagree with one another, it’s a useful rule of thumb to trust the one who is publishing regularly on the topic. More importantly, if there are thousands of scientists publishing regularly in a particular field and not one of them supports a particular statement about that field, you can be damn sure it’s wrong. Which is why the IPCC reviews of the literature are right, and Cato’s adverts are bullshit.

Disclaimer: I don’t publish in the climate science literature either (it’s not my field). I’ve spent enough time hanging out with climate scientists to have a good feel for the science, but I’ll also get it wrong occasionally. If in doubt, check with a real expert.

Well, this is a little off topic, but we (Janice, Dana, Peggy and I) have been invited to run this year’s International Advanced School of Empirical Software Engineering, in Florida in October. We’ve planned the day around the content of our book chapter on Selecting Empirical Research Methods for Software Engineering Research, which appeared in the book Guide to Advanced Empirical Software Engineering. It’s going to be a lot of fun!