For many decades, computational speed has been the main limit on the sophistication of climate models. Climate modelers have become one of the most demanding groups of users for high performance computing, and access to faster and faster machines drives much of the progress, permitting higher resolution models and more earth system processes being explicitly resolved in the models. But from my visits to NCAR, MPI-M and IPSL this summer, I’m learning that growth in volumes of data handled is increasingly a dominant factor. The volume of data generated from today’s models has grown so much that supercomputer facilities find it hard to handle.
Currently, the labs are busy with the CMIP5 runs that will form one of the major inputs to the next IPCC assessment report. See here for a list of the data outputs required from the models (and note that the requirements were last changed on Sept 17, 2010 -well after most centers have started their runs; after all it will take months to complete the runs, and the target date for submitting the data is the end of this year)
Climate modelers have requirements that are somewhat different from most other users of supercomputing facilities anyway:
- very long runs – e.g. runs that take weeks or even months to complete;
- frequent stop and restart of runs – e.g. the runs might be configured to stop once per simulated year, at which point they generate a restart file, and then automatically restart, so that intermediate results can be checked and analyzed, and because some experiments make use of multiple model variants, initialized from a restart file produced partway through a baseline run.
- very high volumes of data generated – e.g. the CMIP5 runs currently underway at IPSL generate 6 terabytes per day, and in postprocessing, this goes up to 30 terabytes per day. Which is a problem, given that the NEC SX-9 being used for these runs has a 4 terabyte work disk and a 35 terabyte scratch disk. It’s getting increasingly hard to move the data to the tape archive fast enough.
Everyone seems to have underestimated the volumes of data generated from these CMIP5 runs. The implication is that data throughput rates are becoming a more important factor than processor speed, which may mean that climate computing centres require a different architecture than most high performance computing centres offer.
Anyway, I was going to write more about the infrastructure needed for this data handling problem, but Bryan Lawrence beat me to it, with his presentation to the NSF cyberinfrastructure “data task force”. He makes excellent points about the (lack of) scaleability of the current infrastructure, and the social and cultural issues with questions of how people get credit for the work they put into this infrastructure, and the issues of data curation and trust. Which means the danger is we will create a WORN (write-once, read-never) archive with all this data…!