Data Limits vs. Computational Limits

30. September 2010 · 4 comments · Categories: climate modeling

For many decades, computational speed has been the main limit on the sophistication of climate models. Climate modelers have become one of the most demanding groups of users for high performance computing, and access to faster and faster machines drives much of the progress, permitting higher resolution models and more earth system processes being explicitly resolved in the models. But from my visits to NCAR, MPI-M and IPSL this summer, I’m learning that growth in volumes of data handled is increasingly a dominant factor. The volume of data generated from today’s models has grown so much that supercomputer facilities find it hard to handle.

Currently, the labs are busy with the CMIP5 runs that will form one of the major inputs to the next IPCC assessment report. See here for a list of the data outputs required from the models (and note that the requirements were last changed on Sept 17, 2010 -well after most centers have started their runs; after all it will take months to complete the runs, and the target date for submitting the data is the end of this year)

Climate modelers have requirements that are somewhat different from most other users of supercomputing facilities anyway:

very long runs – e.g. runs that take weeks or even months to complete;
frequent stop and restart of runs – e.g. the runs might be configured to stop once per simulated year, at which point they generate a restart file, and then automatically restart, so that intermediate results can be checked and analyzed, and because some experiments make use of multiple model variants, initialized from a restart file produced partway through a baseline run.
very high volumes of data generated – e.g. the CMIP5 runs currently underway at IPSL generate 6 terabytes per day, and in postprocessing, this goes up to 30 terabytes per day. Which is a problem, given that the NEC SX-9 being used for these runs has a 4 terabyte work disk and a 35 terabyte scratch disk. It’s getting increasingly hard to move the data to the tape archive fast enough.

Everyone seems to have underestimated the volumes of data generated from these CMIP5 runs. The implication is that data throughput rates are becoming a more important factor than processor speed, which may mean that climate computing centres require a different architecture than most high performance computing centres offer.

Anyway, I was going to write more about the infrastructure needed for this data handling problem, but Bryan Lawrence beat me to it, with his presentation to the NSF cyberinfrastructure “data task force”. He makes excellent points about the (lack of) scaleability of the current infrastructure, and the social and cultural issues with questions of how people get credit for the work they put into this infrastructure, and the issues of data curation and trust. Which means the danger is we will create a WORN (write-once, read-never) archive with all this data…!

4 Comments

Nick Barnes
September 30, 2010 at 10:47 am

I think it was Bryan telling me (at Exeter) that right now he needs several sustained 1 Gbps network links just to get the data out of [some model or other] into [some datacentre].

The old adage about a station-wagon full of tapes is still good, when suitably updated: a shipping container’s worth (20 tons) of 2TB disks, taking two weeks to cross the world, is still something like 400 Gbps.
Robert Grumbine
September 30, 2010 at 12:07 pm

🙂

I was just having this conversation at work. We’ve got a largish (multi Petabyte) tape archive and are looking to transition to a newer tape archive. I wound up doing the ‘car full of tapes’ calculation, vs. the gig-E. Car won easily.

But I’ll offer the anecdotal observation that climate models have often (usually?) been limited by disk and tape as much or more than the cpu. Schemes for restart of the model were, imho, partly legacies of the fact that the output was too expensive to write on the fly, so that if you wanted anything to analyze other than the end state, you wrote your restart files. Then while running the next month/year/decade, analyzed them before the next restart was written and, hopefully, cleared the space (to tape) for the next output. I wasn’t tremendously close to the climate modeling in the 80s, but that’s what trickled through to me. Papers from the 1970s definitely read this way (to me).

Differently: If you could advance the science only by looking at the end state of a run, then disk and tape are less of an issue. But these days, there’s not a lot to be learned by looking only at the end states. As being able to analyze time-dependent runs becomes more important, and needs to be done at higher frequency, you need increasingly large amounts of disk and tape (and bandwidth from processor to disk, and from disk to tape).

For my own taste, I’ve always been displeased about equilibrium CO2 runs. There is no chance at all, I feel, of the real earth hitting an equilibrium level of CO2 at any time in the next several centuries. And, in the fantastically unlikely case of hitting an equilibrium, it is also fantastically unlikely that the equilibrium will be at 2x preindustrial CO2. The only utility of those equilibrium runs is that there’s a long history of looking at 2xCO2 equilibria. Goes back to Arrhenius. But, since no variation of a ‘business as usual’ scenario takes us to such a low CO2 level, I tend to think the more realistic route is to shift the level that’s used for equilibria (if equilibria are to be looked at at all) to something that bau could see — 4x, say.

Of course for political reasons, such a thing cannot be done. There would be simply a storm of screams about ‘alarmism’, when these more realistic levels were used instead of the increasingly, and unrealistically, optimistic 2x figures.
John Mashey
October 2, 2010 at 9:42 pm

Back in my SGI days, I often gave a talk called “Big Data” . There were very good reasons why we built 64-bit file systems (massive disk striping, big transfers, etc) before most, and I/O systems that many thought were crazed. One of the engineers had a good saying “CPU for show, I/O for dough.”

One time a sales guy was having trouble, closing a deal, brought me to talk to the somewhat skeptical prospect. After they explained their environment, I simply said “We’ve demonstrating backing up a terabyte disk array in an hour.” All of sudden they got very interested. (This was back when a terabyte of disk was a *lot* disks.)

Really good I/O is hard work…
Rattus Norvegicus
October 4, 2010 at 9:57 pm

John,

Hate to seem like a stalker, but I remember when SGI went to 64-bit (R10K was cool) stuff and it certainly was because of “big data”. I worked in ESD so this was not a big issue for us, but we were certainly aware of the issues.

But gad, I remember when a terabyte seemed outrageous. Now I’ve got a terabyte sitting on my home network! (And I need more…).

Serendipity

Data Limits vs. Computational Limits

4 Comments

Leave a Reply