One of the exercises I set myself while visiting NCAR this month is to try porting the climate model CESM1 to my laptop (a MacBook Pro). Partly because I wanted to understand what makes it tick, and partly because I thought it would be neat to be able to run it anywhere. At first I thought the idea was crazy – these things are designed to be run on big supercomputers. But CESM is also intended to be portable, as part of a mission to support a broader community of model users. So, porting it to my laptop is a simple proof of concept test – if it ports easily, that’s a good sign that the code is robust.

It took me several days of effort to complete the port, but most of that time was spent on two things that have very little to do with the model itself. The first was a compiler bug that I tripped over (yeah, yeah, blame the compiler, right?) and the second was the issue of getting all the necessary third party packages installed. But in the end I was successful. I’ve just completed two very basic test runs of the model. The first is what’s known as an ‘X’ component set, in which all the major components (atmosphere, ocean, land, ice, etc) don’t actually do anything – this just tests that the framework code builds and runs. The second is an “A” compset at a low resolution, in which all the components are static data models (this ran for five days of simulation time in about 1.5 minutes). If I was going on to test the port correctly, there’s a whole sequence of port validation tests that I ought to perform, for example to check that my runs are consistent with the benchmark runs, that I can stop and restart the model from the data files, that I get the same results in different model configurations, etc. And then eventually there’s the scientific validation tests – checks that the simulated climate in my ported model is realistic.

But for now, I just want to reflect on the process of getting the model to build and run on a new (unsupported) platform. I’ll describe some of the issues I encountered, and then reflect on what I’ve learned about the model. First, some stats. The latest version of the model, CESM1.0 was released on June 25, 2010. It contains just under 1 million lines of code. Three quarters of this is Fortran (mainly Fortran 90), the rest is a mix of shell scripts (of several varieties), XML and HTML:

Lines of Code count for CESM v1.0 (not including build scripts), as calculated by cloc.sourceforge.net v1.51

In addition to the model itself, there are another 12,000 lines of perl and shell script that handle the installing, configuring, building and running the model.

The main issues that tripped me up were

  • The compiler. I decided to use the gnu compiler package (gfortran, included in the gcc package), because it’s free. But it’s not one of the compilers that’s supported for CESM, because in general CESM is used with commercial compilers (e.g. IBM’s) on the supercomputers. I grabbed the newest version of gcc that I could find a pre-built Mac binary for (v4.3.0), but it turned out not to be new enough – I spent a few hours diagnosing what turned out to be a (previously undiscovered?) bug in gfortran v4.3.0 that’s fixed in newer versions (I switched to v4.4.1). And then there’s a whole bunch of compiler flags (mainly to do with compatibility for certain architectures and language versions) that are not compatible with the commercial compilers, which I needed to track down.
  • Third party packages such as MPI (the message passing interface used for exchanging data between model components) and NetCDF (the data formating standard used for geoscience data). It turns out that the Mac already has MPI installed, but without Fortran and Parallel IO support, so I had to rebuild it. And it took me a few rebuilds to get both these packages installed with all the right options.

Once I’d got these sorted, and figured out which compiler flags I needed, the build went pretty smoothly, and I’ve had no problems so far running it. Which leads me to draw a number of (tentative) conclusions about portability. First, CESM is a little unusual compared to most climate models, because it is intended as a community effort, and hence portability is a high priority. It has already been ported to around 30 different platforms, including a variety IBM and Cray supercomputers, and various Linux clusters. Just the process of running the code through many different compilers shakes out not just portability issues, but good coding practices too, as different compilers tend to be picky about different language constructs.

Second, in the process of building the model, it’s quite easy to see that it consists of a number of distinct components, written by different communities, to different coding standards. Most obviously, CESM itself is built from five different component models (atmosphere, ocean, sea ice, land ice, land surface), along with a coupler that allows them to interact. There’s a tension between the needs of scientists who develop code just for a particular component model (run as a standalone model) versus scientists who want to use a fully coupled model. These communities overlap, but not completely, and coordinating the different needs takes considerable effort. Sometimes code that makes sense in a standalone module will break the coupling scheme.

But there’s another distinction that wasn’t obvious to me previously:

  • Scientific code – the bulk of the Fortran code in the component modules. This includes the core numerical routines, radiation schemes, physics parameterizations, and so on. This code is largely written by domain experts (scientists), for whom scientific validity is the over-riding concern (and hence they tend to under-estimate the importance of portability, readability, maintainability, etc).
  • Infrastructure code – including the coupler that allows the components to interact, the shared data handling routines, and a number of shared libraries. Most of this I could characterize as a modeling framework – it provides an overall architecture for a coupled model, and calls the scientific code as and when needed. This code is written jointly by the software engineering team and the various scientific groups.
  • Installation code – including configuration and build scripts. These are distinct from the model itself, but intended to provide flexibility to the community to handle a huge variety of target architectures and model configurations. These are written exclusively by the software engineering team (I think!), and tend to suffer from a serious time crunch: making this code clean and maintainable is difficult, given the need to get a complex and ever-changing model working in reasonable time.

In an earlier post, I described the rapid growth of complexity in earth system models as a major headache. This growth of complexity can be seen in all three types of software, but the complexity growth is compounded in the latter two: modeling frameworks need to support a growing diversity of earth system component models, which then leads to exponential growth in the number of possible model configurations that the build scripts have to deal with. Handling the growing complexity of the installation code is likely to be one of the biggest software challenges for the earth system modeling community in the next few years.

17 Comments

  1. Interesting post, Steve! Does CESM use commercial compilers due to better performance in supercomputing environments?

    In your observations about scientific code, you write “scientific validity is the over-riding concern (and hence they tend to under-estimate the importance of portability, readability, maintainability, etc).” If scientific validity is the scientist’s equivalent to the software developer’s program correctness, should this rightfully not be the overriding concern? It doesn’t seem like their emphasis on scientific validity is what detracts from portability and all those good things — intuitively, they feel like separate concerns. Perhaps it is the scientists’ desire to move on to the next task that steers them away from these software quality issues. We probably shouldn’t be complaining about lack of documentation, portability, etc. (and I think you’re just making observations here, not complaining); after all, we’re trying to spur people into action, not have them sit around and design by committee!

    My impression of this community is that if something needs fixing, you go to the source, if possible. If something doesn’t need fixing for a while, perhaps it isn’t broken. I wonder if the quality of the code has evolved to require the minimum amount of effort of the programmer to maintain (that is, minimizing the total cost of communicating with others after writing code and writing the code in the first place).

  2. NASA has an effort (the Climate-in-a-Box) to simplify the process of getting up in running for scientists by providing hardware pre-packaged with modeling codes and frameworks. I believe the idea is that scientists could 1) not have to mess with a complex installation for their personal use; and 2) be able to run low-resolution scenarios to help determine what is worth submitting for high-res runs on a supercomputer. The Climate-in-a-Box uses a small Cray machine; the next step in accessibility may be to provide more conventional installers for Mac and Windows.

    I also wonder what affects increased accessibility of climate models would have for the public debate on climate change…

  3. Very cool to hear of your success!

    I had poked some last month at getting ccsm running on my Ubuntu vm. Ran into architecture specific makefile issues seemingly similar to what you described. I moved on for now but hope to return to the project.

    Any chance you can post your changes/makefiles?

    As to the ‘debate’, it will have minimal impact. The argument against the models have more to do with lack of validation over time, complexity, “ignores everything except CO2″, and “tuning” of key parameters. I doubt that having a cadre of amateurs running their own versions of climate models will have much impact on those kind of critiques/fallacies.

    What might be more fruitful is that amateur cadre may pursue interesting project ‘experiments’ or educational demos for which the big shops might not have the time or inclination.

    -an amateur

  4. Ron,
    I wasn’t sure whether anyone would want the gory details, but here’s my detailed notes from the port.

    0) Platform: A MacBook Pro laptop. I’m running Leopard, Mac OS X version 10.5.8, on an intel core duo.

    1) Fortran Compiler. There isn’t one installed on the Mac, so I used the Gnu compiler (gfortran, part of the gcc package) as it’s free. This site was very useful for getting both the compiler and the various libraries I needed:
    http://hpc.sourceforge.net/

    A GOTCHA: I originally downloaded a gfortran binary from http://www.macresearch.org/gfortran-leopard, This gave me gfortran v4.3.0, which turns out to have a bug in how it handles the symbol table for fortran modules, which shows up when compiling mct/mpeu. This specific bug wasn’t reported anywhere in the gfortran bug database (so I reported it), but is fixed anyway in the latest version, 4.4.1.
    Hence, I’d recommend building the latest version of gcc from the sources, rather than relying on an older binary.

    2) The sourceforge HPC page says I need Apple’s developer tools installed, so I grabbed
    Xcode 3.1.4 Developer Tools (currently the highest release for Leopard – Xcode 3.2 and up are only for OS 10.6 and up). Available from http://developer.apple.com/mac/
    You have to register as an apple developer, but it’s free.
    Incidentally, I never found out why I need the developer tools, so I don’t know what the dependencies are.

    3) MPI. I’m using MPICH2 v1.2.1p1 from
    http://www.mcs.anl.gov/research/projects/mpich2/
    I decided not to use the binaries, as the CESM docs suggest MPI should be compiled with the same compiler as CESM. So I downloaded the source, and built it myself.

    A GOTCHA: OS X version 10.5 onwards already has MPI installed in /usr/bin, but it’s built without fortran support. And because /usr/bin is ahead of /usr/local/bin in the search path, CESM was picking up the wrong version. Eventually, I just replaced the old version completely in /usr/bin. There are some notes about the pre-installed version of MPI here:
    http://www.open-mpi.org/faq/?category=osx#osx-bundled-ompi
    …including the recommendation *not* to replace the pre-installed version. Too late for me – I hope I didn’t break anything!

    I used these instructions for building MPI:
    http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.2.1-installguide.pdf
    [Note to self: I put the sources in ~/Library/mpich2-1.2.1p1 and the build in ~/mpich2-build and installed in in /usr]

    4) NetCDF. I used v4.1.1, which I got from here:
    http://www.unidata.ucar.edu/software/netcdf/
    And I used the installation instructions here:
    http://www.unidata.ucar.edu/software/netcdf/docs/netcdf-install/Building-on-Unix.html#Building-on-Unix
    [Note to self: I put the sources in ~/Library/netcdf-4.1.1 and the build in ~/netcdf-build and installed in /usr/local]

    A GOTCHA: NetCDF must be installed with support for Parallel IO, otherwise the cesm/pio libraries won’t compile. For this I also needed to install HDF5. I downloaded version 1.8.5 from here: http://www.hdfgroup.org/HDF5/release/obtain5.html
    This in turn needs SZIP (source code available from the same place).

    Another GOTCHA: I had great difficulty getting NetCDF to build, as it kept complaining about not finding HDF5. Eventually I just used the pre-compiled binaries available from the link above.

    5) libcurl. This isn’t installed on the Mac, and is needed to build the libraries in csm_share.
    I downloaded from here: http://curl.haxx.se/download.html
    And built from the sources according to instructions here: http://curl.haxx.se/docs/install.html

    A GOTCHA: I also needed to add compiler switches for linking: -lcurl -lssl -lcrypto -lz
    More on compiler switches in a moment…

    6) Okay, now ready to install CESM.
    a) I downloaded source for CESM1.0 from the release site:
    svn co –username guestuser https://svn-ccsm-release.cgd.ucar.edu/model_versions/cesm1_0
    (you have to register to get the password for the repository)

    b) For porting, I found the porting hints in the appendix to the tutorial to be very useful:
    http://www.cesm.ucar.edu/models/cesm1.0/cesm/cesm1_tutorial.pdf
    In particular, some of the tips in Appendix E aren’t in the user guide. Particularly the bit about MCT and PIO using their own build systems, so they don’t pick up the regular compiler flags. It took me a while to figure out that I needed to insert their compiler flags in the CONFIG_ARGS variable.
    Chapter 7 of the user guide was also useful for a step by step guide to porting:
    http://www.cesm.ucar.edu/models/cesm1.0/cesm/cesm_doc/c2161.html

    c) I started from the generic_linux_intel for my build. The commands to create and configure my first case are straight from the chapter 7 of the user guide:
    % ./create_newcase -case ~/cesm1_0/cases/test1 -res f19_g16 -compset X -mach generic_linux_intel -scratchroot ~/cesm1_0/scratch -din_loc_root_csmdata ~/cesm1_0/inputdata -max_tasks_per_node 8
    % cd ../cases/test1
    % ./configure -case
    % ./test1.generic_linux_intel.build

    But the build fails without some customization. Here’s all the things I found I needed to set (the changes to the compiler flags were found largely by trial and error):

    in Macros.generic_linux_intel:

    I changed the precompiler flags from -DLINUX to -DDarwin:
    CPPDEFS += -DDarwin -DSEQ_$(FRAMEWORK) -DFORTRANUNDERSCORE -DNO_R16 -DNO_SHR_VMATH

    I set the paths for netcdf and mpich (/usr/local and /usr respectively)

    I removed these compiler options:
    “-132 -fp-model precise -convert big_endian -assume byterecl -ftz -traceback”
    …as the gfortran compiler doesn’t understand any of them. I had to add -fno-range-check to get rid of some pesky gfortran errors. So I ended up with:
    FFLAGS := $(CPPDEFS) -g -fno-range-check

    I added the libraries for libcurl to the linker flags (I used “curl-config –libs” to find out what flags I needed)
    LDFLAGS := -lcurl -lssl -lcrypto -lz

    I added the compiler flag -ffree-line-length-none to the PIO build, otherwise gfortran truncates some long lines in the PIO fortran code. I did this by adding this flag only to the line inside the ifeq ($(MODEL),pio) code:
    CONFIG_ARGS += CC=”$(CC)” F90=”$(FC)” NETCDF_PATH=”$(NETCDF_PATH)” MPI_INC=”-I$(INC_MPI)” FFLAGS=”-ffree-line-length-none”
    There’s probably a cleaner way to do this, via the PIO_CONFIG_OPTIONS, but it works for now.

    …and after all that, it built successfully.

    d) To run the model, I just uncommented the following line in test1.generic_linux_intel.run:
    mpiexec -n 16 ./ccsm.exe >&! ccsm.log.$LID

    …and ran it by invoking the run script directly. And of course, nothing happened, because I forgot to start mpi first. So, finally, to run it I did:

    % mpd &
    % ./test1.generic_linux_intel.run

    and off it goes.

    7) Some reflections:
    It took me somewhere between 2 and 3 days work to get all this installed and running (I worked on it for an hour or two at a time over two weeks). Three issues took the most time:
    a) getting MPI and NetCDF installed and built correctly (with all the correct options). I ended up building each of them several times over with different options until I had everything in place.
    b) diagnosing the compiler error in gfortran 4.3.0 (If only I’d started with the latest release!!). However, I did find out that the gfortran compiler team are remarkably responsive to bug reports. I had several replies to my bug report within a few hours! They never did find out what caused the problem, but closed the report because v4.4.1 worked fine.
    c) figuring out all the compiler and linker flags to get pio and csm_share to build correctly.

    There’s a line in the tutorial notes that says most of the problems are likely to be found trying to build MCT and PIO. I agree!

    8) POSTSCRIPT: Sometime during this process, I appear to have completely broken all the Microsoft Office applications (Word, Excel, Powerpoint…). They now all crash immediately on launch. I’ve no idea if this was something I did during installation (perhaps when I blew away the pre-installed MPI??). Or it might be completely unrelated. The only advice I can find on the web about how to fix this is to completely re-install MS Office from the disks. Well, this has finally persuaded me to switch to OpenOffice instead, and I’m loving it so far. How’s that for an unexpected side effect?

  5. Have you looked at climateprediction.net lately? (NB: am sitting in a French hotel room nearing the end of trying to develop a European earth system model strategy that, amongst other things, addresses planning to try to further decouple science code from what you call infrastructure code – since the multicore future is likely to make porting even more hazardous. To that end, you might be interested in a presentation (pdf) I gave on the topic some months ago (if you haven’t seen it already.)

  6. Fascinating. I might give this a go on FreeBSD some time in the next couple of months. Presumably all the various glue scripts (e.g. in different shells, and in awk, sed, m4, and Perl) just grew like that? A decent process would pick a glue language (or possibly 2) and stick with it.

  7. @steve
    You might start disliking OpenOffice when you fill out IRB forms… but that’s what the CSLab apps servers are for!

  8. Isn’t the HTML figure more for documentation than ‘code’? In that case I would be careful about making “faults per LOC” calculations based on the 1 million figure.

    Do you have any good idea about separating code that is for ‘portability’ reasons (e.g. flags, IFDEFs, makefiles) as opposed to scientific ones? It would be interesting to do some type of aspect-oriented analysis using automated machine learners.

  9. @steve
    “Another GOTCHA: I had great difficulty getting NetCDF to build, as it kept complaining about not finding HDF5. Eventually I just used the pre-compiled binaries available from the link above.”

    Gah! Tell me about it… and in the process of trying to get NetCDF to build, I think I may have broken something such that the pre-compiled binaries didn’t work for me either. (though maybe I will try again with your paragraph as a guide).

  10. What I’m most curious about is whether you got any help from the NCAR folks on this or had to sort it all out yourself.

  11. The codes were compiled successfully, but no results were produces, after I commented this line

    mpiexec -n 16 ./ccsm.exe >&! ccsm.log.$LID

    I hitted the following commands, and tested the MPI with the Example program from wikipedia http://en.wikipedia.org/wiki/Message_Passing_Interface, to make sure MPI is running. However, the program is running for a long time without producing file in $RUNDIR?

    % mpd &
    % ./test1.generic_linux_intel.run

    Why? Thanks for your help.

  12. @steve
    Why I have the following error when executing the $CASE.run

    Thu Sep 16 00:17:04 EDT 2010 — CSM EXECUTION BEGINS HERE
    (seq_comm_setcomm) initialize ID ( 7 GLOBAL ) pelist = 0 0 1 ( npes = 1) ( nthreads = 1)
    [zhu-d-debian:06606] *** An error occurred in MPI_Group_range_incl
    [zhu-d-debian:06606] *** on communicator MPI_COMM_WORLD
    [zhu-d-debian:06606] *** MPI_ERR_RANK: invalid rank
    [zhu-d-debian:06606] *** MPI_ERRORS_ARE_FATAL (goodbye)
    Thu Sep 16 00:17:05 EDT 2010 — CSM EXECUTION HAS FINISHED

    Thanks in advance.

  13. Matthias Demuzere

    @ Steve!

    Hi Steve! At the moment I also try to build cesm on my laptop, a dell with Ubuntu 10.10 as OS. Your post has been extremely useful to me, as I had a lot of similar problems. Finally, I got most of the modelling building well, apart from the the following issue, that appears during the build of lnd.

    Error: Conversion of an Infinity or Not-a-Number at (1) to INTEGER

    My hunch is that I should add something to the FFLAGS in Macros, but can’t find any documentation on what to put. For now I just have:
    FFLAGS := $(CPPDEFS) -g -fno-range-check -fconvert=big-endian

    I use gfortran v4.4.5

    If you would have any ideas, please let me know!
    Thanks,
    Matthias

  14. @steve

    Hello!

    I porting cesm1.0 on the supercomputer,I used the default infilter compilers flags, default netCDF-4.1.1 and so on expect the mpi, I installed the mpich2-1.2.1p1 on the machine. But I don’t know how to setting flags,but I successfully built case on this machine and could submit. The result is not ideal. It stops and I don’t know why. If the flags influence it run? Thank you very much!

  15. Hi
    I am trying to port CESM1.0 on my linux OS laptop…. while building it ERROR occurred. Following was the error file…

    Any help is greatly appreciated…..

    thanks in advance

    Kishore

    hu Nov 3 16:40:24 IST 2011 /home/kishoreragi/scratch/dead/pio/pio.bldlog.111103-164002
    Copying source to CCSM EXEROOT…
    ignore pio file alloc_mod.F90
    ignore pio file box_rearrange.F90
    cp: omitting directory `configure.old’
    ignore pio file iompi_mod.F90
    ignore pio file mct_rearrange.F90
    ignore pio file piodarray.F90
    ignore pio file pionfatt_mod.F90
    ignore pio file pionfget_mod.F90
    ignore pio file pionfput_mod.F90
    ignore pio file pionfread_mod.F90
    ignore pio file pionfwrite_mod.F90
    ignore pio file pio_spmd_utils.F90
    ignore pio file pio_support.F90
    ignore pio file rearrange.F90
    New build of PIO
    Running configure…
    for OS=Linux MACH=generic_linux_intel
    cat: Filepath: No such file or directory
    cat: Srcfiles: No such file or directory
    /home/kishoreragi/Desktop/cesm/scripts/dead/Tools/mkSrcfiles > /home/kishoreragi/scratch/dead/pio/Srcfiles
    cp -f /home/kishoreragi/scratch/dead/pio/Filepath /home/kishoreragi/scratch/dead/pio/Deppath
    /home/kishoreragi/Desktop/cesm/scripts/dead/Tools/mkDepends Deppath Srcfiles > /home/kishoreragi/scratch/dead/pio/Depends
    ./configure –disable-mct –disable-timing CC=”mpicc” F90=”mpif90″ NETCDF_PATH=”/usr/local/netcdf-3.6.3-intel-3.2.02″ MPI_INC=”-I/usr/local/include”
    checking for C compiler default output file name… a.out
    checking whether the C compiler works… yes
    checking whether we are cross compiling… no
    checking for suffix of executables…
    checking for suffix of object files… o
    checking whether we are using the GNU C compiler… yes
    checking whether mpicc accepts -g… yes
    checking for mpicc option to accept ISO C89… none needed
    checking for mpicc… mpicc
    checking for MPI_Init… yes
    checking for mpi.h… yes
    checking how to run the C preprocessor… mpicc -E
    checking for grep that handles long lines and -e… /bin/grep
    checking for egrep… /bin/grep -E
    checking for ANSI C header files… yes
    checking for sys/types.h… yes
    checking for sys/stat.h… yes
    checking for stdlib.h… yes
    checking for string.h… yes
    checking for memory.h… yes
    checking for strings.h… yes
    checking for inttypes.h… yes
    checking for stdint.h… yes
    checking for unistd.h… yes
    checking for char… yes
    checking size of char… 1
    checking for int… yes
    checking size of int… 4
    checking for float… yes
    checking size of float… 4
    checking for double… yes
    checking size of double… 8
    checking for void *… yes
    checking size of void *… 4
    checking Fortran 90 filename extension… .F90
    checking whether we are using the GNU Fortran 90 compiler… yes
    checking for mpxlf90_r… no
    checking for mpxlf90… no
    checking for mpxlf95… no
    checking for mpif90… mpif90
    checking for mpif.h… yes
    checking MPI-IO support in MPI implementation… yes
    checking how to get the version output from mpif90… –version
    checking whether byte ordering is bigendian… no
    configure: WARNING: UNKNOWN FORTRAN 90 COMPILER
    checking whether fortran .mod file is uppercase… no
    checking for Fortran 90 name-mangling scheme… lower case, underscore
    checking for cpp… cpp
    checking if Fortran 90 compiler performs preprocessing… yes
    checking if C preprocessor can work with Fortran compiler… yes
    Full hostname= kishoreragi-L41II1
    Hostname=kishoreragi-L41II1
    Machine=i686
    OS=Linux
    using NETCDF_PATH from environment
    configure: error: netcdf.mod not found in NETCDF_PATH/include check the environment variable NETCDF_PATH
    gmake: *** [configure] Error 1
    Making dependencies for pio.F90 –> pio.d
    /bin/sh: AWK@: not found
    gmake: *** [pio.d] Error 127

    thank you

  16. I can’t download the data files , I have used the check_data_files script but no data is downloaded.

    Thanks

  17. Hi Steve: thanks for the info. I’ve been an opteron user for a decade and want to get CESM up and running on a 48-core OpenSuSE desktop. How far do you think your flags will go with gfortran on linux? The Intel compiler is free for noncommercial use – maybe that would be easier…

    Thanks!
    Patricia

Join the discussion: