AGU Day 3 part B: Data Provenance and Open Science

21. December 2009 · 2 comments · Categories: AGU fall meeting 2009

Wednesday morning also saw the poster session “IN31B – Emerging Issues in e-Science: Collaboration, Provenance, and the Ethics of Data”. I was presenting Alicia‘s poster on open science and reproducibility:

Identifying Communication Barriers to Scientific Collaboration (click for fullsize)

The poster summarizes Alicia’s master’s thesis work – a qualitative study of what scientists think about open science and reproducibility, and how they use these terms (Alicia’s thesis will be available real soon now). The most interesting outcome of the study for me was the realization that innocent sounding terms such as “replication” mean very different things to different scientists. For example, when asked how many experiments in their field are replicated, and how many should be replicated, the answers are all over the map. One reason is that the term “experiment” can have vastly different meanings to different people, from a simple laboratory procedure that might take an hour or so, to a journal-paper sized activity spanning many months. Another reason is that it’s not always clear what it means to “replicate” an experiment. To some people it means following the original experimental procedure exactly to try to generate the same results, while to others, replication includes different experiments intended to test the original result in a different way.

Once you’ve waded through the different meanings, there still seems to be a range of opinion on the desirability of frequent replication. In many fields (including my field, software engineering) there are frequent calls for more replication, along with complaints about the barriers (e.g. some journals won’t accept papers reporting replications because they’re not ‘original’ enough). However, on the specific question of how many published studies should be replicated, an answer other than “100%” is quite defensible: some published experiments are dead-ends (research questions that should not be pursued further), and some are just bad experiments (experimental designs that in hindsight were deeply flawed). And then there’s the opportunity cost – instead of replicating an experiment for a very small knowledge gain, it’s often better to design a different experiment to probe new aspects of the same theory, for a much larger knowledge gain. We reflected on some of these issues in our ICSE’2008 paper On the Difficulty of Replicating Human Subjects Studies in Software Engineering.

Anyway, I digress. Alicia’s study also revealed a number of barriers to sharing data, suggesting that some of the stronger calls for open science and reproducibility standards are, at least currently, too impractical. At a minimum, we need better tools for capturing data provenance and scientific workflows. But more importantly, we need to think more about the balance of effort – a scientist who has spent many years developing a dataset needs the appropriate credit for this effort (currently, we only tend to credit the published papers based on the data), and perhaps even some rights to exploit the dataset for their own research first, before sharing. And for large, complex datasets, there’s the balance between ‘user support’ as other people try to use the data and have many questions about it, versus getting on with your own research. I’ve already posted about an extreme case in climate science, where such questions can be used strategically in a kind of denial of service attack. The bottom line is that while in principle, openness and reproducibility are important cornerstones of scientific process, in practice there are all sorts of barriers, most of which are poorly understood.

Alicia’s poster generated a huge amount of interest, and I ended up staying around the poster area for much longer than I expected, having all sorts of interesting conversations. Many people stopped by to ask questions about the results described on the poster, especially the tables (which seemed to catch everyone’s attention). I had a fascinating chat with Paulo Pinheiro da Silva, from UT El Paso, whose Cyber-Share project is probing many of these issues, especially the question of whether knowledge provenance and semantic web techniques can be used to help establish trust in scientific artefacts (e.g. datasets). We spent some time discussing what is good and bad about current metadata projects, and the greater challenge of capturing the tacit knowledge scientists have about their datasets. Also chatted briefly with Peter Fox, of Rensselaer, who has some interesting example use cases for where scientists need to do search based on provenance rather than (or in addition to) content.

This also meant that I didn’t get anywhere near enough time to look at the other posters in the session. All looked interesting, so I’ll list them here to remind me to follow up on them:

IN31B-1001. Provenance Artifact Identification and Semantic Tagging in the Atmospheric Composition Processing System (ACPS), by Curt Tilmes of NASA GSFC.
IN31B-1002. Provenance-Aware Faceted Search, by Deborah McGuinness et. al. of RPI (this was the work Peter Fox was telling me about)
IN31B-1003. Advancing Collaborative Climate Studies through Globally Distributed Geospatial Analysis, by Raj Singh of the Open Geospatial Consortium.
IN31B-1005. Ethics, Collaboration, and Presentation Methods for Local and Traditional Knowledge for Understanding Arctic Change, by Mark Parsons of the NSIDC.
IN31B-1006. Lineage management for on-demand data, by Mary Jo Brodzik, also of the NSIDC.
IN31B-1007. Experiences Developing a Collaborative Data Sharing Portal for FLUXNET, by Deb Agarwal of Lawrence Berkeley Labs.
IN31B-1008. Ignored Issues in e-Science: Collaboration, Provenance and the Ethics of Data, by Joe Hourclé, of NASA GSFC
IN31B-1009. IsoMAP (Isoscape Modeling, Analysis, and Prediction), by Chris Miller, of Purdue

2 Comments

Gary Strand
December 21, 2009 at 2:42 pm

This is related to a tricky problem we’ve had with distributing climate model output data to the community via the Earth System Grid project.

How do we handle the translation from the raw model output to defined data requirements, a la CMIP3 and CMIP5? Can the model code itself be considered metadata of the output data? How we handle, say, errors in the data processing that result in different answers, given that it’s possible some folks have already acquired the now-incorrect data? Tags? Versioning? Notifications via email?

These are all nontrivial problems.
Ethan White
December 21, 2009 at 2:53 pm

Hi Steve – I’ve been following your blog for a month or two now and I’m a big fan. Keep up the great work!

The points you bring up with respect to providing credit for making valuable data available and for providing sufficient support so that it is actually useful are really important. The best approach that I’ve seen in my field (ecology) is the creation of actual Data Papers by one of the major society journals. As actual papers these datasets undergo review, both to determine whether the data itself is worth of publishing, but also to evaluate the metadata and the data structures to make sure that all of the necessary information is provided with the data and that it is easily usable. Use of the data then requires citation of the data paper, so in addition to having the publication of the dataset count as a publication it also accrues the citations that are increasingly important at more senior career stages. This approach provides both credit and thorough review of all aspects of the database. As such I think it is a valuable model to build on when thinking about these challenges.
Pingback: Open Questions about Open Source for Open Science | Serendipity

AGU Day 3 part B: Data Provenance and Open Science

2 Comments

Leave a Reply