This afternoon, I’m at the science 2.0 symposium, or “What every scientist needs to know about how the web is changing the way they work”. The symposium has been organised as part of Greg’s Software Carpentry course. There’s about 120 people here, good internet access, and I got here early enough to snag a power outlet. And a Timmie’s just around the corner for a supply of fresh coffee. All set.
1:05pm. Greg’s up, introducing the challenge: for global challenges (e.g. disease control, climate change) we need two things: Courage and Science. Most of the afternoon will be talking about the latter. Six speakers, 40 minutes each, wine and cheese to follow.
1:08pm. Titus Brown, from Michigan State U. Approaching Open Source Science: Tools Approaches. Aims to talk about two things: how to suck people into your open source project, and automated testing. Why open source? Ideologically: for reproducibility and open communication. Idealistically: can’t change the world by keeping what you do secret. Practical reason: other people might help. Oh and “Closed-source science” is an oxymoron. First, the choice of license probably doesn’t matter, because it’s unlikely anyone will ever download your software. Basics: every open source project should have a place to get the latest release, a mailing list, and an openly accessible version control system. Cute point: a wiki and issue tracker are useful if you have time and manpower, but you don’t, so they’re not.
Then he got into a riff about whether or not to use distributed version control (e.g. git). This is interesting because I’ve heard lots of people complain that tools like git can only be used by ubergeeks (“you have to be Linus Torvolds to use it). Titus has been using it for 6 months, and says it has completely changed his life. Key advantages: decouples developers from the server, hence ability to work offline (on airplanes), but still do version control commits. Also, frees you from “permission” decisions – anyone can take the code and work on it independently (as long as they keep using the same version control system). But there are downsides – creates ‘effective forks’, which might then lead to code bombs – someone who wants to remerge a fork that has been developed independently for months, and which then affects large parts of the code base.
Open development is different to open source. The key question is do you want to allow others to take the code and do their own things with it, or do you want to keep control of everything (professors like to keep control!). Oh, and you open yourself up to “annoying questions” about design decisions, and frank (insulting) discussion of bugs. But the key idea is that these are the hallmarks of a good science project – a community of scientists thinking and discussing design decisions and looking for potential errors.
So, now for some of the core science issues. Titus has been working on Earthshine – measuring the albedo of the earth by measuring how much radiation from the earth lights up the (dark side of the) moon. He ended up looking though the PVwave source code, trying to figure out what the grad student working on the project was doing. By wading through the code, he discovered the student had been applying the same correction to the data multiple times, to try and get a particular smoothing. But the only people who understood how the code worked were the grad student and Titus. Which means there was no way, in general, to know that the code works. Quite clearly, “code working” should not be judged by whether it does what the PI thinks it should do. In practice the code is almost never right – more likely that the PI has the wrong mental model. Which lead to the realization that we don’t teach young scientists how to think about software – including being suspicious of their code. And CS programs don’t really do this well either. And fear of failure doesn’t seem to be enough incentive – there are plenty of examples where software errors have lead to scientific results being retracted.
Finally, he finished off with some thoughts about automated testing. E.g. regression testing is probably the most useful thing scientists can do with their code: run the changed code and compare the new results with the old ones. If there are unexpected changes, then you have a problem. Oh, and put assert statements in to check that things that should never occur don’t ever occur. Titus also suggests that code coverage tools can be useful for finding dead code, and continuous integration is handy if you’re building code that will be used on multiple platforms, so an automated process builds the code and tests it on multiple platforms, and reports when something broke. Bottom line: automated testing allows you to ‘lock down’ boring code (code that you understand), and allows you to focus on ‘interesting’ code.
Questions: I asked whether he has ever encountered problems with the paranoia among some scientific communities, for example, fear of being scooped, or journals who refuse to accept papers if any part has already appeared on the web. Titus pointed out that he has had a paper rejected without review, because when he mentioned that many people were already using the software, the journal editor then felt this means it was not novel. Luckily, he did manage to publish it elsewhere. Journals have to take the lead by, for example, refusing to publish paper unless the software is open, because it’s not really science otherwise.
1:55pm. Next up Cameron Neylon, “A Web Native Research Record: Applying the Best of the Web to the Lab Notebook”. Cameron’s first slide is a permission to copy, share, blog, etc. the contents of the talk (note to self – I need this slide). So the web is great for mixing, mashups, syndicated feeds, etc. Scientists need to publish, subscribe, syndicate (e.g. updates to handbooks), remix (e.g. taking ideas from different disciplines and pull them together to get new advances). So quite clearly, the web is going to solve all our problems, right?
But our publication mechanisms is dead, broken, disconnected. A PDF of a scientific paper is a deadend, when really it should be linked to data, sources, citations, etc. It’s the links between things that matter. Science is a set of loosely coupled chunks of knowledge, they need to be tightly wired to each other so that we understand their context, we understand their links. A paper is too big a piece to be thought of as a typical “chunk of science”. A tweet (example was of MarsPhoenix team announcing they found ice on Mars) is too small, and too disconnected. A blog post seems about right. It includes embedded links (e.g. to detailed information about the procedures and materials used in an experiment). He then shows how his own research group is using blogs as online lab notebooks. Even better, some blog posts are generated automatically by the machines (when dealing with computational steps in the scientific process). Then if you look at the graph of the ‘web of objects’, you can tell certain things about them. E.g. an experiment that failed occupies a certain position in the graph; a set of related experiments appear as a cluster; a procedure that wasn’t properly written up might appear as a disconnected note; etc.
Now, how do we get all this to work? Social tagging (folksonomies) don’t work well because of inconsistent use of tagging, not just across different people, but over time by the same person. Templates help, and the evolution of templates over time tells you a lot about the underlying ontology of the science (both the scientific process and the materials used). Cameron even points out places where their the templates they have developed don’t fit well with established taxonomies of materials developed (over many years) within his field, and that these mismatches reveal problems in the taxonomies themselves, where they have ignored how materials are actually used.
So, now everything becomes a digital object: procedures, analyses, materials, data. What we’re left with is the links between them. So doing science becomes a process of creating new relationships, and what you really want to know about someone’s work is the (semantic) feed of relationships created. The big challenge is the semantic part – how do we start to understand the meaning of the links. Finally, a demonstration of how new tools like Google Wave can support this idea – e.g. a Wave plugin that automates the creation of citations within a shared document (Cameron has an compelling screen capture of someone using it).
Finally, how do we measure research impact? Eventually, something like pagerank. Which means scientists have to be wired into the network, which means everything we create has to be open and available. Cameron says he’s doing a lot less of the traditional “write papers and publish” and much more of this new “create open online links”). But how do we persuade research funding bodies to change their culture to acknowledge and encourage these kinds of contribution? Well, 70% of all research is basically unfunded – done on a shoestring.
2:40pm. slight technical hitch getting the next speaker (Michael) set up, so a switch of speakers: Victoria Stodden, How Computational Science is Changing the Scientific Method. Victoria is particularly interested in reproducibility in scientific research, and how it can be facilitated. Massive computation changes what we can do in science, e.g. data mining for subtle patterns in vast databases, and large scale simulations of complex processes. Examples: climate modeling, high energy physics, astrophysics. Even mathematical proof is affected – e.g. use of a simulation to ‘prove’ a mathematical result. But is this really a valid proof? Is it even mathematics?
So, effectively this might be a third branch of science. (1) deductive method for theory development – e.g. mathematics and logic (2) inductive/empirical – the machinery of hypothesis testing. And now (3) large scale extrapolation and prediction. But there’s lots of contention about this third branch. E.g. Anderson “The End of Theory“, Hillis rebuttal – we look for patterns first, and then create hypotheses, just as we always have. Weinstein points out that simulation underlies the other branches – tools to build intuitions, and tools to test hypotheses. Scientific approach is primarily about the ubiquity of error, so that the main effort is to track down and understand sources of error.
Although computational techniques being widely used now (e.g. in JASA, over the last decade, grown to more than half the papers using them), but very few make their code open, and very little validation going on, which means that there is increasingly a credibility crisis. Scientists make their papers available, but not their complete body of research. Changes are coming (e.g. Madagascar, Sweave,…), and the push towards reproducibility pioneered by Jon Claerbout.
Victoria did a study of one particular subfield: Machine Learning. Surveyed academics attending one of the top conferences in the field (NIPS). Why did they not share? Top reason: time it takes to document and clean up the code and data. Then, not receiving attribution, possibility of patents, legal barriers such as copyright, and potential loss of future publications. Motivations to share are primarily communitarian (for the good of science/community), while most of the barriers are personal (worries about attribution, tenure and promotion, etc).
Idea: take the creative commons license model, and create a reproducible research standard. All media components get released under as CC BY license, code gets released under some form of BSD license. But what about data? Raw facts alone are not generally copyrightable, so this gets a little complicated. But the expression of facts in a particular way is.
So, what are the prospects for reproducibility? Simple case: small scripts and open data. But harder case: inscrutible code and organic programming. Really hard case: massive computing platforms and streaming data. But it’s not clear that readability of the code is essential, e.g. Wolfram Alpha – instead of making the code readable (because in practice nobody will read it), make it available for anyone to run it in any way they like.
Finally, there’s a downside to openness, in particular, a worry that science can be contaminated because anyone can come along, without the appropriate expertise, and create unvalidated science and results, and they will get cited and used.
3:40pm. David Rich. Using “Desktop” Languages for Big Problems. David starts of with an analogy of different types of drill – e.g. a hand drill – trivially easy to use, hard to hurt yourself, but slow; up to big industrial drills. He then compares these to different programming languages / frameworks. One particular class of tools, cordless electric drills, are interesting because they provide a balance between power and usability/utility. So what languages and tools do scientific programmers need? David presented the results of a survey of their userbase, to find out what tools they need. Much of the talk was about the need/potential for parallelization via GPUs. David’s company has a tool called Star-P which allows users of Matlab and NumPy to transform their code for parallel architectures.
4:10pm. Michael Nielsen. Doing Science in the Open: How Online Tools are Changing Scientific Discovery. Case study: Terry Tao‘s use of blogs to support community approaches to mathematics. In particular, he deconstructs one particular post: Why global regularity for Navier-Stokes is hard, which sets out a particular problem, identifies the approaches that have been used, and has attracted a large number of comments from some of the top mathematicians in the field, all of which helps to make progress on the problem. (similar examples from other mathematicians, such as the polymath project), and a brand new blog for this: polymathprojects.org.
But these examples couldn’t be published in the conventional sense. They are more like the scaling up of a conversation that might occur in a workshop or conference, but allowing the scientific community to continue the conversation over a long period of time (e.g. several years in some cases), and across geographical distance.
These examples are pushing the boundaries of blog and wiki software. But blogs are just the beginning. Blogs and open notebooks enable filtered access to new information sources and new conversations. Essentially, they are restructuring expert attention – people focus on different things and in a different way than before. And this is important because expert attention is the critical limiting factor in scientific research.
So, here’s a radically different idea. Markets are a good way to efficiently allocate scarce resources. So can we create online markets in expert attention. For example Innocentive. One particular example: need in India to get hold of solar powered wireless routers to support a social project (ASSET India) helping women in india escape from exploitation and abuse. So this was set up as a challenge on Innocentive. A 31-yr old software engineering from Texas designed a solution, and it’s now being prototyped.
But, after all, isn’t all this a distraction? Shouldn’t you be writing papers and grant proposals rather than blogging and contributing to wikipedia? When Galileo discovered the rings of Saturn (actually, that Saturn looked like three blobs), he sent an anagram to Kepler, which then allowed him to claim credit. The modern scientific publishing infrastructure was not available to him, and he couldn’t conceive of the idea of open sharing of discoveries. The point being that these technologies (blogs etc) are too new to understand the full impact and use, but we can see ways in which they are already changing the way science is done.
Some very interesting questions followed about attribution of contribution, especially for the massive collaboration examples such as polymath. In answer, Michael pointed to the fact that the record of the collaboration is open and available for inspection, and that letters of recommendation from senior people matter a lot, and junior people who contributed in a strong way to the collaboration will get great letters.
[An aside: I'm now trying to follow this on Friendfeed as well as liveblogging. It's going to be hard to do both at once]
4:55pm. Last but not least, Jon Udell. Collaborative Curation of Public Events. So, Jon claims that he can’t talk about science itself, because he’s not qualified, but will talk about other consequences of the technologies that we’re talking about. For example, in the discussions we’ve been having with the City of Toronto on it’s open data initiative, there’s a meme that governments sit on large bodies of data, and people would like to get hold of. But in fact, citizens themselves are owners and creators of data, and that’s a more interesting thing to focus on than governments pushing data out to us. For example, posters advertising local community events on lampposts in neighbourhoods around the city. Jon makes the point that this form of community advertising is outperforming the web, which is shocking!
Key idea: syndication hubs. For example, an experiment to collate events in Keene, NH, in the summer of 2009. Takes in datafeeds from various events websites, calendar entries etc. Then aggregates them, and provides feeds out to various other websites. But not many people understand what this is yet – it’s not a destination, but a broker. Or another way of understanding it is as ‘curation’ – the site becomes a curator looking after information about public events, but in a way that distributes responsibility for curation to the individual sources of information, rather than say a person looking after an events diary.
Key principles: syndication is a two way process (need to both subscribe to things and publish your feeds).But tagging and data formating conventions become critical. The available services form an ecosystem, and they co-evolve, and we’re now starting to understand the eco-system around RSS feeds – sites that are publishers, subscribers, and aggregators. Similar eco-system growing up around iCalendar feeds, but currently missing aggregators. iCalendar is interesting because the standard is 10 years old, but it’s only recently become possible to publish feeds from many tools. And people are still using RSS feeds to do this, when they are the wrong tool – an RSS feed doesn’t expose the data (calendar information) in a usable way.
So how do we manage the metadata for these feeds, and how do we handle the issue of trust (i.e. how do you know which feeds to trust for accuracy, authority, etc)? Jon talks a little about uses of tools like Delicious to bookmark feeds with appropriate metadata, and other tools for calendar aggregation. And the idea of guerilla feed creation – how to find implicit information about recurring events and making them explicit. Often the information is hard to scrape automatically – e.g. information about a regular square dance that is embedded in the image of a cartoon. But maybe this task could be farmed out to a service like mechanical turk.
And these are great examples of computational thinking. Indirection – instead of passing me your information, pass me a pointer to it, so that I can respect your authority over it. Abstraction – we can use any URL as a rendezvous for social information management, and can even invent imaginary ones just for this purpose.
Updates: The twitter tag is tosci20. Andrew Louis also blogged (part of) it, and has some great photos; Joey DeVilla has detailed blog posts on several of the speakers; Titus reflects on his own participation; and Jon Udell has a more detailed write up of the polymath project. Oh, and Greg has now posted the speakers’ slides.