I’ve finally managed to post the results of our workshop on Software Research and Climate Change, held at Onward/Oopsla last month. We did lots of brainstorming, and attempted to cluster the ideas, as you can see in the photos of our sticky notes.

After the workshop, I attempted to boil down the ideas even further, and came up with three clusters of research:

  1. Green IT (i.e. optimize power consumption of software and all things controlled by software (also known as “make sure ICT is no longer part of the problem”). Examples of research in this space include:
    • Power aware computing (better management of power in all devices from mobile to massive installations).
    • Green controllers (smart software to optimize and balance power consumption in everything that consumes power).
    • Sustainability as a first class requirement in software system design.
  2. Computer-Supported Collaborative Science (also known as eScience – i.e. software to support and accelerate inter-disciplinary science in climatology and related disciplines). Examples of research in this space include:
    • Software engineering tools/techniques for climate modellers
    • Data management for data-intensive science
    • Open Notebook science (electronic notebooks)
    • Social network tools for knowledge finding and expertise mapping
    • Smart ontologies
  3. Software to improve global collective decision making (which includes everything from tools to improve public understanding of science through to decision support at multiple levels: individual, community, government, inter-governmental,…). Examples of research in this space include:
    • Simulations, games, educational software to support public understanding of the science (usable climate science)
    • massive open collaborative decision support
    • carbon accounting for corporate decision making
    • systems analysis of sustainability in human activity systems (requires multi-level systems thinking)
    • better understanding of the processes of social epistemology

My personal opinion is that (1) is getting to be a crowded field,  which is great, but will only yield up to about 15% of the 100% reduction in carbon emissions we’re aiming for. (2) is has been mapped out as part of several initiatives in the UK and US on eScience, but  there’s still a huge amount to be done. (3) is pretty much a green  field (no pun intended) at the moment. It’s this third area that  fascinates me the most.

I’ve been invited to give a talk to the Toronto HCI chapter as part of World Usability Day, for which the theme is designing for a sustainable world. Here’s what I have come up with as an abstract for my talk, to be entitled “Usable Climate Science”:

Sustainability is usually defined as “the ability to meet present needs without compromising the ability of future generations to meet their needs”. The current interest in sustainability derives partly from a general concern about environmental degradation and resource depletion, and partly from an awareness of the threat of climate change. But to many people, climate change is only a vague problem, and to some people (e.g. about half the the US population) it isn’t regarded as a problem at all. There is a widespread lack of understanding of the core scientific results of climate science, and the methodology by which those results are obtained. Which in turn means that the public discourse is dominated by ignorance, polarization, and political point scoring. In this environment, lobbyists can propagate misinformation on behalf of various vested interests, and people decide what to believe based on their political worldviews, rather than what the scientific evidence actually says. The chances of getting sound, effective policy in such an environment are slim. In this talk, I will argue that we cannot properly address the challenge of climate change unless this situation is fixed. Furthermore, I’ll argue that the core problem is a usability challenge: how do we make the science itself accessible to the general public? The numerical simulations of climate developed by climatologists are usable only by people with PhDs in climatology. The infographics used to explain climate change in the popular press tend to be high design and low information. What is missing is a concerted attempt to get the core science across to a general audience using software tools and visualizations in which usability is the primary design principle. In short, how do we make climate science usable? Unless we do this, journalists, politicians and the public will be unable to judge whether proposed policy solutions are viable, and unable to distinguish sound science from misinformation. I will illustrate the talk with some suggestions of how we might meet this goal.

Update: talk details have now been announced. It’s on Nov 12 at 7:15pm, in BA1220.

Here’s the intro to a draft proposal I’m working on to set up a new research initiative in climate change informatics at U of T (see also: possible participants and ideas for a research agenda). Comments welcome.

Climate change is likely to be the defining issue of the 21st Century. The impacts of a climate change include a dramatic reduction of food production and water supplies, more extreme weather events, the spread of disease, sea level rise, ocean acidification, and mass extinctions. We are faced with the twin challenges of mitigation (avoiding the worst climate change effects by rapidly transitioning the world to a low-carbon economy) and adaptation (re-engineering the infrastructure of modern society so that we can survive and flourish on a hotter planet)
These challenges are global in nature, and pervade all aspects of society. To address them, researchers, engineers, policymakers, and educators from many different disciplines need to come to the table and ask what they can contribute. There are both short-term challenges (such as how to deploy, as rapidly as possible, existing technology to produce renewable energy; how to design government policies and international treaties to bring greenhouse gas emissions under control) and long-term challenges (such as how to complete the transition to a global carbon-neutral society by the latter half of this century).
For Ontario, climate change is both a challenge and an opportunity. The challenge comes in understanding the impacts and adapting to rapid changes in public health, agriculture, management of water and energy resources, transportation, urban planning, and so on. The opportunity is the creation of green jobs through the rapid development of new alternative energy sources and energy conservation measures. Indeed, it is the opportunity to become a world leader in low-carbon technologies.
While many of these challenges and opportunities are already well understood, the role of digital media as both a critical enabling technology and a growing service industry is less well understood. Digital media is critical to effective decision making on climate change issues at all levels. For governmental planning, simulations and visualizations are essential tools for designing and communicating policy choices. For corporations large and small, effective data gathering and business intelligence tools are needed to enable a transition to low-carbon energy solutions. For communities, social networking and web 2.0 technologies are the key tools in bringing people together and enabling coordinated action, and tracking the effectiveness of that action.
Research on climate change has generally clustered around a number of research questions, each studied in isolation. In the physical sciences, the focus is on the physical processes in the atmosphere and biosphere that lead to climate change. In geography and environmental sciences, there is a strong focus on impacts and adaptation. In economics there is a focus on the trade-offs around various policy instruments. In various fields of engineering there is a push for development and deployment of new low-carbon technologies.
Yet climate change is a systemic problem, and effective action requires an inter-disciplinary approach and a clear understanding of how these various spheres of activity interact. We need the appropriate digital infrastructure for these diverse disciplines to share data and results. We need to understand better how social and psychological processes (human behaviour, peer pressure, the media, etc) interact with political processes (policymaking, leadership, voting patterns, etc), and how both are affected by our level of understanding of the physical processes of climate change. And we need to understand how information about all these processes can be factored into effective decision-making.
To address this challenge, we propose the creation of a major new initiative on Climate Change Informatics at the University of Toronto. This will build on existing work across the university on digital media and climate change, and act as a focus for inter-disciplinary research. We will investigate the use of digital media to bridge the gaps between scientific disciplines, policymakers, the media, and public opinion.

Climate change is likely to be the defining issue of the 21st Century. The impacts of a climate change include a dramatic reduction of food production and water supplies, more extreme weather events, the spread of disease, sea level rise, ocean acidification, and mass extinctions. We are faced with the twin challenges of mitigation (avoiding the worst climate change effects by rapidly transitioning the world to a low-carbon economy) and adaptation (re-engineering the infrastructure of modern society so that we can survive and flourish on a hotter planet)

These challenges are global in nature, and pervade all aspects of society. To address them, researchers, engineers, policymakers, and educators from many different disciplines need to come to the table and ask what they can contribute. There are both short-term challenges (such as how to deploy, as rapidly as possible, existing technology to produce renewable energy; how to design government policies and international treaties to bring greenhouse gas emissions under control) and long-term challenges (such as how to complete the transition to a global carbon-neutral society by the latter half of this century).

For Ontario, climate change is both a challenge and an opportunity. The challenge comes in understanding the impacts and adapting to rapid changes in public health, agriculture, management of water and energy resources, transportation, urban planning, and so on. The opportunity is the creation of green jobs through the rapid development of new alternative energy sources and energy conservation measures. Indeed, it is the opportunity to become a world leader in low-carbon technologies.

While many of these challenges and opportunities are already well understood, the role of digital media as both a critical enabling technology and a growing service industry is less well understood. Digital media is critical to effective decision making on climate change issues at all levels. For governmental planning, simulations and visualizations are essential tools for designing and communicating policy choices. For corporations large and small, effective data gathering and business intelligence tools are needed to enable a transition to low-carbon energy solutions. For communities, social networking and web 2.0 technologies are the key tools in bringing people together and enabling coordinated action, and tracking the effectiveness of that action.

Research on climate change has generally clustered around a number of research questions, each studied in isolation. In the physical sciences, the focus is on the physical processes in the atmosphere and biosphere that lead to climate change. In geography and environmental sciences, there is a strong focus on impacts and adaptation. In economics there is a focus on the trade-offs around various policy instruments. In various fields of engineering there is a push for development and deployment of new low-carbon technologies.

Yet climate change is a systemic problem, and effective action requires an inter-disciplinary approach and a clear understanding of how these various spheres of activity interact. We need the appropriate digital infrastructure for these diverse disciplines to share data and results. We need to understand better how social and psychological processes (human behaviour, peer pressure, the media, etc) interact with political processes (policymaking, leadership, voting patterns, etc), and how both are affected by our level of understanding of the physical processes of climate change. And we need to understand how information about all these processes can be factored into effective decision-making.

To address this challenge, we propose the creation of a major new initiative on Climate Change Informatics at the University of Toronto. This will build on existing work across the university on digital media and climate change, and act as a focus for inter-disciplinary research. We will investigate the use of digital media to bridge the gaps between scientific disciplines, policymakers, the media, and public opinion.

I’ve been tasked with identifying people and initiatives across campus that are involved in Digital Media and Climate Change/Environment. It’s part of a push by the University for greater funding for digital media research. And as everyone seems to interpret the term digital media differently, I’m going to give it the broadest possible interpretation: if it involves doing things with computers (either as a primary research tool or as an object of study), it counts as digital media. Here’s my list of faculty across the University who are doing relevant research. Feel free to suggest more people, or to rearrange my categories…

Understanding Climate Change through Earth Systems Modeling

Impacts of Climate Change and Adaptation

Earth Systems Management (as in: how we manage forests, water supplies, land use, etc)

Sustainable design and energy management (e.g. architectural design, urban planning, etc)

Sustainable Transportation Systems

Geographical Information Systems (GIS) and Environmental Informatics

Policy and Decision Making

For sociologists, a strong call to action in the report of an NSF sponsored workshop on Sociological Perspectives on Global Climate Change. Like the APA report I wrote about earlier, it’s a call to action, covering the key research challenges for the field, and addressing the barriers that might prevent researchers participating in such research. Among the recommendations are better data collection on organisational and community behaviour relevant to climate change, and better inter-disciplinary links:

“…social scientists are seldom consulted except as an afterthought in natural science and engineering research projects […and…] social scientists tend not to seek out collaborations with natural scientists and engineers and often are uninformed about major research programs on climate change. The result is that the research of each community does not tend to be informed by the insights and resources available from the others. This is true not only between the social sciences and the natural sciences, but among the social sciences themselves. For instance, sociological research projects seldom incorporate spatial processes, behavioral analyses, or economic models.

For a short summary, read the article “The Wisdom of Crowds” in this week’s Nature Reports, and indeed, the editorial that goes with it.

This afternoon, I’m at the science 2.0 symposium, or “What every scientist needs to know about how the web is changing the way they work”. The symposium has been organised as part of Greg’s Software Carpentry course. There’s about 120 people here, good internet access, and I got here early enough to snag a power outlet. And a Timmie’s just around the corner for a supply of fresh coffee. All set.

1:05pm. Greg’s up, introducing the challenge: for global challenges (e.g. disease control, climate change) we need two things: Courage and Science. Most of the afternoon will be talking about the latter. Six speakers, 40 minutes each, wine and cheese to follow.

1:08pm. Titus Brown, from Michigan State U. Approaching Open Source Science: Tools Approaches. Aims to talk about two things: how to suck people into your open source project, and automated testing. Why open source? Ideologically: for reproducibility and open communication. Idealistically: can’t change the world by keeping what you do secret. Practical reason: other people might help. Oh and “Closed-source science” is an oxymoron. First, the choice of license probably doesn’t matter, because it’s unlikely anyone will ever download your software. Basics: every open source project should have a place to get the latest release, a mailing list, and an openly accessible version control system. Cute point: a wiki and issue tracker are useful if you have time and manpower, but you don’t, so they’re not.

Then he got into a riff about whether or not to use distributed version control (e.g. git). This is interesting because I’ve heard lots of people complain that tools like git can only be used by ubergeeks (“you have to be Linus Torvolds to use it). Titus has been using it for 6 months, and says it has completely changed his life. Key advantages: decouples developers from the server, hence ability to work offline (on airplanes), but still do version control commits. Also, frees you from “permission” decisions – anyone can take the code and work on it independently (as long as they keep using the same version control system). But there are downsides – creates ‘effective forks’, which might then lead to code bombs – someone who wants to remerge a fork that has been developed independently for months, and which then affects large parts of the code base.

Open development is different to open source. The key question is do you want to allow others to take the code and do their own things with it, or do you want to keep control of everything (professors like to keep control!). Oh, and you open yourself up to “annoying questions” about design decisions, and frank (insulting) discussion of bugs. But the key idea is that these are the hallmarks of a good science project – a community of scientists thinking and discussing design decisions and looking for potential errors.

So, now for some of the core science issues. Titus has been working on Earthshine – measuring the albedo of the earth by measuring how much radiation from the earth lights up the (dark side of the) moon. He ended up looking though the PVwave source code, trying to figure out what the grad student working on the project was doing. By wading through the code, he discovered the student had been applying the same correction to the data multiple times, to try and get a particular smoothing. But the only people who understood how the code worked were the grad student and Titus. Which means there was no way, in general, to know that the code works. Quite clearly, “code working” should not be judged by whether it does what the PI thinks it should do. In practice the code is almost never right – more likely that the PI has the wrong mental model. Which lead to the realization that we don’t teach young scientists how to think about software – including being suspicious of their code. And CS programs don’t really do this well either. And fear of failure doesn’t seem to be enough incentive – there are plenty of examples where software errors have lead to scientific results being retracted.

Finally, he finished off with some thoughts about automated testing. E.g. regression testing is probably the most useful thing scientists can do with their code: run the changed code and compare the new results with the old ones. If there are unexpected changes, then you have a problem. Oh, and put assert statements in to check that things that should never occur don’t ever occur. Titus also suggests that code coverage tools can be useful for finding dead code, and continuous integration is handy if you’re building code that will be used on multiple platforms, so an automated process builds the code and tests it on multiple platforms, and reports when something broke. Bottom line: automated testing allows you to ‘lock down’ boring code (code that you understand), and allows you to focus on ‘interesting’ code.

Questions: I asked whether he has ever encountered problems with the paranoia among some scientific communities, for example, fear of being scooped, or journals who refuse to accept papers if any part has already appeared on the web. Titus pointed out that he has had a paper rejected without review, because when he mentioned that many people were already using the software, the journal editor then felt this means it was not novel. Luckily, he did manage to publish it elsewhere. Journals have to take the lead by, for example, refusing to publish paper unless the software is open, because it’s not really science otherwise.

1:55pm. Next up Cameron Neylon, “A Web Native Research Record: Applying the Best of the Web to the Lab Notebook”. Cameron’s first slide is a permission to copy, share, blog, etc. the contents of the talk (note to self – I need this slide). So the web is great for mixing, mashups, syndicated feeds, etc. Scientists need to publish, subscribe, syndicate (e.g. updates to handbooks), remix (e.g. taking ideas from different disciplines and pull them together to get new advances). So quite clearly, the web is going to solve all our problems, right?

But our publication mechanisms is dead, broken, disconnected. A PDF of a scientific paper is a deadend, when really it should be linked to data, sources, citations, etc. It’s the links between things that matter. Science is a set of loosely coupled chunks of knowledge, they need to be tightly wired to each other so that we understand their context, we understand their links. A paper is too big a piece to be thought of as a typical “chunk of science”. A tweet (example was of MarsPhoenix team announcing they found ice on Mars) is too small, and too disconnected. A blog post seems about right. It includes embedded links (e.g. to detailed information about the procedures and materials used in an experiment). He then shows how his own research group is using blogs as online lab notebooks. Even better, some blog posts are generated automatically by the machines (when dealing with computational steps in the scientific process). Then if you look at the graph of the ‘web of objects’, you can tell certain things about them. E.g. an experiment that failed occupies a certain position in the graph; a set of related experiments appear as a cluster; a procedure that wasn’t properly written up might appear as a disconnected note; etc.

Now, how do we get all this to work? Social tagging (folksonomies) don’t work well because of inconsistent use of tagging, not just across different people, but over time by the same person. Templates help, and the evolution of templates over time tells you a lot about the underlying ontology of the science (both the scientific process and the materials used). Cameron even points out places where their the templates they have developed don’t fit well with established taxonomies of materials developed (over many years) within his field, and that these mismatches reveal problems in the taxonomies themselves, where they have ignored how materials are actually used.

So, now everything becomes a digital object: procedures, analyses, materials, data. What we’re left with is the links between them. So doing science becomes a process of creating new relationships, and what you really want to know about someone’s work is the (semantic) feed of relationships created. The big challenge is the semantic part – how do we start to understand the meaning of the links. Finally, a demonstration of how new tools like Google Wave can support this idea – e.g. a Wave plugin that automates the creation of citations within a shared document (Cameron has an compelling screen capture of someone using it).

Finally, how do we measure research impact? Eventually, something like pagerank. Which means scientists have to be wired into the network, which means everything we create has to be open and available. Cameron says he’s doing a lot less of the traditional “write papers and publish” and much more of this new “create open online links”). But how do we persuade research funding bodies to change their culture to acknowledge and encourage these kinds of contribution? Well, 70% of all research is basically unfunded – done on a shoestring.

2:40pm. slight technical hitch getting the next speaker (Michael) set up, so a switch of speakers: Victoria Stodden, How Computational Science is Changing the Scientific Method. Victoria is particularly interested in reproducibility in scientific research, and how it can be facilitated. Massive computation changes what we can do in science, e.g. data mining for subtle patterns in vast databases, and large scale simulations of complex processes. Examples: climate modeling, high energy physics, astrophysics. Even mathematical proof is affected – e.g. use of a simulation to ‘prove’ a mathematical result. But is this really a valid proof? Is it even mathematics?

So, effectively this might be a third branch of science. (1) deductive method for theory development – e.g. mathematics and logic (2) inductive/empirical – the machinery of hypothesis testing. And now (3) large scale extrapolation and prediction. But there’s lots of contention about this third branch. E.g. Anderson “The End of Theory“, Hillis rebuttal – we look for patterns first, and then create hypotheses, just as we always have. Weinstein points out that simulation underlies the other branches – tools to build intuitions, and tools to test hypotheses. Scientific approach is primarily about the ubiquity of error, so that the main effort is to track down and understand sources of error.

Although computational techniques being widely used now (e.g. in JASA, over the last decade, grown to more than half the papers using them), but very few make their code open, and very little validation going on, which means that there is increasingly a credibility crisis. Scientists make their papers available, but not their complete body of research. Changes are coming (e.g. Madagascar, Sweave,…), and the push towards reproducibility pioneered by Jon Claerbout.

Victoria did a study of one particular subfield: Machine Learning. Surveyed academics attending one of the top conferences in the field (NIPS). Why did they not share? Top reason: time it takes to document and clean up the code and data. Then, not receiving attribution, possibility of patents, legal barriers such as copyright, and potential loss of future publications. Motivations to share are primarily communitarian (for the good of science/community), while most of the barriers are personal (worries about attribution, tenure and promotion, etc).

Idea: take the creative commons license model, and create a reproducible research standard. All media components get released under as CC BY license, code gets released under some form of BSD license. But what about data? Raw facts alone are not generally copyrightable, so this gets a little complicated. But the expression of facts in a particular way is.

So, what are the prospects for reproducibility? Simple case: small scripts and open data. But harder case: inscrutible code and organic programming. Really hard case: massive computing platforms and streaming data. But it’s not clear that readability of the code is essential, e.g. Wolfram Alpha – instead of making the code readable (because in practice nobody will read it), make it available for anyone to run it in any way they like.

Finally, there’s a downside to openness, in particular, a worry that science can be contaminated because anyone can come along, without the appropriate expertise, and create unvalidated science and results, and they will get cited and used.

3:40pm. David Rich. Using “Desktop” Languages for Big Problems. David starts of with an analogy of different types of drill – e.g. a hand drill – trivially easy to use, hard to hurt yourself, but slow; up to big industrial drills. He then compares these to different programming languages / frameworks. One particular class of tools, cordless electric drills, are interesting because they provide a balance between power and usability/utility. So what languages and tools do scientific programmers need? David presented the results of a survey of their userbase, to find out what tools they need. Much of the talk was about the need/potential for parallelization via GPUs. David’s company has a tool called Star-P which allows users of Matlab and NumPy to transform their code for parallel architectures.

4:10pm. Michael Nielsen. Doing Science in the Open: How Online Tools are Changing Scientific Discovery. Case study: Terry Tao‘s use of blogs to support community approaches to mathematics. In particular, he deconstructs one particular post: Why global regularity for Navier-Stokes is hard, which sets out a particular problem, identifies the approaches that have been used, and has attracted a large number of comments from some of the top mathematicians in the field, all of which helps to make progress on the problem. (similar examples from other mathematicians, such as the polymath project), and a brand new blog for this: polymathprojects.org.

But these examples couldn’t be published in the conventional sense. They are more like the scaling up of a conversation that might occur in a workshop or conference, but allowing the scientific community to continue the conversation over a long period of time (e.g. several years in some cases), and across geographical distance.

These examples are pushing the boundaries of blog and wiki software. But blogs are just the beginning. Blogs and open notebooks enable filtered access to new information sources and new conversations. Essentially, they are restructuring expert attention – people focus on different things and in a different way than before. And this is important because expert attention is the critical limiting factor in scientific research.

So, here’s a radically different idea. Markets are a good way to efficiently allocate scarce resources. So can we create online markets in expert attention. For example Innocentive. One particular example: need in India to get hold of solar powered wireless routers to support a social project (ASSET India) helping women in india escape from exploitation and abuse. So this was set up as a challenge on Innocentive. A 31-yr old software engineering from Texas designed a solution, and it’s now being prototyped.

But, after all, isn’t all this a distraction? Shouldn’t you be writing papers and grant proposals rather than blogging and contributing to wikipedia? When Galileo discovered the rings of Saturn (actually, that Saturn looked like three blobs), he sent an anagram to Kepler, which then allowed him to claim credit. The modern scientific publishing infrastructure was not available to him, and he couldn’t conceive of the idea of open sharing of discoveries. The point being that these technologies (blogs etc) are too new to understand the full impact and use, but we can see ways in which they are already changing the way science is done.

Some very interesting questions followed about attribution of contribution, especially for the massive collaboration examples such as polymath. In answer, Michael pointed to the fact that the record of the collaboration is open and available for inspection, and that letters of recommendation from senior people matter a lot, and junior people who contributed in a strong way to the collaboration will get great letters.

[An aside: I’m now trying to follow this on Friendfeed as well as liveblogging. It’s going to be hard to do both at once]

4:55pm. Last but not least, Jon Udell. Collaborative Curation of Public Events. So, Jon claims that he can’t talk about science itself, because he’s not qualified, but will talk about other consequences of the technologies that we’re talking about. For example, in the discussions we’ve been having with the City of Toronto on it’s open data initiative, there’s a meme that governments sit on large bodies of data, and people would like to get hold of. But in fact, citizens themselves are owners and creators of data, and that’s a more interesting thing to focus on than governments pushing data out to us. For example, posters advertising local community events on lampposts in neighbourhoods around the city. Jon makes the point that this form of community advertising is outperforming the web, which is shocking!

Key idea: syndication hubs. For example, an experiment to collate events in Keene, NH, in the summer of 2009. Takes in datafeeds from various events websites, calendar entries etc. Then aggregates them, and provides feeds out to various other websites. But not many people understand what this is yet – it’s not a destination, but a broker. Or another way of understanding it is as ‘curation’ – the site becomes a curator looking after information about public events, but in a way that distributes responsibility for curation to the individual sources of information, rather than say a person looking after an events diary.

Key principles: syndication is a two way process (need to both subscribe to things and publish your feeds).But tagging and data formating conventions become critical.  The available services form an ecosystem, and they co-evolve, and we’re now starting to understand the eco-system around RSS feeds – sites that are publishers, subscribers, and aggregators. Similar eco-system growing up around iCalendar feeds, but currently missing aggregators. iCalendar is interesting because the standard is 10 years old, but it’s only recently become possible to publish feeds from many tools. And people are still using RSS feeds to do this, when they are the wrong tool – an RSS feed doesn’t expose the data (calendar information) in a usable way.

So how do we manage the metadata for these feeds, and how do we handle the issue of trust (i.e. how do you know which feeds to trust for accuracy, authority, etc)? Jon talks a little about uses of tools like Delicious to bookmark feeds with appropriate metadata, and other tools for calendar aggregation. And the idea of guerilla feed creation – how to find implicit information about recurring events and making them explicit. Often the information is hard to scrape automatically – e.g. information about a regular square dance that is embedded in the image of a cartoon. But maybe this task could be farmed out to a service like mechanical turk.

And these are great examples of computational thinking. Indirection – instead of passing me your information, pass me a pointer to it, so that I can respect your authority over it. Abstraction – we can use any URL as a rendezvous for social information management, and can even invent imaginary ones just for this purpose.

Updates: The twitter tag is tosci20. Andrew Louis also blogged (part of) it, and has some great photos; Joey DeVilla has detailed blog posts on several of the speakers; Titus reflects on his own participation; and Jon Udell has a more detailed write up of the polymath project. Oh, and Greg has now posted the speakers’ slides.

Next Wednesday, we’re oganising demos of our students’ summer projects, prior to the Science 2.0 conference. The demos will be in BA1200 (in the Bahen Centre), Wed July 29, 10am-12pm. All welcome!

Here are the demos to be included (running order hasn’t been determined yet – we’ll probably pull names out of hat…):

  • Basie (demo’d by Bill Konrad, Eran Henig and Florian Shkurti)
    Basie is a light weight, web-based software project forge with an emphasis on inter-component communication.  It integrates revision control, issue tracking, mailing lists, wikis, status dashboards, and other tools that developers need to work effectively in teams.  Our mission is to make Basie simple enough for undergraduate students to master in ten minutes, but powerful enough to support large, distributed teams.
  • BreadCrumbs (demo’d by Brent Mombourquette).
    When researching, the context in which a relevant piece of information is found is often overlooked. However, the journey is as important as the destination. BreadCrumbs is a Firefox extension designed to capture this journey, and therefor the context, by maintaining a well structured and dynamic graph of an Internet browsing session. It keeps track of both the chronological order in which websites are visited and the link-by-link path. In addition, through providing simple tools to leave notes to yourself, an accurate record of your thought process and reasoning for browsing the documents that you did can be preserved with limited overhead. The resulting session can then be saved and revisited at a later date, with little to no time spent trying to recall the relevance or semantic relations of documents in an unordered bookmark folder, for example. It can also be used to provide information to a colleague, by not just pointing them to a series of web pages, but by providing them a trail to follow and embedded personal notes. BreadCrumbs maintains the context so that you can focus on the content.
  • Feature Diagram Tool (demo’d by Ebenezer Hailemariam)
    We present a software tool to assist software developers work with legacy code. The tool reverse engineers “dependency diagrams” from Java code through which developers can perform refactoring actions. The tool is a plug-in for the Eclipse integrated development environment.
  • MarkUs (demo’d by Severin GehwolfNelle Varoquaux and Mike Conley)
    MarkUs is a Web application that recreates the ease and flexibility of grading assignments with pen on paper. Graders fill in a marking scheme and directly annotate student’s work.  MarkUs also provides support for other aspects of assignment delivery and management.  For example, it allows students or instructors to form groups for assignment collaboration, and allows students to upload their work for grading. Instructors can also create and manage group or solo assignments, and assign graders to mark and annotate the students’ work quickly and easily.
  • MyeLink: drawing connections between OpenScience lab notes (demo’d by Maria Yancheva)
    A MediaWiki extension which facilitates connections between related wiki pages, notes, and authors. Suitable for OpenScience research communities who maintain a wiki collection of experiment pages online. Provides search functionality on the basis of both structure and content of pages, as well as a user interface allowing the customization of options and displaying an embedded preview of results.
  • TracSNAP – Trac Social Network Analysis Plugin (demo’d by Ainsley Lawson and Sarah Strong)
    TracSNAP is a suite of simple tools to help contributors make use of information about the social aspect of their Trac coding project. It tries to help you to: Find out which other developers you should be talking to, by giving contact suggestions based on commonality of file edits; Recognize files that might be related to your current work, by showing you which files are often committed at the same time as your files; Get a feel for who works on similar pieces of functionality based on discussion in bug and feature tickets, and by edits in common; Visualize your project’s effective social network with graphs of who talks to who; Visualize coupling between files based on how often your colleagues edit them together.
  • VizExpress (demo’d by Samar Sabie)
    Graphs are effective visualizations because they present data quickly and easily. vizExpress is a Mediawiki extension that inserts user-customized tables and graphs in wiki pages without having to deal with complicated wiki syntax. When editing a wiki page, the extension adds a special toolbar icon for opening the vizExpress wizard. You can provide data to the wizard by browsing to a local Excel or CSV file, or by typing (or copying/pasting) data. You can choose from eight graph types and eight graph-coloring schemes, and apply further formatting such as titles, dimensions, limits, and legend position. Once a graph is inserted in a page, you can easily edit it by restarting the wizard or modifying a simple vizExpress tag.

[Update: the session was a great success, and some of the audience have blogged about it already: e.g. Cameron Neylon]

This morning, while doing some research on availability of code for climate models, I came across a set of papers published by the Royal Society in March 2009 reporting on a meeting on the Environmental eScience Revolution. This looks like the best collection of papers I’ve seen yet on the challenges in software engineering for environmental and climate science. These will keep me going for a while, but here are the papers that most interest me:

And I’ll probably have to read the rest as well. Interestingly, I’ve met many of these authors. I’ll have to check whether any followup meetings are planned…

I posted some initial ideas for projects for our summer students awhile back. I’m pleased to say that the students have been making great progress in the last few weeks (despite, or perhaps because of, the fact that I haven’t been around much). Here’s what they’ve been up to:

Sarah Strong and Ainsley Lawson have been exploring how to take the ideas on visualizing the social network of a software development team (as embodied in tools such as Tesseract), and applying them as simple extensions to code browsers / version control tools. The aim is to see if we can add some value in the form of better awareness of who is working on related code, but without asking the scientists to adopt entirely new tools. Our initial target users are the climate scientists at the UK Met Office Hadley Centre, who currently use SVN/Trac as their code management environment.

Brent Mombourquette has been working on a Firefox extension that will capture the browsing history as a graph (pages and traversed links), which can then be visualized, saved, annotated, and shared with others. The main idea is to support the way in which scientists search/browse for resources (e.g. published papers on a particular topic), and to allow them to recall their exploration path to remember the context in which they obtained these resources. I should mention the key idea goes all the way back to the Vannevar Bush’s memex.

Maria Yancheva has been exploring the whole idea of electronic lab notebooks. She has been exploring the workflows used by the climate scientists when they configure and run their simulation models, and considering how a more structured form of wiki might help them. She has selected OpenWetWare as a good starting point, and is exploring how to add extensions to MediaWiki to make OWW more suitable for computational science, especially to keep track of model runs.

Samar Sabie has also been looking at MediaWiki extensions, specifically to find a way to add visualizations into wiki pages and blogs as simply as possible. The problem is that currently, adding something as simple as a table of data to a page requires extensive work with the markup language. The long term aim is to make the insertion of dynamic visualizations (such as those at ManyEyes), but the starting point is to try to make it as ridiculously simple as possible to insert a data table, link it to a graph, and select appropriate parameters to make the graph look good, with the idea that users can subsequently change the appearance in useful ways (which means cut and paste from Excel Spreadsheets won’t be good enough).

Oh, and they’ve all been regularly blogging their progress, so we’re practicing the whole open notebook science thingy.

ICSE proper finished on Friday, but a few brave souls stayed around for more workshops on Saturday. There were two workshops in adjacent rooms that had a big topic overlap: SE Foundations for End-user programming (SEE-UP) and Software Engineering for Computational Science and Engineering (SECSE, pronounced “sexy”). I attended the latter, but chatted to some people attending the former during the breaks – seems we could have merged the two workshops for interesting effect. At SECSE, the first talk was by Greg Wilson, talking about the results of his survey of computational scientists. Some interesting comments about the qualitative data he showed, including the strong confidence exhibited in most of the responses (people who believe they are more effective at using computers than their colleagues). This probably indicates a self-selection bias, but it would be interesting to probe the extent of this. Also, many of them take a “toolbox” perspective – they treat the computer as a set of tools, and associate effectiveness with how well people understand the different tools, and how much they take the time to understand them. Oh and many of them mention that using a Mac makes them more effective. Tee Hee.

Next up: Judith Segal, talking about organisational and process issues – particularly the iterative, incremental approach they take to building software. Only cursory requirements analysis and only cursory testing. The model works because the programmers are the users – they build software for themselves, and because the software is developed (initially) only to solve a specific problem, so they can ignore maintainability and usability. Of course, the software often does escape from the lab, and get used by others, which leads to a large risk of using incorrect, poorly designed software leading to incorrect results. For the scientific communities Judith has been working with, there’s a cultural issue too – the scientists don’t value software skills, because they’re focussed on scientific skills and understanding. Also, openness is a problem because they are busy competing for publications and funding. But this is clearly not true of all scientific disciplines, as the climate scientists I’m familiar with are very different: for them computational skills are right at the core of their discipline, and they are much more collaborative than competitive.

Roscoe Bartlett, from Sandia Labs, presenting “Barely Sufficient Software Engineering: 10 Practices to Improve Your CSE Software”. It’s a good list: Agile (incremental) development, Code management, mail lists, checklists, make the source code the primary source of documentation. Most important was the idea of “barely sufficient”. Mindless application of formal software engineering processes to computational science doesn’t make any sense.

Carlton Crabtree described a study design to investigate the role of agile and plan-driven development processes among scientific software development projects. They are particularly interested in exploring the applicability of the Boehm and Turner model as an analytical tool. They’re also planning to use grounded theory to explore the scientists own perspectives, although I don’t quite get how they will reconcile the contructivist stance of grounded theory (it’s intended as a way of exploring the participants’ own perspectives), with the use of a pre-existing theoretical framework, such as the Boehm and Turner model.

Jeff Overbey, on refactoring Fortran. First, he started with a few thoughts on the history of Fortran (the language that everyone keeps thinking will die out, but never does. Some reference to zombies in here…). Jeff pointed out that languages only ever accumulate features (because removing features breaks backwards compatibility), so they just get more complex and harder to use with each update to the language standard. So, he’s looking at whether you can remove old language features using refactoring tools. This is especially useful for the older language features that encourage bad software engineering practices. Jeff then demo’d his tool. It’s neat, but is currently only available as an Eclipse plugin. If there was an emacs version, I could get lots of climate scientists to use this. [note: In the discussion, Greg recommended the book Working effectively with legacy code].

Next up: Roscoe again, this time on integration strategies. The software integration issues he describes are very familiar to me. and he outlined an “almost” continuous integration process, which makes a lot of sense. However, some of the things he describes a challenges don’t seem to be problems in the environment I’m familiar with (the climate scientists at the Hadley Centre). I need to follow up on this.

Last talk before the break: Wen Yu, talking about the use of program families for scientific computation, including a specific application for finite element method computations.

After an infusion of coffee, Ritu Arora, talking about the application of generative programming for scientific applications. She used a checkpointing example as a proof-of-concept, and created a domain specific language for describing checkpointing needs. Checkpointing is interesting, because it tends to be a cross cutting concern; generating code for this and automatically weaving it into the code is likely to be a significant benefit. Initial results are good: the automatically generated code had similar performance profiles to hand generated checkpointing code.

Next: Daniel Hook on testing for code trustworthiness. He started with some nice definitions and diagrams that distinguish some of the key terminology e.g. faults (mistakes in the code) versus errors (outcomes that affect the results). Here’s a great story: he walked into a glass storefront window the other day, thinking it was a door. The fault was mistaking a window for a door, and the error was about three feet. Two key problems: the oracle problem (we often have only approximate or limited oracles for what answers we should get) and the tolerance problem (there’s no objective way to say that the results are close enough to the expected results so that we can say they are correct). Standard SE techniques often don’t apply. For example, the use of mutation testing to check the quality of a test set doesn’t work on scientific code because of the tolerance problem – the mutant might be closer to the expected result than the unmutated code. So, he’s exploring a variant and it’s looking promising. The project is called matmute.

David Woollard, from JPL, talking about inserting architectural constraints into legacy (scientific) code. David has been doing some interesting work with assessing the applicability of workflow tools to computational science.

Parmit Chilana from U Washington. She’s working mainly with bioinformatics researchers, comparing the work practices of practitioners with researchers. The biologists understand the scientific relevance , but not the technical implementation; the computer scientists understand the tools and algorithms, but not the biological relevance. She’s clearly demonstrated the need for domain expertise during the design process, and explored several different ways to bring both domain expertise and usability expertise together (especially when the two types of expert are hard to get because they are in great demand).

After lunch, the last talk before we break out for discussion. Val Maxville, preparing scientists for scaleable software development. Val gave a great overview of the challenges for software development at iVEC. AuScope looks interesting – an integration of geosciences data across Australia. For each of the different projects. Val assessed how much they have taken practices from the SWEBOK – how much have they applied them, and how much do they value them. And she finished with some thoughts on the challenges for software engineering education for this community, including balancing between generic and niche content, and balance between ‘on demand’ versus a more planned skills development process.

And because this is a real workshop, we spent the rest of the afternoon in breakout groups having fascinating discussions. This was the best part of the workshop, but of course required me to put away the blogging tools and get involved (so I don’t have any notes…!). I’ll have to keep everyone in suspense.

Summer projects: I posted yesterday on social network tools for computational scientists. Greg has posted a whole list of additional suggestions.

Here, I will elaborate another of these ideas: the electronic lab notebook. For computational scientists, wiki pages are an obvious substitute for traditional lab notebooks, because each description of an experiment can then be linked directly with the corresponding datasets, configuration files, visualizations of results, scientific papers, related experiments, etc. (In the most radical version, Open Notebook Science, the lab notebook is completely open for anyone to see. But the toolset would be the same whether it was open to anyone, or just shared with select colleagues)

In my study of the software practices at the UK Met Office last summer, I noticed that some of the scientists carefully document each experiment via a new wiki page, but the process is laborious in a standard wiki, involving a lot of cut-and-paste to create a suitable page structure. For this reason, many scientists don’t keep good records of their experiments. An obvious improvement would be to generate a basic wiki page automatically each time a model run is configured, and populate it with information about the run, and links to the relevant data files. The scientists could then add further commentary via a standard wiki editor.

Of course, an even better solution is to capture all information about a particular run of the model (including subsequent commentary on the results) as meta-data in the configuration file, so that no wiki pages are needed: lab notebook pages are just user-friendly views of the configuration file. I think that’s probably a longer term project, and links in with the observation that existing climate model configuration tools are hard to use anyway and need to be re-invented. Let’s leave that one aside for the moment…

A related problem is better support for navigating and linking existing lab book pages. For example, in the process of writing up a scientific paper, a scientist might need to search for the descriptions of number of individual experiments, select some of the data, create new visualizations for use in the paper, and so on. Recording this trail would improve reproducibility, by capturing the necessary links to source data in case the visualizations used in the paper need to be altered or recreated. Some of requires a detailed analysis of the specific workflows used in a particular lab (which reminds me I need to write up what I know of the Met Office’s workflows), but I think some of this can be achieved by simple generic tools (e.g. browser plugins) that help capture the trail as it happens, and perhaps edit and annotate it afterwards.

I’m sure some of these tools must exist already, but I don’t know of them. Feel free to send me pointers…

Having talked with some of our graduate students about how to get a more inter-disciplinary education while they are in grad school, I’ve been collecting links to collaborative grad programs at U of T:

The Dynamics of Global Change Doctoral Program, housed in the Munk Centre. The core course, DGC1000H is very interesting – it starts with Malcolm Gladwell’s Tipping Point book, and then tours through money, religion, pandemics, climate change, the internet and ICTs, and development. What a wonderful journey.

The Centre for the Environment runs a Collaborative Graduate Program (MSc and PhD) in which students take some environmental science courses in addition to satisfying the degree requirements of their home department. The core course for this program is ENV1001, Environmental Decision Making, and it also include an internship to get hands on experience with environmental problem solving.

The Knowledge Media Design Institute (KMDI) also has a collaborative doctoral program, perfect for those interested in design and evaluation of new knowledge media, with a strong focus on knowledge creation, social change, and community

Finally, the Centre for Global Change Science has a set of graduate student awards, to help fund grad students interested in global change science. Oh, and they have a fascinating seminar series, mainly focussed on climate science (all done for this year, but get on their mailing list for next years seminars).

Are there any more I missed?

Computer Science, as an undergraduate degree, is in trouble. Enrollments have dropped steadily throughout this decade: for example at U of T, our enrollment is about half what it was at the peak. The same is true across the whole of North America. There is some encouraging news: enrollments picked up a little this year (after a serious recruitment drive, ours is up about 20% from it’s nadir, while across the US it’s up 6.2%). But it’s way to early to assume they will climb back up to where they were. Oh, and percentage of women students in CS now averages 12% – the lowest ever.

What happened? One explanation is career expectations. In the 80’s, its was common wisdom that a career in computers was an excellent move, for anyone showing an aptitude for maths. In the 90’s, with the birth of the web, computer science even became cool for a while, and enrollments grew dramatically, with a steady improvement in gender balance too. Then came the dotcom boom and bust, and suddenly a computer science degree was no longer a sure bet. I’m told by our high school liaison team that parents of high school students haven’t got the message that the computer industry is short of graduates to recruit (although with the current recession that’s changing again anyway).

A more likely explanation is perceived relevance. In the 80’s, with the birth of the PC, and in the 90’s with the growth of the web, computer science seemed like the heart of an exciting revolution. But now computers are ubiquitous, they’re no longer particularly interesting. Kids take them for granted, and a only a few über-geeks are truly interested in what’s inside the box. But computer science departments continue to draw boundaries around computer science and its subfields in a way that just encourages the fragmentation of knowledge that is so endemic of modern universities.

Which is why an experiment at Georgia Tech is particularly interesting. The College of Computing at Georgia Tech has managed to buck the enrollment trend, with enrollment numbers holding steady throughout this decade. The explanation appears to be a radical re-design of their undergraduate degree, into a set of eight threads. For a detailed explanation, there’s a white paper, but the basic aim is to get students to take more ownership of their degree programs (as opposed to waiting to be spoonfed), and to re-describe computer science in terms that make sense to the rest of the world (computer scientists often forget the the field is impenetrable to the outsider). The eight threads are: Modeling and simulation; Devices (embedded in the physical world); Theory; Information internetworks; Intelligence; Media (use of computers for more creative expression); People (human-centred design); and Platforms (computer architectures, etc). Students pick any two threads, and the program is designed so that any combination covers most of what you would expect to see in a traditional CS degree.

At first sight, it seems this is just a re-labeling effort, with the traditional subfields of CS (e.g. OS, networks, DB, HCI, AI, etc) mapping on to individual threads. But actually, it’s far more interesting than that. The threads are designed to re-contextualize knowledge. Instead of students picking from a buffet of CS courses, each thread is designed so that students see how the knowledge and skills they are developing can be applied in interesting ways. Most importantly, the threads cross many traditional disciplinary boundaries, weaving a diverse set of courses into a coherent theme, showing the students how their developing CS skills combine in intellectually stimulating ways, and preparing them for the connected thinking needed for inter-disciplinary problem solving.

For example the People thread brings in psychology and sociology, examining the role of computers in the human activity systems that give them purpose. It explore the perceptual and cognitive abilities of people as well as design practices for practical socio-technical systems. The Modeling and Simluation thread explores how computational tools are used in a wide variety of sciences to help understand the world. Following this thread will require consideration of epistemology of scientific knowledge, as well as mastery of the technical machinery by which we create models and simulations, and the underlying mathematics. The thread includes in a big dose of both continuous and discrete math, data mining, and high performance computing. Just imagine what graduates of these two threads would be able to do for our research on SE and the climate crisis! The other thing I hope it will do is to help students to know their own strengths and passions, and be able to communicate effectively with others.

The good news is that our department decided this week to explore our own version of threads. Our aims is to learn from the experience at Georgia Tech and avoid some of the problems they have experienced (for example, by allowing every possible combination of 8 threads, it appears they have created too many constraints on timetabling and provisioning individual courses). I’ll blog this initiative as it unfolds.

I just spent the last two hours chewing the fat with Mark Klein at MIT and Mark Tovey at Carleton, talking about all sorts of ideas, but loosely focussed on how distributed collaborative modeling efforts can help address global change issues (e.g. climate, peak oil, sustainability).

MK has a project, Climate Interactive,[update: Mark tells me I got the wrong project – it should be The Climate Collaboratorium. Climate Interactive is from a different group at MIT] which is exploring how climate simulation tools can be hooked up to discussions around decision making, which is one of the ideas we kicked around in our brainstorming sessions here.

MT has been exploring how you take ideas from distributed cognition and scale them up to much larger teams of people. He has put together a wonderful one-pager that summarized many interesting ideas on how mass collaboration can be applied in this space.

This conversation is going to keep me going for days on stuff to explore and blog about:

And lots of interesting ideas for new projects…