Monday, February 6, 2012

The Data Provenance Project

It is a scientist's job to ask a lot of questions and to search for answers. Often this means collecting extensive data and studying it to generate meaning. Along the way, scientists may see an unusual output, such as described in the hydrology research project in my last post. To follow the trail leading eventually to our friend the moose, questions had to be asked: which sensor produced the anomalous data, when did this happen, what day, what time, how long did the change last, was there a similar rise in water level at nearby sensors in other streams? These answers come in the form of data: Provenance Data. Provenance comes from the French verb "provenir", meaning "to come from, to come forth".

But Provenance Data is not just for getting to the bottom of mysteries. Provenance Data is a key contributor to proving and justifying scientific conclusions.  This is where the challenging concept of "raw data" comes into play. What data exactly are we talking about when we ask for the "raw data"? How do we present that data to others such that it has meaning, given that a bunch of numbers without any interpretation is often meaningless? But once we interpret (manipulate) it, is it still "raw"? It is easy to get trapped in a circular conundrum.

An Example, returning to the hydrology project: is raw data the data about stream outflow at a given location? This outflow information has meaning, but was generated by a synthesis and filtering of other data. So, is raw data the average water weight generated every 15 minutes at various onshore loggers? Maybe. We can get yet more specific: is raw data the individual underwater sensor readings taken every few seconds? Maybe...but at this point would those readings make any sense to anyone other than a few highly trained specialists and engineers?

Probably not.  So how helpful would it actually be in proving and justifying claims of stream outflow volume to the concerned external evaluator or critic? We didn't even discuss the fact that there are enormous technological hurdles to maintaining every single sensor reading for any length of time. Not to mention that if you ask an ecologist they would probably present even more alternatives for the title of "raw data".

More than ever, in this day and age of constant challenging and questioning of scientific claims, something is needed to assist with obtaining a full picture of where results come from and what they mean.

As explained to me by Barbara Lerner, computer science faculty at Mount Holyoke College, Provenance Data is useful for answering many questions related to understanding, validation and accountability: to provide tracking of data, to enable a study of interacting actions inherent to any complex process, to facilitate investigation of deeper and broader questions generated by data inherent to complex processes.

Barbara is part of the multi-institutional Data Provenance Project which is developing a process system to aid scientists in collecting, storing and analyzing Provenance Data. She works with faculty at the University of Massachusetts at Amherst (Lee Osterweil) and at Harvard Forest (Emery Boose). The tool they are creating will provide a disciplined method to track how and when data was collected, and how it has been manipulated, all the way through to the development of descriptive models.  There are applications in diverse domains; her focus is the Harvard Forest ecology project measuring stream volume outflow we have been discussing. When the project is complete, the ecologists her team works with will be able to extensively query and manipulate their data - without having to learn a query language such as SQL. The current prototype is already able to produce Data Derivation Graphs (DDG) for the scientists.

Here is a very simple example of a  DDG describing the process for obtaining one stream discharge value, using a specialized processing language called Little-JIL:

Detailed explanations can be found in the team's published papers.*

There are challenges on many levels to building a Data Provenance tool. One of the biggest concerns is with balancing technical flexibility with ease of use for the non computer scientist. For this reason the computer scientists work closely with the ecologists, who think this project is "cool" and are happy to provide ongoing feedback. There are other challenges: those inherent to graph problems in general; all sorts of challenges to developing process systems that will be functional across disciplines. Other areas of interest range from processes tied to climate modeling, emergency room care, chemotherapy delivery and labor negotiations. Clearly, the long term benefits extend far beyond the ecology project. Theoretically, any science process, research or otherwise, will be able to use this system once it is fully developed.

As Barbara Lerner says, it is extremely rewarding to do outward looking things and obtain concrete results.  It is inspiring to work with other scientists who think this work is exciting. The field of computer science benefits, the overall cause of science benefits, and society benefits. Hard to argue with any of that.



*Barbara Lerner, Emery Boose, Leon Osterweil, Aaron Ellison and Lori Clarke, "Provenance and Quality Control in Sensor Networks", Environmental Information Managemet 2011 Conference, Santa Barbara, California, September 2011.

No comments:

Post a Comment