Wednesday, February 1, 2012

Preparing for the Unexpected Moose in Your Hydrology Research Study

Let's say you are an ecologist studying a watershed high in the mountains. You care about the water flow through several streams that feed a pristine lake. Ultimately you want to understand the stream discharge process in this region - what is the volume? You need to collect a lot of complex data in order to build a realistic model of what happens day in and day out.

Water can enter the environment several ways, including rain and another body of water; water can exit the environment in several ways including evaporation, entering another body of water, or seeping underground. So you create small dams and place sensors in the water at well chosen locations. Each sensor measures the weight of the water (among other things) and feeds that data every few seconds to a data logger on the nearby shore. The data logger computes an average every 15 minutes and saves those values for you. Every so often you trek up the trail to your sensors, Palm Pilot in hand, download the data, take it back to the lab.

(Compressing the description of the scientific process for purposes of brevity) Run statistical analyses on the data, generate defensible behavioral models, write up the results and publish them.

Until the day that you notice a very strange reading. The water level is suddenly unusually high. Why might this be...

By running a few standard checks and conducting a little investigation you discover that a moose stepped in the water. If you are like me, when you first heard this all too real scenario, you almost fell off the chair laughing at the thought of a moose blithely wandering into the middle of a serious research project.

One unexpected and undetected moose could really mess up your data driven model of stream flow. Fortunately, the moose is reasonably easy to figure out. But other scenarios are a lot harder to get to the bottom of when you are dealing with complex natural phenomena and processes. What if you are measuring and modeling atmospheric carbon flow and sequestration in trees over that same expanse of forest? What if you include variables related to climate change, which is sure to bring in-depth scrutiny from peers and critics? You absolutely need to be able to explain and justify your conclusions to science and perhaps even to the wider public.

What you need is Provenance Data: the data about the data; the meta-data, whatever you want to call it. Provenance Data is the data that describes how those stream values were obtained, when they were obtained, what was done to that data. The contextual information surrounding the so-called Raw Data.

Computer Scientists are involved in a series of research projects to enable the gathering and clear presentation of Provenance Data. Next post, I will explain what they are doing, as well as why I said "so-called" Raw Data.

No comments:

Post a Comment