Water can enter the environment several ways, including rain and another body of water; water can exit the environment in several ways including evaporation, entering another body of water, or seeping underground. So you create small dams and place sensors in the water at well chosen locations. Each sensor measures the weight of the water (among other things) and feeds that data every few seconds to a data logger on the nearby shore. The data logger computes an average every 15 minutes and saves those values for you. Every so often you trek up the trail to your sensors, Palm Pilot in hand, download the data, take it back to the lab.
(Compressing the description of the scientific process for purposes of brevity) Run statistical analyses on the data, generate defensible behavioral models, write up the results and publish them.
Until the day that you notice a very strange reading. The water level is suddenly unusually high. Why might this be...
One unexpected and undetected moose could really mess up your data driven model of stream flow. Fortunately, the moose is reasonably easy to figure out. But other scenarios are a lot harder to get to the bottom of when you are dealing with complex natural phenomena and processes. What if you are measuring and modeling atmospheric carbon flow and sequestration in trees over that same expanse of forest? What if you include variables related to climate change, which is sure to bring in-depth scrutiny from peers and critics? You absolutely need to be able to explain and justify your conclusions to science and perhaps even to the wider public.
What you need is Provenance Data: the data about the data; the meta-data, whatever you want to call it. Provenance Data is the data that describes how those stream values were obtained, when they were obtained, what was done to that data. The contextual information surrounding the so-called Raw Data.
Computer Scientists are involved in a series of research projects to enable the gathering and clear presentation of Provenance Data. Next post, I will explain what they are doing, as well as why I said "so-called" Raw Data.