Search This Blog

Thursday, February 14, 2013

Data is like Scat

We (data people) like to talk about data as if it were the most important thing in the world when, in reality, it is exactly like turds.  Yes, that's what I said, merde, gavno, scheiss, shit. 

Of course, it could still be the most important thing is the world but whether it is or is not depends on many things, not the least of which is what we're trying to do.  Data is the byproduct of process much as turds are the byproduct of digestion.

For naturalists and scientists castings or anaimal feces provide important information on a wide range of subjects including the animals diet, where it has been, and its general health.  For a tracker, a different set of information is deducible from the spoor.  The tracker tells which species, how recent, whether the animal is moving or is likely to be nearby and other things that may help him earn a bonus.

No one says, "This is quality shit." or "We need a governance program to improve the quality of our shit."  They simply learn what they can from it and move on.  The kind of value available changes over time but even petrified feces have a story to tell.

If we presume to be data experts, maybe we should be focused on extracting value from our data and understanding the kinds of value that can be extracted as well as how to recognize the data that will tell us what we need to know.

Even the most inconsistent set of data has a story to tell and we should listen to that story instead of wailing about the story we wanted to hear.  Because data is the byproduct of process, inconsistent data should tell us that the process that produced it is inconsistent.  If this is the case, how can we expect consistent data unless we can create a consistent process.

The bottom line is that when we focus our efforts on data quality we are misunderstanding the world we live and work in.  We are creating additional and entirely unncessary complexity.  How does this happen?  The cause lies in the storage of data for input to and output from computer systems.  We have allowed ourselves to institutionalize the cart before the horse.  What computer systems require is consistent input.  A system can be designed to deal with quality issues as long as they are predictable.  A system (process) will always produce output of a consistency equivalent to its input.

If we were pursuing consistency instead of quality, our disagreements would be fewer, our stress would be lower and our impact would be greater.  Consistency a readily understood concept while quality will always be the most elusive of quarries.