Search This Blog

Thursday, October 22, 2009

Programmer as Data Quality Champion

The programmer is the one who takes all the wish lists and turns them into something that a programmable logic device (computer) can execute to fulfill the wishes. Today, several additional roles have attached themselves to this process. The architects, designers, modelers, testers... all play an important part in the final product but it is important to remember that these roles have motivations other than the ability of the product to satisfy wishes. At best they satisfy a different set of wishes that have more to do with the process than the product.


In the not so long ago days when I started in the information systems industry, none of those other roles even existed. It was all about programming.

In talking with a programmer, you will detect a hint of pride and superiority based on the sure knowledge that none of "them" could produce a program that ran without error and produced a useful result. Other than the "end user" or simply "user", there may be no one lower on the respect totem pole than the "data" people. The programmer only needs to know what you want it to "do"; data is just something that you move from one place to another.


In those bygone days before information technology there were organizations known as data processing. I'm leaving out broad segments of programming known as systems programming because at the operating system level, the data really is a commodity consisting of groups of on/off bits known as bytes. In the very act of ignoring this segment of programming we stumble over the origins of our problem. In early computer systems, there really was no data as we think of data today.


A programmer could grant wishes by making a process that took large amounts of "data" values from one file, combined them with large amounts of "data" values from other files and deposited the resulting "data" values in a new file from which address labels or paychecks were printed. The programmer's responsibility was simply to make sure the program didn't unexpectedly halt.


At first, they just told the users what the data values had to look like in order to ensure that the program kept running. When the users proved incapable of guaranteeing the necessary consistency, programmers took matters into their own hands and created scrubbing programs that would for example guarantee that a file contained only values that looked like $nnnnnn.nn where the value of n is from the set (0-9). Now everyone was happy until one day a big order came in for $1,250,000.00 and was thrown out as erroneous. At the same time, someone figured out how to divert the fractional round-off amounts into a private account.


I'm leaving out some reasoning steps in an effort to keep this to an essay length. If you get lost, just drop me a note and I'll be happy to fill in any missing pieces.


Eventually it was realized that we don't have to store data in a form recognizable to humans--the computer could be taught to present a data value in any format that a human might care to see. This leap forward allowed programmers to distance themselves even more from the data. The idea to take away from this is that programmers may not have the same concept of data that you do.


When non-programmers talk about data, they are typically talking about instances rather than types. To a non-programmer, "Walgreens" is an example of a piece of data as is "sea foam green" and "$900 billion." To a programmer, these are all character strings or text values and may be of three different subrange "types". The subrange (store, color, gross revenue) determines how the value should be handled and the value may be acceptable if it fits the pattern defined for the type.


Today, there are many opportunities to enforce patterns on data values and most of them require no programming at all. The problem is that they all produce errors and error messages that the typical user could not hope to comprehend. In effect, they cause the program to terminate unexpectedly. So, despite all the advancements in technology, we are still having to scrub data files. The alternative is for the programmer to think like a human instead of like a programmable controller and the problem with this alternative is that it introduces orders of magnitude increases (x10, x100...) in complexity and corresponding increases in development costs.


So how can programmers become champions of data quality? One relatively simple way would be to avoid accepting text values as program input. This tactic is a favorite because it defers many decisions until a later time when "we know more" and the big problem here is that we never have to go back and change it. An example here might be useful. Imagine that you are programming a system that accepts input from nurses who are taking vital signs (temperature, BP, pulse, respiration, height and weight) in a patient exam room. You take the usual shortcut and implement all the fields on the screen as text.


Everybody is happy because the nurses don't ever have to go back and correct anything and the program runs without apparent error. One day, though, a health insurance company decides to reward its contractual clients by paying a slightly higher rate to those who document that they are doing a consistent job of collecting vitals at every patient visit. Now we're asked to verify that we do an acceptable job of collecting and recording vital signs. Since the values input to a screen go directly to a database, we should have no problem. It is, in fact, no problem to count the records for which there is or is not a value in those fields, however, when we attempt to aggregate those values to show the range of values or the average value, our query fails. the aggregation query must convert the text values in the pulse field to integers and the text values in the temperature field to floating point (real) numbers in order to compute an average.


We finally discover that pulse contains some values like "100.4", "98.5", "SAME"... that cause an error because they can't be converted to an integer value. When we look at this as a nurse or physician, we can see that the mind ignores the labels on the screen and simply produces a picture of the patient based on the values displayed. Our poor computer, though, is unable to continue. The database architect could have made pulse an integer type and the DBMS would have enforced that typing by not allowing these values to be stored in the database. Using a text type allows the DBMS to accept any value for storage. The programmer could enforce a text value that is guaranteed to convert to an integer or could enforce integer types directly but in order to do so he or she must handle resulting errors in a way that is understood and accepted by the nurses.


More often, though, the nurse managers show the incorrect data to the nurses and exhort them to pay more attention. Do you believe the nurses will respond better to blame and exhortation or to assistance from the program? Check out W. E. Deming's Red Bead Experiment to get your answer.


The programmer champion will be suspicious of a discrete valued field whose data type is text. A value that may be used in a computation or any other operation where a conversion must be done must be investigated carefully. Any value that may be used as a tag for identifying rolled-up aggregations, such as store name, must get additional attention if we don't want to see quarterly sales for "Walgreens" and "Walgreen's" and "Wlagreens". The time to catch and repair these data quality errors is the very first time they are captured by a computer program. That makes the programmer responsible. Other roles have a duty to identify situations where these problems might arise, but only the programmer is positioned to do anything about it.


I realize this is asking a lot. A programmer is only human and can't be expected to know everything (right?). This suggests another way in which the programmer can become a champion. Since it isn't possible for one person to know everything that must be known (hard though that may be to swallow), the programmer must develop enthusiasm for consultation and collaboration. Every role in your environment was created for a reason and each has its own goals and responsibilities. The programmer is accustomed to the data people coming with requests. The requests are nearly always framed in terms of something that the programmer should do to make the [modeler's, architect's, steward's...] life easier and improve overall quality.

It's easy to understand how this can get old in a hurry. The solution is for the programmer(s) to sit down with these other roles and get everyone's needs on the table. All of the other roles mentioned have a different view of data than you do and here's the thing--their view is much closer to that of the customer/user than yours is. You need each other.

Accept that you are a key member of a team and as such the team can't succeed without your commitment. The flip side is that you will not be able to enjoy the success you dream of without the commitment, skills and knowledge of the rest of the team. Be a Data Quality Champion--it's within your grasp.

Next we'll take a look at some forces that act to keep the team from being all they could be. Stay tuned for Disturbances in the Force.

No comments:

Post a Comment