Friday, October 23, 2009
Maybe you're old enough to have lived through the Watergate fiasco and can remember the facts coming to light one by one in the press until they eventually began to make a complete picture. Maybe you remember the Hollywood version, All The President's Men, in which the whole picture was produced in two+ hours rather than months, or maybe the whole thing is in the same category as the Crimean War for you and is nothing more than a question on a pop quiz in one of your least favorite subjects.
I'd like to suggest for your consideration that if we want to track down why we are having such a difficult time accomplishing something that we all claim to want, we need look no further than the paragraph above to get all the answers we need.
First, let's imagine that data quality is like truth in government. It's a good thing and we would like to assume that we have it. If, in fact we do not have truth in government (or data quality), who benefits? The answer is that it is in the interests of those who believe they can/will be blamed for the status quo to cover up the problems and subvert efforts to get at the facts that can provide the complete picture. This is especially true if they are responsible for the problems. Even if the only identifiable responsibility is that the person is the supervisor or manager of the function(s) that owns the troubled processes, they still may elect to resist and subvert in order to avoid becoming responsible for the fix.
If we want to avoid this situation, we should absolutely avoid any questions that seem directed at why or who or even how. We should avoid to the extent possible, any investigation into the past. Try to keep all discussions focused on process-based causes that might be producing the effects you are seeing. Do not zoom in on isolated instances but look for trends. Remember, your goal is not prosecution but consistent quality.
In the words of Bob Woodward's source, Deep Throat, "Follow the money." The programmer-champion will struggle against this repeatedly. There is a perception that implementing integrity checking at the point of input represents added cost. Like any other complex process, system development should seek to minimize total cost of ownership rather than any single cost line item. If it takes an extra day of programmer time to ensure that we get 99.99% consistency of integrity in the database and thereby avoid dedicating multiple full-time staff to data clean up, this is a net cost reduction.
Our system design and project management processes may not be mature enough to assign dollar values to this, but it should be easy to determine how much money we are spending on fixing poor quality data every month (or year) and then amending the design and development processes to devote a fraction of that amount to prevention.
The final perspective to be extracted from our example is that a short attention span provides little hope of even recognizing that a problem exists let alone understanding it enough to develop a mitigation strategy. Data quality (and truth in government) requires that everyone be involved. People are capable of recognizing self interest within the corporate interest and enough people will be motivated to act that the ball will be kept moving, but the media of the late 1960's is not the media of 2009. In the 60's there was an interest in the truth that perhaps doesn't exist today. In your corporate environment, you may find it easier to maintain a constant pressure of communication directed at a single theme. The widespread motivation will not be produced by a single appeal surrounded by banners and fanfare and free cake. A communications campaign must be designed for the long haul with continuous refreshing of the message.
You don't even have one percent of your employee workforce today who are ready to grapple with the issue of data quality. You are going to have to break it down in multiple variations and start with the concept of data itself. What is data? You'll be surprised at what you uncover when you go out to talk to people about their data. Stay tuned for some samples.
Thursday, October 22, 2009
In the not so long ago days when I started in the information systems industry, none of those other roles even existed. It was all about programming.
In talking with a programmer, you will detect a hint of pride and superiority based on the sure knowledge that none of "them" could produce a program that ran without error and produced a useful result. Other than the "end user" or simply "user", there may be no one lower on the respect totem pole than the "data" people. The programmer only needs to know what you want it to "do"; data is just something that you move from one place to another.
In those bygone days before information technology there were organizations known as data processing. I'm leaving out broad segments of programming known as systems programming because at the operating system level, the data really is a commodity consisting of groups of on/off bits known as bytes. In the very act of ignoring this segment of programming we stumble over the origins of our problem. In early computer systems, there really was no data as we think of data today.
A programmer could grant wishes by making a process that took large amounts of "data" values from one file, combined them with large amounts of "data" values from other files and deposited the resulting "data" values in a new file from which address labels or paychecks were printed. The programmer's responsibility was simply to make sure the program didn't unexpectedly halt.
At first, they just told the users what the data values had to look like in order to ensure that the program kept running. When the users proved incapable of guaranteeing the necessary consistency, programmers took matters into their own hands and created scrubbing programs that would for example guarantee that a file contained only values that looked like $nnnnnn.nn where the value of n is from the set (0-9). Now everyone was happy until one day a big order came in for $1,250,000.00 and was thrown out as erroneous. At the same time, someone figured out how to divert the fractional round-off amounts into a private account.
I'm leaving out some reasoning steps in an effort to keep this to an essay length. If you get lost, just drop me a note and I'll be happy to fill in any missing pieces.
Eventually it was realized that we don't have to store data in a form recognizable to humans--the computer could be taught to present a data value in any format that a human might care to see. This leap forward allowed programmers to distance themselves even more from the data. The idea to take away from this is that programmers may not have the same concept of data that you do.
When non-programmers talk about data, they are typically talking about instances rather than types. To a non-programmer, "Walgreens" is an example of a piece of data as is "sea foam green" and "$900 billion." To a programmer, these are all character strings or text values and may be of three different subrange "types". The subrange (store, color, gross revenue) determines how the value should be handled and the value may be acceptable if it fits the pattern defined for the type.
Today, there are many opportunities to enforce patterns on data values and most of them require no programming at all. The problem is that they all produce errors and error messages that the typical user could not hope to comprehend. In effect, they cause the program to terminate unexpectedly. So, despite all the advancements in technology, we are still having to scrub data files. The alternative is for the programmer to think like a human instead of like a programmable controller and the problem with this alternative is that it introduces orders of magnitude increases (x10, x100...) in complexity and corresponding increases in development costs.
So how can programmers become champions of data quality? One relatively simple way would be to avoid accepting text values as program input. This tactic is a favorite because it defers many decisions until a later time when "we know more" and the big problem here is that we never have to go back and change it. An example here might be useful. Imagine that you are programming a system that accepts input from nurses who are taking vital signs (temperature, BP, pulse, respiration, height and weight) in a patient exam room. You take the usual shortcut and implement all the fields on the screen as text.
Everybody is happy because the nurses don't ever have to go back and correct anything and the program runs without apparent error. One day, though, a health insurance company decides to reward its contractual clients by paying a slightly higher rate to those who document that they are doing a consistent job of collecting vitals at every patient visit. Now we're asked to verify that we do an acceptable job of collecting and recording vital signs. Since the values input to a screen go directly to a database, we should have no problem. It is, in fact, no problem to count the records for which there is or is not a value in those fields, however, when we attempt to aggregate those values to show the range of values or the average value, our query fails. the aggregation query must convert the text values in the pulse field to integers and the text values in the temperature field to floating point (real) numbers in order to compute an average.
We finally discover that pulse contains some values like "100.4", "98.5", "SAME"... that cause an error because they can't be converted to an integer value. When we look at this as a nurse or physician, we can see that the mind ignores the labels on the screen and simply produces a picture of the patient based on the values displayed. Our poor computer, though, is unable to continue. The database architect could have made pulse an integer type and the DBMS would have enforced that typing by not allowing these values to be stored in the database. Using a text type allows the DBMS to accept any value for storage. The programmer could enforce a text value that is guaranteed to convert to an integer or could enforce integer types directly but in order to do so he or she must handle resulting errors in a way that is understood and accepted by the nurses.
More often, though, the nurse managers show the incorrect data to the nurses and exhort them to pay more attention. Do you believe the nurses will respond better to blame and exhortation or to assistance from the program? Check out W. E. Deming's Red Bead Experiment to get your answer.
The programmer champion will be suspicious of a discrete valued field whose data type is text. A value that may be used in a computation or any other operation where a conversion must be done must be investigated carefully. Any value that may be used as a tag for identifying rolled-up aggregations, such as store name, must get additional attention if we don't want to see quarterly sales for "Walgreens" and "Walgreen's" and "Wlagreens". The time to catch and repair these data quality errors is the very first time they are captured by a computer program. That makes the programmer responsible. Other roles have a duty to identify situations where these problems might arise, but only the programmer is positioned to do anything about it.
I realize this is asking a lot. A programmer is only human and can't be expected to know everything (right?). This suggests another way in which the programmer can become a champion. Since it isn't possible for one person to know everything that must be known (hard though that may be to swallow), the programmer must develop enthusiasm for consultation and collaboration. Every role in your environment was created for a reason and each has its own goals and responsibilities. The programmer is accustomed to the data people coming with requests. The requests are nearly always framed in terms of something that the programmer should do to make the [modeler's, architect's, steward's...] life easier and improve overall quality.
It's easy to understand how this can get old in a hurry. The solution is for the programmer(s) to sit down with these other roles and get everyone's needs on the table. All of the other roles mentioned have a different view of data than you do and here's the thing--their view is much closer to that of the customer/user than yours is. You need each other.
Accept that you are a key member of a team and as such the team can't succeed without your commitment. The flip side is that you will not be able to enjoy the success you dream of without the commitment, skills and knowledge of the rest of the team. Be a Data Quality Champion--it's within your grasp.
Next we'll take a look at some forces that act to keep the team from being all they could be. Stay tuned for Disturbances in the Force.
- Meeting the spec[ification]
- Documented adherence to established [process] norms
- The product's effects are primarily positive (for example, it tastes good and doesn't make me ill)
- I'll know it when I see it
Multiple choice: which of the above (choose only one) is the definition you want applied to your new car?
Does the answer change if we apply the definition to your morning cup of coffee?
Last question: which definition do you apply to the next example of "business intelligence" that comes before you?
I have two points in mind. First, defining quality is not an exact science even within a specific context. Second, #4 may be the deciding factor regardless of #s 1-3. In the end, the consumer/customer merely has to say, "That's not what I was looking for" to relegate a product to the trash heap. We all know that it does no good to say, "This is what you asked for" (meeting the spec) or "I did it just like you told me" (followed established procedure) or "One won't hurt you" (tastes good and not sick--yet).
So what is quality and especially, what is data quality? How we obtain data quality is completely dependent on the answer to this question.
I'd like to suggest that we divide the question in order to produce at least one useful answer. If we examine data quality from the perspective of a computer and its logic, we can come up with an answer that will allow us to progress. The second perspective is obviously the consumer/customer or the human perspective.
Recently I received an email with what at first glance seemed like an innocuous statement full of typographic and/or spelling errors, but when I actually looked at it, it was a nearly perfect illustration of a principle I have been talking about for years.
Teh hmuan mnid is cpalbae of miankg snese of amslot atynhnig taht ftis smoe bisac ptratnes.
This is the principle that draws the line between computer logic and human "logic". It is also what makes programmers (an outmoded term, I know, but best suited to the point I'm making) so vitally important. There is only one role in the continuum of roles involved in producing an information system product that must bear the full weight of responsibility for the integrity of the data quality at the boundary between computer and human. That role is best thought of as programmer.
Unless you earn your living as a programmer, some alarms may be going off. In fact, if you are a programmer, some alarms should be going off. I earned a degree in computer science and made a living as a programmer. From there I moved into data modeling, data administration and database administration. Now I'm involved in data quality and governance. In all of that time, I have never come into contact with any training, education, book or even a job description that addressed my accountability for preserving data quality at the man/machine interface.
This may be a poor forum for this, but my intention is to change this situation right here. My next few posts will present some background on how a programmer might live up to this responsibility and some of the forces that will need to be fended off in order to make it a reality.
Next: Programmer as Data Quality Champion
Thursday, October 1, 2009
"We are compelled to drive toward total knowledge, right down to the levels of the neuron and the gene. When we have progressed enough to explain ourselves in these mechanistic terms...the result might be hard to accept." [Edward O. Wilson]
Unfortunately, though having answers readily available can get you around the game board ahead of everyone else, it can't necessarily produce the big win for a cooperative venture. I don't know what research exists on this subject, if any, but it seems to me that living in "the age of information" is vastly overrated. If you think it is good to have a library of information at your fingertips then you may want to think about some desireable properties of information, for example:
- Is it true?
- Is it accurate?
- Is it current?
- Is it useful?
A relatively recent recent development has been the creation of a new area of specialization around data quality. This is not a technical post, though, so I leave it up to the reader to go as deep as necessary. The point I would like to make is that the useful property trumps all the others.
The publicity that surrounded recent preseidential elections in the U.S. shows us that "information" need not be true or accurate or current in order to be useful. If nothing else, this should cause us to pause and consider our thirst for information.
The bottom line is that a command of facts is useless unless those facts enable us to accomplish something. This is the function of experience, education, training, practice... In truth, a person who commands an encyclopedia of facts and is fluent in acronymese (see previous post) may or may not be able to accomplish a goal. This is also why we tend to look for experience when we need help.
The catch is that a person who has no experience, training, or applicable knowledge has no way of recognizing experience that will be useful. This leads to insistence on things that may have little or no bearing on eventual achievement. For example, when we need a new roof on our house, we don't ask that the roofer have experience with a certain brand of shingle or a certain kind of fastener. We don't specify how the work will be done. We simply insist on straight lines and no leaks. It's inexplicably strange that we do not take the same approach to technology goals including those relating to information. Read some recent position descriptions or postings on Monster or Dice or CareerBuilder. They are chock full of brand names to the extent that the real goal is obscured.
The proof of my point is the extent to which the information management situation is becoming more complex, with dramatic increases in the amount of work and the specialization of that work. These increases in turn produce increases in costs. It's hard to point to achievements today. Micromanagement, based on a swarm of facts with no understanding produces a lot of activity and few results.