So, if we know how to make things better in terms of data quality and we're motivated to do so, what's stopping us? A word of caution; what you're about to read may be harmful to your health.
Maybe you're old enough to have lived through the Watergate fiasco and can remember the facts coming to light one by one in the press until they eventually began to make a complete picture. Maybe you remember the Hollywood version, All The President's Men, in which the whole picture was produced in two+ hours rather than months, or maybe the whole thing is in the same category as the Crimean War for you and is nothing more than a question on a pop quiz in one of your least favorite subjects.
I'd like to suggest for your consideration that if we want to track down why we are having such a difficult time accomplishing something that we all claim to want, we need look no further than the paragraph above to get all the answers we need.
First, let's imagine that data quality is like truth in government. It's a good thing and we would like to assume that we have it. If, in fact we do not have truth in government (or data quality), who benefits? The answer is that it is in the interests of those who believe they can/will be blamed for the status quo to cover up the problems and subvert efforts to get at the facts that can provide the complete picture. This is especially true if they are responsible for the problems. Even if the only identifiable responsibility is that the person is the supervisor or manager of the function(s) that owns the troubled processes, they still may elect to resist and subvert in order to avoid becoming responsible for the fix.
If we want to avoid this situation, we should absolutely avoid any questions that seem directed at why or who or even how. We should avoid to the extent possible, any investigation into the past. Try to keep all discussions focused on process-based causes that might be producing the effects you are seeing. Do not zoom in on isolated instances but look for trends. Remember, your goal is not prosecution but consistent quality.
In the words of Bob Woodward's source, Deep Throat, "Follow the money." The programmer-champion will struggle against this repeatedly. There is a perception that implementing integrity checking at the point of input represents added cost. Like any other complex process, system development should seek to minimize total cost of ownership rather than any single cost line item. If it takes an extra day of programmer time to ensure that we get 99.99% consistency of integrity in the database and thereby avoid dedicating multiple full-time staff to data clean up, this is a net cost reduction.
Our system design and project management processes may not be mature enough to assign dollar values to this, but it should be easy to determine how much money we are spending on fixing poor quality data every month (or year) and then amending the design and development processes to devote a fraction of that amount to prevention.
The final perspective to be extracted from our example is that a short attention span provides little hope of even recognizing that a problem exists let alone understanding it enough to develop a mitigation strategy. Data quality (and truth in government) requires that everyone be involved. People are capable of recognizing self interest within the corporate interest and enough people will be motivated to act that the ball will be kept moving, but the media of the late 1960's is not the media of 2009. In the 60's there was an interest in the truth that perhaps doesn't exist today. In your corporate environment, you may find it easier to maintain a constant pressure of communication directed at a single theme. The widespread motivation will not be produced by a single appeal surrounded by banners and fanfare and free cake. A communications campaign must be designed for the long haul with continuous refreshing of the message.
You don't even have one percent of your employee workforce today who are ready to grapple with the issue of data quality. You are going to have to break it down in multiple variations and start with the concept of data itself. What is data? You'll be surprised at what you uncover when you go out to talk to people about their data. Stay tuned for some samples.
This blog addresses all things related to the safe handling of data and information. It is not for the faint of heart. Draw nigh all ye searchers. Learn and teach.
Search This Blog
Showing posts with label programming. Show all posts
Showing posts with label programming. Show all posts
Friday, October 23, 2009
Thursday, October 22, 2009
Programmer as Data Quality Champion
The programmer is the one who takes all the wish lists and turns them into something that a programmable logic device (computer) can execute to fulfill the wishes. Today, several additional roles have attached themselves to this process. The architects, designers, modelers, testers... all play an important part in the final product but it is important to remember that these roles have motivations other than the ability of the product to satisfy wishes. At best they satisfy a different set of wishes that have more to do with the process than the product.
In the not so long ago days when I started in the information systems industry, none of those other roles even existed. It was all about programming.
In talking with a programmer, you will detect a hint of pride and superiority based on the sure knowledge that none of "them" could produce a program that ran without error and produced a useful result. Other than the "end user" or simply "user", there may be no one lower on the respect totem pole than the "data" people. The programmer only needs to know what you want it to "do"; data is just something that you move from one place to another.
In those bygone days before information technology there were organizations known as data processing. I'm leaving out broad segments of programming known as systems programming because at the operating system level, the data really is a commodity consisting of groups of on/off bits known as bytes. In the very act of ignoring this segment of programming we stumble over the origins of our problem. In early computer systems, there really was no data as we think of data today.
A programmer could grant wishes by making a process that took large amounts of "data" values from one file, combined them with large amounts of "data" values from other files and deposited the resulting "data" values in a new file from which address labels or paychecks were printed. The programmer's responsibility was simply to make sure the program didn't unexpectedly halt.
At first, they just told the users what the data values had to look like in order to ensure that the program kept running. When the users proved incapable of guaranteeing the necessary consistency, programmers took matters into their own hands and created scrubbing programs that would for example guarantee that a file contained only values that looked like $nnnnnn.nn where the value of n is from the set (0-9). Now everyone was happy until one day a big order came in for $1,250,000.00 and was thrown out as erroneous. At the same time, someone figured out how to divert the fractional round-off amounts into a private account.
I'm leaving out some reasoning steps in an effort to keep this to an essay length. If you get lost, just drop me a note and I'll be happy to fill in any missing pieces.
Eventually it was realized that we don't have to store data in a form recognizable to humans--the computer could be taught to present a data value in any format that a human might care to see. This leap forward allowed programmers to distance themselves even more from the data. The idea to take away from this is that programmers may not have the same concept of data that you do.
When non-programmers talk about data, they are typically talking about instances rather than types. To a non-programmer, "Walgreens" is an example of a piece of data as is "sea foam green" and "$900 billion." To a programmer, these are all character strings or text values and may be of three different subrange "types". The subrange (store, color, gross revenue) determines how the value should be handled and the value may be acceptable if it fits the pattern defined for the type.
Today, there are many opportunities to enforce patterns on data values and most of them require no programming at all. The problem is that they all produce errors and error messages that the typical user could not hope to comprehend. In effect, they cause the program to terminate unexpectedly. So, despite all the advancements in technology, we are still having to scrub data files. The alternative is for the programmer to think like a human instead of like a programmable controller and the problem with this alternative is that it introduces orders of magnitude increases (x10, x100...) in complexity and corresponding increases in development costs.
So how can programmers become champions of data quality? One relatively simple way would be to avoid accepting text values as program input. This tactic is a favorite because it defers many decisions until a later time when "we know more" and the big problem here is that we never have to go back and change it. An example here might be useful. Imagine that you are programming a system that accepts input from nurses who are taking vital signs (temperature, BP, pulse, respiration, height and weight) in a patient exam room. You take the usual shortcut and implement all the fields on the screen as text.
Everybody is happy because the nurses don't ever have to go back and correct anything and the program runs without apparent error. One day, though, a health insurance company decides to reward its contractual clients by paying a slightly higher rate to those who document that they are doing a consistent job of collecting vitals at every patient visit. Now we're asked to verify that we do an acceptable job of collecting and recording vital signs. Since the values input to a screen go directly to a database, we should have no problem. It is, in fact, no problem to count the records for which there is or is not a value in those fields, however, when we attempt to aggregate those values to show the range of values or the average value, our query fails. the aggregation query must convert the text values in the pulse field to integers and the text values in the temperature field to floating point (real) numbers in order to compute an average.
We finally discover that pulse contains some values like "100.4", "98.5", "SAME"... that cause an error because they can't be converted to an integer value. When we look at this as a nurse or physician, we can see that the mind ignores the labels on the screen and simply produces a picture of the patient based on the values displayed. Our poor computer, though, is unable to continue. The database architect could have made pulse an integer type and the DBMS would have enforced that typing by not allowing these values to be stored in the database. Using a text type allows the DBMS to accept any value for storage. The programmer could enforce a text value that is guaranteed to convert to an integer or could enforce integer types directly but in order to do so he or she must handle resulting errors in a way that is understood and accepted by the nurses.
More often, though, the nurse managers show the incorrect data to the nurses and exhort them to pay more attention. Do you believe the nurses will respond better to blame and exhortation or to assistance from the program? Check out W. E. Deming's Red Bead Experiment to get your answer.
The programmer champion will be suspicious of a discrete valued field whose data type is text. A value that may be used in a computation or any other operation where a conversion must be done must be investigated carefully. Any value that may be used as a tag for identifying rolled-up aggregations, such as store name, must get additional attention if we don't want to see quarterly sales for "Walgreens" and "Walgreen's" and "Wlagreens". The time to catch and repair these data quality errors is the very first time they are captured by a computer program. That makes the programmer responsible. Other roles have a duty to identify situations where these problems might arise, but only the programmer is positioned to do anything about it.
I realize this is asking a lot. A programmer is only human and can't be expected to know everything (right?). This suggests another way in which the programmer can become a champion. Since it isn't possible for one person to know everything that must be known (hard though that may be to swallow), the programmer must develop enthusiasm for consultation and collaboration. Every role in your environment was created for a reason and each has its own goals and responsibilities. The programmer is accustomed to the data people coming with requests. The requests are nearly always framed in terms of something that the programmer should do to make the [modeler's, architect's, steward's...] life easier and improve overall quality.
It's easy to understand how this can get old in a hurry. The solution is for the programmer(s) to sit down with these other roles and get everyone's needs on the table. All of the other roles mentioned have a different view of data than you do and here's the thing--their view is much closer to that of the customer/user than yours is. You need each other.
Accept that you are a key member of a team and as such the team can't succeed without your commitment. The flip side is that you will not be able to enjoy the success you dream of without the commitment, skills and knowledge of the rest of the team. Be a Data Quality Champion--it's within your grasp.
Next we'll take a look at some forces that act to keep the team from being all they could be. Stay tuned for Disturbances in the Force.
In the not so long ago days when I started in the information systems industry, none of those other roles even existed. It was all about programming.
In talking with a programmer, you will detect a hint of pride and superiority based on the sure knowledge that none of "them" could produce a program that ran without error and produced a useful result. Other than the "end user" or simply "user", there may be no one lower on the respect totem pole than the "data" people. The programmer only needs to know what you want it to "do"; data is just something that you move from one place to another.
In those bygone days before information technology there were organizations known as data processing. I'm leaving out broad segments of programming known as systems programming because at the operating system level, the data really is a commodity consisting of groups of on/off bits known as bytes. In the very act of ignoring this segment of programming we stumble over the origins of our problem. In early computer systems, there really was no data as we think of data today.
A programmer could grant wishes by making a process that took large amounts of "data" values from one file, combined them with large amounts of "data" values from other files and deposited the resulting "data" values in a new file from which address labels or paychecks were printed. The programmer's responsibility was simply to make sure the program didn't unexpectedly halt.
At first, they just told the users what the data values had to look like in order to ensure that the program kept running. When the users proved incapable of guaranteeing the necessary consistency, programmers took matters into their own hands and created scrubbing programs that would for example guarantee that a file contained only values that looked like $nnnnnn.nn where the value of n is from the set (0-9). Now everyone was happy until one day a big order came in for $1,250,000.00 and was thrown out as erroneous. At the same time, someone figured out how to divert the fractional round-off amounts into a private account.
I'm leaving out some reasoning steps in an effort to keep this to an essay length. If you get lost, just drop me a note and I'll be happy to fill in any missing pieces.
Eventually it was realized that we don't have to store data in a form recognizable to humans--the computer could be taught to present a data value in any format that a human might care to see. This leap forward allowed programmers to distance themselves even more from the data. The idea to take away from this is that programmers may not have the same concept of data that you do.
When non-programmers talk about data, they are typically talking about instances rather than types. To a non-programmer, "Walgreens" is an example of a piece of data as is "sea foam green" and "$900 billion." To a programmer, these are all character strings or text values and may be of three different subrange "types". The subrange (store, color, gross revenue) determines how the value should be handled and the value may be acceptable if it fits the pattern defined for the type.
Today, there are many opportunities to enforce patterns on data values and most of them require no programming at all. The problem is that they all produce errors and error messages that the typical user could not hope to comprehend. In effect, they cause the program to terminate unexpectedly. So, despite all the advancements in technology, we are still having to scrub data files. The alternative is for the programmer to think like a human instead of like a programmable controller and the problem with this alternative is that it introduces orders of magnitude increases (x10, x100...) in complexity and corresponding increases in development costs.
So how can programmers become champions of data quality? One relatively simple way would be to avoid accepting text values as program input. This tactic is a favorite because it defers many decisions until a later time when "we know more" and the big problem here is that we never have to go back and change it. An example here might be useful. Imagine that you are programming a system that accepts input from nurses who are taking vital signs (temperature, BP, pulse, respiration, height and weight) in a patient exam room. You take the usual shortcut and implement all the fields on the screen as text.
Everybody is happy because the nurses don't ever have to go back and correct anything and the program runs without apparent error. One day, though, a health insurance company decides to reward its contractual clients by paying a slightly higher rate to those who document that they are doing a consistent job of collecting vitals at every patient visit. Now we're asked to verify that we do an acceptable job of collecting and recording vital signs. Since the values input to a screen go directly to a database, we should have no problem. It is, in fact, no problem to count the records for which there is or is not a value in those fields, however, when we attempt to aggregate those values to show the range of values or the average value, our query fails. the aggregation query must convert the text values in the pulse field to integers and the text values in the temperature field to floating point (real) numbers in order to compute an average.
We finally discover that pulse contains some values like "100.4", "98.5", "SAME"... that cause an error because they can't be converted to an integer value. When we look at this as a nurse or physician, we can see that the mind ignores the labels on the screen and simply produces a picture of the patient based on the values displayed. Our poor computer, though, is unable to continue. The database architect could have made pulse an integer type and the DBMS would have enforced that typing by not allowing these values to be stored in the database. Using a text type allows the DBMS to accept any value for storage. The programmer could enforce a text value that is guaranteed to convert to an integer or could enforce integer types directly but in order to do so he or she must handle resulting errors in a way that is understood and accepted by the nurses.
More often, though, the nurse managers show the incorrect data to the nurses and exhort them to pay more attention. Do you believe the nurses will respond better to blame and exhortation or to assistance from the program? Check out W. E. Deming's Red Bead Experiment to get your answer.
The programmer champion will be suspicious of a discrete valued field whose data type is text. A value that may be used in a computation or any other operation where a conversion must be done must be investigated carefully. Any value that may be used as a tag for identifying rolled-up aggregations, such as store name, must get additional attention if we don't want to see quarterly sales for "Walgreens" and "Walgreen's" and "Wlagreens". The time to catch and repair these data quality errors is the very first time they are captured by a computer program. That makes the programmer responsible. Other roles have a duty to identify situations where these problems might arise, but only the programmer is positioned to do anything about it.
I realize this is asking a lot. A programmer is only human and can't be expected to know everything (right?). This suggests another way in which the programmer can become a champion. Since it isn't possible for one person to know everything that must be known (hard though that may be to swallow), the programmer must develop enthusiasm for consultation and collaboration. Every role in your environment was created for a reason and each has its own goals and responsibilities. The programmer is accustomed to the data people coming with requests. The requests are nearly always framed in terms of something that the programmer should do to make the [modeler's, architect's, steward's...] life easier and improve overall quality.
It's easy to understand how this can get old in a hurry. The solution is for the programmer(s) to sit down with these other roles and get everyone's needs on the table. All of the other roles mentioned have a different view of data than you do and here's the thing--their view is much closer to that of the customer/user than yours is. You need each other.
Accept that you are a key member of a team and as such the team can't succeed without your commitment. The flip side is that you will not be able to enjoy the success you dream of without the commitment, skills and knowledge of the rest of the team. Be a Data Quality Champion--it's within your grasp.
Next we'll take a look at some forces that act to keep the team from being all they could be. Stay tuned for Disturbances in the Force.
Wednesday, April 8, 2009
Programmers Need Leadership
Many programmers are also musicians. Many have fluency in multiple spoken languages. Many have a gift for mathematics. It seems that these abilities are somehow related in the human brain. This has been apparent for decades to those who guide students into appropriate careers.
A bit of reflection reveals that these are all somewhat solitary pursuits involving individual dedication and a large degree of creativity. Of all of these career paths, software development may be the most accessible and the most remunerative. Pair this with the tendency of many socially gifted personalities to throw up their hands when confronted with technology or mathematics beyond what can be done on an adding machine and the corresponding inability or unwillingness to grasp qualitative differences in abstractions such as software, and you get a recipe for significant problems.
People who are intimidated by the technology of a personal computer or laptop are glad to grant credibility to the first person who can make the technology perform the desired tricks. The non-IT parts of the business have come to terms with this and simply compartmentalize I.T. so that as few as possible must have anything to do with those people.
Now add an additional dimension: the belief (common these days) that a good manager can manage anything. This may be true, but having worked with and come to know many developers (programmers), the credible manager is an absolute rarity in I.T. Just ask any developer (or network admin, server admin, DBA...). You'll find that a manager who enjoys the credibility and respect of the "troops" is the very definition of raris avis.
The bottom line is that software developers and the associated technology disciplines comprise people who have had to figure things out for themselves and who know that their bosses don't understand what they do. The combination of these two (an ability and an awareness) produces people who have an approach to life that is similar to that of a cowboy or possibly a farmer. They are independent and like it that way. They won't turn from a challenge, even if that challenge is doing something that they believe is appropriate despite known management objections and even obstacles. They simply know that when it works, everything will be forgiven.
And it is.
I am disheartened when I follow discussion boards on the Internet. A provocative question is posed concerning methods and the discussion immediately goes to tools. Terms that have been around for decades are redefined without any acknowledgement of the accepted definition. Adding insult to injury, this is done even by data architects who, more than anyone else, should know better.
What I am seeing in these discussion boards is a playground full of gifted five-year-olds with absolutely no supervision. They are capable of amazing feats, but at what cost? If you are the CEO or the CIO of a company, do you have ANY idea what it costs--not in salary, but in uncontrolled complexity and corresponding maintenance costs--to allow this?
You can't blame the five-year-olds. They aren't the problem. They are doing exactly what they were put there to do. If you can't find managers who are capable of establishing some level of respect and control, then you must at least find leaders among the children and give them a mentor.
Even mathematicians respond to this. Remember the Manhattan Project? This project produced the atomic bomb that ended WWII. Dr. Robert Oppenheimer was the mathematician/physicist leader who guided the work and Gen. Leslie Groves of the Army Corps of Engineers was the administrative mentor. One mathematician, no matter how capable and creative, could not have developed the weapon in the time available. But 5000, without leadership could not have done it either.
Data is available through the Software Engineering Institute at Carnegie-Mellon University on the cost benefit of a managed development team using defined processes. Once the processes have been defined (which takes leadership), the manager has only to believe in and rely on the processes in order to achieve predictable, low-cost results of known quality.
Whatever path you choose, it is you, the executive leader, who is responsible.
A bit of reflection reveals that these are all somewhat solitary pursuits involving individual dedication and a large degree of creativity. Of all of these career paths, software development may be the most accessible and the most remunerative. Pair this with the tendency of many socially gifted personalities to throw up their hands when confronted with technology or mathematics beyond what can be done on an adding machine and the corresponding inability or unwillingness to grasp qualitative differences in abstractions such as software, and you get a recipe for significant problems.
People who are intimidated by the technology of a personal computer or laptop are glad to grant credibility to the first person who can make the technology perform the desired tricks. The non-IT parts of the business have come to terms with this and simply compartmentalize I.T. so that as few as possible must have anything to do with those people.
Now add an additional dimension: the belief (common these days) that a good manager can manage anything. This may be true, but having worked with and come to know many developers (programmers), the credible manager is an absolute rarity in I.T. Just ask any developer (or network admin, server admin, DBA...). You'll find that a manager who enjoys the credibility and respect of the "troops" is the very definition of raris avis.
The bottom line is that software developers and the associated technology disciplines comprise people who have had to figure things out for themselves and who know that their bosses don't understand what they do. The combination of these two (an ability and an awareness) produces people who have an approach to life that is similar to that of a cowboy or possibly a farmer. They are independent and like it that way. They won't turn from a challenge, even if that challenge is doing something that they believe is appropriate despite known management objections and even obstacles. They simply know that when it works, everything will be forgiven.
And it is.
I am disheartened when I follow discussion boards on the Internet. A provocative question is posed concerning methods and the discussion immediately goes to tools. Terms that have been around for decades are redefined without any acknowledgement of the accepted definition. Adding insult to injury, this is done even by data architects who, more than anyone else, should know better.
What I am seeing in these discussion boards is a playground full of gifted five-year-olds with absolutely no supervision. They are capable of amazing feats, but at what cost? If you are the CEO or the CIO of a company, do you have ANY idea what it costs--not in salary, but in uncontrolled complexity and corresponding maintenance costs--to allow this?
You can't blame the five-year-olds. They aren't the problem. They are doing exactly what they were put there to do. If you can't find managers who are capable of establishing some level of respect and control, then you must at least find leaders among the children and give them a mentor.
Even mathematicians respond to this. Remember the Manhattan Project? This project produced the atomic bomb that ended WWII. Dr. Robert Oppenheimer was the mathematician/physicist leader who guided the work and Gen. Leslie Groves of the Army Corps of Engineers was the administrative mentor. One mathematician, no matter how capable and creative, could not have developed the weapon in the time available. But 5000, without leadership could not have done it either.
Data is available through the Software Engineering Institute at Carnegie-Mellon University on the cost benefit of a managed development team using defined processes. Once the processes have been defined (which takes leadership), the manager has only to believe in and rely on the processes in order to achieve predictable, low-cost results of known quality.
Whatever path you choose, it is you, the executive leader, who is responsible.
Subscribe to:
Posts (Atom)