Search This Blog

Monday, February 14, 2011

Chapter One

This will be chapter one of the eventual book. Some things have to laid out very explicitly. No one should ever be able to complain that they didn't get their money's worth or that they thought they were getting something different.

Some basic principles that will govern everything else that is said:
  • Data is part of language
  • All communication about data is, itself, data
  • Language consists of denoted meaning (denotation) and connoted meaning (connotation) and on top of these is layered implication and inference which involve human perspective
  • Nothing about language guarantees communication
  • Communication requires a minimum of two entities from the following set [human, machine, logical construct (e.g., software)]

If we are to get a grasp on data quality we have the best chance of success if we restrict our discussion to ONLY that data that is part of communication between machines or between machine and logical construct. Of course this is neither exciting nor even very useful. We have many specifications formats (ethernet, etc.) that guarantee that communication on some level will take place between machines and logicals. ASCII or EBCDIC are the most basic of communication specification. It doesn't take very long for the alert observer to notice that unless a human is involved somewhere in the process, it doesn't really matter what the communication is.

"Matter" implies human involvement or at least we can infer human involvement from a statement that something does or does not matter. Matter is a value judgment couched in an emotional context. It's only when we start to peel back the layers, asking why or in what sense something matters that we begin to get to the idea of quality. Our exploration, then, will follow the trail of what matters.

As we follow this trail we're going to encounter the idea that what matters is, in many ways, distinct to the judge. What matters to the reader of a graphical novel (formerly comic book) may not be the same things that matter to a reader of War and Peace or a viewer of The Mona Lisa. What matters to someone watching Wile E. Coyote fail in yet another attempt at catching the Road Runner is not the same as what matters to someone watching Being John Malkovitch or Inception.

How then do we determine whose perspective to assume? Whose view matters?

The answer of course is that, where communication is concerned, everyone's perspective matters. It would be a great feat of communication if we could present a context-free (perspective-free) discussion of data quality. In fact, it would be such a feat that we're not likely to ever see it and it certainly won't happen here. Our intent is to zero in (or home in but NOT hone in) on a very small number of perspectives to see what matters to them and then step back to see if there are any common themes that can be exploited. If we are successfull in that, we may have created a springboard for the one who comes after.

In the next chapter we will nominate some key perspectives and begin to investigate what matters to them.

Saturday, February 12, 2011

DQ: More Than Meets The Eye

As we move forward toward a view of data quality that allows us to create and use a language specific to DQ issues, descriptions and solutions, let’s take a minute here to examine the behavior of data.

Certainly, one of the attributes of quality data is that it is well-behaved. In other words it consistently delivers value according to principles that are applicable because of its type, domain, range, relationships, maturity, purpose(s)…

It is useful at this point to differentiate between static and dynamic properties of data. Any DQL (data quality language) that we might define should work well where static properties are concerned. When we begin to consider dynamic properties, the task becomes much more complex. The greater the number of dynamic properties, the greater will be the complexity.

Our chances of designing a DQL will be significantly greater if we can restrict ourselves to static properties only. Before we can do that, we have to understand the dynamic properties and assess their relative importance. Can we carve them out of the discussion? Will excluding them compromise our DQL’s capabilities?

Looking back at the list in paragraph 2, the first three properties might be thought of as static. These are the focus of our modeling efforts or, if we only pretend to do modeling, of our programming efforts. There is a tangent here that we’ll resist for now, but at some point we have to come back to it. The question of how data is initially defined is huge and the effect of initial definition on the lifetime of a datum and in particular on its quality is not to be underestimated.

For now, though, we’ll put that on the back burner. We expect the individual pieces of data to possess a definition (usually called a description), and our DBMS requires that we say what kind of data it is. Is it variable length text strings, a specified number of characters, integer, floating point, money, date/time, etc. It is surprising how many data are defined to the DBMS as varchar. It shouldn’t be surprising since all of our modeling tools allow us to set a default type and the default for the default is always varchar(n). This is popular because it guarantees that any value supplied will be accepted. Oops, another tangent almost sucked us in.

The final three items in the list are dynamic in the sense that their values can and will change, sometimes rapidly and usually unexpectedly. Let’s take the last first. Purpose, as “fit for…,” will change whenever we’re not paying attention. We hope that our stewards will be on top of this but pragmatically (everyone likes pragmatism), they may be too close to the business itself so that changing business needs or drivers loom so large that defined purpose fades to insignificance.

Maturity is also dynamic. We expect maturity to change over time. When we think of data maturity (if we do) we include stability (of all the other properties), quality metrics that have flattened out, recognition within the enterprise and probably several other aspects.

Finally, we have to face relationships. We’re not very good at relationship management. Some of us wouldn’t recognize a relationship if it sent us a valentine. Others pile all sorts of unwarranted expectations on top of our relationships and then wonder where has the quality gone.

It all starts in the modeling phase. Chen, when he invented a graphical notation for describing data, gave equal weight to entities and relationships. Both had a two dimensional symbol and the opportunity to possess attributes. For many reasons, not least perhaps that tool developers didn’t grasp the importance of relationship, “data modeling” tools eventually turned a multi-dimension, real thing into a single one-dimensional line that is only present all as a clue to the schema generator to copy the identifier from one of the linked entities into the attribute list of the other and label it as a foreign key so that the database engine can build an index.

Although I find examples are often counter-productive in the discussion of data quality, one example may illustrate the role of relationship in completing the semantic of a data set. PATIENT is such a common entity in the health care marketplace that no one even bothers to define it. It is a set of “demographics” by which we mean the attributes and it has relationship with PHYSICIAN or PROVIDER. It probably also has relationship with Visit or Admission, Order, Procedure, Prescription, Specimen and other entities of specific interest to the enterprise such as EDUCATION_SESSION, CLAIM…

It deosn’t take long to figure out that the relationship between patient and physician is more complex than can be accommodated by a single foreign key. A physician can “see” a patient, refer a patient, treat a patient, consult (with) a patient, admit a patient…the list goes on and on. Each of these relationships has real meaning or semantic value and may even be regulated by an outside body. Typically, these are implemented by a single foreign key attribute for each.

Now, imagine a situation in which an in-utero procedure is scheduled on a fetus. You may be aware that transfusions, heart valve repair and a host of other medical procedures are actually performed on the fetus while it is still within the mother’s womb. So, who is a patient? If the facility also terminates pregnancies for any reason you can see the conundrum. Medicine doesn’t allow for terminating the life of a patient (Dr. Kevorkian excepted). At the same time, we would like to sometime treat the fetus as a patient, perhaps for reasons of safety. We also experience the lack of values for attributes that we may have viewed as mandatory, e.g., DOB, SSN.

It is only when we explicitly talk about relationships that these issues emerge. Relationships cast light on the entity from all angles.

Relationships also represent the business processes that inform the purpose of the data. Often, undocumented meaning gets attached to data. Two analysts will get together and agree that for the purpose of this analytic, this combination of attribute values will be included (or excluded). For a given ETL job, we decide that an attribute value that isn’t on the approved list will be replaced with “&”. The adjustments to business processes are constant and usually undocumented and unnoticed. Until we can point to a documented process/relationship, we have no way of capturing and dealing with changes.

What’s the difference between an association and a relationship? Somewhere in there we’ll find clues about dynamic quality properties. One thing leaps out as a property of quality and a property of relationship—expectation. When we claim that something has quality, we establish an environment in which it is permitted to have certain kinds of expectations. The same is true of relationship. When two parties or entities enter into relationship they agree as to the expectations they will have of each other.

In our quest to define quality for data, we will be forced to document expectations and to monitor accountability with respect to those expectations.

Monday, February 7, 2011

Data, Governance and Data Governance

There have been some great discussion threads on the IAIDQ LinkedIn group recently. One thread that attracted a lot of attention started with a question from a PhD candidate in Data Quality. It simply asked whether there is an accepted definition of data quality. 200 replies later, most people would say, "No." More recently a thread began by bemoaning the fact that there is no accepted definition of Data Governance. A lively discussion followed that continues even now. Yet another refers to an article on five reasons to cleanse downstream instead of preventing upstream.

I am about to shed some light, however feeble, on the subject. Allow me to start by admitting that I am a person who likes to do the analysis necessary to solve a problem. Though my patience is improving with practice, those who know me will back me up when I say that I want to solve a problem ONCE.

Previous posts here have explained the abstract nature of data and all that implies in terms of getting people on board when it comes to doing something about quality. People will listen to or read a horror story about some preventable data issue that cost Company ABC $umpteen million. They will nod sagely and say something like, "They should have seen that coming." They are simply unable to see that their own company is engaged in the exact same practices and completely at risk for the $umpteen million.

Friends, it isn't just our favorite whipping boys, Management. There is no more recognition within IT than there is in the boardroom. Our boxes and wires friends think of data in terms of DASD and Raid configurations or bandwidth and throughput. Our developer pals don't really think of data at all except as the fuel that activates their code. Architects appear to be concerned with the storage and throughput views overlaid with an access management filter. They seem more concerned with making developers and DBAs happy than with the quality of the asset.

Enter Data Governance, which in most instances wants to be about definitions, rules and "enforcement." Often Data Governance tries to heap another thick layer, called meta data, on top of all the data that is already being mismanaged in the organization. It's often the case that Data Governance fails to practice what they preach.

Here's the revelation: Data Governance isn't about data. Data Governance is about process. It is the means to the Data Quality end. I have already said that Data Governance is that part of corporate governance that is dedicated to stewarding the corporation's data asset. It is exactly analogous to the role of Finance/Accounting with respect to the capital asset. Unfortunately, Finance has two things going for it that Data Governance doesn't have: GAAP and audits.

Generally Accepted Accounting Practice is a set of guidelines for money management processes that are accepted as the name implies and USED nationally and internationally. The use of these practices insures that processes will be auditable. The audit process verifies that GAAP was used and if there were exceptions, that they were clearly noted with enough information to allow the results to be brought back into alignment with GAAP. The underlying theme is that if the processes were sound then the result is believable.

Imagine if every company of any size whatsoever were able to devise and use its own bookkeeping structure and process. There could never be a stock market. Equity trading would be too risky for anyone and all business would essentially be sole proprietorships. Moreover, there would be no chance of oversight by outside bodies (Government).

This is a picture of the situation with respect to data today. When will it get better? Data Governance has no power to make the situation better. Without an externally defined data management framework and periodic audits by independent auditors, there will be no improvement. In the meantime, if data quality metrics improve, it's only because some particularly strong and charismatic personality is present.

No one questions the need for accounting nor the rigor of accounting procedures. Actually the same can be said for data governance and data management procedures. The difference being that in the case of money, the lack of question results in compliance while in the case of data it results in apathy or confusion.

Does the data world have something like GAAP that could become the necessary process infrastructure to support data management audits? I don't see it. Data is still too personal, too subjective, too misunderstood to attract the attention of researchers. Data management is a black box to virtually everyone and they like it that way.

People prefer to cleanse downstream data because their customers fell their pain being relieved. Happy customers is the goal after all. The bonus is that cleansing provides an unending source of employment for those doing the cleansing. It's win-win! People aren't going to be highly motivated to change a win-win scenario any time soon.