Search This Blog

Saturday, February 12, 2011

DQ: More Than Meets The Eye

As we move forward toward a view of data quality that allows us to create and use a language specific to DQ issues, descriptions and solutions, let’s take a minute here to examine the behavior of data.

Certainly, one of the attributes of quality data is that it is well-behaved. In other words it consistently delivers value according to principles that are applicable because of its type, domain, range, relationships, maturity, purpose(s)…

It is useful at this point to differentiate between static and dynamic properties of data. Any DQL (data quality language) that we might define should work well where static properties are concerned. When we begin to consider dynamic properties, the task becomes much more complex. The greater the number of dynamic properties, the greater will be the complexity.

Our chances of designing a DQL will be significantly greater if we can restrict ourselves to static properties only. Before we can do that, we have to understand the dynamic properties and assess their relative importance. Can we carve them out of the discussion? Will excluding them compromise our DQL’s capabilities?

Looking back at the list in paragraph 2, the first three properties might be thought of as static. These are the focus of our modeling efforts or, if we only pretend to do modeling, of our programming efforts. There is a tangent here that we’ll resist for now, but at some point we have to come back to it. The question of how data is initially defined is huge and the effect of initial definition on the lifetime of a datum and in particular on its quality is not to be underestimated.

For now, though, we’ll put that on the back burner. We expect the individual pieces of data to possess a definition (usually called a description), and our DBMS requires that we say what kind of data it is. Is it variable length text strings, a specified number of characters, integer, floating point, money, date/time, etc. It is surprising how many data are defined to the DBMS as varchar. It shouldn’t be surprising since all of our modeling tools allow us to set a default type and the default for the default is always varchar(n). This is popular because it guarantees that any value supplied will be accepted. Oops, another tangent almost sucked us in.

The final three items in the list are dynamic in the sense that their values can and will change, sometimes rapidly and usually unexpectedly. Let’s take the last first. Purpose, as “fit for…,” will change whenever we’re not paying attention. We hope that our stewards will be on top of this but pragmatically (everyone likes pragmatism), they may be too close to the business itself so that changing business needs or drivers loom so large that defined purpose fades to insignificance.

Maturity is also dynamic. We expect maturity to change over time. When we think of data maturity (if we do) we include stability (of all the other properties), quality metrics that have flattened out, recognition within the enterprise and probably several other aspects.

Finally, we have to face relationships. We’re not very good at relationship management. Some of us wouldn’t recognize a relationship if it sent us a valentine. Others pile all sorts of unwarranted expectations on top of our relationships and then wonder where has the quality gone.

It all starts in the modeling phase. Chen, when he invented a graphical notation for describing data, gave equal weight to entities and relationships. Both had a two dimensional symbol and the opportunity to possess attributes. For many reasons, not least perhaps that tool developers didn’t grasp the importance of relationship, “data modeling” tools eventually turned a multi-dimension, real thing into a single one-dimensional line that is only present all as a clue to the schema generator to copy the identifier from one of the linked entities into the attribute list of the other and label it as a foreign key so that the database engine can build an index.

Although I find examples are often counter-productive in the discussion of data quality, one example may illustrate the role of relationship in completing the semantic of a data set. PATIENT is such a common entity in the health care marketplace that no one even bothers to define it. It is a set of “demographics” by which we mean the attributes and it has relationship with PHYSICIAN or PROVIDER. It probably also has relationship with Visit or Admission, Order, Procedure, Prescription, Specimen and other entities of specific interest to the enterprise such as EDUCATION_SESSION, CLAIM…

It deosn’t take long to figure out that the relationship between patient and physician is more complex than can be accommodated by a single foreign key. A physician can “see” a patient, refer a patient, treat a patient, consult (with) a patient, admit a patient…the list goes on and on. Each of these relationships has real meaning or semantic value and may even be regulated by an outside body. Typically, these are implemented by a single foreign key attribute for each.

Now, imagine a situation in which an in-utero procedure is scheduled on a fetus. You may be aware that transfusions, heart valve repair and a host of other medical procedures are actually performed on the fetus while it is still within the mother’s womb. So, who is a patient? If the facility also terminates pregnancies for any reason you can see the conundrum. Medicine doesn’t allow for terminating the life of a patient (Dr. Kevorkian excepted). At the same time, we would like to sometime treat the fetus as a patient, perhaps for reasons of safety. We also experience the lack of values for attributes that we may have viewed as mandatory, e.g., DOB, SSN.

It is only when we explicitly talk about relationships that these issues emerge. Relationships cast light on the entity from all angles.

Relationships also represent the business processes that inform the purpose of the data. Often, undocumented meaning gets attached to data. Two analysts will get together and agree that for the purpose of this analytic, this combination of attribute values will be included (or excluded). For a given ETL job, we decide that an attribute value that isn’t on the approved list will be replaced with “&”. The adjustments to business processes are constant and usually undocumented and unnoticed. Until we can point to a documented process/relationship, we have no way of capturing and dealing with changes.

What’s the difference between an association and a relationship? Somewhere in there we’ll find clues about dynamic quality properties. One thing leaps out as a property of quality and a property of relationship—expectation. When we claim that something has quality, we establish an environment in which it is permitted to have certain kinds of expectations. The same is true of relationship. When two parties or entities enter into relationship they agree as to the expectations they will have of each other.

In our quest to define quality for data, we will be forced to document expectations and to monitor accountability with respect to those expectations.

No comments:

Post a Comment