Data Quality and Real World Things

This is an article that started out as a white paper for a client on how to conceptualize data quality, or the lack there of. At the time I could not find a taxonomy of the types of characteristics that made data have poor quality. It always seemed like something that “You just knew it when you saw it” but no objective standards existed that I knew of for what constituted low quality data. I found there were two classes of poor quality data, incompleteness and inaccuracy. My thinking was that if I could define the ways that data can deviate from “good” or “acceptable” quality then we could construct solutions to correct each of those conditions.

The article “Data Quality and Real World Things” was first published in November 2006 on the DMReview.com website which is no longer available.

Overview

In general, data represents real-world things and is described by metadata. It needs to be “fit for its purpose”. Data that is not, is considered invalid or of low quality and needs to be either cleansed or discarded. Uncleansed data can be seen as being either incomplete or inaccurate or both. There are several conditions that can be described in terms of either incompleteness or inaccuracy, or both, that constitute low quality data, these conditions include:

Inappropriate data – the metadata inaccurately describes the data.
Unknown data – the metadata describing the data is wholly or partially incomplete.
Inconsistent data – (a) a reference in the data to some other data in the system is either incomplete or inaccurate, or (b) data representing the same exact thing contains different values.
Duplicated data – a thing is unnecessarily represented more than once in the data.
Missing data – a required value in the data is incomplete, partially or wholly.
Wrong data – the values in the data inaccurately measure or describe the thing.

There are undoubtedly many other ways to express why certain data is not fit for its purpose[1], but I have found that these six types pretty well encompass what can be wrong with data and make it either wholly or partially (which can sometimes be worse than wholly) unfit for the purpose to which you want to put it.

Each type of poor quality data needs to be addressed in a different way. This is not to say that a complete methodology with its own framework, reference model, plan and templates is necessary for each type, but the root cause of and what can be done to correct each condition will vary. Any unified approach to Data Quality (DQ) needs to take into account the different types of poor quality data.

Let’s take another look at the six types of poor data quality listed above. You will notice that they include problems that can be found in both the content of the data and in the content of the metadata that describes the data. In many instances there is far less of a difference between data and metadata than contemporary data architecture tends to acknowledge. Essentially metadata is data. Data, as the first sentence of this article states, represents real-world things. Data is an abstraction of certain aspects, or characteristics of real-world things. It was invented (long before computers) to communicate information about real world things. Metadata is an abstraction of certain aspects of data, primarily to turn the data into information so that consumers (human and otherwise) can extract “meaning” from it and use it to record transactions or in a decision making context. As more and more real-world things themselves become abstract the distinction between data and metadata tends to become less distinct.

Inappropriate data

Data is inappropriate for its intended use when the metadata associated with it inaccurately describes it. This leads to a lack of confidence in the data by its consumers. For example, if a data element is defined as being a ‘Customer Home Address’ and it contains the string “John Smith”, which is most likely a name, which do you trust? The metadata (‘Customer Home Address’) or the data content “John Smith” or neither? At least one of them is inappropriate for the purpose at hand. If you assume the metadata is correct then the data is inappropriate. If you assume the data is correct then the metadata is inappropriate. Either way, the quality of the data as a whole, the metadata and the data content, is of low quality.

It is possible to measure the inappropriateness of the data. For example, the number of records that have this anomaly. We can determine that. But how do you know which is inappropriate, the metadata or the data content? Unless you have a definitive answer or reliable heuristics from data profiling or some source of domain knowledge outside of the DQ process itself, you do not know.

Any solution for solving the inappropriate data problem must focus on validating that the metadata in accurate. That it is appropriate metadata for the purpose to which it is being put, and that the data content then represents instances of that metadata.

Unknown data

Unknown data differs from inappropriate data in that at least some of the necessary metadata is simply not there. This seems like a metadata problem at first. But looking at it from a perspective of what can be trusted, it may also be a data content problem. If you trust that the metadata is complete and accurate, whether or not it looks like it is, then the problem may be a data content problem. It might be that you have data that is irrelevant. But without any other information, it is at best unknown data. If you trust the data content then you may need to produce the metadata by reverse engineering it in order to describe it.

A solution for solving the unknown data problem focuses first on whether it’s a metadata or data content problem and then either discarding the irrelevant data or deriving accurate definitions for the data you have.

Inconsistent data

Inconsistent data problems break down more cleanly between metadata and data content than either inappropriate data or unknown data. With inconsistent data either a reference in the metadata is inaccurate or incomplete, or both; or the data content significantly differs between what should be instances of the same data content. Each of these calls for a different solution, but in general is indicative of inconsistency in the body of data and thus a cause for distrust among the data’s consumers.

Metadata inconsistency, that is, references and definitions that are either incomplete or inaccurate, is the more insidious and difficult to find and correct. This inconsistency is frequently caused by unreconciled semantic differences, the old “impedance mismatch” problem. One party calls it a “Car” and the other party calls it an “Automobile”. Now, we as humans know that in almost all cases cars and automobiles are the same type of thing. We know this because we have agreed on the metadata definition of which both “Car” and “Automobile” are instances[2]. In this case they are the data content of a more abstract thing. But wait a minute, “Car” and “Automobile” were examples of inconsistent metadata weren’t they? Well yes, in the original context they were used as such. However to reconcile them with each other we needed an even more abstract concept. We needed metadata for our metadata, in other words, meta-metadata. So the only reliable solution for inconsistent metadata is to raise the abstraction stakes one level and create consistent meta-metadata. Then the problem becomes a data content consistency problem and no longer a metadata problem.

The solution to inconsistent data content, once the inconsistent metadata is no longer a problem, becomes one of identifying the “system of record”. This is not a problem to be taken lightly, but it is more of a business problem than technical problem. Of course the system of record may exhibit data quality problems of any of the types mentioned here but, in general, if data quality issues can be kept to those problems that typically plague data content they are far easier to solve than those plaguing metadata.

Duplicated data

Not all duplicated data is bad. The operative phrase here is “unnecessarily duplicated”. Data that is duplicated because it is de-normalized for performance or security purposes is not necessarily of low quality (though it may be of poor quality for other reasons). As a matter of fact de-normalizing the data, though it is a risky undertaking, may make it more “fit for purpose” when the purpose is narrowly defined in terms of speed of access to specific data sets. Also, from a network or enterprise point-of-view, data can be replicated, across multiple systems, typically for either performance, reporting or security purposes.

Where duplicated data is bad is when it is duplicated for no business or technical reason. This can happen as a result of poor or no entity integrity.[3] Which is primarily the result of a lack of appropriate uniqueness constraints on the source data. It can also happen because of poor or lacking data stewardship. When two (or more) records are exactly the same, selecting any one of them will suffice, however, it is when data records are not quite perfectly duplicated that it becomes more problematic. This often happens because poor data stewardship has either allowed data content to be partially duplicated or the duplicates have been allowed to age and become out of synch with their master copy.

Missing data

Missing data may be the most common type of data quality infraction. It is quite often the result of data being allowed to enter a source system without constraints, domain integrity or default values. Allowing nulls in data can increase its flexibility and make it easier to make progress early in a project, but how many times have you defined a column as null saying to yourself “We’ll get back to this detail later, now we need to finish the interface”? The interface gets finished and the domain integrity issue never gets addressed again. After all, it will get corrected up stream. Right? Maybe not.

Missing data can also be one of the most time consuming poor data quality problems to fix. This is primarily because you do not always know whether the data was supposed to be there or not. In other words, is it missing data or is it just absent data. That is, perhaps the correct value for a data instance is “unknown”. But at least missing data is a data content error and not a metadata error. If the metadata is missing then we have unknown data and that’s another story (see above). The best way to handle missing data is to agree upon a missing data strategy, publish the strategy and apply it consistently.

Wrong data

Wrong data is just that, it is data that has the wrong content. The metadata may be accurate and complete but the data content, that is the values in the data itself, is inaccurate. For example, if the number of items sold was actually 20 but the data recorded is 10, it is wrong. This is probably the most difficult data quality issue to correct, or even to discover. It requires knowledge usually outside of what is typically considered the art and science of data quality. However data profiling can help. If enough data is profiled that the confidence level of the discovered patterns in the content are sufficiently trusted then when outliers are encountered they can throw an exception during the inspection or transformation process to alert users that there is something wrong. Whether that something is caused by a system anomaly or by negligence or by lack of process control can be irrelevant, in and of itself, to the DQ process. A statistical quality control process can be used to spot and in many cases correct wrong data.

Also important in preventing wrong data is a sound data auditing function and even a good old fashioned check and balance system whereby data is entered into the staging area from more than one source. The subject data is checked against what should be the same data either on another channel, or in a source outside the auspices of the project. Only when data from all the sources match, or can be reconciled, is a single “audited” copy of the data content persisted.

Conclusion

As you can see there are subtle differences between the types of poor quality data. In many cases if you can profile a sufficient sample of the data for a sufficient amount of time, many of the types of poor data quality can be mitigated. However, data profiling is a relatively new technique within data transformation processes, does not have the mind share outside the sometimes cloistered world of data professionals, and is often difficult to sell to management as a necessary component of a Total Data Quality program. But if you believe, as I do, that understanding a problem is at least fifty percent of the battle to solve it, then including a data profiling step in your DQ efforts is always worthwhile.

Table 1 below summarizes the six conditions described above that can produce data of low quality, including whether the condition primarily effects the metadata, the data content or both, and the appropriate strategy for mitigating each condition.

Data Condition	Condition Description	Type of Condition	Metadata or Data Content Effected?	Mitigation Strategy
*Inappropriate data*	The metadata inaccurately describes the data.	Inaccuracy	Both metadata and data content are effected.	Profile data for appropriateness between metadata and content, and use domain knowledge to validate.
*Unknown data*	The metadata describing the data is wholly or partially incomplete.	Incompleteness	Both metadata and data content are effected.	If metadata is trusted, then discard irrelevant data content;If data content is trusted, then derive new metadata.
*Inconsistent data*	(a) A reference in the data to some other data in the system is either incomplete or inaccurate.	Inaccuracy or Incompleteness	Both metadata and data content are effected.	Create consistent meta-metadata and then treat the metadata as content data.
*Inconsistent data*	(b) Data representing the same exact thing contains different values.	Inaccuracy	Data content is effected.	Identify system of record on a metadata element by element basis.
*Duplicated data*	A thing is unnecessarily represented more than once in the data.	Inaccuracy	Both metadata and data content are effected.	Rationalize unnecessary duplication of both metadata and content data. Improve data stewardship.
*Missing data*	A required value in the data is incomplete, partially or wholly.	Incompleteness	Data content is effected.	Agree upon a missing data strategy, publish the strategy and apply it consistently.
*Wrong data*	The values in the data inaccurately measure or describe the thing.	Inaccuracy	Data content is effected.	Create a data auditing function to reconcile content against other sources.

Table 1. Summary of low quality data conditions.

[1] For another taxonomy of data quality conditions see the views of Larry English at http://www.infoimpact.com/.

[2] This is if we speak the same language of course.

[3] There are three types of data integrity: Entity integrity, Referential integrity and Domain integrity, all of which, when ignored, contribute to poor data quality.

Tags: BI, Business Intelligence, Data Quality, Data Warehousing, DQ, DW

This entry was posted on October 15, 2010 at 2:42 pm and is filed under Data Quality. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Birkdale Computing