Archive for the ‘Data Quality’ Category

Cross-mapping Business Data Elements with Physical Data Structures

September 12, 2018

Cross-mapping[i] business data elements (BDEs) from requirements or business glossaries all the way through to the physical data structures that store these data elements can be tricky.  This process is made easier if a logical data model (LDM) is implemented between the business side and the technology model upon which the physical implementation is based.

In this article I will show two styles of cross-mapping between the BDEs of the business glossary and the data storage structures defined in the physical data model (PDM).  The two styles are called “Column Style” mapping and “Cell Style” mapping.

Column Style can apply to the cross-mapping of BDEs with both base tables[ii] and reference tables.  Cell Style mapping takes Column Style a step further for reference data and maps the individual values of a reference BDE to individual cells, or row column coordinates, in the database.  If each reference value needs to be implemented in the technology separately a Cell Style mapping is needed.

Also some BDEs are atomic in that there is no inherent structure in the business data element.  I call these Simple BDEs.  Some BDEs have an inherent structure which depends on the scope of their business definition.  These structures are often in the form of a container and its contents.  I call these Compound BDEs.

Some examples of each style follow.  Figure 1 shows an example of a Column Style mapping of a Simple BDE[iii] to a single Column of a single base Table in the database.  Using the terms “Table” and “Column” gives the impression that these mapping styles apply only to relational databases.  This is not true.  The concepts here can apply to “NoSQL” databases as well, such as Key-Value Stores, Column Family Databases and Document Databases.  These types of databases have been referred to as Aggregate-Oriented Databases.[iv]

Figure 1.

Figure 2 shows a Column Style mapping of a Simple BDE to a single Column of a single reference Table in the database.  Note here that because it is a reference data mapping that the valid values of the reference BDE do not get specifically mapped to any structure or data in the database.  In this case the valid reference values may be defined and maintained in the business glossary in an appropriate manner but there is no explicit mapping link between them and any thing in the data model or database.  Also note that I am not saying that the reference values are not in the database.  They almost certainly are, since the applications would presumably not function properly if they were not.  But the values themselves (“Unknown”, “IFRS”, US GAAP”, etc.) are not considered BDEs.

Figure 2.

Figure 3 is a Column Style mapping of a Compound BDE to multiple Columns of a base Table in the database.  Note here that the BDE maps to two Columns in the same Table.  It could, of course, map to only one Column, or to many more Columns in many more Tables.  The point of this example is not how may times the BDE maps to various Columns but that it is a Compound BDE.  “Current Assets.Cash”[v] really identifies two business concepts, “Current Assets” which is a section of a financial statement, and “Cash” which is the most liquid kind of asset.  In terms of a financial statement, cash is a type of current asset.  For whatever reason, probably because cash itself can occur in multiple contexts, the business found it necessary to identify the financial statement use of cash in its own BDE.  Cash could also, for example, be a type of transaction, which would be another BDE.

Figure 3.

In Figure 4 we have an example of Cell Style mapping.  Here, the same Compound BDE that was used in Figure 3, “Current Cash.Assets” is broken apart into its two separate parts, and each is mapped with a different Column in a different reference Table in the database.  Both of these tables are reference Tables.  One contains the name of the types of financial statement templates used by the system (financial_statement_templates) to which “Current Assets” is mapped.  The other contains the names of the types of lines on those financial statements (financial_statement_line_templates) to which “Cash” is mapped.  It is unlikely that “Current Assets” would be mapped to multiple database Columns, but because of the multiple meanings of “Cash” there could be multiple mappings between it and multiple database Columns.

Figure 4.

Note that Cell Style mapping requires mapping to the data content of the database and not just its structure, or metadata.  This needs to be taken on with great caution and requires a consensus for how the words (e.g. “Cash”) used in the glossary are conveyed to and used in the Transaction Activity Data[vi] of the database to determine the actions of any application accessing the data.  Cash is simple, but any time you rely on reading a string to determine a code path you create an external dependency in the code.  However, if the business changes the name of a Cell Style mapped BDE there should be a documented path to where and how that value is used by any application that accesses it in the database.  This fact alone makes any system more maintainable and sustainable.

 

 

[i] The word “cross-mapping” is used to imply that the mapping goes both ways, from business to technology, and from technology to business.  This allows the documentation of the system to more accurately describe the entire enterprise architecture by creating valid and traceable links back and forth between the business architecture and the technical architecture.  By creating an interface between the two aspects of the enterprise architecture, business people can more confidently make decisions based on the data they see, knowing with confidence it is reliably stored and processed, and technical people can more confidently know the data they collect, manipulate and provide does in fact represent what business side intends for it to be.

[ii] I use the term “base table” here to refer to the realization of the bottom four of the six layers of data described by Malcolm Chisholm in his article “What is Master Data” originally published February 6, 2008.  http://www.b-eye-network.com/view/6758 .  These include Enterprise Structure Data, Transaction Structure Data, Transaction Activity Data and Transaction Audit Data.

[iii] The numbers (128, 600 and 1) that follow the name of each BDE in the examples are purely arbitrary and are there only to indicate that BDE’s should be numbered in the catalog for indexing purposes.

[iv] See Pramod J. Saladage and Martin Fowler, NOSQL Distilled, A Brief Guide to the Emerging World of Polyglot Persistence, Addison-Wesley, 2013.

[v] Note the “.” Between Current Assets and Cash.  This is just to make it easier the tell that the BDE is compound.  Don’t count on always having a separator between the parts of a Compound BDE.  The glossary tool may simply not support it or the in-depth analysis to identify the two parts may not have been made.  As a data modeler you may have to do this analysis yourself.

[vi] Please refer to note ii above.

 

Data Quality and Real World Things

October 15, 2010

This is an article that started out as a white paper for a client on how to conceptualize data quality, or the lack there of.  At the time I could not find a taxonomy of the types of  characteristics that made data have poor quality.   It always seemed like something that “You just knew it when you saw it”  but no objective standards existed that I knew of for what constituted low quality data.  I found there were two classes of poor quality data, incompleteness and inaccuracy.  My thinking was that if I could define the ways that data can deviate from “good” or “acceptable” quality then we could construct solutions to correct each of those conditions.

The article “Data Quality and Real World Things” was first published in November 2006 on the DMReview.com website which is no longer available.

Overview

In general, data represents real-world things and is described by metadata.  It needs to be “fit for its purpose”.  Data that is not, is considered invalid or of low quality and needs to be either cleansed or discarded.  Uncleansed data can be seen as being either incomplete or inaccurate or both.  There are several conditions that can be described in terms of either incompleteness or inaccuracy, or both, that constitute low quality data, these conditions include:

  • Inappropriate data – the metadata inaccurately describes the data.
  • Unknown data – the metadata describing the data is wholly or partially incomplete.
  • Inconsistent data – (a) a reference in the data to some other data in the system is either incomplete or inaccurate, or (b) data representing the same exact thing contains different values.
  • Duplicated data – a thing is unnecessarily represented more than once in the data.
  • Missing data – a required value in the data is incomplete, partially or wholly.
  • Wrong data – the values in the data inaccurately measure or describe the thing.

There are undoubtedly many other ways to express why certain data is not fit for its purpose[1], but I have found that these six types pretty well encompass what can be wrong with data and make it either wholly or partially (which can sometimes be worse than wholly) unfit for the purpose to which you want to put it.

Each type of poor quality data needs to be addressed in a different way.  This is not to say that a complete methodology with its own framework, reference model, plan and templates is necessary for each type, but the root cause of and what can be done to correct each condition will vary.  Any unified approach to Data Quality (DQ) needs to take into account the different types of poor quality data.

Let’s take another look at the six types of poor data quality listed above.  You will notice that they include problems that can be found in both the content of the data and in the content of the metadata that describes the data.  In many instances there is far less of a difference between data and metadata than contemporary data architecture tends to acknowledge.  Essentially metadata is data.  Data, as the first sentence of this article states, represents real-world things.  Data is an abstraction of certain aspects, or characteristics of real-world things.  It was invented (long before computers) to communicate information about real world things.  Metadata is an abstraction of certain aspects of data, primarily to turn the data into information so that consumers (human and otherwise) can extract “meaning” from it and use it to record transactions or in a decision making context.  As more and more real-world things themselves become abstract the distinction between data and metadata tends to become less distinct.

Inappropriate data

Data is inappropriate for its intended use when the metadata associated with it inaccurately describes it.  This leads to a lack of confidence in the data by its consumers. For example, if a data element is defined as being a ‘Customer Home Address’ and it contains the string “John Smith”, which is most likely a name, which do you trust?  The metadata (‘Customer Home Address’) or the data content “John Smith” or neither?  At least one of them is inappropriate for the purpose at hand.  If you assume the metadata is correct then the data is inappropriate.  If you assume the data is correct then the metadata is inappropriate.  Either way, the quality of the data as a whole, the metadata and the data content, is of low quality.

It is possible to measure the inappropriateness of the data.  For example, the number of records that have this anomaly.  We can determine that.  But how do you know which is inappropriate, the metadata or the data content?  Unless you have a definitive answer or reliable heuristics from data profiling or some source of domain knowledge outside of the DQ process itself, you do not know.

Any solution for solving the inappropriate data problem must focus on validating that the metadata in accurate.  That it is appropriate metadata for the purpose to which it is being put, and that the data content then represents instances of that metadata.

Unknown data

Unknown data differs from inappropriate data in that at least some of the necessary metadata is simply not there.  This seems like a metadata problem at first.  But looking at it from a perspective of what can be trusted, it may also be a data content problem.  If you trust that the metadata is complete and accurate, whether or not it looks like it is, then the problem may be a data content problem.  It might be that you have data that is irrelevant.  But without any other information, it is at best unknown data.  If you trust the data content then you may need to produce the metadata by reverse engineering it in order to describe it.

A solution for solving the unknown data problem focuses first on whether it’s a metadata or data content problem and then either discarding the irrelevant data or deriving accurate definitions for the data you have.

Inconsistent data

Inconsistent data problems break down more cleanly between metadata and data content than either inappropriate data or unknown data.  With inconsistent data either a reference in the metadata is inaccurate or incomplete, or both; or the data content significantly differs between what should be instances of the same data content.  Each of these calls for a different solution, but in general is indicative of inconsistency in the body of data and thus a cause for distrust among the data’s consumers.

Metadata inconsistency, that is, references and definitions that are either incomplete or inaccurate, is the more insidious and difficult to find and correct.  This inconsistency is frequently caused by unreconciled semantic differences, the old “impedance mismatch” problem. One party calls it a “Car” and the other party calls it an “Automobile”.  Now, we as humans know that in almost all cases cars and automobiles are the same type of thing.  We know this because we have agreed on the metadata definition of which both “Car” and “Automobile” are instances[2].  In this case they are the data content of a more abstract thing.  But wait a minute, “Car” and “Automobile” were examples of inconsistent metadata weren’t they?  Well yes, in the original context they were used as such.  However to reconcile them with each other we needed an even more abstract concept.  We needed metadata for our metadata, in other words, meta-metadata.  So the only reliable solution for inconsistent metadata is to raise the abstraction stakes one level and create consistent meta-metadata.  Then the problem becomes a data content consistency problem and no longer a metadata problem.

The solution to inconsistent data content, once the inconsistent metadata is no longer a problem, becomes one of identifying the “system of record”.  This is not a problem to be taken lightly, but it is more of a business problem than technical problem. Of course the system of record may exhibit data quality problems of any of the types mentioned here but, in general, if data quality issues can be kept to those problems that typically plague data content they are far easier to solve than those plaguing metadata.

Duplicated data

Not all duplicated data is bad.  The operative phrase here is “unnecessarily duplicated”.  Data that is duplicated because it is de-normalized for performance or security purposes is not necessarily of low quality (though it may be of poor quality for other reasons).  As a matter of fact de-normalizing the data, though it is a risky undertaking, may make it more “fit for purpose” when the purpose is narrowly defined in terms of speed of access to specific data sets.  Also, from a network or enterprise point-of-view, data can be replicated, across multiple systems, typically for either performance, reporting or security purposes.

Where duplicated data is bad is when it is duplicated for no business or technical reason.  This can happen as a result of poor or no entity integrity.[3]  Which is primarily the result of a lack of appropriate uniqueness constraints on the source data.  It can also happen because of poor or lacking data stewardship.  When two (or more) records are exactly the same, selecting any one of them will suffice, however, it is when data records are not quite perfectly duplicated that it becomes more problematic.  This often happens because poor data stewardship has either allowed data content to be partially duplicated or the duplicates have been allowed to age and become out of synch with their master copy.

Missing data

Missing data may be the most common type of data quality infraction.  It is quite often the result of data being allowed to enter a source system without constraints, domain integrity or default values.  Allowing nulls in data can increase its flexibility and make it easier to make progress early in a project, but how many times have you defined a column as null saying to yourself “We’ll get back to this detail later, now we need to finish the interface”?  The interface gets finished and the domain integrity issue never gets addressed again.  After all, it will get corrected up stream.  Right?  Maybe not.

Missing data can also be one of the most time consuming poor data quality problems to fix.  This is primarily because you do not always know whether the data was supposed to be there or not.  In other words, is it missing data or is it just absent data.  That is, perhaps the correct value for a data instance is “unknown”.  But at least missing data is a data content error and not a metadata error.  If the metadata is missing then we have unknown data and that’s another story (see above).  The best way to handle missing data is to agree upon a missing data strategy, publish the strategy and apply it consistently.

Wrong data

Wrong data is just that, it is data that has the wrong content.  The metadata may be accurate and complete but the data content, that is the values in the data itself, is inaccurate.  For example, if the number of items sold was actually 20 but the data recorded is 10, it is wrong.  This is probably the most difficult data quality issue to correct, or even to discover.  It requires knowledge usually outside of what is typically considered the art and science of data quality.  However data profiling can help.  If enough data is profiled that the confidence level of the discovered patterns in the content are sufficiently trusted then when outliers are encountered they can throw an exception during the inspection or transformation process to alert users that there is something wrong.  Whether that something is caused by a system anomaly or by negligence or by lack of process control can be irrelevant, in and of itself, to the DQ process.  A statistical quality control process can be used to spot and in many cases correct wrong data.

Also important in preventing wrong data is a sound data auditing function and even a good old fashioned check and balance system whereby data is entered into the staging area from more than one source.  The subject data is checked against what should be the same data either on another channel, or in a source outside the auspices of the project.  Only when data from all the sources match, or can be reconciled, is a single “audited” copy of the data content persisted.

Conclusion

As you can see there are subtle differences between the types of poor quality data.  In many cases if you can profile a sufficient sample of the data for a sufficient amount of time, many of the types of poor data quality can be mitigated.  However, data profiling is a relatively new technique within data transformation processes, does not have the mind share outside the sometimes cloistered world of data professionals, and is often difficult to sell to management as a necessary component of a Total Data Quality program.  But if you believe, as I do, that understanding a problem is at least fifty percent of the battle to solve it, then including a data profiling step in your DQ efforts is always worthwhile.

Table 1 below summarizes the six conditions described above that can produce data of low quality, including whether the condition primarily effects the metadata, the data content or both, and the appropriate strategy for mitigating each condition.

Data Condition Condition Description Type of Condition Metadata or Data Content Effected? Mitigation Strategy
Inappropriate data  The metadata inaccurately describes the data. Inaccuracy Both metadata and data content are effected. Profile data for appropriateness between metadata and content, and use domain knowledge to validate.
Unknown data  The metadata describing the data is wholly or partially incomplete. Incompleteness Both metadata and data content are effected. If metadata is trusted, then discard irrelevant data content;If data content is trusted, then derive new metadata.
Inconsistent data (a) A reference in the data to some other data in the system is either incomplete or inaccurate. Inaccuracy or Incompleteness Both metadata and data content are effected. Create consistent meta-metadata and then treat the metadata as content data.
(b) Data representing the same exact thing contains different values. Inaccuracy Data content is effected. Identify system of record on a metadata element by element basis.
Duplicated data A thing is unnecessarily represented more than once in the data. Inaccuracy Both metadata and data content are effected. Rationalize unnecessary duplication of both metadata and content data. Improve data stewardship.
Missing data A required value in the data is incomplete, partially or wholly. Incompleteness Data content is effected. Agree upon a missing data strategy, publish the strategy and apply it consistently.
Wrong data The values in the data inaccurately measure or describe the thing. Inaccuracy Data content is effected. Create a data auditing function to reconcile content against other sources.

Table 1. Summary of low quality data conditions.


[1] For another taxonomy of data quality conditions see the views of Larry English at http://www.infoimpact.com/.

[2] This is if we speak the same language of course.

[3] There are three types of data integrity: Entity integrity, Referential integrity and Domain integrity, all of which, when ignored, contribute to poor data quality.