Posts Tagged ‘Business Intelligence’

Machine Learning and Database Reverse Engineering

October 13, 2019

Artificial intelligence (AI) is based on the assumption that programming a computer using a feedback loop can improve the accuracy of its results.  Changing the values of the variables, called “parameters”, used in the execution of the code, in the right way, can influence future executions of the code.  These future executions are then expected to produce results that are closer to a desired result than previous executions.  If this happens the AI is said to have “learned”.

Machine learning (ML) is a subset of AI.  An ML execution is called an “activation”.  Activations are what “train” the code to get more accurate.  An ML activation is distinctly a two-step process.  In the first step, input data is conceptualized into what are called “features”.  These features are labeled and assigned weights based on assumptions about their relative influence on the output.  The data is then processed by selected algorithms to produce the output.  The output of this first step is then compared to an expected output and a difference is calculated.  This closes out the first step which is often called “forward propagation”.

The second step, called “back propagation” takes the differences between the output of the first step, called “y_hat” and the expected output, called “y” and, using a different but related set of algorithms, determines how the weights of the features should be modified to reduce the difference between y and y_hat.  The activations are repeated until either the user is satisfied with the output, or changing the weights makes no more difference.  The trained and tested model can then be used to do predictions on similar data sets, and hopefully create value for the owning party or (either person or organization).

In a sense, ML is a bit like database reverse engineering (DRE).  In DRE we have the data, which is the result of some set of processing rules, which we don’t know[i], that have been applied to that data.  We also have our assumptions of what we think a data model would have to look like to produce such data, and what it would need to look like to increase the value of the data.  We iteratively apply various techniques to try to decipher the data modeling rules, mostly based on data profiling. With each iteration we try to get closer to what we believe the original data model looked like.  As with ML activation we eventually stop, either because we are satisfied or because of resource limitations.

At that point we accept that we have produced a “good enough model” of the existing data.  We then move on to what we are going to do with the data, feeling confident that we have an adequate abstraction of the data model as it exists, how it was arrived at, and what we need to do to improve it.  This is true even if there was never any “formal” modeling process originally.

Let’s look at third normal form (3NF) as an example of a possible rule that might have been applied to the data.  3NF is a rule that all columns of a table must be dependent on the key, or identifier of the table, and nothing else.  If the data shows patterns of single key dependencies we can assume that 3NF was applied in its construction.  The application of the 3NF rule will create certain dependencies between the metadata and the data that represent business rules.

These dependencies are critical to what we need to do to change the data model to more closely fit, and thus be more valuable for, changing organizational expectations.  It is also these dependencies that are discovered through both ML and DRE that enable, respectively, both artificial intelligence and business intelligence (BI).

It has been observed that the difference between AI and BI is that in BI we have the data and the rules, and we try to find the answers.  In AI we have the data and the answers, and we try to find the rules.  Whether results derived from either technology are answers to questions, or rules governing patterns, both AI and BI are tools for increasing the value of data.

These are important goals because attaining them, or at least approaching them, will allow a more efficient use of valuable resources, which in turn will allow a system to be more sustainable, support more consumers of those resources, and produce more value for the owners of the resources.

[i] If we knew what the original data model looked like we would have no need for reverse engineering.

Identifying Motivators (“Why”)

February 22, 2018

This is the sixth and final post in this series about how to identify entities in data sources that can readily be classified as belonging to each of the 6BI Business Object Categories (BOCs): Parties, Things, Activities, Locations, Events and Motivators. The fifth post in the series (about Events, the “When” aspect) can be found at https://birkdalecomputing.com/2018/01/30/identifying-events-when/ .

The Motivators BOC is probably the most nuanced and least understood BOC. I have earlier devoted an entire article about the meta-data structure of motivators entitled “The Data Architecture of Business Plans”[i] which can be found at https://birkdalecomputing.com/6bi-home/the-data-architecture-of-business-plans/ .

The Motivators BOC identifies Why things get produced and consumed by parties.  Concepts and objects in this BOC capture data about the ends, means, influencers and assessments that provide the reasons why parties exchanged things (products and money) at a particular time and place.  Ends and means are in general too abstract to be found in object names, but you will find names such as Strength, Weakness, Opportunity, Threat, and Key Performance Indicator (KPI) all of which are assessment elements.

Data element and data element collection names you may encounter that belong to the Motivators BOC include, but are not limited to, names in the following table[ii]. The list gives you a hint of what kind of names to look for in putting together a 6BI Analytic Schema for enabling your data to answer business questions.

In terms of identifying motivator data elements (i.e. attributes and columns) and motivator data element collections (i.e. entity types and tables) the most likely candidates are documents, or at least those objects that have the word Document in their name.  You need to consider documents, because it is quite often that you will find the means (missions and courses of action) of an enterprise described in document form, especially if the document name contains words such as Strategy/Strategic, Tactic, Enablement/Enabled, Directive, Policy or Rule.  The ends of an enterprise (visions and desired results) can also be described in a document, quite often having a name like Goal or Objective.

As mentioned in the post about the Things BOC[iii], a document can also be considered a type of thing, such as a definition.  As in “the definition” is being assessed for accuracy, for example.  However, if its purpose is to contain text that describes means or ends it also belongs to the Motivators BOC.  An event can also be a motivator such as Appeal and Campaign.  But as was mentioned in the Events BOC, events are primarily differentiated from other concepts and objects by their inclusion of a time data element, either a point in time or a duration.

Another source of motivators is reference data.  Reference data can describe business functions (see the post on the Activities BOC) and often determines choices that users make on user interfaces which then determine logic paths that an application will take when processing data and thus explain why certain results are derived.  Example data element and data element collection names that often become the basis of reference data management (RDM) include: Code, Type, Tag, Status and Class/Classification.  Often you may find these name in plural form as well.

So, if you are analyzing a legacy database and you come across a table with any of these words in its name you need to study the content of the table and understand how the rows and columns of the table effect, or are designed to effect, the motivation for actions taken by the parties in the organization.

The Motivators BOC is especially relevant to the type of NOSQL database known as a document database, Mongo DB being a prime example.  It is one thing to structure and access the data in a document store in an effective and efficient manner but, in terms of answering business questions, it is even more important to know what role the content of the document plays in the operation of the enterprise.  In other words, how does or how should the document provide the answer to “why” a business transaction took place between parties.

Another category of motivators deals with security and privacy, and sometimes is included in policies and procedures.  Names here include Authorization, Enforcement and Permission, among others.  The intersection between business motivation and security is ripe for further exploration.

This is the last post in this series.  I hope you will find them worthwhile and useful. To find each one just click the link in the first paragraph of each to take you to the previous one. The first in the series about the Parties BOC can be found at https://birkdalecomputing.com/2017/04/26/identifying-parties/ .

Thanks for reading them and best of luck in developing your 6BI Analytic Schemas.

 

[i] The title “The Data Architecture of Business Plans” is derived from the fact that Business Plans are the deliverable of the Motivation aspect (the “Why” interrogative) at the Business Management, or Conceptual perspective of the Zachman Framework for Enterprise Architecture.

[ii] As previously, I would like to thank Barry Williams and his excellent Database Answers website http://www.databaseanswers.org/data_models/ for providing many of the table name examples.

[iii] https://birkdalecomputing.com/2017/05/04/identifying-things/

Identifying Events (“When”)

January 30, 2018

This is the fifth in a series of posts about how to identify entities in data sources that can readily be classified as belonging to each of the 6BI Business Object Categories (BOCs): Parties, Things, Activities, Locations, Events and Motivators.  Entity types in the Events BOC identify When production and consumption of things by parties occurs. The fourth post in the series (on Locations, the “Where” aspect) can be found at https://birkdalecomputing.com/2017/08/23/identifying-locations/ .

Concepts and objects in this BOC capture data about a point in time or the duration of time over which products or payments flow from one party to another, or when an enterprise carries out its work. Data element and data element collection names you may encounter that belong to the Events BOC include, but are not limited to, names in the following table[i]. The list gives you a hint of what kind of names to look for in putting together a 6BI Analytic Schema for enabling your data to answer business questions.

Events break down into two major sub-types: (1) Occurrence types, which include EventAlert, Notification, and Incident from the list above; and (2) Duration types which include, Year, Month, Week, Day, Hour, Minute, Second, Date and Time from the list.  Duration type entities, as no doubt is obvious, are units of time and can be used to aggregate facts in a star schema across a temporal hierarchy.  Occurrence types are more like things.  Instead of being produced and consumed, they occur, that is they are something that can be referred back to that, in addition to any other properties they may have, always have an aspect of time or “when” about them, this aspect is important for data analysis.

Unlike the other BOCs, the Events BOC has both dimensional and fact characteristics.  On the one hand, time is already defined into a hierarchy and is standard for everyone.  An hour is always an hour, sixty minutes, a minute is always a minute, sixty seconds, and so on.  On the other hand event occurrences are things that happen and can be measured and compared.  They are data, not metadata as the hierarchy of time is.  Events happen and then they are over but there can be much to learn from their having occurred. This BOC is conceived to capture important data about the perspectives of when something happens in your data.  These perspectives relate to when, not where, not who, not how, not why, not even what has happened, but when it happened, or will happen.

This BOC captures the characteristics of time that most influence results.  It is also important to understand how events differ from either locations or activities, two other previously covered BOCs, with which events are often confused.

A location is concrete.  It is a point in space, a place, even if that space is virtual. You can go away and come back to a location, and if most (not necessarily all) other factors are the same, or within tolerances, the location is still there.  Not so with an event.  An event, though all relevant data may be captured about it, once it occurs, is done and goes away forever.  Another instance of a particular class of events can subsequently occur, but each event is unique and has a time when it occurred.

Events and activities are closely related and co-dependent but are not the same.  Activities are event-driven.  They receive and react to events and create new events which are sent to other activities.  Each activity is an independent entity and can execute in parallel with other activities.  Coordination and synchronization is by means of events communicated between the activities.  Activities react to input events by changing state and creating output events[ii].

The important thing, from a 6BI perspective is that an event provides a temporal association for a result.  If the persons, places, products, locations, and motivators are known (or estimated) you still need to know when these aspects came together to create something of significance.

Another instance of the importance of the “When” aspect is in Big Data solutions.  Since systems owners often cannot control when data is available to the solution it is important to be able to record when each event occurs, and there could be literally millions of events in a short unit of time producing results which can uniquely aggregate the results.

[i] I would like to thank Barry Williams and his excellent Database Answers website http://www.databaseanswers.org/data_models/ for providing many of the table name examples.

[ii] David Luckham, various writings.

Identifying Locations (“Where”)

August 23, 2017

My apologies for the long delay since the last post in this series, but the real world got in the way, in the form of a job opportunity I just could not say “no” to.

This is the fourth in a series of posts about how to identify entities in data sources that can readily be classified as belonging to each of the 6BI Business Object Categories (BOCs): Parties, Things, Activities, Locations, Events and Motivators.  The third post in the series (on Activities, the “How” aspect) can be found at https://birkdalecomputing.com/2017/05/24/identifying-activities/.  Please note also that I changed the order of Locations and Events because I want to discuss Locations next and save until last what I consider to be the two most complex BOCs, Events and Motivators.

The Locations BOC identifies Where things get produced and consumed by parties.  Concepts and objects in this BOC capture data about, not only physical locations, but virtual locations as well.  Data element and data element collection names you may encounter that belong to the Locations BOC include, but are not limited to, names in the following table[i].  The list gives you a hint of what kind of names to look for in putting together a 6BI Analytic Schema for enabling your data to answer business questions. 

Usage often determines whether Location or Thing is the appropriate BOC for any given object or concept.  For example, Webpage, Portal, Database, and Dashboard are all objects that depend on context and can either be a thing or describe where a thing is located.  Also, Document, Source and Destination can designate places as well as being things.  The key to the Locations BOC is to include only those objects and/or concepts that actually refer to a Place or a Site, and not to just the description of the place or the site.  The Sites and Addresses are synonymous and can be either physical or virtual.

Locations exist whether we use them or not.  However, it is usually, at least for business purposes, only when a member of one of the other BOCs, usually parties or things, is placed at a specific site when an activity or event occurs that locations become relevant. This can best be determined if you ask yourself how important is it to know where some activity took place, where an event or where some party or thing is located.  Does being located at one site as opposed to any other site make a difference?  Is the measurement of performance impacted by the site of one or more of the objects or concepts under consideration.  If the answer is “yes” within the context being considered then location is a factor to consider in our data analysis.

Locations are often nested hierarchies as in Country, Region, State, County, City, Street, Postal Code, ZoneBuilding, etc.  This hierarchy impacts the level of aggregation of the data.  The larger the scope of the location, or the further apart the sites are within a given level of the hierarchy, generally the more parties and things are included.  The more of these objects and concepts (parties and things) are included in the measurement the less likely the details of any one specific party or thing will influence the data at that level.  This is an application of the law of large numbers. This fact is one reason why it is important to be able is dis-aggregate your data when it is needed.  Enterprise performance is measured by buying and selling lots of things, but those large numbers are generated one item at a time as often as not.

[i] I would like to thank Barry Williams and his excellent Database Answers website http://www.databaseanswers.org/data_models/ for providing many of the table name examples.

 

 

Identifying Activities (“How”)

May 24, 2017

This is the third in a series of posts about how to identify entities in data sources that can readily be classified as belonging to each of the 6BI Business Object Categories (BOCs): Parties, Things, Activities, Events, Locations and Motivators.  The second post in the series (on Things, the “What” aspect) can be found at https://birkdalecomputing.com/2017/05/04/identifying-things/ .

The Activities BOC identifies How things get produced and consumed by parties. Concepts and objects in this BOC capture data about the means by which products or payments flow from one party to another, or how an enterprise carries out its work[i].  Data element and data element collection names you may encounter that belong to the Activities BOC include, but are not limited to, names in the following table[ii].  The list gives you a hint of what kind of names to look for in putting together a 6BI Analytic Schema for enabling your data to answer business questions.

An Activity is the most general super-type, in this BOC, encompassing Function and Process[iii].  Functions are intended to describe how an organization’s mission is planned to be carried out, and Processes describe how the mission is made real.  In the design of the Analytic Schema, data that identify and describe functions is almost always used in the source system as a type of reference data and will typically be brought into the Analytic Schema as text data in dimension members.  The maintenance of this data should be under the control of Data Governance.  The governance of data is itself both a function and a process and as such its performance can also be measured.  If one were to design a schema for measuring the performance of the Data Governance function a hierarchical collection of its sub-functions would be identified.  As we will see, functions also play a significant role in the Motivators BOC, but that will come later.

In data modeling, I have observed that we will more often model processes than functions.  A process can be either a Business_Process or a System_Process, but in either case the “process” is how something gets done.  This is accomplished by transforming either concepts or objects (or both) into different states.  It is the contribution of this transformation toward some goal that we need to measure.  Keep in mind it is “how” something gets done that we are measuring here, not “what” gets done.  This is vitally important to analytics and business intelligence because there is a lot of potential gain in improving how something is done, even if  what is produced (or consumed) remains unchanged.  For example, decreasing processing time, reducing waste and realigning responses to demand are all readily actionable.  For marketing purposes, how a product is produced or provided [iv] disappears into the product itself, and so is quite often overlooked as a separate factor in measurement.  In business systems quite often the names we look for to identify activities contain the word Transaction in some way.

Another feature of a process is that it transforms things, and these transformations usually take place via some Mechanism.  Mechanisms include Sales, Purchases, Receiving and Shipments.  A process can also be represented by a document such as a Request, an Order, an Invoice, a Manifest, or a Receipt.  It is the data about the transformation, perhaps recorded in a document, or perhaps not, that we want to measure.  We measure the impact on the parties and things participating in the transformation and not the parties and things themselves.  This is a subtle but important difference.  An activity’s quantities, values and description are the record of “How” the process produced a result.

An activity is often the source of one or more events, and an event is often the source of one or more activities, but activities and events are not exactly the same thing, and are not interchangeable. We will visit the Events BOC in a future post.

[i] David C. Hay, Data Model Patterns, A Metadata Map, 2006.

[ii] I would like to thank Barry Williams and his excellent Database Answers website http://www.databaseanswers.org/data_models/ for providing many of the table name examples.

[iii] David C. Hay, Data Model Patterns, A Metadata Map, 2006.

[iv]  The distinction between “produced” and “provided” is made to distinguish between, for example, manufacturing and retailing.

Identifying Things (“What”)

May 4, 2017

This is the second post in a series of posts about how to identify entities in data sources that can readily be classified as belonging to each of the 6BI Business Object Categories (BOCs): Parties, Things, Activities, Events, Locations and Motivators.  The first post in the series (on Parties, the “Who” aspect) can be found at   https://birkdalecomputing.com/2017/04/26/identifying-parties/ ‎.

The Things BOC identifies What the concepts and objects are that are produced and consumed by parties.  Data element and data element collection names you may encounter that belong to the Things BOC include but are not limited to names in the following table[i].  The list gives you a hint of what kind of names to look for in putting together a 6BI Analytic Schema for enabling your data to answer business questions.

For the purposes of 6BI the first decomposition of Things is between Product and Payment.  Products are also further decomposed into Good and Service.  Goods are tangible material products for which consumers make payments, and services are products provided primarily by human or human-like labor.  Products are also quite often hierarchical and the names used for each level are Things BOC names in their own right.  These names can be logical such as Class, Category, and Type for example. Or can be physical such as Assembly, Component, and Container.  Look for these words, or ones like them, in conjunction with other words that more clearly designate them as classification levels of a product, such as Asset_Type or Vehicle_Assembly.  Products and Payments represent what is exchanged in a transaction and are differentiated by the direction in which they flow.  Products flow from Provider to Consumer, and Payments flow from Consumer to Provider.  Quite often the difference between a Product and a Payment is obvious, but sometimes it’s not.  This is especially true when transactions are “in kind” and it is not obvious which, if either thing, represents the “money”.  One rule of thumb is to always remember who the “first party”[ii] in your analysis and which side of their ledger you are analyzing.  The first party is for “whom” the analysis is done or “who” the analysis is intended to benefit.  If you are analyzing their receivables side then the inflow is always a payment and the outflow a product.  If you are analyzing their payables side then the opposite is true, inflows are product types and outflows are payment types.

Potentially the Things BOC can be identified by more data store names than any other BOC because we as humans often designate all phenomena as things.  In information systems however it is always more useful to refer to the instances of the Things BOC as products or payments.  We use the term “Things” for this category of business objects so that we remember to look at both sides of “what” is being exchanged in a transaction, and not be content to only consider the product alone.  There are simply so many things in the real world but we must concentrate on “how” (see the Activities BOC post) they flow if we need to measure their value to a party and assess a party’s contribution to that value.

A Definition can also be a product. This is true when used to represent the meaning of that which parties (individuals and organizations) produce or consume.  It doesn’t matter what is defined.  It can be a party, a location, an activity, an event, a motivator or anything.  If the definition itself is manipulated (i.e. produced or consumed by a party) then it is a product, and thus a thing.  We can speak about the “Definition of the customer” for example.  Customer is clearly a member of the Parties BOC when it comes to analyzing data content for understanding performance for example.  But the definition itself (i.e. What a customer is) is a product of a metadata system.  If you need to analyze, normalize and rationalize the consistency of various definitions of customer you need to treat these definitions as things and not as parties.  That is, they are products of the system associated with a provider and a consumer.

The customer can have multiple definitions, but each separate definition must be associated with the customer through some unique combination of location, event, activity and/or motivator.  Those for whom the consistency checking and improving is performed are the parties. However, and this is critical, the definition of what a customer is, so that it can be used consistently to mean the same role played by a party depending on some unique combination of activity, location, event, product, and motivator is itself a product.  As a product, its quality can be controlled and monitored, its accuracy and integrity assessed and its use measured.

[i]  I would like to thank Barry Williams and his excellent Database Answers website http://www.databaseanswers.org/data_models/ for providing many of the table name examples.

[ii] The party from whose perspective the measurements are taken.

 

Identifying Parties (“Who”)

April 26, 2017

This is the first of a series about how to identify entities in data sources that can be readily classified as belonging to each of the 6BI Business Object Categories (BOCs): Parties, Things, Activities, Events, Locations and Motivators. I will start with Parties.

The Parties BOC (Business Object Category) identifies Who produces or consumes objects and concepts.  Examples of commonly used data element[i] and data element collection[ii] names you may encounter in diverse types of data stores, that can be categorized under the Parties BOC, include but are not limited to those in the table below.[iii] They contain names commonly used to identify entities (e.g. tables in an RDBMS and documents in a DDBMS) and attributes (e.g. columns in tables and fields in documents). The list gives you a hint of what kind of names to look for in putting together a 6BI Analytic Schema for enabling your data to answer business questions.

Telling providers from consumers is important to recognize because it will tell you in which direction the product flows and in which direction the payment flows.  Knowing this provides the basis for profitability and resource use efficiencies. Producing parties include such names as Producer, Provider, Seller, Supplier, Broker, Vendor for example. Examples of consuming party names include Consumer, Payer, Receiver, Buyer, Client, Customer.  A Parties BOC name can also identify third parties such as Obligee, Specialist and Agency.  Parties can be animate (Citizen, Patient) or inanimate (Bureau, Department). They can be collective (Organization) or singular (Person). These are some of the names you should look out for in analyzing source data systems. They can be names of data element collections (e.g. tables) and/or names of data elements (e.g. columns). Generally, they hold data about the object types within the Parties BOC which can then be used in answering the Who interrogative component of any query.  Parties often come in pairs, the parties that produce or provide, and the parties that consume or pay. Thus, it is important that when you discover one or more Parties BOC entity in your source data that you ask the question, “Is this a producer or a consumer”?  Of course, a single entity can be both a producer and a consumer depending primarily on the type of thing being measured and the type of activity that causes it to need to be measured. If a “first-party” (i.e. the party from whose perspective the measurements are taken) pays money in the transaction it will be the “consumer” in the scope of the transaction being analyzed. Otherwise the party is the “producer” in some manner and its performance will be measured by some assessment of relative value between product outflow and money inflow. Quite often both providers and consumers will be assessed in some manner, usually by an Assessor, as a result of the measurements.

With the advent of the Internet of Things (IoT) the Role that a device plays in producing and consuming data now means that the Parties BOC includes non-traditionally human parties as well. As a matter of fact entities that previously were only considered as part of the Things BOC can now be parties. The concept of a party has gone beyond the traditional definitions of people and organizations but it still, at the end of the day, remains the producer and consumer of data.

[i] Wikipedia defines any unit of data defined for processing as a data element. Since 6BI deals with the meaning of data elements it is generally non-productive to consider any unit smaller than an attribute or field.

[ii] A data element collection is a set of related attributes or fields. Depending on the degree of normalization they are usually co-located in a container called a table or a document.

[iii] I would like to thank Barry Williams and his excellent Database Answers website http://www.databaseanswers.org/data_models/ for providing many of the table name examples.

6BI and Aggregate-Orientation, Part 2

September 15, 2016

This is the second part of a two-part post.  The first part can be found at https://waynekurtz.wordpress.com/2016/09/06/introduction-to-6bi-and-aggregate-orientation/.

To transform a 6BI (“Six Basic Interrogatives”) derived logical data model into an aggregate-oriented data model there are several steps that need to be taken.  We will use the logical business data model shown in Figure 3 as our example to show this transformation.  It’s very simplified but fairly typical of certain types for commercial enterprises.

ldm-clean-9-9-2016-2-36-43-pm

 

 

 

 

 

 

 

 

Figure 3. A Business Data Model Example.

Semantically we can say the following about our business data model.  A Customer can have many Contacts, and a Contact can have many Customers.  The Customer_Contact entity[i] resolves this by linking a Customer to its Contacts through the “has contact” relationship and a Contact to its Customers through the “has customer” relationship.   A Customer “places” Orders, “receives” Order_Invoices, and “makes” Order_Payments.  A Provider “receives” Orders.  An Order “contains” Order_Items and a Product “identifies” Order_Items.  An Order_Invoice “is paid for by” Order_Payments and a Payment “is applied to” an Order_Payment.

Unlike data modeling where the outcome, as mentioned in Part 1, is typically the creation of tables which separately store instances of entity types linked together by key-based references, the outcome of an aggregate-oriented data model is the creation of aggregates.  An aggregate is a complex data structure based on a “root entity”.  The root is determined by the business requirements and is identified as an entity type in an LDM.  This entity is often derived from a 6BI BOC (“Business Object Category”).  Aggregates are stored as a nested collection of entity instances whose relationship to the root is also determined by business requirements.  How deeply an entity type is nested within the aggregate structure is determined by where it is placed along a dependency gradient.

Dependency Gradients

A dependency gradient is a continuum of degrees of dependency where the amount of dependency flows from a higher concentration to a lower concentration.  It flows from entities that have a greater number of dependencies, both direct and indirect, to entities with a lesser number of dependencies.  An entity that consumes no other entity’s identifier (i.e. it has no foreign keys) is referred to as the “apex entity” of the continuum and has the least dependency.  An entity whose identifier is not consumed by any other entity along the continuum is referred to as the “base entity”.  The relationship between a base entity and an apex entity could be direct, however there are often many potential intermediate entities between the base and the apex entities.

Figure 4 shows two dependency gradients that flow from the same Base Entity Order_Payment (a type of Exchange).

  • Dependency Gradient-01 flows through the relationship “is applied to” directly to Apex Entity-01 Payment (a type of Thing).
  • Dependency Gradient-02 flows through the relationships “is paid for by” to Order_Invoice (a type of Exchange), “is billed thru” to Order (a type of Exchange), and finally “places” indirectly to Apex Entity-02 Customer (a type of Party).

dep-grad-9-13-2016-14-40-20-am

Figure 4. Dependency gradients flowing from Order_Payment.

Dependency gradients form dependency trees when the flow from base to apex takes more than one path.  The term gradient[ii] is used to indicate the relative degree and direction of dependency along a tree.  Dependency always flows from a higher concentration (more foreign keys) to a lower concentration.  Aggregates in an aggregate-oriented data model are based on dependency trees.

Steps for Transforming the LDM

The following ordered set of steps are required for transforming a LDM into an aggregate-oriented data model.

  • Select the root entity for each aggregate.
  • Identify relationships to be eliminated by “rolling down” parent entity attributes into the child entity.
  • Identify relationships to be collapsed by “rolling up” child entity attributes into the parent entity.
  • Eliminate unneeded apex entities by rolling down their attributes to an entity higher on the dependency gradient.
  • Roll up attributes from base entities in the direction of the root entity along selected dependency gradients.
  • Create the aggregate-oriented data model.

Select the Root Entity for each Aggregate

The root entity of an aggregate, identified by the ROOT operator, can occur anywhere along the dependency gradient and is independent of the relational association (i.e. relationship) between itself and any other entity along the gradient.  The other entities in a dependency tree, called aggregate participants, are embedded into the aggregate root entity instance.  The embedding of participant entity attributes into the aggregate root occurs regardless of whether the participant is in the direction of the apex or in the direction of the base relative to the root.  Identification of the root entity will influence which relationships get eliminated and to a greater extent determine which relationships get collapsed.

Identify Relationships to be Eliminated

Relationships to be eliminated are tagged with the REFI[iii] tag.  To identify each individual action triggered by the REFI tag the tags are numbered starting with REFI-01.  In eliminated relationships the parent entity is removed and all or some of its attributes are rolled down into the child entity.

Which attributes to roll down is determined by how normalized the entity is.  In cases where an entity has attributes that are not part of one of the dependency gradients being aggregated the non-dependent attributes would not be rolled down, however the relationship is still eliminated in transforming the logical data model into aggregate-orientated data model.

Figure 5 shows the relationship between Product and Order Item named “identifies” tagged as REFI-03.  In this case it was decided this relationship would be eliminated and the parent (Product) attributes were rolled down into the child (Order Item) entity eliminating the relationship and its parent entity.

refi03-9-10-2016-1-00-40-pm

Figure 5. A relationship tagged for elimination.

Identify Relationships to be Collapsed

This step requires modelers to take an entity, like Order in our example data model, which has other entities that refer to it and roll up the attributes of its selected child entities into the parent entity.  Order and its dependent entities, Order_Item through the “contains” relationship, and Order_Invoice through the “is billed thru”  relationship are shown in Figure 6.  Only the relationship with Order_Item is tagged as EMBED-02.  The relationship with Order_Invoice does not need to be tagged for this transformation because Order_Invoice has a separate relationship with Customer named “receives” and its attributes are rolled up along that dependency gradient (see Figure 7).

ldm-embed-tag-9-10-2016-1-38-40-pm

Figure 6. A relationship tagged for collapse.

Figure 7 shows the whole example model tagged in preparation for transforming it into an aggregate-oriented data model with Customer as its root entity.

ldm-tagged-9-10-2016-1-17-36-pm

Figure 7. A logical data model tagged for transformation.

Eliminate Unneeded Apex Entities

Figure 8 shows what the physical data model (PDM) looks like after certain relationships, and their parent entities, are eliminated.  In this model we have eliminated the relationship “has customer” rolling Contact into Customer_Contact (REFI-01), “receives” rolling Provider into Order (REFI-02), and “identifies” rolling Product into Order_Item (REFI-03)[iv].  Please note that in the roll downs we have included the parent primary keys (now as non-key attributes).  Strictly speaking we would not always need to do this.  If we were to implement this data model in a relational database management system (RDBMS) the entity Provider would become a table and its attributes columns.

Again this is actually more of a business decision than strictly a technical decision, though I suspect it is rarely considered so.  This is another reason why I say selection of a NOSQL database is at least as much of a business decision as a technical decision.

pdm-rel-9-9-2016-2-40-24-pm

Figure 8. Our example data model after eliminating unneeded apex entities.

Create the Aggregate-Oriented Data Model

Figure 9[v] shows an aggregate-oriented data model based on Customer as its root entity.  All the attributes of the entities that survive the roll down are rolled up along dependency gradients and collapsed into the Customer_Aggregate forming the nested collection of entity instances, which is called an aggregate.

We have the same issue here that we had in the previous section.  What do we do with the entity identifiers (i.e. primary keys)?  One identifier that we have to keep of course is the primary key of the root entity, Customer_Aggregate.customer_id.  It is the identifier of the aggregate.  The answer for an aggregate is not the same as the answer for a table.  In a RDBMS we are optimizing the design of tables, in a NOSQL database we do not have tables.  All the attributes of all entity types included in creation of the aggregate are considered attributes of the aggregate and no longer “stand alone” entities or entity instances.  The identifiers of all aggregate participants are removed from the model.

cust-agg-9-13-2016-3-23-47-pm

Figure 9. Aggregate-oriented data model.

Not having tables as your “modeling target artifacts” may be a difficult and even disturbing concept for data modelers experienced in identifying and normalizing entities.  For this reason an aggregate-oriented data model is a new type of data model.  It is not a logical model, but it is also not a physical model in the strict sense that it can be used to generate pre-defined database schema structures (e.g. tables).  It is used to relate the entity types and relationships designed in a logical data model, and the business requirements they represent, to the chosen type of NOSQL database (e.g. key-value, column family, document, and others) and the unique metadata structures each type of NOSQL database has.

A final transformation is required to create code for the construction of the appropriate aggregate-oriented NOSQL database structures.  Because we typically see NOSQL data structures expressed as a mixed meta-data and data structure such as a JSON string it is easy to forget that JSON still has a schema.  That schema is often maintained in the application code and not in the database engine.  The advantage of this is more flexibility in application development but the cost is often a lack of the uniform constraints needed to maintain the re-useable data structures required for non-deterministic query construction and compliance to data governance standards.

In summary the steps needed to transform a LDM into an aggregate-oriented data model are the following.  First, identify the root entity for the aggregate in the LDM, identify non-root parent entities to be eliminated by rolling down their attributes into their child entities, and finally identify child entities to be collapsed by rolling up their attributes into the root entity along dependency gradients that have the aggregate root as apex entity.

In a future paper I will describe how the concepts in this article can be applied to data targeted to be stored in popular NOSQL databases.

 

[i] It’s conventional for data modeling tools to call the rectangles in LDMs “entities”. They can also be called “types” or “entity types”. The physical form should be called an “entity instance”.  In this paper I will use “entity” and “entity type” interchangeably because most of the paper deals with logical data modeling and most of the tools used for logical data modeling call them tables.  Where I refer to physical modeling I use either “entity instance” of if entity is understood just “instance”.

[ii] In physics “gradient” is defined as an increase or decrease in the magnitude of a property, the change of which can usually be measured. The word seemed appropriate to name the concept of increasing or decreasing dependency.

[iii] The REFI and EMBED tags were identified by Jovanovic and Benson.  Their usage in this article is derived from the paper but may differ slightly from the original meaning, including the numbering of each instance of a tag type for easier identification of each individual roll-up and roll-down action.

[iv] It is important to understand that the three eliminated entities are not removed from the data model, neither logically or physically.  They are simply not part of the aggregate-oriented data model.

[v] Any reader familiar with the Erwin Data Modeler tool will recognize that I’ve used the symbol for a View to represent the aggregate. This is not to say that an aggregate is the same as a view.  Erwin, nor any other data modeling tool I am familiar with yet supports aggregate oriented modeling.  The COMN data modeling technique introduced by Theodore Hills is the closest I’ve seen to actually recognizing and describing a methodology for doing so.  Please see http://www.tewdur.com .

Introduction to 6BI and Aggregate-Orientation

September 6, 2016

I wrote this article after reading NoSQL Distilled, A Brief Guide to the Emerging World of Polyglot Persistence by Pramod J. Sadalage and Martin Fowler in the summer of 2016. I was incented to read the book after seeing Martin Fowler give a presentation on NoSQL at a 2013 GoTo Conference presentation, which can be viewed at https://www.youtube.com/watch?v=ASiU89Gl0F0.

Up to that point the whole concept of “NoSQL” was very fuzzy to me, but after experiencing his presentation and reading the book I realized that the “lynchpin” that held RDBMS and NoSQL together was a concept that many SQL database architects have been working with for years, the common de-normalized data structure. One problem (among several) with de-normalization has always been how do you systematically de-normalize data so that two de-normalized data-sets can be compared to each other in a consistent and repeatable manner to assure reliable analysis by different parties, over time and in different locations.  In other words how do we maintain the old “apples-to-apples” and “oranges-to-orange” paradigm.

————————————————————————————————————

Six Basic Interrogatives, or 6BI for short, provides an approach for organizing and understanding the context of data used to support business decision making.  I first wrote about 6BI back in the 2004-2005 time frame when I was working full time as a Business Intelligence Data Architect and needed a quick way to get started analyzing existing data in order to organize it to answer business questions.  Since that time I’ve had other opportunities to put 6BI to use in situations not normally identified as business intelligence, certainly not the traditional type which focuses on a central RDBMS based data warehouse.  I have found it holds up well, as I thought it would, in various scenarios and has versatility and applicability in just about all the computing situations that I have found myself in.

With the increased popularity of distributed computing, fueled mostly by the overwhelming up take of mobile devices both for business as well as personal use, new highly scalable technologies for persisting data have been developed.  These technologies tend to rely more on an expanding network of smaller computing units as they grow by scaling out to more units, as opposed to a small number of large and ever growing computing units which grow capacity by scaling up.  Scaling out is most often referred to as “horizontal scaling” while scaling up is referred to as “vertical scaling”.  There is no doubt that horizontal scaling is more popular today than vertical scaling.  Going hand-in-hand with the horizontal scaling trend has been the introduction of alternative data stores now collectively known as NOSQL databases.

Three types of NOSQL databases that are designed to store a rich structure of closely related data that is accessed as a unit were identified by (Sadalage and Fowler 2013)[i].  They call this type of database an aggregate-oriented database because its basic storage structure is conceptualized differently from that of a traditional relational database management system (RDBMS).  An RDBMS stores its data in tables, or more precisely in relations.  Relations allow data to be normalized into units (tables) which contain data that is dependent only upon the primary key of the entity.  From a logical perspective we can say that each set of attributes (i.e. a row in a table) describes one and only one instance of an entity which in turn is differentiated from all other instances of all other entities by its unique identifier. The columns of the table store the attribute values of the entity instance where its row intersects with its columns.

In practice this results quite often in data about customers stored in a Customer table, data about products in a Product table, data about Orders in an Order table, and so forth.  This is a very powerful design because it does not pre-suppose any fixed relationship between Customers, Products and Orders, but it let’s designers and developers link them together in any way that meets business requirements even when these requirements might not necessarily be known in advance.  This ad hoc linking is made possible by the SQL language[ii].

NOSQL databases, in order to provide more performance across distributed data stores, even when the nodes are distributed globally, were conceptualized and built with a different set of priorities.  Non-deterministic query structure was not a priority, but instead just capturing and persisting the data as quickly as possible was the top priority.  There was no need to build SQL language interpreters into the architecture of these “new age” data stores.  As a result NOSQL databases end up sacrificing some of the flexibility needed to support non-deterministic query construction in favor of higher performance access in a distributed landscape, or cluster.

Inevitably this trade-off has lead to the need for designers to know or predict, in advance, how the data will be accessed, which entities will need to be joined, and in what order to answer their business questions.  The loss of the luxury of not needing to know in advance exactly how your data will be accessed is the price that is paid for better performance.  Fair enough!  Nothing comes without its cost, especially in the world of computing. The important thing to remember is that there will always be a trade off[iii].

There are three aggregate-oriented NOSQL database categories that have emerged over the last several years:  Key-Value, Document, and Column Family.  Each is different but they share a conceptually similar data storage unit called an aggregate[iv].  A data modeler (or developer modeling an application’s data) using these new types of models needs to start thinking about how the data will be accessed earlier in the design and development process than ever before.  As (Sadalage and Fowler 2013) state, aggregate awareness needs to be made part of the design of the data model[v].  I can see how this could be the case when there is a lack of clarity between the role of database design and that of database development.  With the advent of aggregates and aggregate-orientation the job of a business intelligence data modeler has changed from being able to concentrate on business requirements to needing to be more aware of how the data will be accessed and stored.  This challenge is stated succinctly by (Jovanovic and Benson 2013) when they say NoSQL databases have opened a new non-trivial problem area for data modelers[vi].

If we do the aggregate data modeling at a level independent of the style of the data store (i.e. Key Value, Document or Column Family) however we can create implementation neutral and re-useable data models that reflect the requirements of the business as much as they reflect the requirements of the technology used to implement it.  Data organized with an aggregate-orientation needs to be modeled at a logical level just as much as data organized with a relational orientation ever did.  In both cases we need to start with the basic entities that describe business value for persisting and manipulating data.  The need for an organizing framework for data still exists.  We still need to know which real world entities (regardless of how they are instantiated in the data store) the business cares about and how these entities relate to each other.

Business Object Categories

6BI is built on a framework of six universal and re-usable Business Object Categories (BOCs).  Each BOC represents a non-overlapping category of business data.  Each category corresponds to one of the six basic interrogatives: Who, What, Where, When, How and Why.  Figure 1 shows the six basic interrogatives and the business object category which is derived from each including example sub-categories which will be used in subsequent business data model examples.

Basic Interrogative Business Object Category
Who produces the data used to measure performance? Parties[vii] (e.g. Customer and Provider)
What is being manipulated or exchanged to produce measurable performance? Things[viii] (e.g. Product and Payment)
How are the data values that measure performance produced? Activities (e.g. Exchange and Process)
When does the activity take place or for how long is the performance measured? Events (e.g. Point in Time and Period of Time)
Where does the activity used to measure performance take place? Locations (e.g. Address and Placement)
Why does the data actually measure performance? Motivators[ix] (e.g. End and Means

Figure 1. Six Basic Interrogatives and Business Object Categories.

In the 6BI Framework, Parties and Things have no direct association with each other.  It is only through the other four Business Object Categories (Activities, Events, Locations, and Motivators) that Parties and Things are associated.  The left side of Figure 2 shows the relationship of Parties to Activities, Events, Locations and Motivators.  The right side shows the relationship of Things to the same four BOCs.  This makes sense because only when parties engage in activities, exchanging and manipulating things between them, do events get generated at certain locations and reasons why become relevant.  Figure 2 illustrates the relationships between business object categories, both direct and indirect.

6bi-party-product-assocs-20160826

Figure 2. The indirect association of Parties to Things.

These business object categories provide the framework and starting point for classifying data that are need for non-deterministic query structure.  This design allows enough flexibility to support ad hoc reporting by focusing on the basic questions that all decision support data structures need to support, while not requiring a strictly relational storage structure underneath.

In the next part of this article we will show a method for transforming a 6BI derived logical data model into an aggregate-oriented data model.

 

[i] Pramod J. Saladage and Martin Fowler, NOSQL Distilled, A Brief Guide to the Emerging World of Polyglot Persistence, Addison-Wesley, 2013. Sadalage and Fowler identify four (4) types of NOSQL databases. The fourth being Graph databases which we will not be discussing in this article.

[ii] SQL is not the first relational database language.  In the 1980s each relational database product had its own language, each of which did pretty much the same thing.  SQL simply won the “database language war”.  It is often criticized for being, among other short comings, “less than perfect”. But like the human communication vehicles known as “natural languages” it too has evolved to fit its purpose and become widely accepted as a result.

[iii] This is the reason why the choice of using a NOSQL style database is, or should be, as much a business decision as a technical decision.  What needs to always be taken into consideration are the possible hidden costs of implementing a database solution that does not support SQL. Especially in light of the fact that SQL has become a virtual “lingua franca” of data analysis and business intelligence. These costs may be hidden from the group making the NOSQL database decision since it is quite likely they are not the  same group that uses SQL to answer business questions.

[iv] Sadalage and Fowler, page.13.

[v] Sadalage and Fowler, page.17.

[vi] Vladan Jovanovic and Steven Benson, Aggregate Data Modeling Style, Proceedings of the Southern Association for Information Systems Conference, Savannah, GA, USA March 8th–9th, 2013, page 75.

[vii] Parties are often first specialized into “Persons” and “Organizations” but whether a party is only one or a group is not relevant to our discussion here.  Instead I used a sub-categorization that is more reflective of the role that a party plays in an exchange.

[viii] In earlier articles on 6BI the term “Product” as the name of the 1st level BOC that corresponds with the “what” interrogative was used. Product however is not universal enough.  For lack of a better term I use “Thing” to include Money along with Goods and Services.

[ix] In earlier articles on 6BI the terms “Plans” and “Rules” were used as BOCs that correspond with the “why” interrogative. But these are only Means, not Ends, and “Why” needs to include both Ends and Means.

Data Quality and Real World Things

October 15, 2010

This is an article that started out as a white paper for a client on how to conceptualize data quality, or the lack there of.  At the time I could not find a taxonomy of the types of  characteristics that made data have poor quality.   It always seemed like something that “You just knew it when you saw it”  but no objective standards existed that I knew of for what constituted low quality data.  I found there were two classes of poor quality data, incompleteness and inaccuracy.  My thinking was that if I could define the ways that data can deviate from “good” or “acceptable” quality then we could construct solutions to correct each of those conditions.

The article “Data Quality and Real World Things” was first published in November 2006 on the DMReview.com website which is no longer available.

Overview

In general, data represents real-world things and is described by metadata.  It needs to be “fit for its purpose”.  Data that is not, is considered invalid or of low quality and needs to be either cleansed or discarded.  Uncleansed data can be seen as being either incomplete or inaccurate or both.  There are several conditions that can be described in terms of either incompleteness or inaccuracy, or both, that constitute low quality data, these conditions include:

  • Inappropriate data – the metadata inaccurately describes the data.
  • Unknown data – the metadata describing the data is wholly or partially incomplete.
  • Inconsistent data – (a) a reference in the data to some other data in the system is either incomplete or inaccurate, or (b) data representing the same exact thing contains different values.
  • Duplicated data – a thing is unnecessarily represented more than once in the data.
  • Missing data – a required value in the data is incomplete, partially or wholly.
  • Wrong data – the values in the data inaccurately measure or describe the thing.

There are undoubtedly many other ways to express why certain data is not fit for its purpose[1], but I have found that these six types pretty well encompass what can be wrong with data and make it either wholly or partially (which can sometimes be worse than wholly) unfit for the purpose to which you want to put it.

Each type of poor quality data needs to be addressed in a different way.  This is not to say that a complete methodology with its own framework, reference model, plan and templates is necessary for each type, but the root cause of and what can be done to correct each condition will vary.  Any unified approach to Data Quality (DQ) needs to take into account the different types of poor quality data.

Let’s take another look at the six types of poor data quality listed above.  You will notice that they include problems that can be found in both the content of the data and in the content of the metadata that describes the data.  In many instances there is far less of a difference between data and metadata than contemporary data architecture tends to acknowledge.  Essentially metadata is data.  Data, as the first sentence of this article states, represents real-world things.  Data is an abstraction of certain aspects, or characteristics of real-world things.  It was invented (long before computers) to communicate information about real world things.  Metadata is an abstraction of certain aspects of data, primarily to turn the data into information so that consumers (human and otherwise) can extract “meaning” from it and use it to record transactions or in a decision making context.  As more and more real-world things themselves become abstract the distinction between data and metadata tends to become less distinct.

Inappropriate data

Data is inappropriate for its intended use when the metadata associated with it inaccurately describes it.  This leads to a lack of confidence in the data by its consumers. For example, if a data element is defined as being a ‘Customer Home Address’ and it contains the string “John Smith”, which is most likely a name, which do you trust?  The metadata (‘Customer Home Address’) or the data content “John Smith” or neither?  At least one of them is inappropriate for the purpose at hand.  If you assume the metadata is correct then the data is inappropriate.  If you assume the data is correct then the metadata is inappropriate.  Either way, the quality of the data as a whole, the metadata and the data content, is of low quality.

It is possible to measure the inappropriateness of the data.  For example, the number of records that have this anomaly.  We can determine that.  But how do you know which is inappropriate, the metadata or the data content?  Unless you have a definitive answer or reliable heuristics from data profiling or some source of domain knowledge outside of the DQ process itself, you do not know.

Any solution for solving the inappropriate data problem must focus on validating that the metadata in accurate.  That it is appropriate metadata for the purpose to which it is being put, and that the data content then represents instances of that metadata.

Unknown data

Unknown data differs from inappropriate data in that at least some of the necessary metadata is simply not there.  This seems like a metadata problem at first.  But looking at it from a perspective of what can be trusted, it may also be a data content problem.  If you trust that the metadata is complete and accurate, whether or not it looks like it is, then the problem may be a data content problem.  It might be that you have data that is irrelevant.  But without any other information, it is at best unknown data.  If you trust the data content then you may need to produce the metadata by reverse engineering it in order to describe it.

A solution for solving the unknown data problem focuses first on whether it’s a metadata or data content problem and then either discarding the irrelevant data or deriving accurate definitions for the data you have.

Inconsistent data

Inconsistent data problems break down more cleanly between metadata and data content than either inappropriate data or unknown data.  With inconsistent data either a reference in the metadata is inaccurate or incomplete, or both; or the data content significantly differs between what should be instances of the same data content.  Each of these calls for a different solution, but in general is indicative of inconsistency in the body of data and thus a cause for distrust among the data’s consumers.

Metadata inconsistency, that is, references and definitions that are either incomplete or inaccurate, is the more insidious and difficult to find and correct.  This inconsistency is frequently caused by unreconciled semantic differences, the old “impedance mismatch” problem. One party calls it a “Car” and the other party calls it an “Automobile”.  Now, we as humans know that in almost all cases cars and automobiles are the same type of thing.  We know this because we have agreed on the metadata definition of which both “Car” and “Automobile” are instances[2].  In this case they are the data content of a more abstract thing.  But wait a minute, “Car” and “Automobile” were examples of inconsistent metadata weren’t they?  Well yes, in the original context they were used as such.  However to reconcile them with each other we needed an even more abstract concept.  We needed metadata for our metadata, in other words, meta-metadata.  So the only reliable solution for inconsistent metadata is to raise the abstraction stakes one level and create consistent meta-metadata.  Then the problem becomes a data content consistency problem and no longer a metadata problem.

The solution to inconsistent data content, once the inconsistent metadata is no longer a problem, becomes one of identifying the “system of record”.  This is not a problem to be taken lightly, but it is more of a business problem than technical problem. Of course the system of record may exhibit data quality problems of any of the types mentioned here but, in general, if data quality issues can be kept to those problems that typically plague data content they are far easier to solve than those plaguing metadata.

Duplicated data

Not all duplicated data is bad.  The operative phrase here is “unnecessarily duplicated”.  Data that is duplicated because it is de-normalized for performance or security purposes is not necessarily of low quality (though it may be of poor quality for other reasons).  As a matter of fact de-normalizing the data, though it is a risky undertaking, may make it more “fit for purpose” when the purpose is narrowly defined in terms of speed of access to specific data sets.  Also, from a network or enterprise point-of-view, data can be replicated, across multiple systems, typically for either performance, reporting or security purposes.

Where duplicated data is bad is when it is duplicated for no business or technical reason.  This can happen as a result of poor or no entity integrity.[3]  Which is primarily the result of a lack of appropriate uniqueness constraints on the source data.  It can also happen because of poor or lacking data stewardship.  When two (or more) records are exactly the same, selecting any one of them will suffice, however, it is when data records are not quite perfectly duplicated that it becomes more problematic.  This often happens because poor data stewardship has either allowed data content to be partially duplicated or the duplicates have been allowed to age and become out of synch with their master copy.

Missing data

Missing data may be the most common type of data quality infraction.  It is quite often the result of data being allowed to enter a source system without constraints, domain integrity or default values.  Allowing nulls in data can increase its flexibility and make it easier to make progress early in a project, but how many times have you defined a column as null saying to yourself “We’ll get back to this detail later, now we need to finish the interface”?  The interface gets finished and the domain integrity issue never gets addressed again.  After all, it will get corrected up stream.  Right?  Maybe not.

Missing data can also be one of the most time consuming poor data quality problems to fix.  This is primarily because you do not always know whether the data was supposed to be there or not.  In other words, is it missing data or is it just absent data.  That is, perhaps the correct value for a data instance is “unknown”.  But at least missing data is a data content error and not a metadata error.  If the metadata is missing then we have unknown data and that’s another story (see above).  The best way to handle missing data is to agree upon a missing data strategy, publish the strategy and apply it consistently.

Wrong data

Wrong data is just that, it is data that has the wrong content.  The metadata may be accurate and complete but the data content, that is the values in the data itself, is inaccurate.  For example, if the number of items sold was actually 20 but the data recorded is 10, it is wrong.  This is probably the most difficult data quality issue to correct, or even to discover.  It requires knowledge usually outside of what is typically considered the art and science of data quality.  However data profiling can help.  If enough data is profiled that the confidence level of the discovered patterns in the content are sufficiently trusted then when outliers are encountered they can throw an exception during the inspection or transformation process to alert users that there is something wrong.  Whether that something is caused by a system anomaly or by negligence or by lack of process control can be irrelevant, in and of itself, to the DQ process.  A statistical quality control process can be used to spot and in many cases correct wrong data.

Also important in preventing wrong data is a sound data auditing function and even a good old fashioned check and balance system whereby data is entered into the staging area from more than one source.  The subject data is checked against what should be the same data either on another channel, or in a source outside the auspices of the project.  Only when data from all the sources match, or can be reconciled, is a single “audited” copy of the data content persisted.

Conclusion

As you can see there are subtle differences between the types of poor quality data.  In many cases if you can profile a sufficient sample of the data for a sufficient amount of time, many of the types of poor data quality can be mitigated.  However, data profiling is a relatively new technique within data transformation processes, does not have the mind share outside the sometimes cloistered world of data professionals, and is often difficult to sell to management as a necessary component of a Total Data Quality program.  But if you believe, as I do, that understanding a problem is at least fifty percent of the battle to solve it, then including a data profiling step in your DQ efforts is always worthwhile.

Table 1 below summarizes the six conditions described above that can produce data of low quality, including whether the condition primarily effects the metadata, the data content or both, and the appropriate strategy for mitigating each condition.

Data Condition Condition Description Type of Condition Metadata or Data Content Effected? Mitigation Strategy
Inappropriate data  The metadata inaccurately describes the data. Inaccuracy Both metadata and data content are effected. Profile data for appropriateness between metadata and content, and use domain knowledge to validate.
Unknown data  The metadata describing the data is wholly or partially incomplete. Incompleteness Both metadata and data content are effected. If metadata is trusted, then discard irrelevant data content;If data content is trusted, then derive new metadata.
Inconsistent data (a) A reference in the data to some other data in the system is either incomplete or inaccurate. Inaccuracy or Incompleteness Both metadata and data content are effected. Create consistent meta-metadata and then treat the metadata as content data.
(b) Data representing the same exact thing contains different values. Inaccuracy Data content is effected. Identify system of record on a metadata element by element basis.
Duplicated data A thing is unnecessarily represented more than once in the data. Inaccuracy Both metadata and data content are effected. Rationalize unnecessary duplication of both metadata and content data. Improve data stewardship.
Missing data A required value in the data is incomplete, partially or wholly. Incompleteness Data content is effected. Agree upon a missing data strategy, publish the strategy and apply it consistently.
Wrong data The values in the data inaccurately measure or describe the thing. Inaccuracy Data content is effected. Create a data auditing function to reconcile content against other sources.

Table 1. Summary of low quality data conditions.


[1] For another taxonomy of data quality conditions see the views of Larry English at http://www.infoimpact.com/.

[2] This is if we speak the same language of course.

[3] There are three types of data integrity: Entity integrity, Referential integrity and Domain integrity, all of which, when ignored, contribute to poor data quality.