Posts Tagged ‘Data Warehousing’

Data Quality and Real World Things

October 15, 2010

This is an article that started out as a white paper for a client on how to conceptualize data quality, or the lack there of.  At the time I could not find a taxonomy of the types of  characteristics that made data have poor quality.   It always seemed like something that “You just knew it when you saw it”  but no objective standards existed that I knew of for what constituted low quality data.  I found there were two classes of poor quality data, incompleteness and inaccuracy.  My thinking was that if I could define the ways that data can deviate from “good” or “acceptable” quality then we could construct solutions to correct each of those conditions.

The article “Data Quality and Real World Things” was first published in November 2006 on the website which is no longer available.


In general, data represents real-world things and is described by metadata.  It needs to be “fit for its purpose”.  Data that is not, is considered invalid or of low quality and needs to be either cleansed or discarded.  Uncleansed data can be seen as being either incomplete or inaccurate or both.  There are several conditions that can be described in terms of either incompleteness or inaccuracy, or both, that constitute low quality data, these conditions include:

  • Inappropriate data – the metadata inaccurately describes the data.
  • Unknown data – the metadata describing the data is wholly or partially incomplete.
  • Inconsistent data – (a) a reference in the data to some other data in the system is either incomplete or inaccurate, or (b) data representing the same exact thing contains different values.
  • Duplicated data – a thing is unnecessarily represented more than once in the data.
  • Missing data – a required value in the data is incomplete, partially or wholly.
  • Wrong data – the values in the data inaccurately measure or describe the thing.

There are undoubtedly many other ways to express why certain data is not fit for its purpose[1], but I have found that these six types pretty well encompass what can be wrong with data and make it either wholly or partially (which can sometimes be worse than wholly) unfit for the purpose to which you want to put it.

Each type of poor quality data needs to be addressed in a different way.  This is not to say that a complete methodology with its own framework, reference model, plan and templates is necessary for each type, but the root cause of and what can be done to correct each condition will vary.  Any unified approach to Data Quality (DQ) needs to take into account the different types of poor quality data.

Let’s take another look at the six types of poor data quality listed above.  You will notice that they include problems that can be found in both the content of the data and in the content of the metadata that describes the data.  In many instances there is far less of a difference between data and metadata than contemporary data architecture tends to acknowledge.  Essentially metadata is data.  Data, as the first sentence of this article states, represents real-world things.  Data is an abstraction of certain aspects, or characteristics of real-world things.  It was invented (long before computers) to communicate information about real world things.  Metadata is an abstraction of certain aspects of data, primarily to turn the data into information so that consumers (human and otherwise) can extract “meaning” from it and use it to record transactions or in a decision making context.  As more and more real-world things themselves become abstract the distinction between data and metadata tends to become less distinct.

Inappropriate data

Data is inappropriate for its intended use when the metadata associated with it inaccurately describes it.  This leads to a lack of confidence in the data by its consumers. For example, if a data element is defined as being a ‘Customer Home Address’ and it contains the string “John Smith”, which is most likely a name, which do you trust?  The metadata (‘Customer Home Address’) or the data content “John Smith” or neither?  At least one of them is inappropriate for the purpose at hand.  If you assume the metadata is correct then the data is inappropriate.  If you assume the data is correct then the metadata is inappropriate.  Either way, the quality of the data as a whole, the metadata and the data content, is of low quality.

It is possible to measure the inappropriateness of the data.  For example, the number of records that have this anomaly.  We can determine that.  But how do you know which is inappropriate, the metadata or the data content?  Unless you have a definitive answer or reliable heuristics from data profiling or some source of domain knowledge outside of the DQ process itself, you do not know.

Any solution for solving the inappropriate data problem must focus on validating that the metadata in accurate.  That it is appropriate metadata for the purpose to which it is being put, and that the data content then represents instances of that metadata.

Unknown data

Unknown data differs from inappropriate data in that at least some of the necessary metadata is simply not there.  This seems like a metadata problem at first.  But looking at it from a perspective of what can be trusted, it may also be a data content problem.  If you trust that the metadata is complete and accurate, whether or not it looks like it is, then the problem may be a data content problem.  It might be that you have data that is irrelevant.  But without any other information, it is at best unknown data.  If you trust the data content then you may need to produce the metadata by reverse engineering it in order to describe it.

A solution for solving the unknown data problem focuses first on whether it’s a metadata or data content problem and then either discarding the irrelevant data or deriving accurate definitions for the data you have.

Inconsistent data

Inconsistent data problems break down more cleanly between metadata and data content than either inappropriate data or unknown data.  With inconsistent data either a reference in the metadata is inaccurate or incomplete, or both; or the data content significantly differs between what should be instances of the same data content.  Each of these calls for a different solution, but in general is indicative of inconsistency in the body of data and thus a cause for distrust among the data’s consumers.

Metadata inconsistency, that is, references and definitions that are either incomplete or inaccurate, is the more insidious and difficult to find and correct.  This inconsistency is frequently caused by unreconciled semantic differences, the old “impedance mismatch” problem. One party calls it a “Car” and the other party calls it an “Automobile”.  Now, we as humans know that in almost all cases cars and automobiles are the same type of thing.  We know this because we have agreed on the metadata definition of which both “Car” and “Automobile” are instances[2].  In this case they are the data content of a more abstract thing.  But wait a minute, “Car” and “Automobile” were examples of inconsistent metadata weren’t they?  Well yes, in the original context they were used as such.  However to reconcile them with each other we needed an even more abstract concept.  We needed metadata for our metadata, in other words, meta-metadata.  So the only reliable solution for inconsistent metadata is to raise the abstraction stakes one level and create consistent meta-metadata.  Then the problem becomes a data content consistency problem and no longer a metadata problem.

The solution to inconsistent data content, once the inconsistent metadata is no longer a problem, becomes one of identifying the “system of record”.  This is not a problem to be taken lightly, but it is more of a business problem than technical problem. Of course the system of record may exhibit data quality problems of any of the types mentioned here but, in general, if data quality issues can be kept to those problems that typically plague data content they are far easier to solve than those plaguing metadata.

Duplicated data

Not all duplicated data is bad.  The operative phrase here is “unnecessarily duplicated”.  Data that is duplicated because it is de-normalized for performance or security purposes is not necessarily of low quality (though it may be of poor quality for other reasons).  As a matter of fact de-normalizing the data, though it is a risky undertaking, may make it more “fit for purpose” when the purpose is narrowly defined in terms of speed of access to specific data sets.  Also, from a network or enterprise point-of-view, data can be replicated, across multiple systems, typically for either performance, reporting or security purposes.

Where duplicated data is bad is when it is duplicated for no business or technical reason.  This can happen as a result of poor or no entity integrity.[3]  Which is primarily the result of a lack of appropriate uniqueness constraints on the source data.  It can also happen because of poor or lacking data stewardship.  When two (or more) records are exactly the same, selecting any one of them will suffice, however, it is when data records are not quite perfectly duplicated that it becomes more problematic.  This often happens because poor data stewardship has either allowed data content to be partially duplicated or the duplicates have been allowed to age and become out of synch with their master copy.

Missing data

Missing data may be the most common type of data quality infraction.  It is quite often the result of data being allowed to enter a source system without constraints, domain integrity or default values.  Allowing nulls in data can increase its flexibility and make it easier to make progress early in a project, but how many times have you defined a column as null saying to yourself “We’ll get back to this detail later, now we need to finish the interface”?  The interface gets finished and the domain integrity issue never gets addressed again.  After all, it will get corrected up stream.  Right?  Maybe not.

Missing data can also be one of the most time consuming poor data quality problems to fix.  This is primarily because you do not always know whether the data was supposed to be there or not.  In other words, is it missing data or is it just absent data.  That is, perhaps the correct value for a data instance is “unknown”.  But at least missing data is a data content error and not a metadata error.  If the metadata is missing then we have unknown data and that’s another story (see above).  The best way to handle missing data is to agree upon a missing data strategy, publish the strategy and apply it consistently.

Wrong data

Wrong data is just that, it is data that has the wrong content.  The metadata may be accurate and complete but the data content, that is the values in the data itself, is inaccurate.  For example, if the number of items sold was actually 20 but the data recorded is 10, it is wrong.  This is probably the most difficult data quality issue to correct, or even to discover.  It requires knowledge usually outside of what is typically considered the art and science of data quality.  However data profiling can help.  If enough data is profiled that the confidence level of the discovered patterns in the content are sufficiently trusted then when outliers are encountered they can throw an exception during the inspection or transformation process to alert users that there is something wrong.  Whether that something is caused by a system anomaly or by negligence or by lack of process control can be irrelevant, in and of itself, to the DQ process.  A statistical quality control process can be used to spot and in many cases correct wrong data.

Also important in preventing wrong data is a sound data auditing function and even a good old fashioned check and balance system whereby data is entered into the staging area from more than one source.  The subject data is checked against what should be the same data either on another channel, or in a source outside the auspices of the project.  Only when data from all the sources match, or can be reconciled, is a single “audited” copy of the data content persisted.


As you can see there are subtle differences between the types of poor quality data.  In many cases if you can profile a sufficient sample of the data for a sufficient amount of time, many of the types of poor data quality can be mitigated.  However, data profiling is a relatively new technique within data transformation processes, does not have the mind share outside the sometimes cloistered world of data professionals, and is often difficult to sell to management as a necessary component of a Total Data Quality program.  But if you believe, as I do, that understanding a problem is at least fifty percent of the battle to solve it, then including a data profiling step in your DQ efforts is always worthwhile.

Table 1 below summarizes the six conditions described above that can produce data of low quality, including whether the condition primarily effects the metadata, the data content or both, and the appropriate strategy for mitigating each condition.

Data Condition Condition Description Type of Condition Metadata or Data Content Effected? Mitigation Strategy
Inappropriate data  The metadata inaccurately describes the data. Inaccuracy Both metadata and data content are effected. Profile data for appropriateness between metadata and content, and use domain knowledge to validate.
Unknown data  The metadata describing the data is wholly or partially incomplete. Incompleteness Both metadata and data content are effected. If metadata is trusted, then discard irrelevant data content;If data content is trusted, then derive new metadata.
Inconsistent data (a) A reference in the data to some other data in the system is either incomplete or inaccurate. Inaccuracy or Incompleteness Both metadata and data content are effected. Create consistent meta-metadata and then treat the metadata as content data.
(b) Data representing the same exact thing contains different values. Inaccuracy Data content is effected. Identify system of record on a metadata element by element basis.
Duplicated data A thing is unnecessarily represented more than once in the data. Inaccuracy Both metadata and data content are effected. Rationalize unnecessary duplication of both metadata and content data. Improve data stewardship.
Missing data A required value in the data is incomplete, partially or wholly. Incompleteness Data content is effected. Agree upon a missing data strategy, publish the strategy and apply it consistently.
Wrong data The values in the data inaccurately measure or describe the thing. Inaccuracy Data content is effected. Create a data auditing function to reconcile content against other sources.

Table 1. Summary of low quality data conditions.

[1] For another taxonomy of data quality conditions see the views of Larry English at

[2] This is if we speak the same language of course.

[3] There are three types of data integrity: Entity integrity, Referential integrity and Domain integrity, all of which, when ignored, contribute to poor data quality.

Six Basic Interrogatives – Part 3, 6BI and Business Performance Management

August 29, 2010

In this concluding part of the article about 6BI, I will develop its analysis potential further, show how it compares with architecture frameworks, and cover how it can be utilized for traceability back to the enterprise’s mission statements in order to lend credibility to business performance management (BPM) reporting.

This part was first published in February 2005 under the title “Six Basic Interrogatives – Part 3, 6BI and Business Performance Management ” on the no longer existing website.

Six Basic Interrogatives – Part 3

In two previous articles published in, I described the Six Basic Interrogatives (6BI) Design Framework. In the first article, 6BI was described as an approach to designing databases that support business intelligence applications. In the second article I gave a very simplified example of a scenario and how instances of the business object categories can be identified, as well as introduced my color scheme. In this article I will develop the analysis potential of 6BI further, show how it compares with the Zachman Framework and other architecture frameworks, and cover how 6BI can be utilized for traceability back to the enterprise’s mission statements in order to lend credibility to BPM reporting.

The 6BI Design Framework is a pattern for designing data warehouse dimensions, but it is more than that. It abstracts all dimensions of business intelligence into the six categories of questions that a BI application user might ask. Combinations of questions from these six categories make up the requirements for the models and reports needed by management, analysts, and others within an organization to support decision making. As such, 6BI is an approach that can be used to gather and analyze requirements, and perform gap analysis between “as is” and “to be” models, as well as design the solutions to meet those requirements and close the gaps.

The six categories of the 6BI Design Framework align with the columns of the Zachman Framework for Enterprise Architecture. The Zachman Framework has been widely accepted throughout private industry and has been adopted as a basic building block for major enterprise architecture initiatives within the U.S. Federal Government. Two of these initiatives, in response to the Clinger-Cohen Act of 1996, are the Federal Enterprise Architecture Framework (FEAF) and the Department of Defense Architecture Framework (DoDAF)[1].

The first three rows of the Zachman Framework are Scope, Business Model, and System Model, and represent the Planner’s view, Owner’s view, and Designer’s view of the enterprise respectively[2]. Though originally developed for private sector enterprise information systems, the framework can be (and is) applied to the DoDAF as well, where Scope aligns with Mission Statement, Business Model with Operational Architecture (OA) View, and System Model with Systems Architecture (SA) View, see Figure 1 below. I am going to show how the 6BI Design Framework aligns with both the Zachman Framework and the DoDAF, through the six basic interrogatives. This alignment shows how 6BI can be used to analyze requirements and design solutions as part of both operational architectures and systems architectures that are in synch with an enterprise’s mission.

Figure 1. Zachman Framework and DoDAF alignment.

View Zachman Framework DoDAF
Planner Scope Mission Statement
Owner Business Model Operational Architecture View
Designer System Model System Architecture View

The applicability of 6BI to requirements analysis and solution design is especially true in the area of Business Performance Management (BPM). It is true whether the business is a for-profit enterprise, a not-for-profit organization, or a government agency. The six basic interrogatives and their corresponding business object categories define the context for aligning an enterprise’s Mission Statement with its OA and subsequently with its SA.

Figure 2 shows the six basic interrogatives and the business object categories to which each corresponds. However unlike the Zachman Framework, the order of the interrogatives is significant. It has been my experience that beginning requirements activities by identifying who are the participants (animate and inanimate) that make things happen provides a sound starting point, so I typically begin with the “Who” interrogative. It is not absolutely essential that you identify all of the classes of objects in each category the first time through the framework. The initial pass is a sorting exercise[3] to categorize all of the data elements, classes, and objects to begin to discover common attributes and characteristics among and between them. It is these common elements that enable reuse at a level of abstraction above the purely technical.

Normalization across domains in shared spaces is also enhanced in this way. This means that 6BI gives you a high-level set of classifications in which to group definitions from different systems that describe common object classes. Typically I create a spreadsheet for each business object category (Parties, Products, Activities, etc.), color code[4] them so I can see at a glance which ones I am dealing with at any point, and add or remove rows as I go along. A spreadsheet can generally load most textual data with just a little manipulation and gives you a convenient way to sort and shuffle the entries. This is a working document and usually does not become a permanent artifact of the project.

Figure 2. Six Basic Interrogatives and Business Object Categories.

This is also the point at which you begin to identify the types of numbers, that is the measures and facts, that would most likely be described by the columns (or fields) of the tables you have categorized. For Parties, for example, depending on the business context, you might want to capture and aggregate numbers that derive from activities in which the parties participated, events that are responded to or initiated by the parties, locations where parties operate, or motivators that parties plan or react to. These, of course, are simple two-dimensional analyses but they form the building blocks for more complex queries which make use of more detailed or finer grained levels within each business object category in increasingly complex combinations. Remember the idea here is to provide a starting point for analyzing the source data.

In the 6BI Design Framework, Parties and Products have no direct association with each other. It is only through the other four business object categories (Activities, Events, Locations, and Motivators) that Parties and Products are associated. The left side of Figure 3 shows the relationship of Parties to Activities, Events, Locations and Motivators. The right side shows the relationship of Products to the same four business object categories. This makes sense because only when people engage in activities, exchanging and manipulating things, do frequencies, quantities, monetary amounts, metrics, and other measurements get created.

Figure 3. The indirect association of Parties to Products.

Zachman and DoDAF

Figure 4 is a view of how the corresponding concepts of the Zachman Framework and the DoDAF align with each other and with the six basic interrogatives on which the 6BI Design Framework is based.

Figure 4. Six Basic Interrogatives, Zachman Framework and DoDAF.

Basic Interrogative Zachman Framework DoDAF
Who produces the data we are going to use to measure performance? Business Model –Organizational Units and Work Products Operational Architecture –Roles and their Means to Perform Activities
System Model – Roles and Deliverables Systems Architecture –Systems and Sub-systems
What is being manipulated or exchanged to produce the measurable performance? Business Model –Business Entities and Relationships Operational Architecture –Information Items Input and Output through Activities
System Model –Data Entity and Relationships Systems Architecture –Data Elements Input and Output through Functions
How are the data values that measure performance produced? Business Model –Business Processes and their Resources Inputs and Outputs Operational Architecture –Node-to-node Information Exchange Activities
System Model –Application Functions and User Views of Input and Output Systems Architecture –System-to-system Data Transaction Functions
When does the activity take place and for how long is the performance measured? Business Model –Business Events and Business Cycles Operational Architecture –Timing and Sequencing of Activities
System Model –System Events and Processing Cycles Systems Architecture – Timing and Sequencing of System Events
Where does the activity used to measure performance take place? Business Model –Business Locations and Linkages Operational Architecture –Operational Nodes
System Model –System Components and Links Systems Architecture – System Nodes
Why do we believe the data we are using actually measure performance? Business Model –Business Objectives and Strategies Operational Architecture –Concept of Operations
System Model –Structural and Action Assertions Systems Architecture – Design Strategy and Rules

Though rarely are all six interrogatives expressed in a Mission Statement, they are expressed in the OA and the SA. It is at the operational and systems level that BPM report consumers, decision makers at different levels of an organization, need to understand the numbers they are looking at. The types of questions that need to be answered include:

Who is responsible (persons, organizations, systems, etc.) for the numbers? Who are the parties (consumers, payers, providers, sellers, agents, etc.) among whom the exchange is made? Who else (competitors, partners, etc.) participates in this exchange? What is their affinity (participant, observer, etc.) to the market place? Who (government agency, media, etc.) influences the market place in which we exchange products? Who (sellers, producers, managers, others, etc.) need to, or want to, know this information? What is their affiliation (internal party or external party) with me, the first party?

What products (goods or services) are exchanged (bought, sold, produced or consumed) that produce the numbers I am looking at? What type of work effort participated in the exchange? What events influence the exchange? What are the data elements (out of all that are collected and stored) that give me the most relevant information, at the lowest cost, in the least amount of time?

How do the activities (sales, purchases, production runs, case work, intelligence gathering, analytical activity, etc.) engaged in by parties produce the results? How does the exchange take place? What are the functions and activities engaged in? How do activities produce events that influence the numbers? How do events spawn activities that influence the numbers?

When do the events that influence the numbers take place? Which events (contact, proposal, closing, order, delivery, support call, payment, etc.) influence the numbers? Which do not? At what points in time do the influencing events take place? What events take place at given points in time? How long is the time duration of an event? How long between events? What events outside of our control correlate with the numbers? What events cause a particular event? What particular event causes other events?

Where (geographic locations, virtual addresses, etc.) did the events or activities take place? Where is the location of the events that produce the numbers? Where is the placement of the production assets? Where are the participants? What is the spatial relationship between nodes? Between nodes and other dimensions?

Why are particular pieces of information relevant? Why do particular combinations of events and activities produce the results they do? How are ends associated with means? What are the motivators (strategy, tactics, programs, initiatives, rules, etc.) that correlate with the resultant numbers?

Each of the six basic interrogatives addresses an independent category of characteristics about the problem space, yet the categories are inter-related. In fact there are specific associations between each pair of business object categories. These associations can be expressed as a set of one-to-many relationships (see Figure 3) where the “one” side is a specific business object category (e.g. Parties) and the “many” side consists of one or more of the other business object categories that depend on it (i.e. Activities, Events, Locations, and/or Motivators), but not Products. In a 6BI design, Parties and Products are associated only through a linkage with one or more of the other dimensions.


6BI is also an approach to designing and building solutions that can be traced back to the characteristics and scope of the problem domain and the mission of the organization. If the use cases and requirements are based on the six basic interrogatives, and the classes, workflows, and events that constitute the design of the component parts of the solution (i.e. the System Architecture) are constructed to address those interrogatives, a clear path from the problem space to the solution space can be described. This path takes guidance from the context of the organization’s scope and mission as also defined in terms of the six basic interrogatives. This can be a basis for validity checking the business use cases used to trace business scenarios.

In other words, the characteristics of the business and the problems that it needs to solve define how the solution needs to be constructed. Each category of the problem is addressed by specific capabilities and components of the solution. Each problem category, and combination of the categories, can be scripted in testing scenarios to exercise those features of the solution designed and constructed to address those specific categories and combinations. Requirements compliance can be traced throughout the lifecycle of a project.

When new requirements need to be met the analyst asks herself, in the discovery activities of the requirements workflow, “Have I accounted for all six of the basic interrogative categories”? If the answer is “yes” she then knows she has a comprehensive and useful framework for going forward and digging deeper for answers and solutions. The possible impact of changes on each business object category can be traced separately or together.

This type of thinking about requirements applies whether the implementation is a custom development or a packaged analytical application. As a matter of fact, 6BI can be the basis of a robust subject matter coverage analysis when evaluating commercial off-the-shelf (COTS) software.

Because business intelligence, especially designing the data stores for data warehouses and other analytical and decision support systems, has progressed over the last decade and a half to where it is a mature discipline, many data architects no longer actually create uniquely designed databases directly from requirements. With the emphasis on design patterns and the availability of domain specific templates and examples, the emphasis is now on gap analysis. Gap analysis is the process of closing the gap between a given design, which can be either an industry standard, a generic model, or built into a product, and the acknowledged data requirements of the problem at hand.

The 6B Design Framework can actually provide a very thorough vehicle to facilitate this gap analysis. 6BI is a way to organize your thinking and help you to know what types of questions to ask in order to “cover all the bases” or make sure no gaps remain in the requirements analysis process. This gap analysis is especially critical in designing databases to support modern business intelligence technologies such as Business Activity Monitoring (BAM) where visibility into business processes and the atomic activities that make them up is critical. It is critical because different sets of events and activities are important to individuals and groups at different levels of an organization. The wider the scope of responsibility of an individual or group, the more aggregated and wider their reporting needs become. In order to aggregate data from more atomic and concrete levels to broader and more abstract levels, the aggregation must take place along one or more dimensions of the data. It is also critical that this visibility into the business processes be as independent of the implementation technology as possible.

The 6BI approach helps to identify these dimensions by grouping them is six distinct categories, each of which is aligned with a type of question an end user might ask about the nature of the information that they need. As questions become more and more detailed and precise within each interrogative category, a 6BI designed system allows the user to drill-down through combinations of dimensions that correspond with the types of questions asked. 6BI identifies the six fundamental business object categories that identify the dimensions of the answers to operational questions about an enterprise, its parties, its products, its activities, its events, its locations, and its motivators.


“A framework for information systems architecture” by John Zachman, IBM Systems Journal, Vol 26, No 3, 1987

“Zachman Framework Extensions: An Update” and “Rules for the Zachman Framework Architecture”, Database Newsletter, Vol 19, No. 4, July/August 1991

“DoD Architecture Framework Version 1.0”, U.S. Department of Defense Architecture Framework Working Group, 9 February 2004

[1] I like to think of this object sorting process as analogous to the Sorting Hat process that each first year student at the Hogwarts School of Witchcraft and Wizardry must go through in the Harry Potter books.

[2] The choice of colors is purely arbitrary but I typically use red for “who”, blue for “what”, green for “how”, aqua for “when”, maroon for “where”, and violet for “why”. Proforma ProVision, used for business vision, operations and process modeling, for example, uses a completely different color scheme.

[3] DoDAF was originally called the Command, Control, Communications, Computers, Intelligence, Surveillance, and Reconnaissance (C4ISR) Architecture Framework. It has recently undergone an update and standardization process, in response to user feedback and an executive mandate to provide common standards to improve information sharing, thus the name change.

[4] The additional Zachman Framework rows, Technology Model, Detailed Representation, and Functioning Enterprise, representing the Builder’s view, the Subcontractor’s view, and the Actual System view are not discussed in this article. The Technology Model aligns with the Technical Standards Architecture (TA) View in the DoDAF.

Six Basic Interrogatives – Part 1, 6BI Overview

August 29, 2010

This article is the first of a three part series that introduces a method for analyzing the data sources for a business intelligence solution in a framework that classifies the data entities into six (6) basic categories, sometimes called business object classes, but that I call the “Six Basic Interrogatives” or “6BI”, a term taken from journalism. Obviously a play on the common abbreviation for business intelligence, BI, as well. It is based on the familiar Zachman Framework for Enterprise Architecture, and thus acts as a bridge between the disciplines of Business Intelligence and Enterprise Architecture, EA. These categories are the aspects (vertical columns) of the Zachman Framework; “Who”, “What”, “Where”, “When”, “Why”, and “How”.

The article was first published in January 2004 on, a website that no longer exists, under the title “Six Basic Interrogatives – Part 1, 6BI Overview”. The article owes much not only to John Zachman, but also Dan Tasker and Len Silverston, from whom much inspiration was gained and hopefully refined in the process.

Six Basic Interrogatives – Part 1

One of the most difficult and yet most critical activities to get right when embarking on a new Business Intelligence / Data Warehousing (BI/DW) project is organizing and presenting the data requirements in such a way that they make sense to the persons responsible for designing and building the database that will support the decision support application. Two aspects of this activity make it very important.

First, when a significant amount of analytical ad hoc reporting is required, the database cannot be tuned to enhance the performance of any specific query. In ad hoc environments all the details of all queries are not known in advance (if they were it would not be ad hoc and probably of limited analytical value). Second, the people who design and build the database may not have personally participated in the often intense and sometimes contentious requirements gathering and analysis process. These two skill sets (business requirements analysis and database design) are often found in persons with different learning styles, experiences and motivation, so are quite frequently done by different individuals.

The required business performance measurements need to be identified, agreed upon and the algorithm for each defined. The available data sources that support these measurements must also be identified. Ideally the activity to identify candidate data sources will be performed in parallel with that of identifying business performance measurements. Once both of these “identifying” activities have reached an agreed upon state of completeness we then need to match the performance measurements to the data sources. That is, we need to know what are the data elements which, when combined algorithmically, produce the required performance measurements. We need to answer these questions:

Who produces the data we are going to use to measure performance?
What task, item or service is producing the measurable performance?
Where does the activity used to measure performance take place?
When and for how long is the performance of the task, item or service measured?
Why do we believe the data we are using actually measure performance?
How are the data values that measure performance produced?

These questions, or others very similar to them, form what I call the “Six Basic Interrogatives” or 6BI approach to database design for business intelligence. I use this term, borrowed from journalism, to describe a framework for thinking about what needs to go into the database to support decision making and ad hoc reporting. Which is, after all, what needs to come out of a data warehouse or decision support application for it to be useful. In other words we need to know all the factors that influence the data we use to measure performance.

To answer these questions, the structures within a business intelligence database must be designed to hold information about certain categories of business objects. These categories include:

Parties (Persons and Organizations) who produce the data we are going to use to measure performance.
Products (Goods and Services) that are represented by the data used to measure performance.
Locations (and Means of Communication) that facilitate the Goods and Services represented by the data used to measure performance.
Points in Time when the data is produced that is used to measure performance, and Durations over which the performance measures are produced.
Initiatives and Programs designed and executed to influence the data used to measure performance.
Contracts, Exchanges and Actions that represent the activities (participated in by the Parties) that produce the data used to measure performance.

These business object categories then provide the starting point for classifying the dimensions of a data warehouse database design. The design needs to allow enough flexibility to support ad hoc reporting by focusing on the basic questions that all decision support data structures need to support. However, good dimensional design also needs to address the unique decision making environment of the specific enterprise for which it is being built.

The next challenge, of course, is determining the actual measurements of performance themselves. These make up the facts or measures in the fact table that the dimensions describe. These are the data elements that, when combined algorithmically, produce the required performance measurements.

The 6BI design framework provides an approach for organizing and understanding the context of business intelligence and provides a way of classifying legacy data and meta data so that it makes sense and can be built upon. It can not and should not be followed slavishly and there are quite often data structures which simply cannot be easily classified. However, if a place to get started with your database design is what you need it can provide a platform to build upon.

In future articles I plan to discuss each of the 6BI business object categories in more detail, what you look for to identify how to classify data elements and examples from my own work experience. As the database administrator (DBA) function evolves from solely a technical expertise to more of a partner with business domain experts we will all need to quickly and effectively transform business requirements into working software solutions.

Six Basic Interrogatives – Part 2, 6BI Design Framework

August 29, 2010

This is the second part of the original three part article about the “Six Basic Interrogatives” or “6BI” for analyzing the source data for business intelligence solutions. I this part an example of how to apply the method is given.

This part was first published in March 2004 under the title “Six Basic Interrogatives – Part 2, 6BI Design Framework” on the no longer existing website.

Six Basic Interrogatives – Part 2

In a previous article I discussed a conceptual framework for designing and building databases that support business intelligence and data warehousing applications. In the article I mentioned the six basic interrogatives, borrowed from journalism, that are the basis for our thinking and subsequent design work in building databases that support the performance measurement and decision support processes. An interrogative is of course a question and our goal is to answer questions.

Six corresponding categories of business objects were identified that align with these six interrogatives. These business object categories identify the dimensions of the answers to our business questions. Our answers come from algorithms based on combinations of measurements set in a context of these business objects.

Each business object category contributes to the production of performance measuring data. All the categories together provide the different aspects one uses to analyze business performance. These business object categories therefore form the ‘6BI Design Framework’ that can be used in designing and building databases for business intelligence.

Figure 1 shows the six basic interrogatives and the business object categories they correspond to:

Figure 1. Six Basic Interrogatives and Business Object Categories.

Let’s look at a simple scenario. Alpha Corporation receives an order on 30 October 2003 on their website, for 100 units of Alpha Super Stuff from Bravo Enterprises. On 5 November 2003 the fulfillment of the 100 units is made. The payment of $1,000 USD is made on 15 November 2003, via an electronic funds transfer. Carol Daniels, the customer representative, who covers the sales territory is responsible for writing the contract for the sales process. She worked on it for one week. The activities took place in the fourth quarter of 2003. The business transaction is associated in time with an advertising campaign for Alpha Super Stuff.

Using the 6BI approach we can classify the business objects, each underlined and color-coded in the scenario above, into the categories that form the 6BI Design Framework. Let’s see how we do this:

First, Alpha Corporation and Bravo Enterprises are Organizations, a type of Party. Carol Daniels is a Person, another type of Party. This tells us who produces the data we are going to use for our performance measurements. If we are measuring performance from the perspective of Alpha Corporation we call it the first party in our example. To accommodate who produces the data we need to create dimensional entities and attributes that describe persons and organizations in our decision support data models.

Second, Alpha Super Stuff is a Product, in this case the Goods which were exchanged between Alpha Corporation and Bravo Enterprises, the second party in our example. It answers the question of what produced the measurement data. But Carol Daniels’ Work Effort also produced the data so we need to include that here as well. To describe what is produced we need to create entities and attributes that describe goods and services in our dimensional data models.

Third, the website tells us the Location of where the process was initiated, in this case in virtual space. The sales territory tells us where to Place the exchange geographically. We need to create dimensional entities and attributes to describe where the exchange occurred.

Fourth, the dates above record the Events occurring at different points in time that tell us when the exchange that produces the data occurred. The fact that Carol Daniels worked on it for one week in the fourth quarter of 2003 tells us the Duration over which some of the data is relevant. In our decision support data models we need to create dimensional entities and attributes that describe points in time and periods of time.

Fifth, the sales process (including the order, fulfillment and payment) is an Exchange between the parties and tells us how the data that we are going to use for our measurements were actually produced. The contract between Alpha Corporation and Bravo Enterprises gives us the means by which the data was produced. To accommodate this we need to create dimensional entities and attributes that describe business processes.

Sixth, the advertising campaign which is a Program, gives us an idea of perhaps why the measurement data was produced. To accommodate this we need to create dimensional entities and attributes that describe motivators in our business intelligence data models.

Lastly, the bolded items in the scenario above (100 units, $1,000 USD) are, of course, instances of the measurements themselves. In future articles I will discuss measures and how they fit into the dimensional context we have outlined by the 6BI approach. I will also discuss color coding and how it helps to document the 6BI Design Framework.


At the logical level the 6BI Design Framework forms the basis for the reuse of entities, attributes, value domains, relationships and other logical objects in our data warehouse designs. When actually implemented in software the 6BI Design Framework is instantiated as tables, columns, constraints, references and other database objects with common features across implementations and across business domains.

This ability to reuse our decision support database objects as widely as possible across many areas of business intelligence reduces the amount of discovery work that is needed for each effort. It also frees up valuable human resources to concentrate on specializing these objects to meet the requirements of specific business domains. It is, after all, the problems of specific business domains that we need to solve with our business intelligence applications, and not theoretical or conceptual problems. Our business intelligence applications can then become more and more aligned with the goals of the enterprise and produce better results faster and more economically.