6BI and Aggregate-Orientation

I wrote this article after reading NoSQL Distilled, A Brief Guide to the Emerging World of Polyglot Persistence by Pramod J. Sadalage and Martin Fowler in the summer of 2016. I was incented to read the book after seeing Martin Fowler give a presentation on NoSQL at a 2013 GoTo Conference presentation, which can be viewed at https://www.youtube.com/watch?v=ASiU89Gl0F0.

Up to that point the whole concept of “NoSQL” was very fuzzy to me, but after experiencing his presentation and reading the book I realized that the “lynchpin” that held RDBMS and NoSQL together was a concept that many SQL database architects have been working with for years, the common de-normalized data structure. One problem (among several) with de-normalization has always been how do you systematically de-normalize data so that two de-normalized data-sets can be compared to each other in a consistent and repeatable manner to assure reliable analysis by different parties, over time and in different locations. In other words how do we maintain the old “apples-to-apples” and “oranges-to-orange” paradigm.

The original was published in two blog posts, https://birkdalecomputing.com/2016/09/06/introduction-to-aggregate-orientation/ and https://birkdalecomputing.com/2016/09/15/6bi-and-aggregat…-modeling-part-2/ ‎. This page merges these two posts.

Six Basic Interrogatives, or 6BI for short, provides an approach for organizing and understanding the context of data used to support business decision making. I first wrote about 6BI back in the 2004-2005 time frame when I was working full time as a Business Intelligence Data Architect and needed a quick way to get started analyzing existing data in order to organize it to answer business questions. Since that time I’ve had other opportunities to put 6BI to use in situations not normally identified as business intelligence, certainly not the traditional type which focuses on a central RDBMS based data warehouse. I have found it holds up well, as I thought it would, in various scenarios and has versatility and applicability in just about all the computing situations that I have found myself in.

With the increased popularity of distributed computing, fueled mostly by the overwhelming up take of mobile devices both for business as well as personal use, new highly scalable technologies for persisting data have been developed. These technologies tend to rely more on an expanding network of smaller computing units as they grow by scaling out to more units, as opposed to a small number of large and ever growing computing units which grow capacity by scaling up. Scaling out is most often referred to as “horizontal scaling” while scaling up is referred to as “vertical scaling”. There is no doubt that horizontal scaling is more popular today than vertical scaling. Going hand-in-hand with the horizontal scaling trend has been the introduction of alternative data stores now collectively known as NOSQL databases.

Three types of NOSQL databases that are designed to store a rich structure of closely related data that is accessed as a unit were identified by (Sadalage and Fowler 2013)[i]. They call this type of database an aggregate-oriented database because its basic storage structure is conceptualized differently from that of a traditional relational database management system (RDBMS). An RDBMS stores its data in tables, or more precisely in relations. Relations allow data to be normalized into units (tables) which contain data that is dependent only upon the primary key of the entity. From a logical perspective we can say that each set of attribute values (i.e. a row in a table) describes one and only one instance of an entity which in turn is differentiated from all other instances of all other entities by its unique identifier. The columns of the table store the attribute values of the entity instance where its row intersects with its columns.

In practice this results quite often in data about customers stored in a Customer table, data about products in a Product table, data about Orders in an Order table and so forth. This is a very powerful design because it does not presuppose any fixed relationship between Customers, Products and Orders, but it let’s designers and developers link them together in any way that meets business requirements even when these requirements might not necessarily be known in advance. This adhoc linking is made possible by the SQL language[ii].

NOSQL databases, in order to provide more performance across distributed data stores, even when the nodes are distributed globally, were conceptualized and built with a different set of priorities. Non-deterministic query structure was not a priority, but instead just capturing and persisting the data as quickly as possible was the top priority. There was no need to build SQL language interpreters into the architecture of these “new age” data stores. As a result NOSQL databases end up sacrificing some of the flexibility needed to support non-deterministic query construction in favor of higher performance access in a distributed landscape, or cluster.

Inevitably this trade-off has lead to the need for designers to know or predict in advance how the data will be accessed, which entities will need to be joined, and in what order to answer their business questions. The loss of the luxury of not needing to know in advance exactly how your data will be accessed is the price that is paid for better performance. Fair enough! Nothing comes without its cost, especially in the world of computing. The important thing to remember is that there will always be a trade off[iii].

There are three aggregate-oriented NOSQL database categories that have emerged over the last several years: Key-Value, Document, and Column Family. Each is different but they share a conceptually similar data storage unit called an aggregate[iv]. A data modeler (or developer modeling an application’s data) using these new types of models needs to start thinking about how the data will be accessed earlier in the design and development process than ever before. As (Sadalage and Fowler 2013) state, aggregate awareness needs to be made part of the design of the data model[v]. I can see how this could be the case when there is a lack of clarity between the role of database design and that of database development. With the advent of aggregates and aggregate-orientation the job of a business intelligence data modeler has changed from being able to concentrate on business requirements to needing to be more aware of how the data will be accessed and stored. This challenge is stated succinctly by (Jovanovic and Benson 2013) when they say NoSQL databases have opened a new non-trivial problem area for data modelers[vi].

If we do the aggregate data modeling at a level independent of the style of the data store (i.e. Key Value, Document or Column Family) however we can create implementation neutral and re-useable data models that reflect the requirements of the business as much as they reflect the requirements of the technology used to implement it. Data organized with an aggregate-orientation needs to be modeled at a logical level just as much as data organized with a relational orientation ever did. In both cases we need to start with the basic entities that describe business value for persisting and manipulating data. The need for an organizing framework for data still exists. We still need to know which real world entities (regardless of how they are instantiated in the data store) the business cares about and how these entities relate to each other.

Business Object Categories

6BI is built on a framework of six universal and re-usable Business Object Categories (BOCs). Each BOC represents a non-overlapping category of business data. Each category corresponds to one of the six basic interrogatives: Who, What, Where, When, How and Why. Figure 1 shows the six basic interrogatives and the business object category which is derived from each including example sub-categories which will be used in subsequent business data model examples.

Basic Interrogative	Business Object Category
*Who* produces the data used to measure performance?	Parties[vii] (e.g. Customer and Provider)
*What* is being manipulated or exchanged to produce measurable performance?	Things[viii] (e.g. Product and Payment)
*How* are the data values that measure performance produced?	Activities (e.g. Exchange and Process)
*When* does the activity take place or for how long is the performance measured?	Events (e.g. Point in Time and Period of Time)
*Where* does the activity used to measure performance take place?	Locations (e.g. Address and Placement)
*Why* does the data actually measure performance?	Motivators[ix] (e.g. End and Means)

Figure 1. Six Basic Interrogatives and Business Object Categories.

In the 6BI Framework, Parties and Things have no direct association with each other. It is only through the other four Business Object Categories (Activities, Events, Locations, and Motivators) that Parties and Things are associated. The left side of Figure 2 shows the relationship of Parties to Activities, Events, Locations and Motivators. The right side shows the relationship of Things to the same four BOCs. This makes sense because only when parties engage in activities, exchanging and manipulating things between them, do events get generated at certain locations and reasons why become relevant. Figure 2 illustrates the relationships between business object categories, both direct and indirect.

Figure 2. The indirect association of Parties to Things.

These business object categories provide the framework and starting point for classifying data that are needed for non-deterministic query structure. This design allows enough flexibility to support adhoc reporting by focusing on the basic questions that all decision support data structures need to support, while not requiring a strictly relational storage structure underneath.

Model Transformation

To transform a 6BI (“Six Basic Interrogatives”) derived logical data model into an aggregate-oriented data model there are several steps that need to be taken. We will use the logical business data model shown in Figure 3 as our example to show this transformation. It’s very simplified but fairly typical of certain types for commercial enterprises.

Figure 3. A Business Data Model Example.

Semantically we can say the following about our business data model. A Customer can have many Contacts, and a Contact can have many Customers. The Customer_Contact entity[x] resolves this by linking a Customer to its Contacts through the “has contact” relationship and a Contact to its Customers through the “has customer” relationship. A Customer “places” Orders, “receives” Order_Invoices, and “makes” Order_Payments. A Provider “receives” Orders. An Order “contains” Order_Items and a Product “identifies” Order_Items. An Order_Invoice “is paid for by” Order_Payments and a Payment “is applied to” an Order_Payment.

Unlike data modeling where the outcome is typically the creation of tables which separately store instances of entity types linked together by key-based references, the outcome of an aggregate-oriented data model is the creation of aggregates. An aggregate is a complex data structure based on a “root entity”. The root is determined by the business requirements and is identified as an entity type in an LDM. This entity is often derived from a 6BI BOC (“Business Object Category”). Aggregates are stored as a nested collection of entity instances whose relationship to the root is also determined by business requirements. How deeply an entity type is nested within the aggregate structure is determined by where it is placed along a dependency gradient.

Dependency Gradients

A dependency gradient is a continuum of degrees of dependency where the amount of dependency flows from a higher concentration to a lower concentration. It flows from entities that have a greater number of dependencies, both direct and indirect, to entities with a lesser number of dependencies. An entity that consumes no other entity’s identifier (i.e. it has no foreign keys) is referred to as the apex entity of the dependency gradient and has no dependency on any other entity of the continuum. An entity whose identifier is not consumed by any other entity along the continuum is referred to as the base entity of the dependency gradient. A dependency gradient between a base entity and an apex entity can be the result of a direct reference from one entity to another where there are no other entities between the two or there can be intermediate entities between a base entity and its apex entity. In cases like this a dependency gradient can include all or some of the indirect references between the base and the apex. An example follows.

Figure 4 shows two dependency gradients that flow from the same Base Entity Order_Payment (a type of Exchange).

Dependency Gradient-01 flows (on a path from left to right) through the relationship “is applied to” which is a direct reference (no intermediate entities) from Order_Payment to Payment (a type of Thing). Payment is the apex entity of Dependency Gradient-01.
Dependency Gradient-02 flows (on a path from right to left, in three steps) through the relationship “is paid for by” as a direct reference from Order_Payment to Order_Invoice (a type of Exchange), then through the relationship “is billed thru” to Order (another type of Exchange), and finally through the relationship “places” as an indirect reference to Customer (a type of Party) which is the apex entity of Dependency Gradient-02.

Figure 4. Dependency gradients forming a dependency tree from the Order Payment entity.

Dependency gradients form dependency trees when the flow from a base entity (Order Payment above) takes more than one path, to more than one apex entity. The term gradient[xi] is used to indicate the relative degree and direction of dependency along a tree. Dependency always flows from a higher concentration (more foreign keys) to a lower concentration (fewer foreign keys).

An aggregate is a type of composition and in a data model entities are the components of the composition. The entities retain their integrity but separating them destroys the aggregate. An aggregate is a type of composition. Aggregates in an aggregate-oriented data model are based on dependency trees. Dependency gradients which form dependency trees determine the order of nesting between the entities participating in an aggregate. Because entities can be arbitrarily de-normalized, nesting can be done in either direction. However the apex entity of at least one the dependency gradients has to be the root entity of the aggregate. When the root entity is determined all other apex entities in the dependency tree are eliminated in the composition of the aggregate.

Steps for Transforming the LDM

The following ordered set of steps are required for transforming a LDM into an aggregate-oriented data model. The first three steps are part of logical data modeling, and the last two are part of physical data modeling.

Select the root entity for each aggregate.
Identify relationships to be eliminated by “rolling down” parent entity attributes into the child entity.
Identify relationships to be collapsed by “rolling up” child entity attributes into the parent entity.
Eliminate unneeded apex entities by rolling down their attributes to an entity higher on the dependency gradient.
Create the aggregate-oriented data model.

Select the Root Entity for Each Aggregate

The root entity of an aggregate, identified by the ROOT operator, can occur anywhere along the dependency gradient and is independent of the relational association (i.e. relationship) between itself and any other entity along the gradient. The other entities in a dependency tree, called aggregate participants, are embedded into the aggregate root entity instance. The embedding of participant entity attributes into the aggregate root occurs regardless of whether the participant is in the direction of the apex or in the direction of the base relative to the root. Identification of the root entity will influence which relationships get eliminated and which relationships get collapsed.

Identify Relationships to be Eliminated

Relationships to be eliminated are tagged with the REFI[xii] tag. To identify each individual action triggered by the REFI tag, the tags are numbered starting with REFI-01. In eliminated relationships the parent entity is removed and all or some of its attributes are rolled down into the child entity.

Which attributes to roll down is determined by how normalized the entity is. In cases where an entity has attributes that are not part of one of the dependency gradients being aggregated the non-dependent attributes would not be rolled down, however the relationship is still eliminated in transforming the logical data model into an aggregate-orientated data model.

Figure 5 shows the relationship between Product and Order Item named “identifies” tagged as REFI-03. In this case it was decided this relationship would be eliminated and the parent (Product) attributes were rolled down into the child (Order Item) entity eliminating the relationship and its parent entity.

Figure 5. A relationship tagged for elimination.

Identify Relationships to be Collapsed

This step requires modelers to take an entity, like Order in our example data model, which has other entities that refer to it and roll up the attributes of its selected child entities into the parent entity. Order and its dependent entities, Order_Item through the “contains” relationship, and Order_Invoice through the “is billed thru” relationship are shown in Figure 6. Only the relationship with Order_Item is tagged as EMBED-02. The relationship with Order_Invoice does not need to be tagged for this transformation because Order_Invoice has a separate relationship with Customer named “receives” and its attributes are rolled up along that dependency gradient (see Figure 7).

Figure 6. A relationship tagged for collapse.

Figure 7 shows the whole example model tagged in preparation for transforming it into an aggregate-oriented data model with Customer as its root entity.

Figure 7. A logical data model tagged for transformation.

Eliminate Unneeded Apex Entities

Figure 8 shows what the physical data model (PDM) looks like after certain relationships, and their parent entities, are eliminated. In this model we have eliminated the relationship “has customer” by rolling Contact into Customer_Contact (REFI-01), “receives” by rolling Provider into Order (REFI-02), and “identifies” by rolling Product into Order_Item (REFI-03)[xiii]. Please note that in the roll downs we have included the parent primary keys (now as non-key attributes). Strictly speaking we would not always need to do this. It would depend on whether an apex entity such as Provider, for example, is actually an entity and requires its identifier to be retained for entity integrity purposes, or whether its attributes are now considered attributes of its former child entity, Customer_Contact.

Again this is actually more of a business decision than strictly a technical decision, though I suspect it is rarely considered so. This is another reason why I say selection of a NOSQL database is at least as much of a business decision as a technical decision.

Figure 8. Our example data model after eliminating unneeded apex entities.

Create the Aggregate-Oriented Data Model

Figure 9[xiv] shows an aggregate-oriented data model based on Customer as its root entity. All the attributes of the entities that survive the roll down are rolled up along dependency gradients and collapsed into the Customer_Aggregate forming the nested collection of entity instances, which is called an aggregate.

We have the same issue here that we had in the previous section. What do we do with the entity identifiers (i.e. primary keys)? One identifier that we have to keep of course is the primary key of the root entity, Customer_Aggregate.customer_id. It is the identifier of the aggregate. The answer for an aggregate is not the same as the answer for a table. In a RDBMS we are optimizing the design of tables, in a NOSQL database we do not have tables. All the attributes of all entity types included in creation of the aggregate are considered attributes of the aggregate and no longer “stand alone” entities or entity instances. The identifiers of all aggregate participants are removed from the model.

Figure 9. Aggregate-oriented data model.

Not having tables as your “modeling target artifacts” may be a difficult and even disturbing concept for data modelers experienced in identifying and normalizing entities. For this reason an aggregate-oriented data model is a new type of data model. It is not a logical model, but it is also not a physical model in the strict sense that it can be used to generate pre-defined database schema structures (e.g. tables). It is used to relate the entity types and relationships designed in a logical data model, and the business requirements they represent, to the chosen type of NOSQL database (e.g. key-value, column family, document, and others) and the unique metadata structures each type of NOSQL database has.

A final transformation is required to create code for the construction of the appropriate aggregate-oriented NOSQL database structures. Because we typically see NOSQL data structures expressed as a mixed meta-data and data structure such as a JSON string it is easy to forget that JSON still has a schema. That schema is often maintained in the application code and not in the database engine. The advantage of this is more flexibility in application development but the cost is often a lack of the uniform constraints needed to maintain the re-useable data structures required for non-deterministic query construction and compliance to data governance standards.

In summary the steps needed to transform a LDM into an aggregate-oriented data model are the following. First, identify the root entity for the aggregate in the LDM, identify non-root parent entities to be eliminated by rolling down their attributes into their child entities, and finally identify child entities to be collapsed by rolling up their attributes into the root entity along dependency gradients that have the aggregate root as apex entity.

In a future paper I will describe how the concepts in this article can be applied to data targeted to be stored in popular NOSQL databases.

————————————————————————————————————————

[i] Pramod J. Saladage and Martin Fowler, NOSQL Distilled, A Brief Guide to the Emerging World of Polyglot Persistence, Addison-Wesley, 2013. Sadalage and Fowler identify four (4) types of NOSQL databases. The fourth being Graph databases which we will not be discussing in this article.

[ii] SQL is not the first relational database language. In the 1980s each relational database product had its own language, each of which did pretty much the same thing. SQL simply won the “database language war”. It is often criticized for being, among other short comings, “less than perfect”. But like the human communication vehicles known as “natural languages” it too has evolved to fit its purpose and become widely accepted as a result.

[iii] This is the reason why the choice of using a NOSQL style database is, or should be, as much a business decision as a technical decision. What needs to always be taken into consideration are the possible hidden costs of implementing a database solution that does not support SQL. Especially in light of the fact that SQL has become a virtual “lingua franca” of data analysis and business intelligence. These costs may be hidden from the group making the NOSQL database decision since it is quite likely they are not the same group that uses SQL to answer business questions.

[iv] Sadalage and Fowler, page.13.

[v] Sadalage and Fowler, page.17.

[vi] Vladan Jovanovic and Steven Benson, Aggregate Data Modeling Style, Proceedings of the Southern Association for Information Systems Conference, Savannah, GA, USA March 8th–9th, 2013, page 75.

[vii] Parties are often first specialized into “Persons” and “Organizations” but whether a party is only one or a group is not relevant to our discussion here. Instead I used a sub-categorization that is more reflective of the role that a party plays in an exchange.

[viii] In earlier articles on 6BI the term “Product” as the name of the 1^st level BOC that corresponds with the “what” interrogative was used. Product however is not universal enough. For lack of a better term I use “Thing” to include Money along with Goods and Services.

[ix] In earlier articles on 6BI the terms “Plans” and “Rules” were used as BOCs that correspond with the “why” interrogative. But these are only Means, not Ends, and “Why” needs to include both Ends and Means.

[x] It’s conventional for data modeling tools to call the rectangles in LDMs “entities”. They can also be called “types” or “entity types”. The physical form should be called an “entity instance”. In this paper I will use “entity” and “entity type” interchangeably because most of the paper deals with logical data modeling and most of the tools used for logical data modeling call them tables. Where I refer to physical modeling I use either “entity instance” of if entity is understood just “instance”.

[xi] In physics “gradient” is defined as an increase or decrease in the magnitude of a property, the change of which can usually be measured. The word seemed appropriate to name the concept of increasing or decreasing dependency.

[xii] The REFI and EMBED tags were identified by Jovanovic and Benson. Their usage in this article is derived from the paper but may differ slightly from the original meaning, including the numbering of each instance of a tag type for easier identification of each individual roll-up and roll-down action.

[xiii] It is important to understand that the three eliminated entities are not removed from the data model, neither logically or physically. They are simply not part of the aggregate-oriented data model.

[xiv] Any reader familiar with the Erwin Data Modeler tool will recognize that I’ve used the symbol for a View to represent the aggregate. This is not to say that an aggregate is the same as a view. Erwin, nor any other data modeling tool I am familiar with yet supports aggregate oriented modeling. The COMN data modeling technique introduced by Theodore Hills is the closest I’ve seen to actually recognizing and describing a methodology for doing so. Please see http://www.tewdur.com .

Birkdale Computing