Enterprise ETL Reference Architecture

This article describes a generic method for designing and implementing ETL processes. Figure 1 depicts a standardized architecture for ETL process design. This depiction is technology neutral and shows the processing tasks that all ETL implementations must perform. The depiction groups these tasks in the form of three separate stages. Each stage is responsible for its own set of tasks. When taken as a whole process these tasks cover all the goals that an ETL process will need to achieve. The stages are color coded to show them as distinct from each other and as a way to communicate the objectives of each stage. The colors have meaning only in the context of the design process.

Each ETL stage performs specific tasks that are unique to that stage. Also depicted are the four (4) schemas through which data passes as it is processed. The schemas can be implemented in any of several different types of data store technology, and as described below there is some flexibility in the order of the stages. Data flows link the schemas with the ETL stages that use them. The ETL stages are described below.

Figure 1. ETL data process reference architecture

ETL Stage 1 (ETL1) – The goal of an ETL1 type job is to extract the correct data from source data store types and to make the extracted data available to an ETL2 job to prepare it. Source data stores are typically external to the ETL system so ETL1 jobs must have a variety of data source adapters in order to meet their responsibility. In the reference architecture, ETL1 type jobs are referred to as “Yellow” jobs.

Original data sources are called Source Schemas (see Figure 1) and can be hosted on any of several different source data store types. These include:

Databases such as Oracle, DB2, SQL Server or others;
File systems such as UNIX or Windows files in many formats including CSV, XML and others;
Distributed file systems such as Hadoop which are often referred to as “Big Data” for their ability to maintain and retrieve large quantities of data in name-value pairs;
Data streams such as IBM MQ Series, real-time feeds, or other sources where data is not persisted.

The ETL1 type job also performs the task of identifying and capturing changed source data to prevent data from unintentionally being processed more than once. Changed data capture (CDC) can either be performed in the Source Schema (i.e. in the source data store) before extraction, or after extraction in the Landing Schema. There are significant factors to consider that will determine where and when CDC should occur in a given architecture implementation. For example CDC can be performed by an ETL2 job if there is a need to reduce the complexity of the ETL1 job.

On completion, ETL1 type jobs load data into Landing Schemas (see Figure 1). Unlike Source Schemas which are typically under the control of a source system, Landing Schemas are under the control of the ETL system and are shared between ELT1 and ETL2 type jobs. They can be hosted on the same data source types which host Source Schemas, and thus also require data source adaptors, but are created for the use of the ETL system only. They are often hosted either locally or remotely on a separate “ETL server”. Landing Schemas are one type of schema in the staging data store. Landing Schemas do not always need to be persisted (though they usually are) and can exist only in the memory of an ETL server which is then used as a volatile staging area. Server memory is leveraged mostly in “near real-time” or “low latency” environments. This is a design decision of the instance architecture where trade-offs between speed and consistency need to be weighed.

ETL Stage 2 (ETL2) – ETL2 type jobs transform and reconcile data between sources and targets. The goal of an ETL2 job is to prepare the data for use in a target system and make the prepared data available for ETL3 jobs to load it. In the reference architecture, ETL2 jobs are referred to as “Green” jobs.

ETL2 type jobs extract data from the same Landing Schema to which ETL1 jobs loaded the data. Landing Schemas can be hosted on any of several different staging data store types. These include:

Databases such as Oracle, DB2, SQL Server or others;
File systems such as UNIX or Windows files in many formats including CSV, XML and others;
Server Memory which is usually backed up to a backing store in usually an encoded and compressed format;

ETL2 jobs occur typically before ETL3 jobs, however ETL2 can occur after ETL3 in scenarios called ELT for extract, load and transform. In these scenarios the transformation occurs after the data is loaded into the target system. Even in ELT scenarios, ETL2 jobs extract data from the same schemas to which ETL1 jobs have loaded it. The difference is that the Landing Schema is on the target data store and not the staging data store.

ETL2 jobs perform two kinds of tasks. First is applying data transformation business rules through either hand coding or manipulation of pre-constructed parameterized processing steps and second, reconciling balances between sources and targets, often referred to as “Business Auditing” (see Figure 1).

On completion, ETL2 jobs load data into Loading Schemas. These schemas are separate from Landing Schemas, though they can, and often are, hosted on the same staging data store as Landing Schemas. In some ETL scenarios where little actual transformation in done, the Landing and Loading Schemas are partially or fully the same set of tables in the staging data store. The disadvantage of using a single set of tables for both the Landing Schema and the Loading Schema is that once the data is transformed (the primary goal of ETL2) its original state can no longer be accessed without either restoring a backup of the Landing Schema made prior to its transformation into the Loading Schema, or re-extracting the data, which would require re-running the same ETL1 job. Two problems with re-running ETL1 is that exception logic must be run to undo the CDC constraint applied in ETL1 and since source systems are seldom static, the second reading of the original source data set may not be exactly identical to the first reading.

ETL Stage 3 (ETL3) – The goal of ETL3 type jobs is to take prepared data from ETL2 and load it into final target data stores in a controlled and predictable manner. In the design process, ETL3 jobs are referred to as “blue” jobs.

ETL3 jobs extract data from the same Loading Schemas to which ETL2 jobs loaded data and then load it into Target Schemas. Target Schemas can be supported on the same types of data stores as Source Schemas, that is databases, file systems, and/or data streams. As described above, ETL3 jobs may in fact occur before ETL2 jobs in ELT scenarios.

On completion, ETL3 jobs typically record the statistics that provide a history of the ETL operations for management and control purposes. This process is often referred to as “operational auditing”. Operational auditing can be performed by ETL2 type jobs if no ETL3 type jobs are performed.

Notice that each process is in fact an ETL process in and of itself. It reads the data, processes the data it has read in some way, and then writes that processed data. It is an ETL microcosm. The three processes become a system as they work together in a given sequence to meet system level needs.

An ETL1 job typically starts on a signal from the External Controller and on finishing signals its completion either back to the controller or directly to the ETL2 job. When errors occur they are identified as being produced by the ETL1 job and, depending on severity, frequency and scope, are handled by either the ETL1 job itself or passed to the controller for resolution.

ETL2 and ETL3 jobs similarity start with a signal, either externally or from the preceding ETL job and follow a pattern of either handling the error within the ETL stage or passing to the controller for resolution.

When all three types of ETL jobs have run the data should be in the target format and ready for consumption by applications that need it.

Birkdale Computing