The Data Processing Equation

The equation P(D) = R means the Processing of Data produces Results. Where P is Processing, D is the Data, and R is the Results. Processing is a function that acts upon the Data, producing the Results. This can be expressed as P of D yields R.

Algebraically we can solve for either of the variables (P, D or R). We can solve for any one of the variables (designating it as X) which then becomes the dependent variable, as long as we know the value of the other two. The other two are the independent variables. One independent variable is the experimental variable and the other is the control variable or constant.

Solving for the Results (R) we have equation 1: P(D) = X. This means that if we know the rules and procedures of the Processing (P) and we have the Data (D) we can calculate the Results (R). This is the classic Business Intelligence (BI) paradigm. In a classic star schema think of the fact and dimension tables as containing the Data and the various analyses and reporting as the Processing which produce the Results which are then used as a predictive model going forward. This can be called a “Results Driven Predictive Model” (RDPM) because the predictive power of the model is derived from the Results, the R factor of our equation. You use the Results (which you do not know ahead of time) which are derived from the interaction of Data and Processing, to inform your predictions.

Solving for the Processing (P) we have equation 2: X(D) = R. This means that if we have the Results (R) and have the Data (D) we can discover the rules and procedures of the Processing (P) that was applied to the Data (D) to produce those Results. This is the classic Machine Learning paradigm. Here through progressively measuring how close each iteration of processing allows you to get to the Results (which you already know), given the Data, you can produce a predictive model going forward. This is called a “Processing Driven Predictive Model” (PDPM). You use the rules and procedures of processing (which you do not know ahead of time) that produced the Results given the Data, to inform your predictions.

Solving for the Data (D), which is far less common than the previous two solutions, we have equation 3: P(X) = R. This means that if we have the Results and know the rules and procedures of the Processing we can deduce the Data (D) that had to be used. This equation has no classic application to what is typically thought of as business as far as I know. But has application to scientific and historical endeavors. It can be called the Historical paradigm. In other words, what Data had to be processed according to the rules and procedures of the Processing to yield the observed Results. This is called a “Data Driven Predictive Model” (DDPM). You use the Data (which you do not know ahead of time) upon which the Processing was used to produce the Results, to inform your predictions.

We manipulate the experimental independent variable while holding the control independent variable constant. This is done in order to observe and measure how changes in the experimental variable (the one being changed) effects the dependent variable. For example, in equation 1 we can change the Processing (P) while leaving the Data constant and observe how the dependent Results change. This of course is very common. A constant set of data will almost always produce different results if processed according to a different set of rules and procedures.

We can also change the Data (the D factor) in equation 1 to observe how that changes the Results while the Processing stays constant. This opens up many predictive possibilities like comparing the different Results when different Data sets are processed the same way by constant Processing.

The same experimental design structure can be applied to equations 2 and 3 as well. This becomes interesting when the Results are held constant. That is, we know what we want to see in the Results. The Data may be out of our control (that is, it may be supplied by others) and we want to know how we can Process that Data to give us the Results we want. This scenario is, in fact, the basis of fraud,

This examination, of course, is an oversimplification but I believe it captures to essential interdependency between Processing, Data and Results. This interdependency follows the classic experimental model where we have two independent variables (one experimental and one control) and one dependent variable which is subject to the manipulation of either of the other two.

This entry was posted on April 29, 2021 at 5:24 pm and is filed under Business Intelligence. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Birkdale Computing