Machine Learning and Database Reverse Engineering

Artificial intelligence (AI) is based on the assumption that programming a computer using a feedback loop can improve the accuracy of its results.  Changing the values of the variables, called “parameters”, used in the execution of the code, in the right way, can influence future executions of the code.  These future executions are then expected to produce results that are closer to a desired result than previous executions.  If this happens the AI is said to have “learned”.

Machine learning (ML) is a subset of AI.  An ML execution is called an “activation”.  Activations are what “train” the code to get more accurate.  An ML activation is distinctly a two-step process.  In the first step, input data is conceptualized into what are called “features”.  These features are labeled and assigned weights based on assumptions about their relative influence on the output.  The data is then processed by selected algorithms to produce the output.  The output of this first step is then compared to an expected output and a difference is calculated.  This closes out the first step which is often called “forward propagation”.

The second step, called “back propagation” takes the differences between the output of the first step, called “y_hat” and the expected output, called “y” and, using a different but related set of algorithms, determines how the weights of the features should be modified to reduce the difference between y and y_hat.  The activations are repeated until either the user is satisfied with the output, or changing the weights makes no more difference.  The trained and tested model can then be used to do predictions on similar data sets, and hopefully create value for the owning party (either person or organization).

In a sense, ML is a bit like database reverse engineering (DRE).  In DRE we have the data, which is the result of some set of processing rules, which we don’t know[i], that have been applied to that data.  We also have our assumptions of what we think a data model would have to look like to produce such data, and what it would need to look like to increase the value of the data.  We iteratively apply various techniques to try to decipher the data modeling rules, mostly based on data profiling. With each iteration we try to get closer to what we believe the original data model looked like.  As with ML activation we eventually stop, either because we are satisfied or because of resource limitations.

At that point we accept that we have produced a “good enough model” of the existing data.  We then move on to what we are going to do with the data, feeling confident that we have an adequate abstraction of the data model as it exists, how it was arrived at, and what we need to do to improve it.  This is true even if there was never any “formal” modeling process originally.

Let’s look at third normal form (3NF) as an example of a possible rule that might have been applied to the data.  3NF is a rule that all columns of a table must be dependent on the key, or identifier of the table, and nothing else.  If the data shows patterns of single key dependencies we can assume that 3NF was applied in its construction.  The application of the 3NF rule will create certain dependencies between the metadata and the data that represent business rules.

These dependencies are critical to what we need to do to change the data model to more closely fit, and thus be more valuable for, changing organizational expectations.  It is also these dependencies that are discovered through both ML and DRE that enable, respectively, both artificial intelligence and business intelligence (BI).

It has been observed that the difference between AI and BI is that in BI we have the data and the rules, and we try to find the answers.  In AI we have the data and the answers, and we try to find the rules.  Whether results derived from either technology are answers to questions, or rules governing patterns, both AI and BI are tools for increasing the value of data.

These are important goals because attaining them, or at least approaching them, will allow a more efficient use of valuable resources, which in turn will allow a system to be more sustainable, support more consumers of those resources, and produce more value for the owners of the resources.

[i] If we knew what the original data model looked like we would have no need for reverse engineering.

Tags: , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


%d bloggers like this: