Detecting Change

That’s a one word blog isn’t it? “GoldenGate!”

Well actually, no. GoldenGate is a fantastic product and like many other Oracle acquired products has an enviable array of non-Oracle connectivity. In fact, it is my change data propagation tool of choice. We make changes on database X (which might or might not be an Oracle database), these changes are captured, converted into a source-neutral format (the trail file) for transmission to the target platform and then these changes are replayed into database Y (which again might or not be an Oracle database). The net result is that we have replicated that data from one platform to another. This may be good enough if we are using the replica directly as a reporting database, but it may not be enough if we going to put that data into a traditional data warehouse and its dimension and fact tables. ETL tools such as ODI with its ability to consume journalized data can help greatly with this, but the fact is the same; we need to be able to detect what has changed and then populate those changes downstream in foundation (relational or big data) and performance (star schema) layers.

This brings us neatly to the subject, Detecting Change.

It may be possible to use source database commits as an indicator of change. Sadly, we can’t always rely on a commit being the same as a change, sometimes developers don’t do a good job and they create “commit-happy” applications (often web based) that constantly seem to issue commits as you navigate from field to field. Here we have to filter no-change from change. Perhaps the easiest way is to use analytic functions such as lag and lead to identify when change actually occurs. Another possibility is to aggregate changes over short time periods (such as a few minutes) so that we get a series of snapshots, however this still does not necessarily mean that we have found changes.

We may be able to use timestamps on data items to indicate what happened and when. There is no guarantee though that timestamps exist in the source, or even if they do, that they are used correctly or consistently. Sometimes this timestamp is held in some form of audit table, other times it is held in the same row as the data we are working with. I’m not saying don’t use the timestamp, just saying be aware that it may not be the correct way of sequencing or identifying change. One thing is certain though, never use a sequence number as a timestamp proxy. Providing we don’t recycle sequence numbers we are certain the numbers are unique but that is all; on Oracle RAC for example it quite likely that each node is using its own pool of sequence values so there is no certainty that 100 occurs before 200 , similarly I know of a distributed system using odd sequence numbers on one platform and even on another to aid bi-directional replication of data.

Sometimes though all we can do is extract the current data from the source and compare it with data we already hold.

My colleague, Oracle ACE Director and data modeller, Kent Graziano presented at Oaktable World 2014 on using MD5 hashes of data to allow before and after data comparisons. This can be a great technique, but there are a couple of drawbacks, firstly it can’t cope with nulls in the data (NULL does not concatenate well) and secondly the MD5 hash is compute intensive; I have seen this method almost stall an under resourced (by CPU count) server. If you do use this method, I would advise that you pre-compute and store the MD5 hashes on the target table to minimise the execution overhead.

As I blogged last time we can use set based SQL and in particular SQL MINUS to identify change sets. This can get quite complex if we are building a SCD2 type operation as we need to combine updates to current record with new changes. We could well end up a combination of UNION, MINUS, and possibly INTERSECT just to get all of the rows we need to process. The joy of the set based sql operators is that they don’t care if we have nulls in our data sets, though this is not a big problem as we will most likely fix up the NULLS first.

In addition to using set based SQL we also have another weapon at disposal, analytic functions. We create a time-ordered view of changes from (and including) the version in our target table and using lag (or lead) look for change between rows. The resulting SQL can get quite large if we are looking at many columns but on the plus side we are partitioning our analytic functions on the same business key and using the same time ordering key so we are likely to minimise our need to sort data.

How ever we detect change our goal is to build the smallest dataset that encompasses all of the changes and to merge that with our published target table.

 

 

Leave a comment