As you may have seen in the Twittersphere, today is a special date in the Oracle community, a day where we can express thanks to OTN for wealth of technical content that makes our life easier.
For my tiny contribution to the day here is a little note on detecting change using SQL MINUS, maybe not my absolute favourite feature but that one has already been blogged today!
In data warehouses we need to ensure that we refresh the content to keep it line with the source systems. However in the real world we find that only a small amount of data actually changes between load; we may get a few new customers and a few others may update their details, but the bulk are the same from one day to the next. Obviously we don’t want to truncate and replace our data each load, nor do we want to update all of our records. We just need to process the new and the changed data.
For some people a change data capture approach, be it Oracle GoldenGate, old-style streams based CDC or even triggers is the way to go. It is important to remember, though, that straight data replication is not what we want, we need to consume change in a way that will load our dimension tables with the deltas only, so some form of (in ODI speak) “journalization” is essential. For others the setting up of CDC is not the right solution.
In this case we need to compare the data we are loading with the data we already hold and look for changes. Some developers do this by outer joining source to target and then comparing columns to find changed data, either through direct comparison of columns or computing a MD5 hash of the concatenated columns. Either way need to do something with those pesky nulls in our data as we may get false matches or even miss changed rows.
A more NULL tolerant approach is the use a set-based approach. The classic is something like SELECT a,b,c from StagedData MINUS select x,y,z from TargetData. Using SQL minus like this we get rows in our result set that do not exist in our target + rows that are different in the two sets, stage and target.
Some people worry that the SQL minus is going to be a heavyweight query, but there is a simple thing we can do lighten the load. Instead of a straight
SELECT * from STAGE_TABLE -- a 20 row table
SELECT * from TARGET_TABLE -- a 50 million row table
we pre-filter the Target table to only include rows with the stage table’s business keys; and EXISTS or an inner join would do nicely here
Simple but an effective way of reducing the amount of data being processed in a set based query.