This is part one of a two part blog that’s going to talk about bright shiny things like Oracle GoldenGate Big Data streaming my ERP system changes through Kafka to cloud (in this case an Oracle Cloud App). But first …
In a galaxy far away and in a time when my Exadata box had an HP logo on it (yes, I did work on V1) my then employer had a customer that needed to stream data from their sales platform to the data warehouse. For this customer change capture was about reducing the impact on a fragile source system that because of the nature of their business was busy 24 hours a day, they just could not afford an invasive bulk data read slowing their principal system.
This was in days when the Data Warehouse Guide promoted two main forms of Change Replication – synchronous, trigger-based and asynchronous, log based using one of 3 or 4 methods of Oracle Streams propagation between source and target. There was also Oracle’s recent acquisition of GoldenGate which again uses the redo logs as a change source but in conjunction with its own data transport mechanism. For this customer, making changes on the source platform was almost invariably never permitted. So adding change capture triggers was not going to happen. This leaves us with log based change capture. The other thing the customer was not going to do was change the database logging settings and in this case would not switch on supplementary logging which eliminated GoldenGate as a change replication method.
Using Oracle Streams and log mining is a simple concept. When we mine logs for changes to data we are looking for information that identifies the table being changed-tracked, the row that is being changed, the type of change and the change itself. Somehow, that change message is transported and inserted into a table of changes to apply. Over this list there are mechanisms to expose this data in windows of change to the individual data subscribers and to purge no longer needed information. This is, simplistically a game of leap-frog played by the beginning and end of windows pointers; Extend is move the end pointer to most recent streamed record and purge is jump the beginning pointer to the record after the end.
Each row in the window contains information on when a change happened (A SCN and a sequence number since we may have more than one change recorded for each SCN), the type of change and then the rows of the table itself, or in the case of updates two rows, one for the before (U1) and one for the after (U2). However, here lies the problem, without supplementary logging we only see column data that has changed or is part of the primary key. If there is no change for a column the U1 and U2 records for that change are both NULL. Supplementary logging gives the complete row every time.
Having the full row of columns makes processing trivial: for each row in the CDC window insert the ‘I” records then in SCN-Sequence_Nbr order apply the updates. You can even get a little smart and only apply the latest update if we don’t actually care about multiple changes over a short time period. As usual I never throw data away so I tend to ignore deletes and truncates. Without this detail we give ourselves a huge problem. Yes, it is possible to process this (after all the DB does in a recovery from redo logs) but this is not simple. We need to build complete rows, but those rows need to be applied in the correct order and may need access to data that we already have in our target table. This can be done using Oracle SQL Analytic functions and a mix of UNION ALL statements. It takes time to write and is highly customised. The sensible kids get the logging turned on and do it with full data rows, and preferably using a replication product such as GoldenGate – next time I’ll give you a for why.