Go Faster Tape

When I was a kid you could buy a roll of sticky tape with a checker board print on it. This was used to decorate bikes to make them go faster. The reality of this was that it did not actually work, but as a young child you just accepted it for what it was, magic.

The same could be said for the silver bullets you might read about in tuning blogs. The reality of tuning is that there is never a one-size-fits-all solution. Invariably, performance is driven by the distribution of the data and the characteristics of the platform it is being accessed on; everyone’s data and environment is different so what works for one person may not work for another. That is not to say that there is no such thing as good practice for designing data access, it’s just that you can’t say “this block size” or “index this” is the best thing for all circumstances.

However, there are some general goals to explore if not achieve

NewImage

 

Obviously, by not doing something at all we get the best performance gain. Surprisingly, by looking at a system you might spot time consuming but unnecessary steps, sometimes it requires you to restate the problem in a different way and alter processing logic to omit a step.

It’s harder to call if doing things less frequently or faster gives the best performance boost. Less often could be something simple such as commit less often (of course, commit as often as transactional integrity requires, but for bulk we might not need commit for each row). Likewise pre-filtering data can reduce the volume to be processed. Sometimes the line between less often and faster is quite blurred; is changing from a nested loop join to a hash join an example of “less often” or “faster”?  You could argue the same for PL/SQL loops being replaced with a SQL join. Somethings definitely sit in the realm of “same but faster” such as where an operation occurs, for example in the CPU, in memory, at the storage-cell, on disk, using off-line data.

The fourth way – do something different

In some cases, typically with aggregation, it might be possible to move a complex function to the result of an aggregate rather than for each value, though some functions will not allow this.

NewImage

Let’s look at a typical problem we see in the data warehouse / data analytics world where we get an extract of dimensional data for a source. A simplistic statement of this: using the staged data as a source insert new data and update existing data. We might write a PL/SQL block that loops through the source and tries to update the target and if it fails tries to insert.  This however is likely to be slow as we loop through our data set. Faster, we could replace the loop code with a MERGE or UPSERT statement. This gets rid of our loop code and runs as a simple SQL statement. There is still scope for improvement though, but this requires us to revisit the requirements.

In a real world dimension we are likely to have new records, updated records and (almost certainly as the majority) unchanged records. So how do we remove the unchanged records so we don’t process them? I have blogged on that before but there are two basic approaches we can take:

  • We can outer join source to target and filter out the duplicate rows.
  • We can SQL MINUS the source and target.

As I said before the join approach may be fine, but when you might get NULLS in some columns there is some coding to do as comparing a NULL to a NULL is not quite straight forward in Oracle. Some people use a MD5 hash on the concatenated columns but again NULLS may trip things up and the hash function is very compute intensive so could have a negative impact on systems with low CPU power.

However if we could change our source data extract to contain only new and changed data we can go back to a simple merge load. One approach for this is to use log based change data on the source system; for example Oracle GoldenGate on the source database only passes rows that have a commit – and for many systems commit implies change. However, experience shows that there are “noisy” systems out there that commit on view and that leaves us with data that need not be processed. I say change data capture but we could equally use data streams such as Apache Kafka.

Having less data in the source allows us to further reduce the data we process in a SQL MINUS:

	-- Basic Set based approach to find change
SELECT * FROM SOURCE_SCHEMA.D1
MINUS
SELECT * FROM TARGET_SCHEMA.D1;

-- Modify where SOURCE_SCHEMA.D1 has many fewer rows than TARGET_SCHEMA.D1
SELECT * FROM SOURCE_SCHEMA.D1
MINUS
SELECT T1.* FROM TARGET_SCHEMA.D1 T1 INNER JOIN SOURCE_SCHEMA.D1 S1 on S1.Dim_Key = T1.DIM_KEY;

Our next question is do we merge the data or can we do something with partition exchange? The Oracle Merge statement can be very efficient if we are merging on an indexed key, but there may be some mileage in only using append inserts to an empty table and exchanging the data using partition exchange. You can also do this with a non-partitioned table!

 	— Insert into exchange table EX1 unchanged data
INSERT /*+APPEND */ INTO EX1 (
SELECT * FROM SOURCE_SCHEMA.D1
INTERSECT
SELECT T1.* FROM TARGET_SCHEMA.D1 T1 INNER JOIN SOURCE_SCHEMA.D1 S1 on S1.Dim_Key = T1.DIM_KEY );
COMMIT;
— Now the changes + new data
INSERT /*+APPEND */ INTO EX1 (
SELECT * FROM SOURCE_SCHEMA.D1
MINUS
SELECT T1.* FROM TARGET_SCHEMA.D1 T1 INNER JOIN SOURCE_SCHEMA.D1 S1 on S1.Dim_Key = T1.DIM_KEY);
COMMIT;

This is the basic approach I took in one of the ODI SCD2 IKMs I developed.

Maybe not go faster tape – but some methods to look at for performance improvement.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s