OTN Appreciation Day – Looking for Change

As you may have seen in the Twittersphere, today is a special date in the Oracle community, a day where we can express thanks to OTN for wealth of technical content that makes our life easier.

For my tiny contribution to the day here is a little note on detecting change using SQL MINUS, maybe not my absolute favourite feature but that one has already been blogged today!

In data warehouses we need to ensure that we refresh the content to keep it line with the source systems. However in the real world we find that only a small amount of data actually changes between load; we may get a few new customers and a few others may update their details, but the bulk are the same from one day to the next. Obviously we don’t want to truncate and replace our data each load, nor do we want to update all of our records. We just need to process the new and the changed data.

For some people a change data capture approach, be it Oracle GoldenGate, old-style streams based CDC or even triggers is the way to go. It is important to remember, though, that straight data replication is not what we want, we need to consume change in a way that will load our dimension tables with the deltas only, so some form of (in ODI speak) “journalization” is essential. For others the setting up of CDC is not the right solution.

In this case we need to compare the data we are loading with the data we already hold and look for changes. Some developers do this by outer joining source to target and then comparing columns to find changed data, either through direct comparison of columns or computing a MD5 hash of the concatenated columns. Either way need to do something with those pesky nulls in our data as we may get false matches or even miss changed rows.

A more NULL tolerant approach is the use a set-based approach. The classic is something like SELECT a,b,c from StagedData MINUS select x,y,z from TargetData. Using SQL minus like this we get rows in our result set that do not exist in our target + rows that are different in the two sets, stage and target.

Some people worry that the SQL minus is going to be a heavyweight query, but there is a simple thing we can do lighten the load. Instead of a straight

SELECT * from STAGE_TABLE -- a 20 row table
MINUS
SELECT * from TARGET_TABLE -- a 50 million row table

we pre-filter the Target table to only include rows with the stage table’s business keys; and EXISTS or an inner join would do nicely here

Simple but an effective way of reducing the amount of data being processed in a set based query.

Modules of Knoweledge

One of the features of Oracle Data Integrator that sets it apart from other data integration tools is the use of Knowledge Modules to define how parts of the data integration process occur. The developer creates a data load mapping from sources, joins, filters and targets (and in ODI 12 even more components are available). However, it is the knowledge modules that have responsibility for describing the code that will be generated at execution time. The same logical design could be coded as a bulk truncate-insert with one set of KMs and as an incremental update using another.

True, there are many out-of-the-box KMs, or KMs available for download from the ODI Exchange, but at some point most developers will wish to customize an existing KM or create a new piece of functionality for their ODI processes. Typically we work with Integration KMs (IKM) and Load KMs (LKM) as these do much of the ODI heavy lifting.

Writing a KM sounds complex but the process is not too daunting. Under the covers there is a lot of similarity between writing an ODI procedure and a KM. Both can have multiple steps, both can use a variety of technologies, both are able to execute code steps on the “source” and / or “target”; the step editors are very similar. The key difference is that we write a procedure to do some “one-off” piece of functionality such as update a control table or SFTP some files to a remote platform; the KM however provides a more generic process that can use ODI metadata to describe the data structures being manipulated.

Here at Red Pill Analytics we have already produced some IKMs for our own set-based + analytic function approach to loading and maintaining Slowly Changing Dimensions. We have created custom versions of this KM to support Oracle 11g (and higher) and the latest MS SQL Server versions (some SQL syntax, keywords and functions differ between MSSQL and Oracle hence the need for two versions). Similarly, we have created a bulk incremental update IKM for MSSQL, using the “EXCEPT” set operator (EXCEPT is MSSQL’s equivalent of Oracle’s MINUS) to identify changes between the work table and the target. More recently we have created custom KMs for cloud-hosted databases, the latest being the one we wrote for the Snowflake Elastic Data Warehouse service.

Our first Snowflake KM was a special RKM to reverse engineer an existing Snowflake database into the ODI data stores and models. This KM utilizes Snowflake’s information schema to retrieve table design. Our second Snowflake KM was an IKM to bulk load a Snowflake target table from a SQL source.

Although it is quite possible to load most cloud databases using standard SQL insert commands executed through a JDBC driver, it is preferable to adopt a different approach for moving large numbers of rows to cloud hosted platforms such as Amazon Redshift or Snowflake. For this type of use we should extract our data to file and then use platform specific file load commands to populate our target table. This approach is equally valid for bulk loading relational databases such as Oracle and is especially recommended for Exadata.

Rather than give the exact code we use, I will outline the steps we might take to populate a cloud hosted data warehouse platform — in fact the exact step syntax will vary with database vendor and connection security methods.

  1. Unload the results of the source query to a file. It is most likely that we would use a character separated format such as CSV, but we could use some schema encoded format like JSON or XML. However, for most use cases simplicity is the best starting point, so go with CSV.
  2. Compress the output file: there are two reasons for this — smaller files are quicker to upload; and some cloud vendors charge by the byte transferred or stored.
  3. Upload the file. This is one of the highly platform specific calls. Loading an S3 bucket using a java API will not be the same as FTP, nor is the same as a native Snowflake “put” command, nor is the same as… well, you get the picture.
  4. Load the target table from the uploaded file. We may not need to uncompress the file first since many vendors (including Snowflake) support unzipping on the fly during table load.

We do have a couple of Red Pill Analytics “secret sauce” steps in our IKM (the Jython task shown in the editor screenshot), but the four steps above are basically it!

 

The ODI 12c KM task editor

The first step in any KM development is to look at the available KMs to see if any of them would be a useful base to work from; for example a generic SQL to file IKM may already do some of the actions we need for a file based load. If we find a suitable base KM we can clone it and start work on editing the copy. We look at the steps we have to undertake and decide how much of the current code we can reuse, and how much is net-new. This is potentially the hard part as we have to weigh up technology choices for the various methods available. With the load to Snowflake I had choices such as:

  • do I upload the file through a S3 java API in a jython step in my KM?
  • do I raise an OS command to upload a file S3?
  • do I raise an OS command to natively load Snowflake?
  • do I use a web service?

Each option will have its own positive and negative points, for example can I put any required .JAR files on my ODI agent machine? How do I keep passwords secure? Do I have the code skills in java / Jython / Groovy to code what I want or am I better with shell script? Answer these questions and you on your way to creating your first ODI KMs.

 

My Favourite Things

Apart from the obligatory “raindrops on roses and whiskers on kittens” I must also add analytic functions to my list, and one function in particular, FIRST_VALUE (oh, and its almost identical twin LAST_VALUE.)

FIRST_VALUE returns the first value encountered for a selection window; obviously, we specify the column to apply the function on and as this is an analytic function we have that OVER part at the end where we supply an optional partition by clause. What is not optional is the ORDER BY clause; clearly, it makes no sense to look for a first value if the order we see those values is not controlled and this means we must window too! Getting the first value or last value in a select is perhaps less useful if we have NULL values in that column, but this is just where the IGNORE NULLS option comes in handy. We can find the first non-null value encountered.

Lets consider this data set:

Screen Shot 2016 06 15 at 15 20 25

I have a couple of null values in “Your” things.

select first_value(THING) over (partition by WHOSE order by OTHER_THING rows between unbounded PRECEDING and unbounded FOLLOWING) AS our_fav from FAV_THINGS

gives us

Screen Shot 2016 06 15 at 15 23 03

select first_value(THING IGNORE NULLS) over (partition by WHOSE order by OTHER_THING rows between unbounded PRECEDING and unbounded FOLLOWING) our_fav from FAV_THINGS

Screen Shot 2016 06 15 at 15 22 45

Ignore Nulls just skips those pesky null values, but of course if every row in the window is a NULL then we will get a null back.

The first time I used this kind of functionality was a few years back when Stewart Bryson and I were involved in an Exadata V1 data warehouse. The customer had a requirement to use log-based change data capture, but was unable to set the CDC process to use supplemental logging. With log based CDC we get a time-marked (SCN + an order number (to handle multiple changes for a single SCN value)) and a single row representing each insert or delete and TWO rows for each update (a before and an after state). With supplemental logging we get all of the columns populated; without it we only get columns that change value and NULLs in the columns unaffected. Although far from ideal we can use the captured data (in our case a subscriber view over it) to synthesise the “missing” data in the capture. This is where LAST_VALUE comes into play.

As our customer was only interested in the final outcome of change – a ‘noisy’ source application meant a row might change many times over a few seconds as each individual column was updated Only the final state was of interest in the data warehouse. In outline we built a data set from the UNION ALL of the CDC subscriber view and the current state of our target table (not forgetting to filter the target table select to only include the PK values found in the CDC subscriber view being processed) – the only complication is to remember there are TWO kinds of NULLs in a CDC view: data that becomes NULL and data that remains unchanged on UPDATE.  The resulting query (which we encapsulated in a view) was quite complex. LAG and LEAD analytics and CASE statements to handle individual changes with an ultimate LAST_VALUE to build the final state of the row to merge into the target table.

Recently, I revisited FIRST_VALUE for a piece of ETL logic to supply ‘default’ values to some columns based on the content of other columns. A few years back I would fight hard to keep data cleanse / data enrichment logic like this on the source system; today, I am less convinced that this is the right place to do things – we live in world where source data systems are frequently outside our scope of influence – cloud based packaged solutions, multiple data sources (web, cloud, on-premise, subscription) leads us to a take a view that perhaps the best place to do things is where our data first lands on a system we control. I am working up a conference talk on this topic, so more on that later.

Simplistically, we needed to store some ‘rules’ in a table and then if a rule is matched apply the rule to the target column in the data. Suppose I run a grocery and the things I sell have item codes, what if I made a rule that said anything containing ‘OJ” was going to be of type “Orange Juice” and category “Fruit Juice”, likewise if it contains “GJ” then it is “Grape Juice” and again “Fruit Juice”.  It can be seen that we have two rules that supply the same value of “Juice” so we need to have a solution that can take an arbitrary number of rules to map to the same target value. One way of doing this is to use a regular expression as the join condition (REGEXP_LIKE is suitable) and to build a single expression separated by the ‘|’ symbol; remember in Oracle regular expressions we us a pipe as an OR operator between expressions. This is a simple use of LISTAGG to build a ‘|’ separated list we will then use in our REGEXP_LIKE match, we also get the benefit of optional ignore case.

Now suppose I wrote a rule PRODUCT_NAME contains “JUIC” – that will do nicely for telling me that “Orange Juice” is a fruit juice. However what about Carrot Juice, or even Juicy Fruit™ one option is give some form of rule “priority” so that highly specific rules take priority over generic ones, and take the result of the most important matching rule, ROW_NUMBER does that nicely. 

Did you spot that little ‘gotcha’ in the bit with item codes? We have one item code match string supplying two target columns. This is where we need to pivot our target values around (Use a CASE statement as the Oracle PIVOT command only works on aggregates) to give a sparse set of data (lots of NULLS) then use FIRST_VALUE IGNORE NULLS to generate our transformed column values. It is now we can reduce our data set to just one row with the ROW_NUMBER()=1 filter. Each row in our output gives the best-fit rules for each target column

Mind the Gap – Getting the number on ODI sequences

At work last week I lifted the covers on the Red Pill Analytics Oracle Slowly Changing Dimension ODI IKM. If you haven’t heard about this knowledge module it was specifically written to take a dimension source table and transform it to a slowly changing dimension target table. Unlike many other IKMs this module uses a mix of analytic functions and set operations to handle record closure and surrogate key management on SCD2 records efficiently. As an added bonus it works with mixed SCD2/ SCD1 attributes.

Currently, I am enhancing the KM to build-in flow control handling. We left that part out earlier as it is perhaps a less common use case. If you are following an Oracle Reference Architecture approach to building your BI platform you probably have already done the data validity checking at an earlier stage than the “publish to dimension table”.

Whilst implementing the new code and testing it I was reminded about some of the quirks of using sequences in ODI.

ODI supports many kinds of sequence: Standard, Specific, Native. However, coming from a database background, I prefer to use native if the underlying technology supports it. Not all databases support sequences, but where they do they will be the most efficient way to generate unique key values.

Adding a sequence to a mapping is a simple matter of editing the expression on the column properties, either directly on the target table or on any component that exposes the column. Using the RPA Oracle SCD IKM as an example this is often the target column as our dimension load is very simple indeed.

Screen Shot 2016 04 25 at 09 40 08

Sequences declared in ODI can be selected directly from the ODI Objects selector of the Expression Editor

Screen Shot 2016 04 25 at 09 43 28

By default our sequence becomes :[project name].[sequence name].NEXTVAL (I am assuming you don’t really want to use CURRVAL), that is an ODI ‘bind’ object. The manual also gives us a second syntax for a sequence expression. This syntax replaces the leading colon with a #. The two syntaxes give different results at execution time. It is down to how the ODI agent executing the mapping interprets the expression.

:[project name].[sequence name].NEXTVAL equates to for ‘each target row get the next sequence value’

#[project name].[sequence name].NEXTVAL equates to for ’this execution get the next sequence value then apply it to all target rows’

This difference between ‘:’ and ‘#’ also explains why using the ODI bind (:) is slower when using variables in a mapping.

Rail travellers in the UK and particularly on the London Underground will have heard the expression ‘MIND THE GAP’. Gaps need minding when the distance from the train to the platform in stations is larger than expected and there is a danger of falling.

By Robert S Donovan

(Image by Robert S Donovan under a CC license)

So what have gaps to do with sequences? In fact having a gap in a sequence is the way of life. In Oracle databases we can get gaps when transactions rollback (there is no mechanism to return the number to the sequence pool); gaps when we start the database as the sequence is often set to ‘cache’ a block of numbers to speed sequence access; gaps on RAC when each node may have its own cache of numbers.

When testing my SCD module I noticed some large gaps between key values when new records were processed and certainly not associated with RAC, DB restarts nor rollback. So why the gaps? Let’s look at this snippet of pseudo code used in the KM for surrogate key creation.

CASE WHEN NEW_KEY_NEEDED = TRUE THEN :DEMO.sq1.NEXTVAL ELSE CURRENT_SK END

What happens here is here is that the ODI Agent evaluates the sequence expression for each row processed whether the value is used or not. This will leave gaps, and in the case of a dimension where we have few changes that trigger new SCD rows per execution, gaps almost the same size as number of members in the dimension. In a way this does not really matter as the surrogate key is only a unique reference to the row and has no semantic meaning based on its value. However big gaps may mean we hit the maximum value we set on the sequence sooner than we expect.

We can draw on the techniques we used in ODI 10 for using a database sequence and explicitly name the sequence in the column – so if our sequence, SCD_TEST, is in the ODI_DEMO_TRG schema we can set our expression to be ODI_DEMO_TRG.SCD_TEST.nextval, just as we would do in SQL. In reality I would use an ODI Ref Function to resolve the schema name.  This gets rid of all the gaps generated by the execution of the KM.

ODI 12 LOOKUP Components

In the depths of time, before ODI 12, ODI Studio had a fairly simple canvas to design our interfaces. We dropped on our datastores (tables) and then dragged columns between tables to create joins and columns to the canvas to make filters. If we needed to use expressions we just typed them on the target table and selected where to execute the expression, be it source, stage or target.

In a data warehouse world we frequently need to lookup values, typically this would by joining to a dimension table on the business key and picking up the surrogate key from the dimension. Sometimes, we would need to outer join to cover the case of missing keys (hopefully only on dimension loads), downstream we would convert the resulting NULL to another value using some form of expression. Of course, the more adventurous of us would write custom IKMs to handle dimension loads and fact load key lookup chains, the later is not too hard if we create a metadata table to identify all the keys to supply or even use the data dictionary constraints.

ODI 12 changes things. We now have a palette of components that we can drop on to our logical design canvas. On this palette we have the join operator and, now, a look-up operator. So what’s the difference between the two and why should we use one or the other?

Firstly, there is nothing wrong with using the pre-ODI 12 design of joining to a table supplying our lookup data and doing a join on our key value, it works and working is always the fundamental requirement to meet!.

The lookup component gives us some additional functionality compared to a join. Firstly we can supply a default value for a key that does not exist in our look-up table. Using a join we might have to do an outer join followed by an expression operator providing a CASE statement to replace the NULL from the unmatched key. This however is not the only thing we can do with lookup component as we shall see.

Let’s take a look in the Oracle ODI12.2 VM, in the ODI LCM Trunk repository there is a mapping that uses a lookup, in this case we are not finding a key, but instead enhancing a feed to give age ranges.

Screen Shot 2016 04 13 at 14 02 25

We supply our lookup condition

Screen Shot 2016 04 13 at 14 02 56

Next we define our Match Row Rules

Screen Shot 2016 04 13 at 14 03 16

What happens if our key is not unique in the lookup table? This can happen in a versioned table. Sometimes we can change our conditions to force a single row select, but sometimes we need to be able to handle multiple rows. There is a dropdown to help us define our action

Screen Shot 2016 04 13 at 14 15 57

As you see, a lot of choice. The last three choices require us to specify an ‘order by’ column This can be good if we need to enforce using the most recent value in the lookup table.

Below the Multiple Match Rows dropdown there is another for No-Match Rows. This is the action to take when a match is not found. By default this is set to return a row, in other words do an outer join. We have two choices here, inner or outer join:

Screen Shot 2016 04 13 at 14 27 21

Finally, we can supply default values for the no-match rows. For example, we can set a key value to a key for ‘not known’.

So how does this look in the generated SQL?

Setting a default value of ‘0-110’ for age range we see

select 
 SRC_CUSTOMER_A.CUST_ID,
 INITCAP(SRC_CUSTOMER_A.FIRST_NAME) || ' ' || INITCAP(SRC_CUSTOMER_A.LAST_NAME),
 SRC_CUSTOMER_A.ADDRESS,
 SRC_CUSTOMER_A.CITY_ID,
 SRC_CUSTOMER_A.PHONE,
 SRC_CUSTOMER_A.AGE,
 CASE WHEN (SRC_AGE_GROUP_A.AGE_MIN IS NULL AND SRC_AGE_GROUP_A.AGE_MAX IS NULL AND SRC_AGE_GROUP_A.AGE_RANGE IS NULL) THEN '0-110' ELSE SRC_AGE_GROUP_A.AGE_RANGE END,
 SRC_SALES_PERSON_A.FIRST_NAME || ' ' || INITCAP(SRC_SALES_PERSON_A.LAST_NAME)
from (ODI_STAGING.C$_0SRC_CUSTOMER SRC_CUSTOMER_A LEFT OUTER JOIN ODI_STAGING.C$_1SRC_AGE_GROUP SRC_AGE_GROUP_A 
 ON SRC_CUSTOMER_A.AGE between SRC_AGE_GROUP_A.AGE_MIN and SRC_AGE_GROUP_A.AGE_MAX = SRC_AGE_GROUP_A.AGE_RANGE
 ) INNER JOIN ODI_STAGING.C$_2SRC_SALES_PERSON SRC_SALES_PERSON_A 
 ON SRC_CUSTOMER_A.SALES_PERS_ID = SRC_SALES_PERSON_A.SALES_PERS_ID
where 
 (1=1)

The code in RED is specific to our LOOKUP component. A left outer join in the select and a case statement to replace the NULL on not matching.

Setting the lookup to return the last single value the code becomes more complex as we add in some analytics and change the way the case statement works. Note, I am using the supplied mapping as an example, it is most unlikely that I’d really order on the AGE_MIN!

select 
 SRC_AGE_GROUP.CUST_ID,
 INITCAP(SRC_AGE_GROUP.FIRST_NAME) || ' ' || INITCAP(SRC_AGE_GROUP.LAST_NAME),
 SRC_AGE_GROUP.ADDRESS,
 SRC_AGE_GROUP.CITY_ID,
 SRC_AGE_GROUP.PHONE,
 SRC_AGE_GROUP.AGE,
 CASE WHEN SRC_AGE_GROUP.TC > 1 OR SRC_AGE_GROUP.AGE_MIN IS NOT NULL AND SRC_AGE_GROUP.AGE_MAX IS NOT NULL AND SRC_AGE_GROUP.AGE_RANGE IS NOT NULL THEN SRC_AGE_GROUP.AGE_RANGE ELSE '0-110' END,
 SRC_SALES_PERSON_A.FIRST_NAME || ' ' || INITCAP(SRC_SALES_PERSON_A.LAST_NAME)
from (
SELECT 
 SRC_CUSTOMER_A.DEAR DEAR ,
 SRC_CUSTOMER_A.LAST_NAME LAST_NAME ,
 SRC_CUSTOMER_A.FIRST_NAME FIRST_NAME ,
 SRC_CUSTOMER_A.ADDRESS ADDRESS ,
 SRC_CUSTOMER_A.CITY_ID CITY_ID ,
 SRC_CUSTOMER_A.PHONE PHONE ,
 SRC_CUSTOMER_A.AGE AGE ,
 SRC_CUSTOMER_A.SALES_PERS_ID SALES_PERS_ID ,
 SRC_CUSTOMER_A.CUST_ID CUST_ID ,
 SRC_AGE_GROUP_A_1.AGE_MIN AGE_MIN ,
 SRC_AGE_GROUP_A_1.AGE_MAX AGE_MAX ,
 SRC_AGE_GROUP_A_1.AGE_RANGE AGE_RANGE ,
 ROW_NUMBER() OVER (PARTITION BY SRC_CUSTOMER_A.AGE ORDER BY SRC_AGE_GROUP_A_1.AGE_MIN ASC) RN ,
 COUNT(*) OVER (PARTITION BY SRC_CUSTOMER_A.AGE) TC 
FROM 
 ODI_STAGING.C$_0SRC_CUSTOMER SRC_CUSTOMER_A LEFT OUTER JOIN (
SELECT 
 SRC_AGE_GROUP_A.AGE_MIN AGE_MIN ,
 SRC_AGE_GROUP_A.AGE_MAX AGE_MAX ,
 SRC_AGE_GROUP_A.AGE_RANGE AGE_RANGE 
FROM 
 ODI_STAGING.C$_1SRC_AGE_GROUP SRC_AGE_GROUP_A 
 ) SRC_AGE_GROUP_A_1 
 ON SRC_CUSTOMER_A.AGE between SRC_AGE_GROUP_A_1.AGE_MIN and SRC_AGE_GROUP_A_1.AGE_MAX = SRC_AGE_GROUP_A_1.AGE_RANGE
 ) SRC_AGE_GROUP INNER JOIN ODI_STAGING.C$_2SRC_SALES_PERSON SRC_SALES_PERSON_A 
 ON SRC_AGE_GROUP.SALES_PERS_ID = SRC_SALES_PERSON_A.SALES_PERS_ID
where 
 (1=1)
 and (TC=RN)

Erroring on multiple rows increases the complexity further:

select 
 SRC_CUSTOMER_A_1.CUST_ID,
 INITCAP(SRC_CUSTOMER_A_1.FIRST_NAME) || ' ' || INITCAP(SRC_CUSTOMER_A_1.LAST_NAME),
 SRC_CUSTOMER_A_1.ADDRESS,
 SRC_CUSTOMER_A_1.CITY_ID,
 SRC_CUSTOMER_A_1.PHONE,
 SRC_CUSTOMER_A_1.AGE,
 SRC_CUSTOMER_A_1.COL_2,
 SRC_SALES_PERSON_A.FIRST_NAME || ' ' || INITCAP(SRC_SALES_PERSON_A.LAST_NAME)
from (
SELECT 
 SRC_CUSTOMER_A.DEAR DEAR ,
 SRC_CUSTOMER_A.LAST_NAME LAST_NAME ,
 SRC_CUSTOMER_A.FIRST_NAME FIRST_NAME ,
 SRC_CUSTOMER_A.ADDRESS ADDRESS ,
 SRC_CUSTOMER_A.CITY_ID CITY_ID ,
 SRC_CUSTOMER_A.PHONE PHONE ,
 SRC_CUSTOMER_A.AGE AGE ,
 SRC_CUSTOMER_A.SALES_PERS_ID SALES_PERS_ID ,
 SRC_CUSTOMER_A.CUST_ID CUST_ID ,
 (
SELECT 
 SRC_AGE_GROUP.AGE_MIN AS AGE_MIN 
FROM 
 (
SELECT 
 SRC_AGE_GROUP_A.AGE_MIN AS AGE_MIN ,
 SRC_AGE_GROUP_A.AGE_MAX AS AGE_MAX ,
 SRC_AGE_GROUP_A.AGE_RANGE AS AGE_RANGE 
FROM 
 ODI_STAGING.C$_1SRC_AGE_GROUP SRC_AGE_GROUP_A 
 ) SRC_AGE_GROUP 
WHERE
 (SRC_CUSTOMER_A.AGE between SRC_AGE_GROUP.AGE_MIN and SRC_AGE_GROUP.AGE_MAX = SRC_AGE_GROUP.AGE_RANGE
) ) COL ,
 (
SELECT 
 SRC_AGE_GROUP.AGE_MAX AS AGE_MAX 
FROM 
 (
SELECT 
 SRC_AGE_GROUP_A.AGE_MIN AS AGE_MIN ,
 SRC_AGE_GROUP_A.AGE_MAX AS AGE_MAX ,
 SRC_AGE_GROUP_A.AGE_RANGE AS AGE_RANGE 
FROM 
 ODI_STAGING.C$_1SRC_AGE_GROUP SRC_AGE_GROUP_A 
 ) SRC_AGE_GROUP 
WHERE
 (SRC_CUSTOMER_A.AGE between SRC_AGE_GROUP.AGE_MIN and SRC_AGE_GROUP.AGE_MAX = SRC_AGE_GROUP.AGE_RANGE
) ) COL_1 ,
 COALESCE(((
SELECT 
 SRC_AGE_GROUP.AGE_RANGE AS AGE_RANGE 
FROM 
 (
SELECT 
 SRC_AGE_GROUP_A.AGE_MIN AS AGE_MIN ,
 SRC_AGE_GROUP_A.AGE_MAX AS AGE_MAX ,
 SRC_AGE_GROUP_A.AGE_RANGE AS AGE_RANGE 
FROM 
 ODI_STAGING.C$_1SRC_AGE_GROUP SRC_AGE_GROUP_A 
 ) SRC_AGE_GROUP 
WHERE
 (SRC_CUSTOMER_A.AGE between SRC_AGE_GROUP.AGE_MIN and SRC_AGE_GROUP.AGE_MAX = SRC_AGE_GROUP.AGE_RANGE
) )),'0-110') COL_2 
FROM 
 ODI_STAGING.C$_0SRC_CUSTOMER SRC_CUSTOMER_A 
 ) SRC_CUSTOMER_A_1 INNER JOIN ODI_STAGING.C$_2SRC_SALES_PERSON SRC_SALES_PERSON_A 
 ON SRC_CUSTOMER_A_1.SALES_PERS_ID = SRC_SALES_PERSON_A.SALES_PERS_ID
where 
 (1=1)

Finally, electing to ignore no-match rows removes the outer join

select 
 SRC_CUSTOMER_A.CUST_ID,
 INITCAP(SRC_CUSTOMER_A.FIRST_NAME) || ' ' || INITCAP(SRC_CUSTOMER_A.LAST_NAME),
 SRC_CUSTOMER_A.ADDRESS,
 SRC_CUSTOMER_A.CITY_ID,
 SRC_CUSTOMER_A.PHONE,
 SRC_CUSTOMER_A.AGE,
 SRC_AGE_GROUP_A.AGE_RANGE,
 SRC_SALES_PERSON_A.FIRST_NAME || ' ' || INITCAP(SRC_SALES_PERSON_A.LAST_NAME)
from (ODI_STAGING.C$_0SRC_CUSTOMER SRC_CUSTOMER_A INNER JOIN ODI_STAGING.C$_1SRC_AGE_GROUP SRC_AGE_GROUP_A 
 ON SRC_CUSTOMER_A.AGE between SRC_AGE_GROUP_A.AGE_MIN and SRC_AGE_GROUP_A.AGE_MAX = SRC_AGE_GROUP_A.AGE_RANGE
 ) INNER JOIN ODI_STAGING.C$_2SRC_SALES_PERSON SRC_SALES_PERSON_A 
 ON SRC_CUSTOMER_A.SALES_PERS_ID = SRC_SALES_PERSON_A.SALES_PERS_ID
where 
 (1=1)

This just shows the richness of implementation available in the Lookup operator – and more importantly a way to remove the multiple steps we may have to implement with a simple join.

What’s in a Name?

Another of those blogs so that Google can remind me of stuff I forget.

We all know the importance of naming conventions; a way we know the purpose of our tables and columns and as a means to identify what our constraints and indexes do without having to resort to the data dictionary to look them up. True, as ETL developers we are hampered in naming conventions for our data sources (we get what we are given) but downstream in our Staging, Raw Data Reservoir (if you are bang up to date) , Foundation, and Data Access & Performance layers we can influence the naming to create a valuable meta-data asset that helps the whole development team (both ETL and reporting) and business users alike.

ODI 12 has introduced a component based development paradigm. We can now drop components such as filters, joins, expressions, aggregators and splitters directly on to our mapping canvas in ODI Studio. In the real world we are likely to drop multiple joins and expressions on to the canvas, and here we see another opportunity to name things, this time the operators themselves. Instead of Filter, Filter1, Filter2 we should give meaningful names such as ‘DateRangeFilter’, ‘ActiveCodeFilter’, ‘LargeStoreFilter’. Doing this makes our mappings more understandable; product_code_lookup is far more easy to understand than ‘Lookup5’.

It may seem less important to name our expression operators, but in reality this is probably the most important thing to name, and to name uniquely across our mappings.

If we write a mapping that creates C$ tables, ODI 12 could well create a C$ table that encapsulates the extract from the source expression. This does not use the source table name (as ODI 11 and earlier did) instead it uses the expression operator name. Thus we might create a table called C$_0_EXPRESSION. This would be OK if we only run our mappings serially. However, if we go for parallel execution using a Load Plan we have the potential for two or more mappings to execute at the same time with all of the mappings creating, inserting into, selecting from and dropping the same C$ table. Inevitably leading to  execution failures. To make our debugging harder, these errors appear to occur randomly and not always generating the same error condition; sometimes table does not exist, sometimes column not existing. Sometimes the mapping actually works as expected. It is even conceivable that we have two mappings with the same structure for the C$ table and a situation where mapping 1 creates and populates the C$ table only to have mapping 2 drop and create the same table before mapping 1 uses the data.

Moral give all of your ODI 12 operators sensible names and ensure that expression names are unique if there is the slightest chance that two mappings may execute simultaneously.

Side note: it is possible to force ODI to create ‘process specific’ table names (a sort of run PID appended to the C$ name) This can be done on the physical tab of the mapping editor and checking ‘use unique temporary object names’. This is a simple fix for the parallel issue, at the potential loss of meaningful names making our mappings self-documenting. My suggestion is to use both the unique names check box and supplying your own meaningful names.

Random Random Thoughts

Last time I wrote about generating fake data for testing. Continuing with that theme here’s a quick note on getting non-even distributions.

Let’s say I have a employee dimension and I need to allocate staff across 4 offices (for simplicity, lets say the offices are called 1,2,3 and 4). There are several things we can do depending on how we want mix up our staff. Simplest is the use of the NTILE analytic function to divi up the employees across 4 buckets (or as we call them here, offices).

Select NTILE(4) OVER (ORDER BY DBMS_RANDOM.VALUE()) as OFFICE_ID from EMP_SRC;

The hard part is to come up with a suitable ordering key for the ORDER BY – it quite unlikely that ordering by employee name is a realistic solution, I often use a random order just to mix things up. The next simplest  is a round-robin approach where our office number is the row number of our employee table MOD 4 (but we also need to add 1 to get the office number into our 1 to 4 range). Both of these techniques will ensure that each of our 4 offices has the same number (±1) of staff.

But what if have been asked to allocate different numbers of staff to each office? Here DBMS_RANDOM.VALUE(LOW_RANGE_LIMIT, HIGH_RANGE_LIMIT) comes into play to give a random number >= LOW_RANG_LIMIT and < HIGH_RANGE_LIMIT. The thing to note is the numbers returned are decimal and not integers so we will need to convert them to integers to map to our OFFICE_ID. So let’s go with ROUND.

select OFFICE_ID, count(*) from
(select round (dbms_random.value(1,4)) OFFICE_IDfrom dual
connect by level <=1000)
group by OFFICE_ID order by OFFICE_ID;

So we expect about 250 people per office, yes?

Well NO! we get something like this:

OFFICE_ID COUNT(*)
---------- ----------
1 157
2 343
3 338
4 162

So what has happened? the simple answer is that ROUND  is returning the nearest integer to the random value. So if we get a number between 1.50 and 2.49 (well actually to more decimal places than that) we round it to 2.  However the first and last buckets are restricted in value – never less than 1 or more than 4. This means Office 1 maps to random values between 1.00 and 1.49 and not the usual N ± 0.5.

So what happens if we replace ROUND with TRUNC? Well the first to note is that we only get three offices back. 

select OFFICE_ID, count(*) from
(
select
TRUNC (dbms_random.value(1,4)) OFFICE_ID
from dual
connect by level <=1000
)
group by OFFICE_ID order by OFFICE_ID;
OFFICE_ID COUNT(*)
---------- ----------
1 339
2 316
3 345

Remember that the upper limit of the range is a ‘less than’ limit. So to get our distribution across 4 offices we need to set the upper limit to 5. 

OFFICE_ID COUNT(*)
---------- ----------
1 257
2 243
3 249
4 251

Now I am not saying that ROUND or TRUNC will give you the ‘best’ randomization, I am saying that they will give different results – perhaps you want a shoulder on the distribution, perhaps you want something flatter.

Choose wisely as Indiana Jones was told.