My Favourite Things

Apart from the obligatory “raindrops on roses and whiskers on kittens” I must also add analytic functions to my list, and one function in particular, FIRST_VALUE (oh, and its almost identical twin LAST_VALUE.)

FIRST_VALUE returns the first value encountered for a selection window; obviously, we specify the column to apply the function on and as this is an analytic function we have that OVER part at the end where we supply an optional partition by clause. What is not optional is the ORDER BY clause; clearly, it makes no sense to look for a first value if the order we see those values is not controlled and this means we must window too! Getting the first value or last value in a select is perhaps less useful if we have NULL values in that column, but this is just where the IGNORE NULLS option comes in handy. We can find the first non-null value encountered.

Lets consider this data set:

Screen Shot 2016 06 15 at 15 20 25

I have a couple of null values in “Your” things.

select first_value(THING) over (partition by WHOSE order by OTHER_THING rows between unbounded PRECEDING and unbounded FOLLOWING) AS our_fav from FAV_THINGS

gives us

Screen Shot 2016 06 15 at 15 23 03

select first_value(THING IGNORE NULLS) over (partition by WHOSE order by OTHER_THING rows between unbounded PRECEDING and unbounded FOLLOWING) our_fav from FAV_THINGS

Screen Shot 2016 06 15 at 15 22 45

Ignore Nulls just skips those pesky null values, but of course if every row in the window is a NULL then we will get a null back.

The first time I used this kind of functionality was a few years back when Stewart Bryson and I were involved in an Exadata V1 data warehouse. The customer had a requirement to use log-based change data capture, but was unable to set the CDC process to use supplemental logging. With log based CDC we get a time-marked (SCN + an order number (to handle multiple changes for a single SCN value)) and a single row representing each insert or delete and TWO rows for each update (a before and an after state). With supplemental logging we get all of the columns populated; without it we only get columns that change value and NULLs in the columns unaffected. Although far from ideal we can use the captured data (in our case a subscriber view over it) to synthesise the “missing” data in the capture. This is where LAST_VALUE comes into play.

As our customer was only interested in the final outcome of change – a ‘noisy’ source application meant a row might change many times over a few seconds as each individual column was updated Only the final state was of interest in the data warehouse. In outline we built a data set from the UNION ALL of the CDC subscriber view and the current state of our target table (not forgetting to filter the target table select to only include the PK values found in the CDC subscriber view being processed) – the only complication is to remember there are TWO kinds of NULLs in a CDC view: data that becomes NULL and data that remains unchanged on UPDATE.  The resulting query (which we encapsulated in a view) was quite complex. LAG and LEAD analytics and CASE statements to handle individual changes with an ultimate LAST_VALUE to build the final state of the row to merge into the target table.

Recently, I revisited FIRST_VALUE for a piece of ETL logic to supply ‘default’ values to some columns based on the content of other columns. A few years back I would fight hard to keep data cleanse / data enrichment logic like this on the source system; today, I am less convinced that this is the right place to do things – we live in world where source data systems are frequently outside our scope of influence – cloud based packaged solutions, multiple data sources (web, cloud, on-premise, subscription) leads us to a take a view that perhaps the best place to do things is where our data first lands on a system we control. I am working up a conference talk on this topic, so more on that later.

Simplistically, we needed to store some ‘rules’ in a table and then if a rule is matched apply the rule to the target column in the data. Suppose I run a grocery and the things I sell have item codes, what if I made a rule that said anything containing ‘OJ” was going to be of type “Orange Juice” and category “Fruit Juice”, likewise if it contains “GJ” then it is “Grape Juice” and again “Fruit Juice”.  It can be seen that we have two rules that supply the same value of “Juice” so we need to have a solution that can take an arbitrary number of rules to map to the same target value. One way of doing this is to use a regular expression as the join condition (REGEXP_LIKE is suitable) and to build a single expression separated by the ‘|’ symbol; remember in Oracle regular expressions we us a pipe as an OR operator between expressions. This is a simple use of LISTAGG to build a ‘|’ separated list we will then use in our REGEXP_LIKE match, we also get the benefit of optional ignore case.

Now suppose I wrote a rule PRODUCT_NAME contains “JUIC” – that will do nicely for telling me that “Orange Juice” is a fruit juice. However what about Carrot Juice, or even Juicy Fruit™ one option is give some form of rule “priority” so that highly specific rules take priority over generic ones, and take the result of the most important matching rule, ROW_NUMBER does that nicely. 

Did you spot that little ‘gotcha’ in the bit with item codes? We have one item code match string supplying two target columns. This is where we need to pivot our target values around (Use a CASE statement as the Oracle PIVOT command only works on aggregates) to give a sparse set of data (lots of NULLS) then use FIRST_VALUE IGNORE NULLS to generate our transformed column values. It is now we can reduce our data set to just one row with the ROW_NUMBER()=1 filter. Each row in our output gives the best-fit rules for each target column

Mind the Gap – Getting the number on ODI sequences

At work last week I lifted the covers on the Red Pill Analytics Oracle Slowly Changing Dimension ODI IKM. If you haven’t heard about this knowledge module it was specifically written to take a dimension source table and transform it to a slowly changing dimension target table. Unlike many other IKMs this module uses a mix of analytic functions and set operations to handle record closure and surrogate key management on SCD2 records efficiently. As an added bonus it works with mixed SCD2/ SCD1 attributes.

Currently, I am enhancing the KM to build-in flow control handling. We left that part out earlier as it is perhaps a less common use case. If you are following an Oracle Reference Architecture approach to building your BI platform you probably have already done the data validity checking at an earlier stage than the “publish to dimension table”.

Whilst implementing the new code and testing it I was reminded about some of the quirks of using sequences in ODI.

ODI supports many kinds of sequence: Standard, Specific, Native. However, coming from a database background, I prefer to use native if the underlying technology supports it. Not all databases support sequences, but where they do they will be the most efficient way to generate unique key values.

Adding a sequence to a mapping is a simple matter of editing the expression on the column properties, either directly on the target table or on any component that exposes the column. Using the RPA Oracle SCD IKM as an example this is often the target column as our dimension load is very simple indeed.

Screen Shot 2016 04 25 at 09 40 08

Sequences declared in ODI can be selected directly from the ODI Objects selector of the Expression Editor

Screen Shot 2016 04 25 at 09 43 28

By default our sequence becomes :[project name].[sequence name].NEXTVAL (I am assuming you don’t really want to use CURRVAL), that is an ODI ‘bind’ object. The manual also gives us a second syntax for a sequence expression. This syntax replaces the leading colon with a #. The two syntaxes give different results at execution time. It is down to how the ODI agent executing the mapping interprets the expression.

:[project name].[sequence name].NEXTVAL equates to for ‘each target row get the next sequence value’

#[project name].[sequence name].NEXTVAL equates to for ’this execution get the next sequence value then apply it to all target rows’

This difference between ‘:’ and ‘#’ also explains why using the ODI bind (:) is slower when using variables in a mapping.

Rail travellers in the UK and particularly on the London Underground will have heard the expression ‘MIND THE GAP’. Gaps need minding when the distance from the train to the platform in stations is larger than expected and there is a danger of falling.

By Robert S Donovan

(Image by Robert S Donovan under a CC license)

So what have gaps to do with sequences? In fact having a gap in a sequence is the way of life. In Oracle databases we can get gaps when transactions rollback (there is no mechanism to return the number to the sequence pool); gaps when we start the database as the sequence is often set to ‘cache’ a block of numbers to speed sequence access; gaps on RAC when each node may have its own cache of numbers.

When testing my SCD module I noticed some large gaps between key values when new records were processed and certainly not associated with RAC, DB restarts nor rollback. So why the gaps? Let’s look at this snippet of pseudo code used in the KM for surrogate key creation.

CASE WHEN NEW_KEY_NEEDED = TRUE THEN :DEMO.sq1.NEXTVAL ELSE CURRENT_SK END

What happens here is here is that the ODI Agent evaluates the sequence expression for each row processed whether the value is used or not. This will leave gaps, and in the case of a dimension where we have few changes that trigger new SCD rows per execution, gaps almost the same size as number of members in the dimension. In a way this does not really matter as the surrogate key is only a unique reference to the row and has no semantic meaning based on its value. However big gaps may mean we hit the maximum value we set on the sequence sooner than we expect.

We can draw on the techniques we used in ODI 10 for using a database sequence and explicitly name the sequence in the column – so if our sequence, SCD_TEST, is in the ODI_DEMO_TRG schema we can set our expression to be ODI_DEMO_TRG.SCD_TEST.nextval, just as we would do in SQL. In reality I would use an ODI Ref Function to resolve the schema name.  This gets rid of all the gaps generated by the execution of the KM.

ODI 12 LOOKUP Components

In the depths of time, before ODI 12, ODI Studio had a fairly simple canvas to design our interfaces. We dropped on our datastores (tables) and then dragged columns between tables to create joins and columns to the canvas to make filters. If we needed to use expressions we just typed them on the target table and selected where to execute the expression, be it source, stage or target.

In a data warehouse world we frequently need to lookup values, typically this would by joining to a dimension table on the business key and picking up the surrogate key from the dimension. Sometimes, we would need to outer join to cover the case of missing keys (hopefully only on dimension loads), downstream we would convert the resulting NULL to another value using some form of expression. Of course, the more adventurous of us would write custom IKMs to handle dimension loads and fact load key lookup chains, the later is not too hard if we create a metadata table to identify all the keys to supply or even use the data dictionary constraints.

ODI 12 changes things. We now have a palette of components that we can drop on to our logical design canvas. On this palette we have the join operator and, now, a look-up operator. So what’s the difference between the two and why should we use one or the other?

Firstly, there is nothing wrong with using the pre-ODI 12 design of joining to a table supplying our lookup data and doing a join on our key value, it works and working is always the fundamental requirement to meet!.

The lookup component gives us some additional functionality compared to a join. Firstly we can supply a default value for a key that does not exist in our look-up table. Using a join we might have to do an outer join followed by an expression operator providing a CASE statement to replace the NULL from the unmatched key. This however is not the only thing we can do with lookup component as we shall see.

Let’s take a look in the Oracle ODI12.2 VM, in the ODI LCM Trunk repository there is a mapping that uses a lookup, in this case we are not finding a key, but instead enhancing a feed to give age ranges.

Screen Shot 2016 04 13 at 14 02 25

We supply our lookup condition

Screen Shot 2016 04 13 at 14 02 56

Next we define our Match Row Rules

Screen Shot 2016 04 13 at 14 03 16

What happens if our key is not unique in the lookup table? This can happen in a versioned table. Sometimes we can change our conditions to force a single row select, but sometimes we need to be able to handle multiple rows. There is a dropdown to help us define our action

Screen Shot 2016 04 13 at 14 15 57

As you see, a lot of choice. The last three choices require us to specify an ‘order by’ column This can be good if we need to enforce using the most recent value in the lookup table.

Below the Multiple Match Rows dropdown there is another for No-Match Rows. This is the action to take when a match is not found. By default this is set to return a row, in other words do an outer join. We have two choices here, inner or outer join:

Screen Shot 2016 04 13 at 14 27 21

Finally, we can supply default values for the no-match rows. For example, we can set a key value to a key for ‘not known’.

So how does this look in the generated SQL?

Setting a default value of ‘0-110’ for age range we see

select 
 SRC_CUSTOMER_A.CUST_ID,
 INITCAP(SRC_CUSTOMER_A.FIRST_NAME) || ' ' || INITCAP(SRC_CUSTOMER_A.LAST_NAME),
 SRC_CUSTOMER_A.ADDRESS,
 SRC_CUSTOMER_A.CITY_ID,
 SRC_CUSTOMER_A.PHONE,
 SRC_CUSTOMER_A.AGE,
 CASE WHEN (SRC_AGE_GROUP_A.AGE_MIN IS NULL AND SRC_AGE_GROUP_A.AGE_MAX IS NULL AND SRC_AGE_GROUP_A.AGE_RANGE IS NULL) THEN '0-110' ELSE SRC_AGE_GROUP_A.AGE_RANGE END,
 SRC_SALES_PERSON_A.FIRST_NAME || ' ' || INITCAP(SRC_SALES_PERSON_A.LAST_NAME)
from (ODI_STAGING.C$_0SRC_CUSTOMER SRC_CUSTOMER_A LEFT OUTER JOIN ODI_STAGING.C$_1SRC_AGE_GROUP SRC_AGE_GROUP_A 
 ON SRC_CUSTOMER_A.AGE between SRC_AGE_GROUP_A.AGE_MIN and SRC_AGE_GROUP_A.AGE_MAX = SRC_AGE_GROUP_A.AGE_RANGE
 ) INNER JOIN ODI_STAGING.C$_2SRC_SALES_PERSON SRC_SALES_PERSON_A 
 ON SRC_CUSTOMER_A.SALES_PERS_ID = SRC_SALES_PERSON_A.SALES_PERS_ID
where 
 (1=1)

The code in RED is specific to our LOOKUP component. A left outer join in the select and a case statement to replace the NULL on not matching.

Setting the lookup to return the last single value the code becomes more complex as we add in some analytics and change the way the case statement works. Note, I am using the supplied mapping as an example, it is most unlikely that I’d really order on the AGE_MIN!

select 
 SRC_AGE_GROUP.CUST_ID,
 INITCAP(SRC_AGE_GROUP.FIRST_NAME) || ' ' || INITCAP(SRC_AGE_GROUP.LAST_NAME),
 SRC_AGE_GROUP.ADDRESS,
 SRC_AGE_GROUP.CITY_ID,
 SRC_AGE_GROUP.PHONE,
 SRC_AGE_GROUP.AGE,
 CASE WHEN SRC_AGE_GROUP.TC > 1 OR SRC_AGE_GROUP.AGE_MIN IS NOT NULL AND SRC_AGE_GROUP.AGE_MAX IS NOT NULL AND SRC_AGE_GROUP.AGE_RANGE IS NOT NULL THEN SRC_AGE_GROUP.AGE_RANGE ELSE '0-110' END,
 SRC_SALES_PERSON_A.FIRST_NAME || ' ' || INITCAP(SRC_SALES_PERSON_A.LAST_NAME)
from (
SELECT 
 SRC_CUSTOMER_A.DEAR DEAR ,
 SRC_CUSTOMER_A.LAST_NAME LAST_NAME ,
 SRC_CUSTOMER_A.FIRST_NAME FIRST_NAME ,
 SRC_CUSTOMER_A.ADDRESS ADDRESS ,
 SRC_CUSTOMER_A.CITY_ID CITY_ID ,
 SRC_CUSTOMER_A.PHONE PHONE ,
 SRC_CUSTOMER_A.AGE AGE ,
 SRC_CUSTOMER_A.SALES_PERS_ID SALES_PERS_ID ,
 SRC_CUSTOMER_A.CUST_ID CUST_ID ,
 SRC_AGE_GROUP_A_1.AGE_MIN AGE_MIN ,
 SRC_AGE_GROUP_A_1.AGE_MAX AGE_MAX ,
 SRC_AGE_GROUP_A_1.AGE_RANGE AGE_RANGE ,
 ROW_NUMBER() OVER (PARTITION BY SRC_CUSTOMER_A.AGE ORDER BY SRC_AGE_GROUP_A_1.AGE_MIN ASC) RN ,
 COUNT(*) OVER (PARTITION BY SRC_CUSTOMER_A.AGE) TC 
FROM 
 ODI_STAGING.C$_0SRC_CUSTOMER SRC_CUSTOMER_A LEFT OUTER JOIN (
SELECT 
 SRC_AGE_GROUP_A.AGE_MIN AGE_MIN ,
 SRC_AGE_GROUP_A.AGE_MAX AGE_MAX ,
 SRC_AGE_GROUP_A.AGE_RANGE AGE_RANGE 
FROM 
 ODI_STAGING.C$_1SRC_AGE_GROUP SRC_AGE_GROUP_A 
 ) SRC_AGE_GROUP_A_1 
 ON SRC_CUSTOMER_A.AGE between SRC_AGE_GROUP_A_1.AGE_MIN and SRC_AGE_GROUP_A_1.AGE_MAX = SRC_AGE_GROUP_A_1.AGE_RANGE
 ) SRC_AGE_GROUP INNER JOIN ODI_STAGING.C$_2SRC_SALES_PERSON SRC_SALES_PERSON_A 
 ON SRC_AGE_GROUP.SALES_PERS_ID = SRC_SALES_PERSON_A.SALES_PERS_ID
where 
 (1=1)
 and (TC=RN)

Erroring on multiple rows increases the complexity further:

select 
 SRC_CUSTOMER_A_1.CUST_ID,
 INITCAP(SRC_CUSTOMER_A_1.FIRST_NAME) || ' ' || INITCAP(SRC_CUSTOMER_A_1.LAST_NAME),
 SRC_CUSTOMER_A_1.ADDRESS,
 SRC_CUSTOMER_A_1.CITY_ID,
 SRC_CUSTOMER_A_1.PHONE,
 SRC_CUSTOMER_A_1.AGE,
 SRC_CUSTOMER_A_1.COL_2,
 SRC_SALES_PERSON_A.FIRST_NAME || ' ' || INITCAP(SRC_SALES_PERSON_A.LAST_NAME)
from (
SELECT 
 SRC_CUSTOMER_A.DEAR DEAR ,
 SRC_CUSTOMER_A.LAST_NAME LAST_NAME ,
 SRC_CUSTOMER_A.FIRST_NAME FIRST_NAME ,
 SRC_CUSTOMER_A.ADDRESS ADDRESS ,
 SRC_CUSTOMER_A.CITY_ID CITY_ID ,
 SRC_CUSTOMER_A.PHONE PHONE ,
 SRC_CUSTOMER_A.AGE AGE ,
 SRC_CUSTOMER_A.SALES_PERS_ID SALES_PERS_ID ,
 SRC_CUSTOMER_A.CUST_ID CUST_ID ,
 (
SELECT 
 SRC_AGE_GROUP.AGE_MIN AS AGE_MIN 
FROM 
 (
SELECT 
 SRC_AGE_GROUP_A.AGE_MIN AS AGE_MIN ,
 SRC_AGE_GROUP_A.AGE_MAX AS AGE_MAX ,
 SRC_AGE_GROUP_A.AGE_RANGE AS AGE_RANGE 
FROM 
 ODI_STAGING.C$_1SRC_AGE_GROUP SRC_AGE_GROUP_A 
 ) SRC_AGE_GROUP 
WHERE
 (SRC_CUSTOMER_A.AGE between SRC_AGE_GROUP.AGE_MIN and SRC_AGE_GROUP.AGE_MAX = SRC_AGE_GROUP.AGE_RANGE
) ) COL ,
 (
SELECT 
 SRC_AGE_GROUP.AGE_MAX AS AGE_MAX 
FROM 
 (
SELECT 
 SRC_AGE_GROUP_A.AGE_MIN AS AGE_MIN ,
 SRC_AGE_GROUP_A.AGE_MAX AS AGE_MAX ,
 SRC_AGE_GROUP_A.AGE_RANGE AS AGE_RANGE 
FROM 
 ODI_STAGING.C$_1SRC_AGE_GROUP SRC_AGE_GROUP_A 
 ) SRC_AGE_GROUP 
WHERE
 (SRC_CUSTOMER_A.AGE between SRC_AGE_GROUP.AGE_MIN and SRC_AGE_GROUP.AGE_MAX = SRC_AGE_GROUP.AGE_RANGE
) ) COL_1 ,
 COALESCE(((
SELECT 
 SRC_AGE_GROUP.AGE_RANGE AS AGE_RANGE 
FROM 
 (
SELECT 
 SRC_AGE_GROUP_A.AGE_MIN AS AGE_MIN ,
 SRC_AGE_GROUP_A.AGE_MAX AS AGE_MAX ,
 SRC_AGE_GROUP_A.AGE_RANGE AS AGE_RANGE 
FROM 
 ODI_STAGING.C$_1SRC_AGE_GROUP SRC_AGE_GROUP_A 
 ) SRC_AGE_GROUP 
WHERE
 (SRC_CUSTOMER_A.AGE between SRC_AGE_GROUP.AGE_MIN and SRC_AGE_GROUP.AGE_MAX = SRC_AGE_GROUP.AGE_RANGE
) )),'0-110') COL_2 
FROM 
 ODI_STAGING.C$_0SRC_CUSTOMER SRC_CUSTOMER_A 
 ) SRC_CUSTOMER_A_1 INNER JOIN ODI_STAGING.C$_2SRC_SALES_PERSON SRC_SALES_PERSON_A 
 ON SRC_CUSTOMER_A_1.SALES_PERS_ID = SRC_SALES_PERSON_A.SALES_PERS_ID
where 
 (1=1)

Finally, electing to ignore no-match rows removes the outer join

select 
 SRC_CUSTOMER_A.CUST_ID,
 INITCAP(SRC_CUSTOMER_A.FIRST_NAME) || ' ' || INITCAP(SRC_CUSTOMER_A.LAST_NAME),
 SRC_CUSTOMER_A.ADDRESS,
 SRC_CUSTOMER_A.CITY_ID,
 SRC_CUSTOMER_A.PHONE,
 SRC_CUSTOMER_A.AGE,
 SRC_AGE_GROUP_A.AGE_RANGE,
 SRC_SALES_PERSON_A.FIRST_NAME || ' ' || INITCAP(SRC_SALES_PERSON_A.LAST_NAME)
from (ODI_STAGING.C$_0SRC_CUSTOMER SRC_CUSTOMER_A INNER JOIN ODI_STAGING.C$_1SRC_AGE_GROUP SRC_AGE_GROUP_A 
 ON SRC_CUSTOMER_A.AGE between SRC_AGE_GROUP_A.AGE_MIN and SRC_AGE_GROUP_A.AGE_MAX = SRC_AGE_GROUP_A.AGE_RANGE
 ) INNER JOIN ODI_STAGING.C$_2SRC_SALES_PERSON SRC_SALES_PERSON_A 
 ON SRC_CUSTOMER_A.SALES_PERS_ID = SRC_SALES_PERSON_A.SALES_PERS_ID
where 
 (1=1)

This just shows the richness of implementation available in the Lookup operator – and more importantly a way to remove the multiple steps we may have to implement with a simple join.

What’s in a Name?

Another of those blogs so that Google can remind me of stuff I forget.

We all know the importance of naming conventions; a way we know the purpose of our tables and columns and as a means to identify what our constraints and indexes do without having to resort to the data dictionary to look them up. True, as ETL developers we are hampered in naming conventions for our data sources (we get what we are given) but downstream in our Staging, Raw Data Reservoir (if you are bang up to date) , Foundation, and Data Access & Performance layers we can influence the naming to create a valuable meta-data asset that helps the whole development team (both ETL and reporting) and business users alike.

ODI 12 has introduced a component based development paradigm. We can now drop components such as filters, joins, expressions, aggregators and splitters directly on to our mapping canvas in ODI Studio. In the real world we are likely to drop multiple joins and expressions on to the canvas, and here we see another opportunity to name things, this time the operators themselves. Instead of Filter, Filter1, Filter2 we should give meaningful names such as ‘DateRangeFilter’, ‘ActiveCodeFilter’, ‘LargeStoreFilter’. Doing this makes our mappings more understandable; product_code_lookup is far more easy to understand than ‘Lookup5’.

It may seem less important to name our expression operators, but in reality this is probably the most important thing to name, and to name uniquely across our mappings.

If we write a mapping that creates C$ tables, ODI 12 could well create a C$ table that encapsulates the extract from the source expression. This does not use the source table name (as ODI 11 and earlier did) instead it uses the expression operator name. Thus we might create a table called C$_0_EXPRESSION. This would be OK if we only run our mappings serially. However, if we go for parallel execution using a Load Plan we have the potential for two or more mappings to execute at the same time with all of the mappings creating, inserting into, selecting from and dropping the same C$ table. Inevitably leading to  execution failures. To make our debugging harder, these errors appear to occur randomly and not always generating the same error condition; sometimes table does not exist, sometimes column not existing. Sometimes the mapping actually works as expected. It is even conceivable that we have two mappings with the same structure for the C$ table and a situation where mapping 1 creates and populates the C$ table only to have mapping 2 drop and create the same table before mapping 1 uses the data.

Moral give all of your ODI 12 operators sensible names and ensure that expression names are unique if there is the slightest chance that two mappings may execute simultaneously.

Side note: it is possible to force ODI to create ‘process specific’ table names (a sort of run PID appended to the C$ name) This can be done on the physical tab of the mapping editor and checking ‘use unique temporary object names’. This is a simple fix for the parallel issue, at the potential loss of meaningful names making our mappings self-documenting. My suggestion is to use both the unique names check box and supplying your own meaningful names.

Random Random Thoughts

Last time I wrote about generating fake data for testing. Continuing with that theme here’s a quick note on getting non-even distributions.

Let’s say I have a employee dimension and I need to allocate staff across 4 offices (for simplicity, lets say the offices are called 1,2,3 and 4). There are several things we can do depending on how we want mix up our staff. Simplest is the use of the NTILE analytic function to divi up the employees across 4 buckets (or as we call them here, offices).

Select NTILE(4) OVER (ORDER BY DBMS_RANDOM.VALUE()) as OFFICE_ID from EMP_SRC;

The hard part is to come up with a suitable ordering key for the ORDER BY – it quite unlikely that ordering by employee name is a realistic solution, I often use a random order just to mix things up. The next simplest  is a round-robin approach where our office number is the row number of our employee table MOD 4 (but we also need to add 1 to get the office number into our 1 to 4 range). Both of these techniques will ensure that each of our 4 offices has the same number (±1) of staff.

But what if have been asked to allocate different numbers of staff to each office? Here DBMS_RANDOM.VALUE(LOW_RANGE_LIMIT, HIGH_RANGE_LIMIT) comes into play to give a random number >= LOW_RANG_LIMIT and < HIGH_RANGE_LIMIT. The thing to note is the numbers returned are decimal and not integers so we will need to convert them to integers to map to our OFFICE_ID. So let’s go with ROUND.

select OFFICE_ID, count(*) from
(select round (dbms_random.value(1,4)) OFFICE_IDfrom dual
connect by level <=1000)
group by OFFICE_ID order by OFFICE_ID;

So we expect about 250 people per office, yes?

Well NO! we get something like this:

OFFICE_ID COUNT(*)
---------- ----------
1 157
2 343
3 338
4 162

So what has happened? the simple answer is that ROUND  is returning the nearest integer to the random value. So if we get a number between 1.50 and 2.49 (well actually to more decimal places than that) we round it to 2.  However the first and last buckets are restricted in value – never less than 1 or more than 4. This means Office 1 maps to random values between 1.00 and 1.49 and not the usual N ± 0.5.

So what happens if we replace ROUND with TRUNC? Well the first to note is that we only get three offices back. 

select OFFICE_ID, count(*) from
(
select
TRUNC (dbms_random.value(1,4)) OFFICE_ID
from dual
connect by level <=1000
)
group by OFFICE_ID order by OFFICE_ID;
OFFICE_ID COUNT(*)
---------- ----------
1 339
2 316
3 345

Remember that the upper limit of the range is a ‘less than’ limit. So to get our distribution across 4 offices we need to set the upper limit to 5. 

OFFICE_ID COUNT(*)
---------- ----------
1 257
2 243
3 249
4 251

Now I am not saying that ROUND or TRUNC will give you the ‘best’ randomization, I am saying that they will give different results – perhaps you want a shoulder on the distribution, perhaps you want something flatter.

Choose wisely as Indiana Jones was told.

 

Making Stuff Up

My whole data career has been about finding the truth within data, seeking correlations within and between data sources (and not always relational sources at that). You get to notice patterns in the world and even develop skills to spot outliers on a rapidly scrolling screen. In those early days I used tools like i2 Analyst Notebook (now an IBM offering) to search call data records or (more fun) some of the free text systems I had access to.

In DSS / BI / Data Science / whatever, we just can’t do our job of finding patterns and relationships if data is truly random. Relationships always exist (or at least are statistically significant). For an example, the non-fiction books people buy from Amazon often have a bias towards a person’s occupation or education; the bookcase above my desk has a large number of database and mathematics books with the odd volume on management. It’s these biases that make building BI test or training data more complex that just randomly populating columns.

We could use obfuscated data but that sometimes is problematic as obfuscation needs careful thought to prevent inadvertent exposure of live data through subject domain knowledge and guessing. Often though for test and training we need to come up with 100% synthetic content.

Personally, I prefer to see my synthetic data in my own language. I have a bit of a bee in my bonnet about ‘Zsdefes Aagrtf’ being my top customer when ‘Ricco daCosta’ or ‘Claire Wilson’ would look so much better on generated reports. Likewise just randomising CITY, STATE, COUNTRY as independent entities may be great for building ETL test sets but for reporting we end up with a far too flat distribution and many repeating city names that relate to completely different locations; just how many Little Rock, Utah really exist in Belgium? It is relatively easy to build a table of real city-state-country data and then map that to our synthetic geography data dimensions. If we will be analyzing our data on phone number or postal code I’d extend my table to include realistic area codes and representative postal codes. As an aside, unless I am doing single country work I tend to avoid using postal codes as there is no global standard for where a postal code exists in an address hierarchy, UK and Netherlands postcodes go down to a few houses on a street, French code are to a grain of one or more whole towns or villages within a Department, US Zip codes sit  somewhere in-between and might represent an area in a large city.

We can’t just randomly allocate our fake customers to our generated city list or assign a random Standard and Poor’s credit rating to our corporate accounts. These things actually follow some form of distribution. We may well have more customers in San Francisco than in Fresno; and there are far fewer ‘AAA’ rated companies around than ‘A’ rated (but hopefully more ‘AAA’ than ‘D’). Thus we need to map some form of distribution over synthetic data. If we have a real data source and something like Oracle Enterprise Data Quality (OEDQ) we can easily build representative data distributions we need to superimpose on our synthetic data, or we could go home grown with R or even DBMS_STAT_FUNCS.SUMMARY to find how data is distributed. Often though something really simple like the analytic NTILE function will allow us to readily combine our source data with the synthetic obfuscation we are applying.

My fake data tool chest contains public domain geo-data (preferably including city size so that I can distribute on that), the dbms_random family of Oracle functions, the ntile and row_number analytic functions and the good old case statement to help bucket up small domain data sets such as distribution of S&P ratings. Building good synthetic data is a great exercise and allows you get quite inventive with SQL.

As a recent example I needed to add a Standard and Poor’s rating attribute to each of the 5000 members of our company dimension. The testing requirement was for a realistic distribution of ratings over the dimension, but there was no need to use real company rating data.
To do this I created a couple of staging tables. The first was the company key and a bucket number, here I am using 175 buckets.

create table bucket_list as select unique company_id, ntile(175) over (order by company_id) bkt from COMPANY_D;

I am using unique as company_id is the business key and for this use I want to keep the same rating across any dimension change history.
Next I get creative and build a simple table of S&P rating and start and end bucket numbers. Here I shape the distribution of the ratings; if I had the real distribution I could model that, but for this use I just shape a distribution that looks nice!

--------------------------------------------------------
 -- DDL for Table SP_DISTRIBUTION
 --------------------------------------------------------
CREATE TABLE "SP_DISTRIBUTION" 
 ( "RATING" VARCHAR2(10 BYTE), 
 "BUCKET_START" NUMBER, 
 "BUCKET_END" NUMBER
 ) ;
 REM INSERTING into SP_DISTRIBUTION
 SET DEFINE OFF;
 Insert into SP_DISTRIBUTION (RATING,BUCKET_START,BUCKET_END) values ('AAA',1,3);
 Insert into SP_DISTRIBUTION (RATING,BUCKET_START,BUCKET_END) values ('AA',4,10);
 Insert into SP_DISTRIBUTION (RATING,BUCKET_START,BUCKET_END) values ('A',11,25);
 Insert into SP_DISTRIBUTION (RATING,BUCKET_START,BUCKET_END) values ('BBB',26,50);
 Insert into SP_DISTRIBUTION (RATING,BUCKET_START,BUCKET_END) values ('BB',51,90);
 Insert into SP_DISTRIBUTION (RATING,BUCKET_START,BUCKET_END) values ('B',91,128);
 Insert into SP_DISTRIBUTION (RATING,BUCKET_START,BUCKET_END) values ('CCC',129,150);
 Insert into SP_DISTRIBUTION (RATING,BUCKET_START,BUCKET_END) values ('CC',151,165);
 Insert into SP_DISTRIBUTION (RATING,BUCKET_START,BUCKET_END) values ('C',166,174);
 Insert into SP_DISTRIBUTION (RATING,BUCKET_START,BUCKET_END) values ('D',175,175);
 COMMIT;

We could use analytic functions to derive the bucket_end value but for a small data set like this it is often as easy to type the values in.

finally we join these two tables and use them in a merge statement to update the dimension

merge into company_d dd using (
 select company_key, rating from BUCKET_LIST, SP_DISTRIBUTION
 where
 bkt >= SP_DISTRIBUTION.BUCKET_START and bkt <= SP_DISTRIBUTION.BUCKET_END) h
 on (h.company_key = dd.company_key)
 when matched then
 UPDATE SET 
 dd.sp_rating = h.rating;

A thought on ODI variables

One of the projects I have been working on recently had a requirement to enrich some data by bringing back additional information from a legacy system, For various reasons they did not want the whole of the legacy system extracted to enterprise data warehouse. We only need extract data for records that change in the current load cycle.

This may sound simple: we join the active records to the legacy to create an enriched data set. There is however a problem with this approach (or at least on Oracle databases where the legacy system is accessed through a database link) and that is driving site hints are ignored on insert statements. We end up a horrible execution plan as mentioned here by Jonathan Lewis. Yes there are solutions, such as the pipelined table approach that I developed for that customer.

Sometimes we need to step back and think “is the use of a join the right thing to do?” Can I code this another way that doesn’t require me to write a function I will need to maintain in the database?

ODI gives us two fundamental ways of restricting the amount of data we bring back, the join and the filter; is there a way we could use a FILTER to restrict data? Obviously we can hard code filters such as “data1.STATUS=‘Active’” but what if we need flexibility to dynamically choose our status value. This is of course simple for a single value – we use a variable in the filter. But what if we need to filter using an IN list, for example ‘data_source.key IN(1,2,5,7,99,2345)’? how do we supply a variable number of values? Here is one method that we can use.

Oracle 11gR2 introduced the LISTAGG function to the database. This function basically concatenates several rows of data into a single string. In our case we want to generate a comma separated list of values that we can slot into our filter expression in ODI. The variable refresh is a simple SELECT LISTAGG… from our list source table. Our filter expression in ODI would look something like 

NewImage

The Oracle documentation for LISTAGG has the syntax. For our use we need to specify the delimiter between items. For numeric matches this will need to be ‘,’  a comma, it is a bit more complex for string matches, here we need to quote each item so would need to quote our comma – we also need to quote the variable in the ODI filter condition so that we supply leading and trailing quotes

For example we could define the refresh for the variable Filter_List as

SELECT LISTAGG(TO_CHAR(PRODUCT_KEY) ,’,’) WITHIN GROUP (ORDER BY PRODUCT_KEY) from PRODUCTS_LOADED

I try to remember to explicitly to define type conversions when converting between numbers and text to avoid any implicit conversion surprises.

There are of course limitations to this approach. We can only have a finite size for a ODI character variable and there is that limitation on in-list size in the database. But for reasonable sized selections this approach works well.