Using ODI with Git Versioning – Part 2: A Game of Tag

Last time I blogged about getting started with the new Git VCS functionality in Oracle Data Integrator (ODI) 12.2. Then, I promised to write a bit on integrating Git into the deployment process.

Firstly, a disclaimer. I used to work for a IT services company that also ran one of the largest ITIL training organizations in EMEA. In the traditions of “Eat your own dog food”, it was inevitable that the disciplines of release control and configuration management underpinned all we did. We also developed (in collaboration with other global IT companies) methodologies to bring the ‘software’ side of the house into a similar framework to the one used for ‘infrastructure’. Of course, this was not a simple drop-in of a process; developers have different needs from infrastructure deployment folk. From the deployment side of the house we need to document and create packages of artefacts to deploy (these may be scripts, binary files, ODI scenarios) and at the same time freeze a copy of the development source code so that developers and support staff have a code commit point to work from for subsequent development and bug-fix activities. Developers need more than that, they need to be able to save versions during the development process so that they protect the code they are developing from all kinds of mishap. Often the code that developers need to track in an SCM is way more than the deployed codebase, and the waters are muddied further when we consider commonplace processes such as continuous integration. Basically, it boils down to my expectations with Git-based development and deployment, and how those expectations are actually met in the SCM integration documented by Oracle when they introduced Subversion VCS to ODI.

Organizing your work

Although the Oracle ODI versioning documentation suggests that small teams can do their development in trunk, that is simply not the best way to approach this, for two reasons. Firstly, ODI’s Create New Version paradigm always saves and commits to the remote Git repository; it is far cleaner to develop in an isolated branch and have a Git release manager or deployment engineer managing the downstream SCM merge process. Secondly if we are using any automation based on Git commit hooks (such as continuous integration processes using Jenkins) it is sensible that we do this from a “clean” codebase and use the commit messages from the merge as the process trigger.

The other organizational question is where does everything sit in the environment? To help explain this I have created this simplified schematic:

Schemetic

As you can see, there is a remote file system holding the whole Git repository. This remote Git repository may be at a hosting provider such as github.com, or it could be an in-house repository running on some git server. The developer clones a single branch from the remote repository to a local Git file system (usually) on the developer’s workstation. So far this is standard Git practice. ODI adds to this by including a database-hosted ODI repository. Optionally, the Git file systems can seed the ODI repository, and of course the ODI repository sends changes to Git (local and remote repositories) whenever we create a new version.

All sounds simple. The potential “gotcha” is that ODI only supports a single set of objects in the work and master repositories. As an example: I am developing some new mappings in my “SalesDevJan2017” branch and my colleague, Lisa, is working in her “CustomerDevMar2017” branch. Both of our branches contain some datastores and mappings that relate to customer; what if Lisa adds some new attributes to the customer table that won’t go live until the quarter after my changes? As soon as I version all changed objects there is a risk that my branch will contain code / objects that haven’t been finalized yet.

We have four basic options:

  1. Use ODI as the master codebase and Git as an out-of-line code store. This is pretty much what we did in ODI 12.1 (and earlier) days; create XML exports of the objects and store in a file system. We get the benefit of version restore, but little else. I just don’t like this, ODI is never going to be the sole source of the artefacts we need to version in a project (think DDL, documentation, report definitions etc). Git should be our corporate repository of source, not a cluster of multi-master islands of code.
  2. Create a branch for each sprint (if we are agile) and have all developers work in the same branch. Of course we still have potential problems for work that takes longer than a single sprint.
  3. Create an ODI WORK repository for each developer and develop in individual branches. This gives great isolation for all the design components, but still requires shared components in the security, topology parts of the MASTER repository.
  4. Give each developer his or her own ODI repository database and a Git branch to work in. This gives us maximum isolation of development. Obviously, there may be a potential license cost for all of those repository databases however we can consider ways to mitigate this such as named user licensing on development / test platforms or by running multiple ODI repositories on the same platform, either as separate DB instances or as multiple schemas within one database.

The flow for option 4 could be:

  • Clone the master code line to create a new branch
  • Populate the ODI work and master repositories from the branch
  • Develop ODI code and version when necessary
  • Version the ODI repository
  • Merge the branch back in to the master

Versioning

Deployments

Here my release management gene kicks in: I want to move source control away from code releases. True, there is always a need to tie code at a point in time to the executable version of released code but we don’t really want to clutter our code repository with a large number of executables that are not required. We also need a way to handle the executables away from the code; there is no place for source code on any platform other than development (well maybe hot-fix, but that is really just a special case of a git branch). I also want a way to back-out a release if things go wrong.

In ODI terms, I will need to deploy the scenarios and loadplans generated from ODI. I also need to deploy any non-ODI configuration scripting, this will include shell scripts, DDL files, data seeding scripts etc. The non-ODI scripts are fairly straight forward, we just create additional directories in our branch to hold the scripts and manage that though a Git client or web interface. For the ODI materials we can use another ODI 12.2 feature, the deployment archive.

In simple terms, the deployment archive is a pair of zip files, one containing the XML source files, the other containing the executable scenarios and load plans. It is this executable archive that we need to deploy to our testing and production environments. The source zip should not contain anything we do not already have in SCM, but it is nice to have a single file that ties to the released executables. In fact the deployment archive is a tad more sophisticated than this, we create ‘Initial’ deployments (all content) or ‘patch’ deployments (just changes to apply) where we can also generate ‘rollback’ archives to allow us to revert patches.

The deployment archive functionality is linked to the concept of ‘Tags’ in SCM. Git tags are not the same as subversion labels, but for our ODI deployments this is not too relevant. We create a tag in git from ODI studio and then use this tag to identify all of the objects to include in the deployment archive. ODI has a concept of a partial tag— this is of course not a real git term. Think of a partial tag as a user generated list of objects, the editor for a partial tag is quite similar to the one for smart exports. However I think it is better to use a full tag, that is all the tracked objects.

TeamVCSActions

Once our tag is in git we can use create deployment archive from VCS label to generate the archives. This gives us a set of zip file we can add to git and then deploy. Better yet we can use the ODI scripting SDK to develop an automated process to rollout code to the test, UAT, production as part of a CI workflow.

Last time I blogged about getting started with the new Git VCS functionality in Oracle Data Integrator (ODI) 12.2. Then, I promised to write a bit on integrating Git into the deployment process.

Firstly, a disclaimer. I used to work for a IT services company that also ran one of the largest ITIL training organizations in EMEA. In the traditions of “Eat your own dog food”, it was inevitable that the disciplines of release control and configuration management underpinned all we did. We also developed (in collaboration with other global IT companies) methodologies to bring the ‘software’ side of the house into a similar framework to the one used for ‘infrastructure’. Of course, this was not a simple drop-in of a process; developers have different needs from infrastructure deployment folk. From the deployment side of the house we need to document and create packages of artefacts to deploy (these may be scripts, binary files, ODI scenarios) and at the same time freeze a copy of the development source code so that developers and support staff have a code commit point to work from for subsequent development and bug-fix activities. Developers need more than that, they need to be able to save versions during the development process so that they protect the code they are developing from all kinds of mishap. Often the code that developers need to track in an SCM is way more than the deployed codebase, and the waters are muddied further when we consider commonplace processes such as continuous integration. Basically, it boils down to my expectations with Git-based development and deployment, and how those expectations are actually met in the SCM integration documented by Oracle when they introduced Subversion VCS to ODI.

 

Using ODI with Git Versioning – Part 1: The Basics

Last time I went through some of the basics about getting the latest patch-set of ODI to connect to a GitHub.com VCS repository. Today I will talk a little about versioning our ODI development in Git.

Navigator

The first point to make is that we can select the objects we wish to version-track in Git, it need not be the whole repository; we could just track an individual mapping, but most likely we would be tracking one or more ODI repository projects or project folders. I don’t know about you, but my preferred development style is to use a single project containing multiple folders to organize my codebase; I also create a ‘development playground’ folder where I can try out ideas. Whether we source track the playground folder is debatable. It is a code asset and should be tracked, however it may not sit well with some aspects of ODI Deployment Archive creation from Git sources; I will write more on deployments next time.

The easiest way of adding ODI content to Git VCS is by right-clicking on the object in the ODI designer navigator and then Version/Add to VCS.

AddContentToGit

This leads to the Add Objects To Git dialog box. As I mentioned in the previous blog, ODI uses a commit and push model; it is mandatory to provide a commit comment every time you version an object in ODI Git; this dialog box is no exception. As you can see, versioning at the project (or folder) level will allow to include child objects; this can save us work and, more importantly, add multiple items in a single commit.

AddToGit

Include Dependencies will ensure that dependent ODI objects are included in the files pushed to Git, the Regenerate and Version Scenarios option is self-explanatory. In the ODI Studio Navigator Git versioned ODI objects have a small green ‘g’ on the object icon.

Having pushed to Git we can see what is in there! Looking in my local file system’s Git repository clone I see that ODI has generated two folder trees, one for the Master Repository and the other for the Work Repository. The structures of the two repository trees follow the same basic outline as you would see on the designer and topology navigators in ODI Studio.

 

GitTree

In my screenshot I have also included a couple of other objects that are not managed by ODI (a file and a folder); most organizations would also keep non-ODI items in the same Git repository, for example DDL scripts, data models, shell scripts, deployment notes, in fact just about any resource that needs versioning — being able to do this is an important plus.

Now that we have ODI source code in Git we can get on with developing and versioning as we go. As previously described, I am not in favor of automatically creating a new version on each save, rather I prefer developer choice regarding the right time to commit and push.

We just right click on the object we want to version in the source navigator to access the Version context menu and then select Create VCS Version…

VersionMenu

As always, we will need to provide a commit comment for the save. Note too, that ODI is smart enough not to commit to Git if there have been no changes.

NoChange

Note the Version menu is context specific; if we right click on a container item such as Folder or Model we get an additional option to Add Non-versioned Objects to VCS… This does exactly what we expect: it brings up a list of objects in that container that are not currently under version control and allows us to select the ones we need to add to Git.

Other options on any VCS tracked object are Version History, Version Tree, Restore From VCS.

VersionHistory

From the version history we can restore a previous version from VCS. As a developer, I find it difficult to remember which commit date is the version that I need to restore. Instead, I prefer to use meaningful commit comments such as Jira ticket number or release number. Of course, with Git we could use these commit comment as part of a workflow triggering other updates to our systems. The version history screen also lets us access a useful 3-way version compare window. This allows us to compare the Git baseline with the selected version and the current state of the ODI repository or to compare 2 selected versions against the baseline. In my example below, I added the filter FILTER1 to the baseline mapping and saved to Git — next I renamed the filter Low_numbers and saved to the ODI repository but not pushed to Git.

Screen Shot 2016 10 28 at 11 41 31

 

Clicking on the ‘Select Version’ button brings up a version history window shown on the right of the screenshot above. Again, I find it easier to use the commit comment rather than the date column to identify versions, especially if we create multiple versions per day.The final option is to display a Version Tree.

Version Tree

A version tree is a graphical representation of the Git commits for an object, again with access to the version compare feature. The tree is color-coded with the initial create in green, changes in black and deletes in red. For my example mapping I tested the ability to recover a deleted version and hence I get a red spot representing the delete and a yellow one for the restore version step. Hover the mouse over a dot and a pop-up window shows additional detail on the version. Again, we get direct access to the 3-way compare feature by clicking on the icon in the screen menu bar.

Team Menu

The other place where the developer interacts with the ODI-Git versioning system is on the Team Menu -> VCS Actions. I’ll cover most of these options in part 2 of this blog but for now a few of developer-useful features of VCS Actions.

TeamVCSActions

Pending Changes identifies all of the Git tracked objects that have been saved to the ODI metadata repository but have not yet been pushed to Git.

Pending Changes

We can, optionally, push one or more of the changes to Git from the Pending Changes viewer or select a single ODI object and restore it from Git. In fact, I prefer making my new version generation from the pending changes screen — it is a clear interface and allows us to ensure that all of the change is committed in a single operation.

The Add Non-versioned Objects option from the VCS Actions menu is not quite the same as the one reached by right clicking on a folder or project in the Navigator. On the navigator we can add all the child objects and their dependencies; in Team we have the ability to select anything in the repositories. If you are working with fine-grain development of just a project or a couple of folders, this may be too big a selection choice. For most of my initial use cases I would only really want to version the stuff created or customized in the development process.

One final thing I’ll mention now is the ability to bring back deleted objects, yes, sometimes developers need to do that.

In the final part of the new Git features in ODI blog I’ll talk about Tags, Deployments and starting afresh in a repository.

Source, The No Longer Secret Sauce

Working as an IT Professional I need to adhere to standards; working in consulting these standards may vary between customer organizations. Some places are Agile shops, some are classic waterfall, others use whatever flavor of methodology that suits them best. Two things are a given though: We use source control to protect and version the code assets we have spent a major part of our project time creating and we use some form of deployment methodology that ensures we can roll out software releases in a controlled way and if need be revert to a previous stable version.

Historically interfacing GUI based data integration tools to industry standard version control software (VCS) has been less than easy. For example to store ODI mappings in GIT I would need to write some scripting to export the design as an XML file to a file system containing my GIT branch then execute some git commands to commit my change and possibly push it to the remote GIT repository. True, ODI had a create Version feature, however I never really loved it. This version is basically a ‘labeled’ copy of Design Repository objects held in the Master Repository. Two aspects of this feature were less than endearing.

  • The versions are only held in the ODI repository database, code security is through database backups and not whatever corporate standard VCS is in use.
  • There are no scripting APIs to automate this versioning process – it is all point and click, and that is often unacceptable in an enterprise IT situation.

ODI 12.2 introduced the use of Subversion as an integrated VCS tool within ODI Studio. This was a significant step forward, however many IT organisations have adopted GIT as the VCS of choice. The latest patch-set release of ODI 12.2 addresses this and gives us the ability to GIT as a VCS platform alongside ODI as a data integration tool

Over the past few months I have taken an early look at the newly released GIT functionality. One point to note is that versioning process used in ODI GIT integration is COMMIT & PUSH, this may not be the norm for developers used to committing their changes as they work and doing a single push as milestone in the development process. Thought will need be given about the most appropriate way to manage when and how to branch the master tree in the GIT system; developing in Master may not be the best choice.

When I use GIT I am often using GITHUB.COM as a cloud hosted GIT platform. For security I use 2 factor authentication (2FA) on my desktop client and for web access. If you have to use 2FA to access your GIT system from ODI you need to generate an application token to use instead of your GIT password in ODI. If you are using GITHUB.COM the token generator needs to know what level of access to GIT you need, and for ODI this will be a level that allows you to push and commit.  One watch point is that the generated token (a long hexadecimal string) is only visible at the time time of creation, if you forget the string you need to create a new one – you should, of course, revoke the old one to reduce the risk of unused tokens being at large.

Setting up GIT access in ODI is done through the team menu. First use ‘Switch Versioning Application’ to select GIT versioning, next enter the ‘Settings’ sub-menus in order from the top (shown below)

TeamMenu

First edit the connection. For GITHUB.COM and 2FA you must use SSL Authentication to write to the remote repository – the other options are not going to work for you and will error either straight away or at the stage you try to create your local clone repository . 

Connection

next create a local clone of the GIT repository – the remote GIT URL is already populated, we just need to give the local repository location, it will be created if necessary

SetClone

The last part is to select an existing branch to work with. The branch selector will list all of the branches currently in the remote GIT repository. The New Branch button will create an new Branch based on an existing branch

SetBranch

One final point – avoid selecting the Auto Version option checkbox. Selecting this feature will commit and push a new version each time you hit save in the ODI repository, that’s almost certainly not a good choice to make.

Next time I will blog on adding objects to ODI version control and the workflow steps you follow in managing source control and creating deployment archives. To do that I will need to talk about how ODI versioning sets up its file systems

OTN Appreciation Day – Looking for Change

As you may have seen in the Twittersphere, today is a special date in the Oracle community, a day where we can express thanks to OTN for wealth of technical content that makes our life easier.

For my tiny contribution to the day here is a little note on detecting change using SQL MINUS, maybe not my absolute favourite feature but that one has already been blogged today!

In data warehouses we need to ensure that we refresh the content to keep it line with the source systems. However in the real world we find that only a small amount of data actually changes between load; we may get a few new customers and a few others may update their details, but the bulk are the same from one day to the next. Obviously we don’t want to truncate and replace our data each load, nor do we want to update all of our records. We just need to process the new and the changed data.

For some people a change data capture approach, be it Oracle GoldenGate, old-style streams based CDC or even triggers is the way to go. It is important to remember, though, that straight data replication is not what we want, we need to consume change in a way that will load our dimension tables with the deltas only, so some form of (in ODI speak) “journalization” is essential. For others the setting up of CDC is not the right solution.

In this case we need to compare the data we are loading with the data we already hold and look for changes. Some developers do this by outer joining source to target and then comparing columns to find changed data, either through direct comparison of columns or computing a MD5 hash of the concatenated columns. Either way need to do something with those pesky nulls in our data as we may get false matches or even miss changed rows.

A more NULL tolerant approach is the use a set-based approach. The classic is something like SELECT a,b,c from StagedData MINUS select x,y,z from TargetData. Using SQL minus like this we get rows in our result set that do not exist in our target + rows that are different in the two sets, stage and target.

Some people worry that the SQL minus is going to be a heavyweight query, but there is a simple thing we can do lighten the load. Instead of a straight

SELECT * from STAGE_TABLE -- a 20 row table
MINUS
SELECT * from TARGET_TABLE -- a 50 million row table

we pre-filter the Target table to only include rows with the stage table’s business keys; and EXISTS or an inner join would do nicely here

Simple but an effective way of reducing the amount of data being processed in a set based query.

Modules of Knoweledge

One of the features of Oracle Data Integrator that sets it apart from other data integration tools is the use of Knowledge Modules to define how parts of the data integration process occur. The developer creates a data load mapping from sources, joins, filters and targets (and in ODI 12 even more components are available). However, it is the knowledge modules that have responsibility for describing the code that will be generated at execution time. The same logical design could be coded as a bulk truncate-insert with one set of KMs and as an incremental update using another.

True, there are many out-of-the-box KMs, or KMs available for download from the ODI Exchange, but at some point most developers will wish to customize an existing KM or create a new piece of functionality for their ODI processes. Typically we work with Integration KMs (IKM) and Load KMs (LKM) as these do much of the ODI heavy lifting.

Writing a KM sounds complex but the process is not too daunting. Under the covers there is a lot of similarity between writing an ODI procedure and a KM. Both can have multiple steps, both can use a variety of technologies, both are able to execute code steps on the “source” and / or “target”; the step editors are very similar. The key difference is that we write a procedure to do some “one-off” piece of functionality such as update a control table or SFTP some files to a remote platform; the KM however provides a more generic process that can use ODI metadata to describe the data structures being manipulated.

Here at Red Pill Analytics we have already produced some IKMs for our own set-based + analytic function approach to loading and maintaining Slowly Changing Dimensions. We have created custom versions of this KM to support Oracle 11g (and higher) and the latest MS SQL Server versions (some SQL syntax, keywords and functions differ between MSSQL and Oracle hence the need for two versions). Similarly, we have created a bulk incremental update IKM for MSSQL, using the “EXCEPT” set operator (EXCEPT is MSSQL’s equivalent of Oracle’s MINUS) to identify changes between the work table and the target. More recently we have created custom KMs for cloud-hosted databases, the latest being the one we wrote for the Snowflake Elastic Data Warehouse service.

Our first Snowflake KM was a special RKM to reverse engineer an existing Snowflake database into the ODI data stores and models. This KM utilizes Snowflake’s information schema to retrieve table design. Our second Snowflake KM was an IKM to bulk load a Snowflake target table from a SQL source.

Although it is quite possible to load most cloud databases using standard SQL insert commands executed through a JDBC driver, it is preferable to adopt a different approach for moving large numbers of rows to cloud hosted platforms such as Amazon Redshift or Snowflake. For this type of use we should extract our data to file and then use platform specific file load commands to populate our target table. This approach is equally valid for bulk loading relational databases such as Oracle and is especially recommended for Exadata.

Rather than give the exact code we use, I will outline the steps we might take to populate a cloud hosted data warehouse platform — in fact the exact step syntax will vary with database vendor and connection security methods.

  1. Unload the results of the source query to a file. It is most likely that we would use a character separated format such as CSV, but we could use some schema encoded format like JSON or XML. However, for most use cases simplicity is the best starting point, so go with CSV.
  2. Compress the output file: there are two reasons for this — smaller files are quicker to upload; and some cloud vendors charge by the byte transferred or stored.
  3. Upload the file. This is one of the highly platform specific calls. Loading an S3 bucket using a java API will not be the same as FTP, nor is the same as a native Snowflake “put” command, nor is the same as… well, you get the picture.
  4. Load the target table from the uploaded file. We may not need to uncompress the file first since many vendors (including Snowflake) support unzipping on the fly during table load.

We do have a couple of Red Pill Analytics “secret sauce” steps in our IKM (the Jython task shown in the editor screenshot), but the four steps above are basically it!

 

The ODI 12c KM task editor

The first step in any KM development is to look at the available KMs to see if any of them would be a useful base to work from; for example a generic SQL to file IKM may already do some of the actions we need for a file based load. If we find a suitable base KM we can clone it and start work on editing the copy. We look at the steps we have to undertake and decide how much of the current code we can reuse, and how much is net-new. This is potentially the hard part as we have to weigh up technology choices for the various methods available. With the load to Snowflake I had choices such as:

  • do I upload the file through a S3 java API in a jython step in my KM?
  • do I raise an OS command to upload a file S3?
  • do I raise an OS command to natively load Snowflake?
  • do I use a web service?

Each option will have its own positive and negative points, for example can I put any required .JAR files on my ODI agent machine? How do I keep passwords secure? Do I have the code skills in java / Jython / Groovy to code what I want or am I better with shell script? Answer these questions and you on your way to creating your first ODI KMs.

 

My Favourite Things

Apart from the obligatory “raindrops on roses and whiskers on kittens” I must also add analytic functions to my list, and one function in particular, FIRST_VALUE (oh, and its almost identical twin LAST_VALUE.)

FIRST_VALUE returns the first value encountered for a selection window; obviously, we specify the column to apply the function on and as this is an analytic function we have that OVER part at the end where we supply an optional partition by clause. What is not optional is the ORDER BY clause; clearly, it makes no sense to look for a first value if the order we see those values is not controlled and this means we must window too! Getting the first value or last value in a select is perhaps less useful if we have NULL values in that column, but this is just where the IGNORE NULLS option comes in handy. We can find the first non-null value encountered.

Lets consider this data set:

Screen Shot 2016 06 15 at 15 20 25

I have a couple of null values in “Your” things.

select first_value(THING) over (partition by WHOSE order by OTHER_THING rows between unbounded PRECEDING and unbounded FOLLOWING) AS our_fav from FAV_THINGS

gives us

Screen Shot 2016 06 15 at 15 23 03

select first_value(THING IGNORE NULLS) over (partition by WHOSE order by OTHER_THING rows between unbounded PRECEDING and unbounded FOLLOWING) our_fav from FAV_THINGS

Screen Shot 2016 06 15 at 15 22 45

Ignore Nulls just skips those pesky null values, but of course if every row in the window is a NULL then we will get a null back.

The first time I used this kind of functionality was a few years back when Stewart Bryson and I were involved in an Exadata V1 data warehouse. The customer had a requirement to use log-based change data capture, but was unable to set the CDC process to use supplemental logging. With log based CDC we get a time-marked (SCN + an order number (to handle multiple changes for a single SCN value)) and a single row representing each insert or delete and TWO rows for each update (a before and an after state). With supplemental logging we get all of the columns populated; without it we only get columns that change value and NULLs in the columns unaffected. Although far from ideal we can use the captured data (in our case a subscriber view over it) to synthesise the “missing” data in the capture. This is where LAST_VALUE comes into play.

As our customer was only interested in the final outcome of change – a ‘noisy’ source application meant a row might change many times over a few seconds as each individual column was updated Only the final state was of interest in the data warehouse. In outline we built a data set from the UNION ALL of the CDC subscriber view and the current state of our target table (not forgetting to filter the target table select to only include the PK values found in the CDC subscriber view being processed) – the only complication is to remember there are TWO kinds of NULLs in a CDC view: data that becomes NULL and data that remains unchanged on UPDATE.  The resulting query (which we encapsulated in a view) was quite complex. LAG and LEAD analytics and CASE statements to handle individual changes with an ultimate LAST_VALUE to build the final state of the row to merge into the target table.

Recently, I revisited FIRST_VALUE for a piece of ETL logic to supply ‘default’ values to some columns based on the content of other columns. A few years back I would fight hard to keep data cleanse / data enrichment logic like this on the source system; today, I am less convinced that this is the right place to do things – we live in world where source data systems are frequently outside our scope of influence – cloud based packaged solutions, multiple data sources (web, cloud, on-premise, subscription) leads us to a take a view that perhaps the best place to do things is where our data first lands on a system we control. I am working up a conference talk on this topic, so more on that later.

Simplistically, we needed to store some ‘rules’ in a table and then if a rule is matched apply the rule to the target column in the data. Suppose I run a grocery and the things I sell have item codes, what if I made a rule that said anything containing ‘OJ” was going to be of type “Orange Juice” and category “Fruit Juice”, likewise if it contains “GJ” then it is “Grape Juice” and again “Fruit Juice”.  It can be seen that we have two rules that supply the same value of “Juice” so we need to have a solution that can take an arbitrary number of rules to map to the same target value. One way of doing this is to use a regular expression as the join condition (REGEXP_LIKE is suitable) and to build a single expression separated by the ‘|’ symbol; remember in Oracle regular expressions we us a pipe as an OR operator between expressions. This is a simple use of LISTAGG to build a ‘|’ separated list we will then use in our REGEXP_LIKE match, we also get the benefit of optional ignore case.

Now suppose I wrote a rule PRODUCT_NAME contains “JUIC” – that will do nicely for telling me that “Orange Juice” is a fruit juice. However what about Carrot Juice, or even Juicy Fruit™ one option is give some form of rule “priority” so that highly specific rules take priority over generic ones, and take the result of the most important matching rule, ROW_NUMBER does that nicely. 

Did you spot that little ‘gotcha’ in the bit with item codes? We have one item code match string supplying two target columns. This is where we need to pivot our target values around (Use a CASE statement as the Oracle PIVOT command only works on aggregates) to give a sparse set of data (lots of NULLS) then use FIRST_VALUE IGNORE NULLS to generate our transformed column values. It is now we can reduce our data set to just one row with the ROW_NUMBER()=1 filter. Each row in our output gives the best-fit rules for each target column

Mind the Gap – Getting the number on ODI sequences

At work last week I lifted the covers on the Red Pill Analytics Oracle Slowly Changing Dimension ODI IKM. If you haven’t heard about this knowledge module it was specifically written to take a dimension source table and transform it to a slowly changing dimension target table. Unlike many other IKMs this module uses a mix of analytic functions and set operations to handle record closure and surrogate key management on SCD2 records efficiently. As an added bonus it works with mixed SCD2/ SCD1 attributes.

Currently, I am enhancing the KM to build-in flow control handling. We left that part out earlier as it is perhaps a less common use case. If you are following an Oracle Reference Architecture approach to building your BI platform you probably have already done the data validity checking at an earlier stage than the “publish to dimension table”.

Whilst implementing the new code and testing it I was reminded about some of the quirks of using sequences in ODI.

ODI supports many kinds of sequence: Standard, Specific, Native. However, coming from a database background, I prefer to use native if the underlying technology supports it. Not all databases support sequences, but where they do they will be the most efficient way to generate unique key values.

Adding a sequence to a mapping is a simple matter of editing the expression on the column properties, either directly on the target table or on any component that exposes the column. Using the RPA Oracle SCD IKM as an example this is often the target column as our dimension load is very simple indeed.

Screen Shot 2016 04 25 at 09 40 08

Sequences declared in ODI can be selected directly from the ODI Objects selector of the Expression Editor

Screen Shot 2016 04 25 at 09 43 28

By default our sequence becomes :[project name].[sequence name].NEXTVAL (I am assuming you don’t really want to use CURRVAL), that is an ODI ‘bind’ object. The manual also gives us a second syntax for a sequence expression. This syntax replaces the leading colon with a #. The two syntaxes give different results at execution time. It is down to how the ODI agent executing the mapping interprets the expression.

:[project name].[sequence name].NEXTVAL equates to for ‘each target row get the next sequence value’

#[project name].[sequence name].NEXTVAL equates to for ’this execution get the next sequence value then apply it to all target rows’

This difference between ‘:’ and ‘#’ also explains why using the ODI bind (:) is slower when using variables in a mapping.

Rail travellers in the UK and particularly on the London Underground will have heard the expression ‘MIND THE GAP’. Gaps need minding when the distance from the train to the platform in stations is larger than expected and there is a danger of falling.

By Robert S Donovan

(Image by Robert S Donovan under a CC license)

So what have gaps to do with sequences? In fact having a gap in a sequence is the way of life. In Oracle databases we can get gaps when transactions rollback (there is no mechanism to return the number to the sequence pool); gaps when we start the database as the sequence is often set to ‘cache’ a block of numbers to speed sequence access; gaps on RAC when each node may have its own cache of numbers.

When testing my SCD module I noticed some large gaps between key values when new records were processed and certainly not associated with RAC, DB restarts nor rollback. So why the gaps? Let’s look at this snippet of pseudo code used in the KM for surrogate key creation.

CASE WHEN NEW_KEY_NEEDED = TRUE THEN :DEMO.sq1.NEXTVAL ELSE CURRENT_SK END

What happens here is here is that the ODI Agent evaluates the sequence expression for each row processed whether the value is used or not. This will leave gaps, and in the case of a dimension where we have few changes that trigger new SCD rows per execution, gaps almost the same size as number of members in the dimension. In a way this does not really matter as the surrogate key is only a unique reference to the row and has no semantic meaning based on its value. However big gaps may mean we hit the maximum value we set on the sequence sooner than we expect.

We can draw on the techniques we used in ODI 10 for using a database sequence and explicitly name the sequence in the column – so if our sequence, SCD_TEST, is in the ODI_DEMO_TRG schema we can set our expression to be ODI_DEMO_TRG.SCD_TEST.nextval, just as we would do in SQL. In reality I would use an ODI Ref Function to resolve the schema name.  This gets rid of all the gaps generated by the execution of the KM.