Tunneling Through the Clouds

That title sounds a little weird, tunnels go through the ground and clouds are those fluffy things that float in the sky. However when we are talking about Oracle Cloud and SSH tunneling things make a lot more sense. Although this blog is specific to the Oracle Cloud Database, tunneling is a generic technique and will work with other cloud providers and your own on-premises systems.

Before the how, a little of the why.

Cloud resources are plugged into the internet — that is, our connection to the cloud database is going to be to the public IP address of the compute node of the database. Exposing the DB listener to the whole internet is probably a bad idea; even with strong passwords in the database, an exposed listener is a sort of ‘have a go at hacking me’ flag. To minimise such risks the Oracle Cloud Compute Node security rules default to no-access for the public internet to the listener. If our client has a known public IP address we could set up a rule to allow just that IP address to access the listener, however this is probably not sustainable in the long term. Many Internet Service Providers (ISP) present multiple pools of addresses and this hopping about of a user’s IP address can cause a lot of problems with the need to edit rules to accommodate changing IP addresses or leave over generous masks in place to allow blocks of addresses to pass. Add to this the reluctance of some corporate networks to allow outbound SQLNet traffic to pass through their routers and we see that it probably best not to access the Cloud database listener over the internet.If only there was a way to provide a secure, encrypted connection over the internet to the database that does not use the listener. There is! Step up tunneling. Here we create a secure SSH connection to the Oracle Cloud Compute Node and then create a port redirect on our local computer so that any traffic to that port is translated to an IP address and port on the remote network.

Setting it up

Before we create an Oracle Cloud Database we have to generate a public / private key pair and to upload the public key to be used as part of the creation process. The ‘master’ private key is precious as it allows you to create as SSH connection to the the DB and perform highly privileged operations in the underlying operating system.Obviously, we DO NOT WANT TO USE THIS KEY for our secure tunnel to the database listener, instead we should create a new private / public key pair and a new Linux user specifically for port forwarding. In my examples I am using tunnel_user as the Linux user and cloud_host as the address of the Oracle Cloud Database Public IP Address.

Oracle has posted a tutorial on Oracle Cloud Linux user creation, we can adapt that to suit our needs. The main difference is that we are not setting up a new administrator user, so we can omit the final stages of the tutorial. In outline the steps are:

  1. Create a new private / public key pair. My preference is to ssh-keygen on my MacBook to make a RSA 2048-bit key pair, but other tools can do this. In this blog I named the key pair tunnel_user. On my MacBook the key pair is created in my user’s .ssh directory or using conventional Nix-like syntax ~/.ssh (by the way, the tilde sign means ‘home of’: ~ = my home, ~oracle = the oracle user’s home)

  • Start a ssh terminal session to the Oracle Cloud Compute Node. Using the ‘master’ private key (the one paired with the public key used to create the database) connect as the opc user. 
    ssh -i ~/.shh/MasterKey opc@cloud_host
  • Escalate privileges (sudo su)
  • Create a new Linux user
    useradd tunnel_user
    This user will not need a password as we will only be using the ssh key to access the Cloude Compute Node 
  • Create a .ssh directory in the new user’s home directory
    mkdir  ~tunnel_user/.ssh
  • Copy the public key you created for this user into your paste buffer and add it to the authorized_keys file in the .ssh directory (the Oracle tutorial uses echo [my public key] > ~[tunnel user name]/.ssh/authorized_keys to do this, which might overwrite an existing file, so maybe >> is better to use than >)
  • Using your prefered text editor add the new user to the AllowUsers line of /etc/sshd_config. Search for the line starting with AllowUsers and edit it for example
    AllowUsers opc oracle becomes AllowUsers opc oracle tunnel_user
  • Change the ownership of the key file to the new user
    chown -R tunnel_user:tunnel_user ~tunnel_user/.ssh
  • We have now finished with the Oracle Tutorial steps, so exit from root and exit from opc
  • Log in as the new user to verify that the new ssh connection works and then log out — this step is strictly not necessary but it makes debugging simpler at this time
    ssh -i ~/.shh/tunnel_user tunnel_user@cloud_host
    exitLog in as opc and sudo to root
  • Modify the new user to use the /sbin/nologin shell
    ssh -i ~/.shh/tunnel_user tunnel_user@cloud_host 
  • Exit from root and opc then try connecting to the tunnel_user through shh
    ssh -i ~/.shh/tunnel_user tunnel_user@cloud_host
    You should be politely refused.

The Linux compute node is now set for tunnelling with port forwarding but can not be used to create a Linux session.

As I mentioned earlier, port forwarding is where ALL the network traffic for a specific port is redirected down the shh tunnel to the remote server where it passed to an ip address and port visible to the remote server. The target database does not have to be on the machine we have tunnelled to, it just needs to be network accessible from the remote host

In my use case the database is running on the compute node and used the default (1521) port. My target database connection is The local port I redirect can be any unused user port. Some people use 1521 as the local port, but that is only suitable if you do not need to connect to any Oracle databases on port 1521 on your local network. Remember too, that if you tunnel to multiple destinations each tunnel will need its own local port number.

Creating the tunnel is relatively simple. If you are using Oracle SQL Developer 4.1 (or later) we can even do that in the GUI. As long as the connection is open in SQL Developer the tunnel is also available to any of your applications that wish to connect to the database — this includes R Studio, SQLDesktopJ, the OBIEE Admin tool and ODI Studio, in fact any way we can connect to the database using an IP address and port number. As I often need to use SQL Developer for looking at the database this is very convenient for me. You can find simple instructions to set this up here. I would make a couple of changes to that method. Firstly, do not use your database create key (MasterKey in my example above), instead use the one created specifically for the tunnel_user. The second change is that we should manually specify the port number we want to use in the redirect; leaving it set as automatically assigned is fine if we are only going to use SQL Developer, however we do need to know the local port number if we are going to use the tunnel to connect to the database from other local clients.


If you want something a little more light-weight we can use a terminal session to host the tunnel — PuTTY works well if you are into Windows, but as an Apple-fan-boy I would use Apple Terminal.

In PuTTY we create a new session to connect to the Oracle Cloud compute node and then add our authentication details on the SSH-Authentication tab followed by the tunnel details on the SSH-Tunnel tab finally back to the session tab and save it. Open the session and leave it running. As always with PuTTY you could use a command line connection string: the syntax is very like that for ssh given below

In a terminal application on Linux or a Mac things are even easier. We just invoke SSH to use our key file and create a tunnel using a command like:

ssh -i ~/.shh/tunnel_user -L 1555: tunnel_user@cloud_host -N

The -N at the end of the command is important as this tells shh not to establish a command shell connection.

Remember, the tunnel is a connection to the remote server, not to the database, we will still need to use JDBC or whatever other protocol (OCI, ODBC etc) to create a database session so we still need a valid Oracle user name and a password. The only things we change in our connection is to use the redirected local port and the address we set up in the tunnel command (in my examples this is port 1555 and IP address

Happy tunneling folks


Using ODI with Git Versioning – Part 2: A Game of Tag

Last time I blogged about getting started with the new Git VCS functionality in Oracle Data Integrator (ODI) 12.2. Then, I promised to write a bit on integrating Git into the deployment process.

Firstly, a disclaimer. I used to work for a IT services company that also ran one of the largest ITIL training organizations in EMEA. In the traditions of “Eat your own dog food”, it was inevitable that the disciplines of release control and configuration management underpinned all we did. We also developed (in collaboration with other global IT companies) methodologies to bring the ‘software’ side of the house into a similar framework to the one used for ‘infrastructure’. Of course, this was not a simple drop-in of a process; developers have different needs from infrastructure deployment folk. From the deployment side of the house we need to document and create packages of artefacts to deploy (these may be scripts, binary files, ODI scenarios) and at the same time freeze a copy of the development source code so that developers and support staff have a code commit point to work from for subsequent development and bug-fix activities. Developers need more than that, they need to be able to save versions during the development process so that they protect the code they are developing from all kinds of mishap. Often the code that developers need to track in an SCM is way more than the deployed codebase, and the waters are muddied further when we consider commonplace processes such as continuous integration. Basically, it boils down to my expectations with Git-based development and deployment, and how those expectations are actually met in the SCM integration documented by Oracle when they introduced Subversion VCS to ODI.

Organizing your work

Although the Oracle ODI versioning documentation suggests that small teams can do their development in trunk, that is simply not the best way to approach this, for two reasons. Firstly, ODI’s Create New Version paradigm always saves and commits to the remote Git repository; it is far cleaner to develop in an isolated branch and have a Git release manager or deployment engineer managing the downstream SCM merge process. Secondly if we are using any automation based on Git commit hooks (such as continuous integration processes using Jenkins) it is sensible that we do this from a “clean” codebase and use the commit messages from the merge as the process trigger.

The other organizational question is where does everything sit in the environment? To help explain this I have created this simplified schematic:


As you can see, there is a remote file system holding the whole Git repository. This remote Git repository may be at a hosting provider such as github.com, or it could be an in-house repository running on some git server. The developer clones a single branch from the remote repository to a local Git file system (usually) on the developer’s workstation. So far this is standard Git practice. ODI adds to this by including a database-hosted ODI repository. Optionally, the Git file systems can seed the ODI repository, and of course the ODI repository sends changes to Git (local and remote repositories) whenever we create a new version.

All sounds simple. The potential “gotcha” is that ODI only supports a single set of objects in the work and master repositories. As an example: I am developing some new mappings in my “SalesDevJan2017” branch and my colleague, Lisa, is working in her “CustomerDevMar2017” branch. Both of our branches contain some datastores and mappings that relate to customer; what if Lisa adds some new attributes to the customer table that won’t go live until the quarter after my changes? As soon as I version all changed objects there is a risk that my branch will contain code / objects that haven’t been finalized yet.

We have four basic options:

  1. Use ODI as the master codebase and Git as an out-of-line code store. This is pretty much what we did in ODI 12.1 (and earlier) days; create XML exports of the objects and store in a file system. We get the benefit of version restore, but little else. I just don’t like this, ODI is never going to be the sole source of the artefacts we need to version in a project (think DDL, documentation, report definitions etc). Git should be our corporate repository of source, not a cluster of multi-master islands of code.
  2. Create a branch for each sprint (if we are agile) and have all developers work in the same branch. Of course we still have potential problems for work that takes longer than a single sprint.
  3. Create an ODI WORK repository for each developer and develop in individual branches. This gives great isolation for all the design components, but still requires shared components in the security, topology parts of the MASTER repository.
  4. Give each developer his or her own ODI repository database and a Git branch to work in. This gives us maximum isolation of development. Obviously, there may be a potential license cost for all of those repository databases however we can consider ways to mitigate this such as named user licensing on development / test platforms or by running multiple ODI repositories on the same platform, either as separate DB instances or as multiple schemas within one database.

The flow for option 4 could be:

  • Clone the master code line to create a new branch
  • Populate the ODI work and master repositories from the branch
  • Develop ODI code and version when necessary
  • Version the ODI repository
  • Merge the branch back in to the master



Here my release management gene kicks in: I want to move source control away from code releases. True, there is always a need to tie code at a point in time to the executable version of released code but we don’t really want to clutter our code repository with a large number of executables that are not required. We also need a way to handle the executables away from the code; there is no place for source code on any platform other than development (well maybe hot-fix, but that is really just a special case of a git branch). I also want a way to back-out a release if things go wrong.

In ODI terms, I will need to deploy the scenarios and loadplans generated from ODI. I also need to deploy any non-ODI configuration scripting, this will include shell scripts, DDL files, data seeding scripts etc. The non-ODI scripts are fairly straight forward, we just create additional directories in our branch to hold the scripts and manage that though a Git client or web interface. For the ODI materials we can use another ODI 12.2 feature, the deployment archive.

In simple terms, the deployment archive is a pair of zip files, one containing the XML source files, the other containing the executable scenarios and load plans. It is this executable archive that we need to deploy to our testing and production environments. The source zip should not contain anything we do not already have in SCM, but it is nice to have a single file that ties to the released executables. In fact the deployment archive is a tad more sophisticated than this, we create ‘Initial’ deployments (all content) or ‘patch’ deployments (just changes to apply) where we can also generate ‘rollback’ archives to allow us to revert patches.

The deployment archive functionality is linked to the concept of ‘Tags’ in SCM. Git tags are not the same as subversion labels, but for our ODI deployments this is not too relevant. We create a tag in git from ODI studio and then use this tag to identify all of the objects to include in the deployment archive. ODI has a concept of a partial tag— this is of course not a real git term. Think of a partial tag as a user generated list of objects, the editor for a partial tag is quite similar to the one for smart exports. However I think it is better to use a full tag, that is all the tracked objects.


Once our tag is in git we can use create deployment archive from VCS label to generate the archives. This gives us a set of zip file we can add to git and then deploy. Better yet we can use the ODI scripting SDK to develop an automated process to rollout code to the test, UAT, production as part of a CI workflow.

Last time I blogged about getting started with the new Git VCS functionality in Oracle Data Integrator (ODI) 12.2. Then, I promised to write a bit on integrating Git into the deployment process.

Firstly, a disclaimer. I used to work for a IT services company that also ran one of the largest ITIL training organizations in EMEA. In the traditions of “Eat your own dog food”, it was inevitable that the disciplines of release control and configuration management underpinned all we did. We also developed (in collaboration with other global IT companies) methodologies to bring the ‘software’ side of the house into a similar framework to the one used for ‘infrastructure’. Of course, this was not a simple drop-in of a process; developers have different needs from infrastructure deployment folk. From the deployment side of the house we need to document and create packages of artefacts to deploy (these may be scripts, binary files, ODI scenarios) and at the same time freeze a copy of the development source code so that developers and support staff have a code commit point to work from for subsequent development and bug-fix activities. Developers need more than that, they need to be able to save versions during the development process so that they protect the code they are developing from all kinds of mishap. Often the code that developers need to track in an SCM is way more than the deployed codebase, and the waters are muddied further when we consider commonplace processes such as continuous integration. Basically, it boils down to my expectations with Git-based development and deployment, and how those expectations are actually met in the SCM integration documented by Oracle when they introduced Subversion VCS to ODI.


Using ODI with Git Versioning – Part 1: The Basics

Last time I went through some of the basics about getting the latest patch-set of ODI to connect to a GitHub.com VCS repository. Today I will talk a little about versioning our ODI development in Git.


The first point to make is that we can select the objects we wish to version-track in Git, it need not be the whole repository; we could just track an individual mapping, but most likely we would be tracking one or more ODI repository projects or project folders. I don’t know about you, but my preferred development style is to use a single project containing multiple folders to organize my codebase; I also create a ‘development playground’ folder where I can try out ideas. Whether we source track the playground folder is debatable. It is a code asset and should be tracked, however it may not sit well with some aspects of ODI Deployment Archive creation from Git sources; I will write more on deployments next time.

The easiest way of adding ODI content to Git VCS is by right-clicking on the object in the ODI designer navigator and then Version/Add to VCS.


This leads to the Add Objects To Git dialog box. As I mentioned in the previous blog, ODI uses a commit and push model; it is mandatory to provide a commit comment every time you version an object in ODI Git; this dialog box is no exception. As you can see, versioning at the project (or folder) level will allow to include child objects; this can save us work and, more importantly, add multiple items in a single commit.


Include Dependencies will ensure that dependent ODI objects are included in the files pushed to Git, the Regenerate and Version Scenarios option is self-explanatory. In the ODI Studio Navigator Git versioned ODI objects have a small green ‘g’ on the object icon.

Having pushed to Git we can see what is in there! Looking in my local file system’s Git repository clone I see that ODI has generated two folder trees, one for the Master Repository and the other for the Work Repository. The structures of the two repository trees follow the same basic outline as you would see on the designer and topology navigators in ODI Studio.



In my screenshot I have also included a couple of other objects that are not managed by ODI (a file and a folder); most organizations would also keep non-ODI items in the same Git repository, for example DDL scripts, data models, shell scripts, deployment notes, in fact just about any resource that needs versioning — being able to do this is an important plus.

Now that we have ODI source code in Git we can get on with developing and versioning as we go. As previously described, I am not in favor of automatically creating a new version on each save, rather I prefer developer choice regarding the right time to commit and push.

We just right click on the object we want to version in the source navigator to access the Version context menu and then select Create VCS Version…


As always, we will need to provide a commit comment for the save. Note too, that ODI is smart enough not to commit to Git if there have been no changes.


Note the Version menu is context specific; if we right click on a container item such as Folder or Model we get an additional option to Add Non-versioned Objects to VCS… This does exactly what we expect: it brings up a list of objects in that container that are not currently under version control and allows us to select the ones we need to add to Git.

Other options on any VCS tracked object are Version History, Version Tree, Restore From VCS.


From the version history we can restore a previous version from VCS. As a developer, I find it difficult to remember which commit date is the version that I need to restore. Instead, I prefer to use meaningful commit comments such as Jira ticket number or release number. Of course, with Git we could use these commit comment as part of a workflow triggering other updates to our systems. The version history screen also lets us access a useful 3-way version compare window. This allows us to compare the Git baseline with the selected version and the current state of the ODI repository or to compare 2 selected versions against the baseline. In my example below, I added the filter FILTER1 to the baseline mapping and saved to Git — next I renamed the filter Low_numbers and saved to the ODI repository but not pushed to Git.

Screen Shot 2016 10 28 at 11 41 31


Clicking on the ‘Select Version’ button brings up a version history window shown on the right of the screenshot above. Again, I find it easier to use the commit comment rather than the date column to identify versions, especially if we create multiple versions per day.The final option is to display a Version Tree.

Version Tree

A version tree is a graphical representation of the Git commits for an object, again with access to the version compare feature. The tree is color-coded with the initial create in green, changes in black and deletes in red. For my example mapping I tested the ability to recover a deleted version and hence I get a red spot representing the delete and a yellow one for the restore version step. Hover the mouse over a dot and a pop-up window shows additional detail on the version. Again, we get direct access to the 3-way compare feature by clicking on the icon in the screen menu bar.

Team Menu

The other place where the developer interacts with the ODI-Git versioning system is on the Team Menu -> VCS Actions. I’ll cover most of these options in part 2 of this blog but for now a few of developer-useful features of VCS Actions.


Pending Changes identifies all of the Git tracked objects that have been saved to the ODI metadata repository but have not yet been pushed to Git.

Pending Changes

We can, optionally, push one or more of the changes to Git from the Pending Changes viewer or select a single ODI object and restore it from Git. In fact, I prefer making my new version generation from the pending changes screen — it is a clear interface and allows us to ensure that all of the change is committed in a single operation.

The Add Non-versioned Objects option from the VCS Actions menu is not quite the same as the one reached by right clicking on a folder or project in the Navigator. On the navigator we can add all the child objects and their dependencies; in Team we have the ability to select anything in the repositories. If you are working with fine-grain development of just a project or a couple of folders, this may be too big a selection choice. For most of my initial use cases I would only really want to version the stuff created or customized in the development process.

One final thing I’ll mention now is the ability to bring back deleted objects, yes, sometimes developers need to do that.

In the final part of the new Git features in ODI blog I’ll talk about Tags, Deployments and starting afresh in a repository.

Source, The No Longer Secret Sauce

Working as an IT Professional I need to adhere to standards; working in consulting these standards may vary between customer organizations. Some places are Agile shops, some are classic waterfall, others use whatever flavor of methodology that suits them best. Two things are a given though: We use source control to protect and version the code assets we have spent a major part of our project time creating and we use some form of deployment methodology that ensures we can roll out software releases in a controlled way and if need be revert to a previous stable version.

Historically interfacing GUI based data integration tools to industry standard version control software (VCS) has been less than easy. For example to store ODI mappings in GIT I would need to write some scripting to export the design as an XML file to a file system containing my GIT branch then execute some git commands to commit my change and possibly push it to the remote GIT repository. True, ODI had a create Version feature, however I never really loved it. This version is basically a ‘labeled’ copy of Design Repository objects held in the Master Repository. Two aspects of this feature were less than endearing.

  • The versions are only held in the ODI repository database, code security is through database backups and not whatever corporate standard VCS is in use.
  • There are no scripting APIs to automate this versioning process – it is all point and click, and that is often unacceptable in an enterprise IT situation.

ODI 12.2 introduced the use of Subversion as an integrated VCS tool within ODI Studio. This was a significant step forward, however many IT organisations have adopted GIT as the VCS of choice. The latest patch-set release of ODI 12.2 addresses this and gives us the ability to GIT as a VCS platform alongside ODI as a data integration tool

Over the past few months I have taken an early look at the newly released GIT functionality. One point to note is that versioning process used in ODI GIT integration is COMMIT & PUSH, this may not be the norm for developers used to committing their changes as they work and doing a single push as milestone in the development process. Thought will need be given about the most appropriate way to manage when and how to branch the master tree in the GIT system; developing in Master may not be the best choice.

When I use GIT I am often using GITHUB.COM as a cloud hosted GIT platform. For security I use 2 factor authentication (2FA) on my desktop client and for web access. If you have to use 2FA to access your GIT system from ODI you need to generate an application token to use instead of your GIT password in ODI. If you are using GITHUB.COM the token generator needs to know what level of access to GIT you need, and for ODI this will be a level that allows you to push and commit.  One watch point is that the generated token (a long hexadecimal string) is only visible at the time time of creation, if you forget the string you need to create a new one – you should, of course, revoke the old one to reduce the risk of unused tokens being at large.

Setting up GIT access in ODI is done through the team menu. First use ‘Switch Versioning Application’ to select GIT versioning, next enter the ‘Settings’ sub-menus in order from the top (shown below)


First edit the connection. For GITHUB.COM and 2FA you must use SSL Authentication to write to the remote repository – the other options are not going to work for you and will error either straight away or at the stage you try to create your local clone repository . 


next create a local clone of the GIT repository – the remote GIT URL is already populated, we just need to give the local repository location, it will be created if necessary


The last part is to select an existing branch to work with. The branch selector will list all of the branches currently in the remote GIT repository. The New Branch button will create an new Branch based on an existing branch


One final point – avoid selecting the Auto Version option checkbox. Selecting this feature will commit and push a new version each time you hit save in the ODI repository, that’s almost certainly not a good choice to make.

Next time I will blog on adding objects to ODI version control and the workflow steps you follow in managing source control and creating deployment archives. To do that I will need to talk about how ODI versioning sets up its file systems

OTN Appreciation Day – Looking for Change

As you may have seen in the Twittersphere, today is a special date in the Oracle community, a day where we can express thanks to OTN for wealth of technical content that makes our life easier.

For my tiny contribution to the day here is a little note on detecting change using SQL MINUS, maybe not my absolute favourite feature but that one has already been blogged today!

In data warehouses we need to ensure that we refresh the content to keep it line with the source systems. However in the real world we find that only a small amount of data actually changes between load; we may get a few new customers and a few others may update their details, but the bulk are the same from one day to the next. Obviously we don’t want to truncate and replace our data each load, nor do we want to update all of our records. We just need to process the new and the changed data.

For some people a change data capture approach, be it Oracle GoldenGate, old-style streams based CDC or even triggers is the way to go. It is important to remember, though, that straight data replication is not what we want, we need to consume change in a way that will load our dimension tables with the deltas only, so some form of (in ODI speak) “journalization” is essential. For others the setting up of CDC is not the right solution.

In this case we need to compare the data we are loading with the data we already hold and look for changes. Some developers do this by outer joining source to target and then comparing columns to find changed data, either through direct comparison of columns or computing a MD5 hash of the concatenated columns. Either way need to do something with those pesky nulls in our data as we may get false matches or even miss changed rows.

A more NULL tolerant approach is the use a set-based approach. The classic is something like SELECT a,b,c from StagedData MINUS select x,y,z from TargetData. Using SQL minus like this we get rows in our result set that do not exist in our target + rows that are different in the two sets, stage and target.

Some people worry that the SQL minus is going to be a heavyweight query, but there is a simple thing we can do lighten the load. Instead of a straight

SELECT * from STAGE_TABLE -- a 20 row table
SELECT * from TARGET_TABLE -- a 50 million row table

we pre-filter the Target table to only include rows with the stage table’s business keys; and EXISTS or an inner join would do nicely here

Simple but an effective way of reducing the amount of data being processed in a set based query.

Modules of Knoweledge

One of the features of Oracle Data Integrator that sets it apart from other data integration tools is the use of Knowledge Modules to define how parts of the data integration process occur. The developer creates a data load mapping from sources, joins, filters and targets (and in ODI 12 even more components are available). However, it is the knowledge modules that have responsibility for describing the code that will be generated at execution time. The same logical design could be coded as a bulk truncate-insert with one set of KMs and as an incremental update using another.

True, there are many out-of-the-box KMs, or KMs available for download from the ODI Exchange, but at some point most developers will wish to customize an existing KM or create a new piece of functionality for their ODI processes. Typically we work with Integration KMs (IKM) and Load KMs (LKM) as these do much of the ODI heavy lifting.

Writing a KM sounds complex but the process is not too daunting. Under the covers there is a lot of similarity between writing an ODI procedure and a KM. Both can have multiple steps, both can use a variety of technologies, both are able to execute code steps on the “source” and / or “target”; the step editors are very similar. The key difference is that we write a procedure to do some “one-off” piece of functionality such as update a control table or SFTP some files to a remote platform; the KM however provides a more generic process that can use ODI metadata to describe the data structures being manipulated.

Here at Red Pill Analytics we have already produced some IKMs for our own set-based + analytic function approach to loading and maintaining Slowly Changing Dimensions. We have created custom versions of this KM to support Oracle 11g (and higher) and the latest MS SQL Server versions (some SQL syntax, keywords and functions differ between MSSQL and Oracle hence the need for two versions). Similarly, we have created a bulk incremental update IKM for MSSQL, using the “EXCEPT” set operator (EXCEPT is MSSQL’s equivalent of Oracle’s MINUS) to identify changes between the work table and the target. More recently we have created custom KMs for cloud-hosted databases, the latest being the one we wrote for the Snowflake Elastic Data Warehouse service.

Our first Snowflake KM was a special RKM to reverse engineer an existing Snowflake database into the ODI data stores and models. This KM utilizes Snowflake’s information schema to retrieve table design. Our second Snowflake KM was an IKM to bulk load a Snowflake target table from a SQL source.

Although it is quite possible to load most cloud databases using standard SQL insert commands executed through a JDBC driver, it is preferable to adopt a different approach for moving large numbers of rows to cloud hosted platforms such as Amazon Redshift or Snowflake. For this type of use we should extract our data to file and then use platform specific file load commands to populate our target table. This approach is equally valid for bulk loading relational databases such as Oracle and is especially recommended for Exadata.

Rather than give the exact code we use, I will outline the steps we might take to populate a cloud hosted data warehouse platform — in fact the exact step syntax will vary with database vendor and connection security methods.

  1. Unload the results of the source query to a file. It is most likely that we would use a character separated format such as CSV, but we could use some schema encoded format like JSON or XML. However, for most use cases simplicity is the best starting point, so go with CSV.
  2. Compress the output file: there are two reasons for this — smaller files are quicker to upload; and some cloud vendors charge by the byte transferred or stored.
  3. Upload the file. This is one of the highly platform specific calls. Loading an S3 bucket using a java API will not be the same as FTP, nor is the same as a native Snowflake “put” command, nor is the same as… well, you get the picture.
  4. Load the target table from the uploaded file. We may not need to uncompress the file first since many vendors (including Snowflake) support unzipping on the fly during table load.

We do have a couple of Red Pill Analytics “secret sauce” steps in our IKM (the Jython task shown in the editor screenshot), but the four steps above are basically it!


The ODI 12c KM task editor

The first step in any KM development is to look at the available KMs to see if any of them would be a useful base to work from; for example a generic SQL to file IKM may already do some of the actions we need for a file based load. If we find a suitable base KM we can clone it and start work on editing the copy. We look at the steps we have to undertake and decide how much of the current code we can reuse, and how much is net-new. This is potentially the hard part as we have to weigh up technology choices for the various methods available. With the load to Snowflake I had choices such as:

  • do I upload the file through a S3 java API in a jython step in my KM?
  • do I raise an OS command to upload a file S3?
  • do I raise an OS command to natively load Snowflake?
  • do I use a web service?

Each option will have its own positive and negative points, for example can I put any required .JAR files on my ODI agent machine? How do I keep passwords secure? Do I have the code skills in java / Jython / Groovy to code what I want or am I better with shell script? Answer these questions and you on your way to creating your first ODI KMs.


My Favourite Things

Apart from the obligatory “raindrops on roses and whiskers on kittens” I must also add analytic functions to my list, and one function in particular, FIRST_VALUE (oh, and its almost identical twin LAST_VALUE.)

FIRST_VALUE returns the first value encountered for a selection window; obviously, we specify the column to apply the function on and as this is an analytic function we have that OVER part at the end where we supply an optional partition by clause. What is not optional is the ORDER BY clause; clearly, it makes no sense to look for a first value if the order we see those values is not controlled and this means we must window too! Getting the first value or last value in a select is perhaps less useful if we have NULL values in that column, but this is just where the IGNORE NULLS option comes in handy. We can find the first non-null value encountered.

Lets consider this data set:

Screen Shot 2016 06 15 at 15 20 25

I have a couple of null values in “Your” things.

select first_value(THING) over (partition by WHOSE order by OTHER_THING rows between unbounded PRECEDING and unbounded FOLLOWING) AS our_fav from FAV_THINGS

gives us

Screen Shot 2016 06 15 at 15 23 03

select first_value(THING IGNORE NULLS) over (partition by WHOSE order by OTHER_THING rows between unbounded PRECEDING and unbounded FOLLOWING) our_fav from FAV_THINGS

Screen Shot 2016 06 15 at 15 22 45

Ignore Nulls just skips those pesky null values, but of course if every row in the window is a NULL then we will get a null back.

The first time I used this kind of functionality was a few years back when Stewart Bryson and I were involved in an Exadata V1 data warehouse. The customer had a requirement to use log-based change data capture, but was unable to set the CDC process to use supplemental logging. With log based CDC we get a time-marked (SCN + an order number (to handle multiple changes for a single SCN value)) and a single row representing each insert or delete and TWO rows for each update (a before and an after state). With supplemental logging we get all of the columns populated; without it we only get columns that change value and NULLs in the columns unaffected. Although far from ideal we can use the captured data (in our case a subscriber view over it) to synthesise the “missing” data in the capture. This is where LAST_VALUE comes into play.

As our customer was only interested in the final outcome of change – a ‘noisy’ source application meant a row might change many times over a few seconds as each individual column was updated Only the final state was of interest in the data warehouse. In outline we built a data set from the UNION ALL of the CDC subscriber view and the current state of our target table (not forgetting to filter the target table select to only include the PK values found in the CDC subscriber view being processed) – the only complication is to remember there are TWO kinds of NULLs in a CDC view: data that becomes NULL and data that remains unchanged on UPDATE.  The resulting query (which we encapsulated in a view) was quite complex. LAG and LEAD analytics and CASE statements to handle individual changes with an ultimate LAST_VALUE to build the final state of the row to merge into the target table.

Recently, I revisited FIRST_VALUE for a piece of ETL logic to supply ‘default’ values to some columns based on the content of other columns. A few years back I would fight hard to keep data cleanse / data enrichment logic like this on the source system; today, I am less convinced that this is the right place to do things – we live in world where source data systems are frequently outside our scope of influence – cloud based packaged solutions, multiple data sources (web, cloud, on-premise, subscription) leads us to a take a view that perhaps the best place to do things is where our data first lands on a system we control. I am working up a conference talk on this topic, so more on that later.

Simplistically, we needed to store some ‘rules’ in a table and then if a rule is matched apply the rule to the target column in the data. Suppose I run a grocery and the things I sell have item codes, what if I made a rule that said anything containing ‘OJ” was going to be of type “Orange Juice” and category “Fruit Juice”, likewise if it contains “GJ” then it is “Grape Juice” and again “Fruit Juice”.  It can be seen that we have two rules that supply the same value of “Juice” so we need to have a solution that can take an arbitrary number of rules to map to the same target value. One way of doing this is to use a regular expression as the join condition (REGEXP_LIKE is suitable) and to build a single expression separated by the ‘|’ symbol; remember in Oracle regular expressions we us a pipe as an OR operator between expressions. This is a simple use of LISTAGG to build a ‘|’ separated list we will then use in our REGEXP_LIKE match, we also get the benefit of optional ignore case.

Now suppose I wrote a rule PRODUCT_NAME contains “JUIC” – that will do nicely for telling me that “Orange Juice” is a fruit juice. However what about Carrot Juice, or even Juicy Fruit™ one option is give some form of rule “priority” so that highly specific rules take priority over generic ones, and take the result of the most important matching rule, ROW_NUMBER does that nicely. 

Did you spot that little ‘gotcha’ in the bit with item codes? We have one item code match string supplying two target columns. This is where we need to pivot our target values around (Use a CASE statement as the Oracle PIVOT command only works on aggregates) to give a sparse set of data (lots of NULLS) then use FIRST_VALUE IGNORE NULLS to generate our transformed column values. It is now we can reduce our data set to just one row with the ROW_NUMBER()=1 filter. Each row in our output gives the best-fit rules for each target column