It’s ODC Appreciation Day

Last year for the OTN Appreciation Day I wrote about one of my favourite Oracle Analytic functions, LISTAGG. I don’t have a new favourite bit of SQL, so this time I’m writing about my love of SQL Developer.

If you are not using it, you are really missing out on a lot of free functionality. Just take a look at the things you can explore from the VIEW  and the TOOLS menus.


As a sort of hybrid developer /  DBA / data modeler I find the SQL Developer to be my “Swiss Army Knife” database tool. I use it in so many ways at so many times in a project.

I could go on about the OLAP and Data Mining features or how (subject to the correct DB license) it can be used to investigate and tune problem queries. However, today I am going to mention the DBA view. Quick confession time, I am a fake-DBA – my interests are in sizing, performance, security and storage, I don’t retain a detailed knowledge of the DBA views and the more obscure DDL syntax, but I do know where to look to find something I can adapt.


One of those places is the query result pane in SQL Developer – that may sound a surprise, but by right clicking on the query results we can elect to save the grid as report

NewImageThen simply copy the report SQL and adapt it to the requirements. For example, I needed to list the users and their quotas on given tablespace. By finding the quota for one user I could adapt the SQL Developer generated SQL to my needs:

 "Quota (Bytes)"
 tablespace_name AS "Tablespace",
 max_bytes "Quota (Bytes)"
 username =:name
 max_blocks != 0


 max_bytes "Quota (Bytes)"
 tablespace_name = 'USERS';



Joining It All Up

Time for a quick tip

When I start a new ETL project I often find it useful to have some idea how the source tables relate to each other. If I’m lucky there is documentation to guide me or even a diagram showing me how the tables link up (I do dream a lot). Of course, I could pull the structure into SQL Developer Data Modeller and get this data from the database.

However for quick show-and-tells the latest releases of SQL Developer gives us another and simpler way. Step up to the plate the Model tab on the table viewer. You may not have noticed it, but it is here.

Screen Shot 2017 09 05 at 15 40 09

After a few moments a model diagram will appear based on the table you are inspecting and showing all the tables related to it through foreign key constraints.

Screen Shot 2017 09 05 at 13 59 30

However, if there are no constraints, this will not work. “No constraints” I hear you gasp. This does happen, and sadly too often with a whole raft of commercial applications that for some reason use the application to enforce integrity. Shame on all developers that forget that the word data is in database, data quality and data integrity.

Two other factoids on this feature:

If you edit the table structure or constraints you will need to click on refresh icon in the top bar of the window to see the changes.

The second feature is the icon to the right of the refresh button – it copies the model into SQL Developer Data Modeller so that you can then use all its features (like that very useful print function)

My next blog is on constraints in the data warehouse. Until then, bye!


Tunneling Through the Clouds

That title sounds a little weird, tunnels go through the ground and clouds are those fluffy things that float in the sky. However when we are talking about Oracle Cloud and SSH tunneling things make a lot more sense. Although this blog is specific to the Oracle Cloud Database, tunneling is a generic technique and will work with other cloud providers and your own on-premises systems.

Before the how, a little of the why.

Cloud resources are plugged into the internet — that is, our connection to the cloud database is going to be to the public IP address of the compute node of the database. Exposing the DB listener to the whole internet is probably a bad idea; even with strong passwords in the database, an exposed listener is a sort of ‘have a go at hacking me’ flag. To minimise such risks the Oracle Cloud Compute Node security rules default to no-access for the public internet to the listener. If our client has a known public IP address we could set up a rule to allow just that IP address to access the listener, however this is probably not sustainable in the long term. Many Internet Service Providers (ISP) present multiple pools of addresses and this hopping about of a user’s IP address can cause a lot of problems with the need to edit rules to accommodate changing IP addresses or leave over generous masks in place to allow blocks of addresses to pass. Add to this the reluctance of some corporate networks to allow outbound SQLNet traffic to pass through their routers and we see that it probably best not to access the Cloud database listener over the internet.If only there was a way to provide a secure, encrypted connection over the internet to the database that does not use the listener. There is! Step up tunneling. Here we create a secure SSH connection to the Oracle Cloud Compute Node and then create a port redirect on our local computer so that any traffic to that port is translated to an IP address and port on the remote network.

Setting it up

Before we create an Oracle Cloud Database we have to generate a public / private key pair and to upload the public key to be used as part of the creation process. The ‘master’ private key is precious as it allows you to create as SSH connection to the the DB and perform highly privileged operations in the underlying operating system.Obviously, we DO NOT WANT TO USE THIS KEY for our secure tunnel to the database listener, instead we should create a new private / public key pair and a new Linux user specifically for port forwarding. In my examples I am using tunnel_user as the Linux user and cloud_host as the address of the Oracle Cloud Database Public IP Address.

Oracle has posted a tutorial on Oracle Cloud Linux user creation, we can adapt that to suit our needs. The main difference is that we are not setting up a new administrator user, so we can omit the final stages of the tutorial. In outline the steps are:

  1. Create a new private / public key pair. My preference is to ssh-keygen on my MacBook to make a RSA 2048-bit key pair, but other tools can do this. In this blog I named the key pair tunnel_user. On my MacBook the key pair is created in my user’s .ssh directory or using conventional Nix-like syntax ~/.ssh (by the way, the tilde sign means ‘home of’: ~ = my home, ~oracle = the oracle user’s home)

  • Start a ssh terminal session to the Oracle Cloud Compute Node. Using the ‘master’ private key (the one paired with the public key used to create the database) connect as the opc user. 
    ssh -i ~/.shh/MasterKey opc@cloud_host
  • Escalate privileges (sudo su)
  • Create a new Linux user
    useradd tunnel_user
    This user will not need a password as we will only be using the ssh key to access the Cloude Compute Node 
  • Create a .ssh directory in the new user’s home directory
    mkdir  ~tunnel_user/.ssh
  • Copy the public key you created for this user into your paste buffer and add it to the authorized_keys file in the .ssh directory (the Oracle tutorial uses echo [my public key] > ~[tunnel user name]/.ssh/authorized_keys to do this, which might overwrite an existing file, so maybe >> is better to use than >)
  • Using your prefered text editor add the new user to the AllowUsers line of /etc/ssh/sshd_config. Search for the line starting with AllowUsers and edit it for example
    AllowUsers opc oracle becomes AllowUsers opc oracle tunnel_user
  • Change the ownership of the key file to the new user
    chown -R tunnel_user:tunnel_user ~tunnel_user/.ssh
  • We have now finished with the Oracle Tutorial steps, so exit from root and exit from opc
  • Log in as the new user to verify that the new ssh connection works and then log out — this step is strictly not necessary but it makes debugging simpler at this time
    ssh -i ~/.shh/tunnel_user tunnel_user@cloud_host
    exitLog in as opc and sudo to root
  • Modify the new user to use the /sbin/nologin shell
    ssh -i ~/.shh/tunnel_user tunnel_user@cloud_host 
  • Exit from root and opc then try connecting to the tunnel_user through shh
    ssh -i ~/.shh/tunnel_user tunnel_user@cloud_host
    You should be politely refused.

The Linux compute node is now set for tunnelling with port forwarding but can not be used to create a Linux session.

As I mentioned earlier, port forwarding is where ALL the network traffic for a specific port is redirected down the shh tunnel to the remote server where it passed to an ip address and port visible to the remote server. The target database does not have to be on the machine we have tunnelled to, it just needs to be network accessible from the remote host

In my use case the database is running on the compute node and used the default (1521) port. My target database connection is The local port I redirect can be any unused user port. Some people use 1521 as the local port, but that is only suitable if you do not need to connect to any Oracle databases on port 1521 on your local network. Remember too, that if you tunnel to multiple destinations each tunnel will need its own local port number.

Creating the tunnel is relatively simple. If you are using Oracle SQL Developer 4.1 (or later) we can even do that in the GUI. As long as the connection is open in SQL Developer the tunnel is also available to any of your applications that wish to connect to the database — this includes R Studio, SQLDesktopJ, the OBIEE Admin tool and ODI Studio, in fact any way we can connect to the database using an IP address and port number. As I often need to use SQL Developer for looking at the database this is very convenient for me. You can find simple instructions to set this up here. I would make a couple of changes to that method. Firstly, do not use your database create key (MasterKey in my example above), instead use the one created specifically for the tunnel_user. The second change is that we should manually specify the port number we want to use in the redirect; leaving it set as automatically assigned is fine if we are only going to use SQL Developer, however we do need to know the local port number if we are going to use the tunnel to connect to the database from other local clients.


If you want something a little more light-weight we can use a terminal session to host the tunnel — PuTTY works well if you are into Windows, but as an Apple-fan-boy I would use Apple Terminal.

In PuTTY we create a new session to connect to the Oracle Cloud compute node and then add our authentication details on the SSH-Authentication tab followed by the tunnel details on the SSH-Tunnel tab finally back to the session tab and save it. Open the session and leave it running. As always with PuTTY you could use a command line connection string: the syntax is very like that for ssh given below

In a terminal application on Linux or a Mac things are even easier. We just invoke SSH to use our key file and create a tunnel using a command like:

ssh -i ~/.shh/tunnel_user -L 1555: tunnel_user@cloud_host -N

The -N at the end of the command is important as this tells shh not to establish a command shell connection.

Remember, the tunnel is a connection to the remote server, not to the database, we will still need to use JDBC or whatever other protocol (OCI, ODBC etc) to create a database session so we still need a valid Oracle user name and a password. The only things we change in our connection is to use the redirected local port and the address we set up in the tunnel command (in my examples this is port 1555 and IP address

Happy tunneling folks


Using ODI with Git Versioning – Part 2: A Game of Tag

Last time I blogged about getting started with the new Git VCS functionality in Oracle Data Integrator (ODI) 12.2. Then, I promised to write a bit on integrating Git into the deployment process.

Firstly, a disclaimer. I used to work for a IT services company that also ran one of the largest ITIL training organizations in EMEA. In the traditions of “Eat your own dog food”, it was inevitable that the disciplines of release control and configuration management underpinned all we did. We also developed (in collaboration with other global IT companies) methodologies to bring the ‘software’ side of the house into a similar framework to the one used for ‘infrastructure’. Of course, this was not a simple drop-in of a process; developers have different needs from infrastructure deployment folk. From the deployment side of the house we need to document and create packages of artefacts to deploy (these may be scripts, binary files, ODI scenarios) and at the same time freeze a copy of the development source code so that developers and support staff have a code commit point to work from for subsequent development and bug-fix activities. Developers need more than that, they need to be able to save versions during the development process so that they protect the code they are developing from all kinds of mishap. Often the code that developers need to track in an SCM is way more than the deployed codebase, and the waters are muddied further when we consider commonplace processes such as continuous integration. Basically, it boils down to my expectations with Git-based development and deployment, and how those expectations are actually met in the SCM integration documented by Oracle when they introduced Subversion VCS to ODI.

Organizing your work

Although the Oracle ODI versioning documentation suggests that small teams can do their development in trunk, that is simply not the best way to approach this, for two reasons. Firstly, ODI’s Create New Version paradigm always saves and commits to the remote Git repository; it is far cleaner to develop in an isolated branch and have a Git release manager or deployment engineer managing the downstream SCM merge process. Secondly if we are using any automation based on Git commit hooks (such as continuous integration processes using Jenkins) it is sensible that we do this from a “clean” codebase and use the commit messages from the merge as the process trigger.

The other organizational question is where does everything sit in the environment? To help explain this I have created this simplified schematic:


As you can see, there is a remote file system holding the whole Git repository. This remote Git repository may be at a hosting provider such as, or it could be an in-house repository running on some git server. The developer clones a single branch from the remote repository to a local Git file system (usually) on the developer’s workstation. So far this is standard Git practice. ODI adds to this by including a database-hosted ODI repository. Optionally, the Git file systems can seed the ODI repository, and of course the ODI repository sends changes to Git (local and remote repositories) whenever we create a new version.

All sounds simple. The potential “gotcha” is that ODI only supports a single set of objects in the work and master repositories. As an example: I am developing some new mappings in my “SalesDevJan2017” branch and my colleague, Lisa, is working in her “CustomerDevMar2017” branch. Both of our branches contain some datastores and mappings that relate to customer; what if Lisa adds some new attributes to the customer table that won’t go live until the quarter after my changes? As soon as I version all changed objects there is a risk that my branch will contain code / objects that haven’t been finalized yet.

We have four basic options:

  1. Use ODI as the master codebase and Git as an out-of-line code store. This is pretty much what we did in ODI 12.1 (and earlier) days; create XML exports of the objects and store in a file system. We get the benefit of version restore, but little else. I just don’t like this, ODI is never going to be the sole source of the artefacts we need to version in a project (think DDL, documentation, report definitions etc). Git should be our corporate repository of source, not a cluster of multi-master islands of code.
  2. Create a branch for each sprint (if we are agile) and have all developers work in the same branch. Of course we still have potential problems for work that takes longer than a single sprint.
  3. Create an ODI WORK repository for each developer and develop in individual branches. This gives great isolation for all the design components, but still requires shared components in the security, topology parts of the MASTER repository.
  4. Give each developer his or her own ODI repository database and a Git branch to work in. This gives us maximum isolation of development. Obviously, there may be a potential license cost for all of those repository databases however we can consider ways to mitigate this such as named user licensing on development / test platforms or by running multiple ODI repositories on the same platform, either as separate DB instances or as multiple schemas within one database.

The flow for option 4 could be:

  • Clone the master code line to create a new branch
  • Populate the ODI work and master repositories from the branch
  • Develop ODI code and version when necessary
  • Version the ODI repository
  • Merge the branch back in to the master



Here my release management gene kicks in: I want to move source control away from code releases. True, there is always a need to tie code at a point in time to the executable version of released code but we don’t really want to clutter our code repository with a large number of executables that are not required. We also need a way to handle the executables away from the code; there is no place for source code on any platform other than development (well maybe hot-fix, but that is really just a special case of a git branch). I also want a way to back-out a release if things go wrong.

In ODI terms, I will need to deploy the scenarios and loadplans generated from ODI. I also need to deploy any non-ODI configuration scripting, this will include shell scripts, DDL files, data seeding scripts etc. The non-ODI scripts are fairly straight forward, we just create additional directories in our branch to hold the scripts and manage that though a Git client or web interface. For the ODI materials we can use another ODI 12.2 feature, the deployment archive.

In simple terms, the deployment archive is a pair of zip files, one containing the XML source files, the other containing the executable scenarios and load plans. It is this executable archive that we need to deploy to our testing and production environments. The source zip should not contain anything we do not already have in SCM, but it is nice to have a single file that ties to the released executables. In fact the deployment archive is a tad more sophisticated than this, we create ‘Initial’ deployments (all content) or ‘patch’ deployments (just changes to apply) where we can also generate ‘rollback’ archives to allow us to revert patches.

The deployment archive functionality is linked to the concept of ‘Tags’ in SCM. Git tags are not the same as subversion labels, but for our ODI deployments this is not too relevant. We create a tag in git from ODI studio and then use this tag to identify all of the objects to include in the deployment archive. ODI has a concept of a partial tag— this is of course not a real git term. Think of a partial tag as a user generated list of objects, the editor for a partial tag is quite similar to the one for smart exports. However I think it is better to use a full tag, that is all the tracked objects.


Once our tag is in git we can use create deployment archive from VCS label to generate the archives. This gives us a set of zip file we can add to git and then deploy. Better yet we can use the ODI scripting SDK to develop an automated process to rollout code to the test, UAT, production as part of a CI workflow.

Last time I blogged about getting started with the new Git VCS functionality in Oracle Data Integrator (ODI) 12.2. Then, I promised to write a bit on integrating Git into the deployment process.

Firstly, a disclaimer. I used to work for a IT services company that also ran one of the largest ITIL training organizations in EMEA. In the traditions of “Eat your own dog food”, it was inevitable that the disciplines of release control and configuration management underpinned all we did. We also developed (in collaboration with other global IT companies) methodologies to bring the ‘software’ side of the house into a similar framework to the one used for ‘infrastructure’. Of course, this was not a simple drop-in of a process; developers have different needs from infrastructure deployment folk. From the deployment side of the house we need to document and create packages of artefacts to deploy (these may be scripts, binary files, ODI scenarios) and at the same time freeze a copy of the development source code so that developers and support staff have a code commit point to work from for subsequent development and bug-fix activities. Developers need more than that, they need to be able to save versions during the development process so that they protect the code they are developing from all kinds of mishap. Often the code that developers need to track in an SCM is way more than the deployed codebase, and the waters are muddied further when we consider commonplace processes such as continuous integration. Basically, it boils down to my expectations with Git-based development and deployment, and how those expectations are actually met in the SCM integration documented by Oracle when they introduced Subversion VCS to ODI.


Using ODI with Git Versioning – Part 1: The Basics

Last time I went through some of the basics about getting the latest patch-set of ODI to connect to a VCS repository. Today I will talk a little about versioning our ODI development in Git.


The first point to make is that we can select the objects we wish to version-track in Git, it need not be the whole repository; we could just track an individual mapping, but most likely we would be tracking one or more ODI repository projects or project folders. I don’t know about you, but my preferred development style is to use a single project containing multiple folders to organize my codebase; I also create a ‘development playground’ folder where I can try out ideas. Whether we source track the playground folder is debatable. It is a code asset and should be tracked, however it may not sit well with some aspects of ODI Deployment Archive creation from Git sources; I will write more on deployments next time.

The easiest way of adding ODI content to Git VCS is by right-clicking on the object in the ODI designer navigator and then Version/Add to VCS.


This leads to the Add Objects To Git dialog box. As I mentioned in the previous blog, ODI uses a commit and push model; it is mandatory to provide a commit comment every time you version an object in ODI Git; this dialog box is no exception. As you can see, versioning at the project (or folder) level will allow to include child objects; this can save us work and, more importantly, add multiple items in a single commit.


Include Dependencies will ensure that dependent ODI objects are included in the files pushed to Git, the Regenerate and Version Scenarios option is self-explanatory. In the ODI Studio Navigator Git versioned ODI objects have a small green ‘g’ on the object icon.

Having pushed to Git we can see what is in there! Looking in my local file system’s Git repository clone I see that ODI has generated two folder trees, one for the Master Repository and the other for the Work Repository. The structures of the two repository trees follow the same basic outline as you would see on the designer and topology navigators in ODI Studio.



In my screenshot I have also included a couple of other objects that are not managed by ODI (a file and a folder); most organizations would also keep non-ODI items in the same Git repository, for example DDL scripts, data models, shell scripts, deployment notes, in fact just about any resource that needs versioning — being able to do this is an important plus.

Now that we have ODI source code in Git we can get on with developing and versioning as we go. As previously described, I am not in favor of automatically creating a new version on each save, rather I prefer developer choice regarding the right time to commit and push.

We just right click on the object we want to version in the source navigator to access the Version context menu and then select Create VCS Version…


As always, we will need to provide a commit comment for the save. Note too, that ODI is smart enough not to commit to Git if there have been no changes.


Note the Version menu is context specific; if we right click on a container item such as Folder or Model we get an additional option to Add Non-versioned Objects to VCS… This does exactly what we expect: it brings up a list of objects in that container that are not currently under version control and allows us to select the ones we need to add to Git.

Other options on any VCS tracked object are Version History, Version Tree, Restore From VCS.


From the version history we can restore a previous version from VCS. As a developer, I find it difficult to remember which commit date is the version that I need to restore. Instead, I prefer to use meaningful commit comments such as Jira ticket number or release number. Of course, with Git we could use these commit comment as part of a workflow triggering other updates to our systems. The version history screen also lets us access a useful 3-way version compare window. This allows us to compare the Git baseline with the selected version and the current state of the ODI repository or to compare 2 selected versions against the baseline. In my example below, I added the filter FILTER1 to the baseline mapping and saved to Git — next I renamed the filter Low_numbers and saved to the ODI repository but not pushed to Git.

Screen Shot 2016 10 28 at 11 41 31


Clicking on the ‘Select Version’ button brings up a version history window shown on the right of the screenshot above. Again, I find it easier to use the commit comment rather than the date column to identify versions, especially if we create multiple versions per day.The final option is to display a Version Tree.

Version Tree

A version tree is a graphical representation of the Git commits for an object, again with access to the version compare feature. The tree is color-coded with the initial create in green, changes in black and deletes in red. For my example mapping I tested the ability to recover a deleted version and hence I get a red spot representing the delete and a yellow one for the restore version step. Hover the mouse over a dot and a pop-up window shows additional detail on the version. Again, we get direct access to the 3-way compare feature by clicking on the icon in the screen menu bar.

Team Menu

The other place where the developer interacts with the ODI-Git versioning system is on the Team Menu -> VCS Actions. I’ll cover most of these options in part 2 of this blog but for now a few of developer-useful features of VCS Actions.


Pending Changes identifies all of the Git tracked objects that have been saved to the ODI metadata repository but have not yet been pushed to Git.

Pending Changes

We can, optionally, push one or more of the changes to Git from the Pending Changes viewer or select a single ODI object and restore it from Git. In fact, I prefer making my new version generation from the pending changes screen — it is a clear interface and allows us to ensure that all of the change is committed in a single operation.

The Add Non-versioned Objects option from the VCS Actions menu is not quite the same as the one reached by right clicking on a folder or project in the Navigator. On the navigator we can add all the child objects and their dependencies; in Team we have the ability to select anything in the repositories. If you are working with fine-grain development of just a project or a couple of folders, this may be too big a selection choice. For most of my initial use cases I would only really want to version the stuff created or customized in the development process.

One final thing I’ll mention now is the ability to bring back deleted objects, yes, sometimes developers need to do that.

In the final part of the new Git features in ODI blog I’ll talk about Tags, Deployments and starting afresh in a repository.

Source, The No Longer Secret Sauce

Working as an IT Professional I need to adhere to standards; working in consulting these standards may vary between customer organizations. Some places are Agile shops, some are classic waterfall, others use whatever flavor of methodology that suits them best. Two things are a given though: We use source control to protect and version the code assets we have spent a major part of our project time creating and we use some form of deployment methodology that ensures we can roll out software releases in a controlled way and if need be revert to a previous stable version.

Historically interfacing GUI based data integration tools to industry standard version control software (VCS) has been less than easy. For example to store ODI mappings in GIT I would need to write some scripting to export the design as an XML file to a file system containing my GIT branch then execute some git commands to commit my change and possibly push it to the remote GIT repository. True, ODI had a create Version feature, however I never really loved it. This version is basically a ‘labeled’ copy of Design Repository objects held in the Master Repository. Two aspects of this feature were less than endearing.

  • The versions are only held in the ODI repository database, code security is through database backups and not whatever corporate standard VCS is in use.
  • There are no scripting APIs to automate this versioning process – it is all point and click, and that is often unacceptable in an enterprise IT situation.

ODI 12.2 introduced the use of Subversion as an integrated VCS tool within ODI Studio. This was a significant step forward, however many IT organisations have adopted GIT as the VCS of choice. The latest patch-set release of ODI 12.2 addresses this and gives us the ability to GIT as a VCS platform alongside ODI as a data integration tool

Over the past few months I have taken an early look at the newly released GIT functionality. One point to note is that versioning process used in ODI GIT integration is COMMIT & PUSH, this may not be the norm for developers used to committing their changes as they work and doing a single push as milestone in the development process. Thought will need be given about the most appropriate way to manage when and how to branch the master tree in the GIT system; developing in Master may not be the best choice.

When I use GIT I am often using GITHUB.COM as a cloud hosted GIT platform. For security I use 2 factor authentication (2FA) on my desktop client and for web access. If you have to use 2FA to access your GIT system from ODI you need to generate an application token to use instead of your GIT password in ODI. If you are using GITHUB.COM the token generator needs to know what level of access to GIT you need, and for ODI this will be a level that allows you to push and commit.  One watch point is that the generated token (a long hexadecimal string) is only visible at the time time of creation, if you forget the string you need to create a new one – you should, of course, revoke the old one to reduce the risk of unused tokens being at large.

Setting up GIT access in ODI is done through the team menu. First use ‘Switch Versioning Application’ to select GIT versioning, next enter the ‘Settings’ sub-menus in order from the top (shown below)


First edit the connection. For GITHUB.COM and 2FA you must use SSL Authentication to write to the remote repository – the other options are not going to work for you and will error either straight away or at the stage you try to create your local clone repository . 


next create a local clone of the GIT repository – the remote GIT URL is already populated, we just need to give the local repository location, it will be created if necessary


The last part is to select an existing branch to work with. The branch selector will list all of the branches currently in the remote GIT repository. The New Branch button will create an new Branch based on an existing branch


One final point – avoid selecting the Auto Version option checkbox. Selecting this feature will commit and push a new version each time you hit save in the ODI repository, that’s almost certainly not a good choice to make.

Next time I will blog on adding objects to ODI version control and the workflow steps you follow in managing source control and creating deployment archives. To do that I will need to talk about how ODI versioning sets up its file systems

OTN Appreciation Day – Looking for Change

As you may have seen in the Twittersphere, today is a special date in the Oracle community, a day where we can express thanks to OTN for wealth of technical content that makes our life easier.

For my tiny contribution to the day here is a little note on detecting change using SQL MINUS, maybe not my absolute favourite feature but that one has already been blogged today!

In data warehouses we need to ensure that we refresh the content to keep it line with the source systems. However in the real world we find that only a small amount of data actually changes between load; we may get a few new customers and a few others may update their details, but the bulk are the same from one day to the next. Obviously we don’t want to truncate and replace our data each load, nor do we want to update all of our records. We just need to process the new and the changed data.

For some people a change data capture approach, be it Oracle GoldenGate, old-style streams based CDC or even triggers is the way to go. It is important to remember, though, that straight data replication is not what we want, we need to consume change in a way that will load our dimension tables with the deltas only, so some form of (in ODI speak) “journalization” is essential. For others the setting up of CDC is not the right solution.

In this case we need to compare the data we are loading with the data we already hold and look for changes. Some developers do this by outer joining source to target and then comparing columns to find changed data, either through direct comparison of columns or computing a MD5 hash of the concatenated columns. Either way need to do something with those pesky nulls in our data as we may get false matches or even miss changed rows.

A more NULL tolerant approach is the use a set-based approach. The classic is something like SELECT a,b,c from StagedData MINUS select x,y,z from TargetData. Using SQL minus like this we get rows in our result set that do not exist in our target + rows that are different in the two sets, stage and target.

Some people worry that the SQL minus is going to be a heavyweight query, but there is a simple thing we can do lighten the load. Instead of a straight

SELECT * from STAGE_TABLE -- a 20 row table
SELECT * from TARGET_TABLE -- a 50 million row table

we pre-filter the Target table to only include rows with the stage table’s business keys; and EXISTS or an inner join would do nicely here

Simple but an effective way of reducing the amount of data being processed in a set based query.