Partition Exchange (backwards)

The other day my Twitter-buddy (and real life friend) Jeff Smith replied to question about efficient data removal using partitioning (which as people reminded him) an option in the Enterprise Database (but included in XE version 18 ).

I have used partitioning from Oracle 8 days (and for some customers, partition views which predated that technology). As versions are released the technology becomes more and more capable. Partitioning has always been a key part of my data warehousing and large database work. It addresses many important needs including: manageability and resilience, information life cycle and query performance.

One particular aspect I have enjoyed in my ETL processing was the ability to do partition exchange.

Partition Exchange, simply

This is where we create a table with the same column structure as our partitioned table, load it with data that would belong in a single partition and then execute an exchange statement that simply swaps data dictionary information for the table and the partition. The table becomes part of the partitioned table and the partition now becomes the table. Being a dictionary operation this can be very quick and is an efficient way of working with bulk data. The data itself does not move in the exchange – this means the the tablespace you create the table in will become the tablespace for the partition – you may need to factor that in if you need partitions in specific tablespace – I will rephrase that YOU SHOULD CREATE YOUR TABLE TO EXCHANGE IN THE CORRECT TABLESPACE.

One way of using this is for a daily load of new data to a weekly partition table. We use exchange as it is less invasive on the target partitioned table. For example

NewImage

Here I leave out some of the steps we would do in a real process such as stats collection and indexing.

One thing that people often overlook is that in partition exchange the PARTITION BECOMES THE TABLE.  This fact allows us to modify the exchange technique to allow it  to work with non partitioned tables in exactly the same way.

For this quick tip we create a single partition partitioned table – one where the partition will hold all of the data possible from the table. Two ideas spring to mind, a hash partitioned table with just 1 partition bucket or a range partitioned table with the partition key max value to a value greater than the greatest possible actual key value.  We use the partition as our work space for the data loading process and when complete we exchange it for the table.

Advertisements

Streaming and Star Transforms

In a few weeks time I will be in Seattle to speak at ODTUG’s KSCOPE 2019. I have two talks – one about how data profiling is an important adjunct to data modelling, the second, some thoughts on how the advent of streamed data has influenced our traditional star-schema data warehouse.

Without it being too much of a spoiler, one of my takeaways will be that the star is not dead, it’s just that what we do has morphed.

Let’s consider a classic Oracle data warehouse star schema. We have a fact table (let’s call it SALES) joined to some dimension tables using foreign key relationships. We record the item sold (PRODUCT_KEY), the person that bought it (CUSTOMER_KEY), the store location it was sold in (STORE_KEY) and date and time when the item was sold (SELL_DATE_TIME). We also record how much they spent and how many of each item they purchased in the transaction line. One fact table, four dimensions and two measures, almost as simple as it gets.  Except of course for the elephant in the room, the date/time dimension. We would not usually hold every possible date/time in the dimension table. For simplicity, I am going to store the actual transaction time on the fact table. This also has a benefit as this is likely to be the partitioning key and, as a bonus benefit, the optimizer knows date arithmetic and won’t get flustered when working out intervals in cardinality estimates. The other keys should be surrogate keys if we are using slowly changing dimensions – come to my talk or see my blog on temporal validity for a reason why that feature is not all that it seems.

There’s nothing really new in the above. In our old, on-premises, big iron in the data center world we would have partitioned the fact table to make access by transaction date/time work on a small slice of the fact table. We would bitmap index the fact dimension keys to allow us to use star transformation query plans. Star Transformations  select rows from the fact by first ‘AND’ing the dimension key bitmap indexes; we combine the bitmaps for selling a banana (what) to Peter (who) in Mayenne (where) to find only the rows where all of those conditions are true. Star transformations require us to have bitmap indexes on the fact dimension columns and to have at least RELY foreign constraints in place. 

Oracle bitmap indexes are wonderful things but they need a bit of love and care when doing data loads. They index ranges of rows and not single rows. This can cause issues with contention for concurrent access and during data load would constantly perform (slow) index updates as new data arrives. There are things we do here to get around the slow index update. Simplest is to disable the index before loading data and rebuild it afterwards. If, as I hope, the fact is partitioned we only need disable the local bitmap indexes for the partitions being loaded. Other methods to avoid index updates during insert exist such as using loading to another table, indexing and then using partition exchange to swap the data and indexes for the target partition.

Using a streamed source changes things – we don’t really have the opportunity to maintain the bitmap indexes during a constant load process. Obtaining the keys on fact load is quite simple we just join to the active records of the dimensions and obtain the keys; we can do this directly in the stream or on the relational side. To be a bit picky here we should get the key of the dimension record that was active at the time of the transaction, but unless we are processing late arriving data this is usually the current record. These key lookups are likely to be quite speedy to perform unlike indexing a row by row insert into a bitmap indexed table (or partition). 

One way to avoid the update issue is to not use bitmap indexes at all! This is not as bad a choice as it first sounds. Star Transforms have been with us since Oracle 9i and since then we have FIVE major version number shifts in the database.  New features have been released that can get around the need for bitmap indexes in a near-realtime fact.

  • Storage indexes on the disk cells.
  • Zone maps to “coarse index” the fact to dimension joins.
  • Bloom filters.

All three of the above are “negative indexes” they ignore the blocks of disk where the required data will not be found and thus restrict database block reads to the blocks that probably contain the data being sought.

Zone maps are interesting in that they can become stale through DML (and inserting new rows is DML) we would periodically need to refresh the stale zones however this operation is carried less often then you might think as we tend to insert data and not revisit existing data.

My bottom line is we probably don’t need to use star transform query plans with stream sourced fact data. Newer database structures support performant data access without the bitmap index updates seen on continuous data load.

 

NewImage

Go Faster Tape

When I was a kid you could buy a roll of sticky tape with a checker board print on it. This was used to decorate bikes to make them go faster. The reality of this was that it did not actually work, but as a young child you just accepted it for what it was, magic.

The same could be said for the silver bullets you might read about in tuning blogs. The reality of tuning is that there is never a one-size-fits-all solution. Invariably, performance is driven by the distribution of the data and the characteristics of the platform it is being accessed on; everyone’s data and environment is different so what works for one person may not work for another. That is not to say that there is no such thing as good practice for designing data access, it’s just that you can’t say “this block size” or “index this” is the best thing for all circumstances.

However, there are some general goals to explore if not achieve

NewImage

 

Obviously, by not doing something at all we get the best performance gain. Surprisingly, by looking at a system you might spot time consuming but unnecessary steps, sometimes it requires you to restate the problem in a different way and alter processing logic to omit a step.

It’s harder to call if doing things less frequently or faster gives the best performance boost. Less often could be something simple such as commit less often (of course, commit as often as transactional integrity requires, but for bulk we might not need commit for each row). Likewise pre-filtering data can reduce the volume to be processed. Sometimes the line between less often and faster is quite blurred; is changing from a nested loop join to a hash join an example of “less often” or “faster”?  You could argue the same for PL/SQL loops being replaced with a SQL join. Somethings definitely sit in the realm of “same but faster” such as where an operation occurs, for example in the CPU, in memory, at the storage-cell, on disk, using off-line data.

The fourth way – do something different

In some cases, typically with aggregation, it might be possible to move a complex function to the result of an aggregate rather than for each value, though some functions will not allow this.

NewImage

Let’s look at a typical problem we see in the data warehouse / data analytics world where we get an extract of dimensional data for a source. A simplistic statement of this: using the staged data as a source insert new data and update existing data. We might write a PL/SQL block that loops through the source and tries to update the target and if it fails tries to insert.  This however is likely to be slow as we loop through our data set. Faster, we could replace the loop code with a MERGE or UPSERT statement. This gets rid of our loop code and runs as a simple SQL statement. There is still scope for improvement though, but this requires us to revisit the requirements.

In a real world dimension we are likely to have new records, updated records and (almost certainly as the majority) unchanged records. So how do we remove the unchanged records so we don’t process them? I have blogged on that before but there are two basic approaches we can take:

  • We can outer join source to target and filter out the duplicate rows.
  • We can SQL MINUS the source and target.

As I said before the join approach may be fine, but when you might get NULLS in some columns there is some coding to do as comparing a NULL to a NULL is not quite straight forward in Oracle. Some people use a MD5 hash on the concatenated columns but again NULLS may trip things up and the hash function is very compute intensive so could have a negative impact on systems with low CPU power.

However if we could change our source data extract to contain only new and changed data we can go back to a simple merge load. One approach for this is to use log based change data on the source system; for example Oracle GoldenGate on the source database only passes rows that have a commit – and for many systems commit implies change. However, experience shows that there are “noisy” systems out there that commit on view and that leaves us with data that need not be processed. I say change data capture but we could equally use data streams such as Apache Kafka.

Having less data in the source allows us to further reduce the data we process in a SQL MINUS:

	-- Basic Set based approach to find change
SELECT * FROM SOURCE_SCHEMA.D1
MINUS
SELECT * FROM TARGET_SCHEMA.D1;

-- Modify where SOURCE_SCHEMA.D1 has many fewer rows than TARGET_SCHEMA.D1
SELECT * FROM SOURCE_SCHEMA.D1
MINUS
SELECT T1.* FROM TARGET_SCHEMA.D1 T1 INNER JOIN SOURCE_SCHEMA.D1 S1 on S1.Dim_Key = T1.DIM_KEY;

Our next question is do we merge the data or can we do something with partition exchange? The Oracle Merge statement can be very efficient if we are merging on an indexed key, but there may be some mileage in only using append inserts to an empty table and exchanging the data using partition exchange. You can also do this with a non-partitioned table!

 	— Insert into exchange table EX1 unchanged data
INSERT /*+APPEND */ INTO EX1 (
SELECT * FROM SOURCE_SCHEMA.D1
INTERSECT
SELECT T1.* FROM TARGET_SCHEMA.D1 T1 INNER JOIN SOURCE_SCHEMA.D1 S1 on S1.Dim_Key = T1.DIM_KEY );
COMMIT;
— Now the changes + new data
INSERT /*+APPEND */ INTO EX1 (
SELECT * FROM SOURCE_SCHEMA.D1
MINUS
SELECT T1.* FROM TARGET_SCHEMA.D1 T1 INNER JOIN SOURCE_SCHEMA.D1 S1 on S1.Dim_Key = T1.DIM_KEY);
COMMIT;

This is the basic approach I took in one of the ODI SCD2 IKMs I developed.

Maybe not go faster tape – but some methods to look at for performance improvement.

Kafka-in-a-database

Some people might take the view that I have one of those long dead technical expertises, ETL; a sort of Latin of the tech world. However, that view is not strictly accurate. The domain of ETL is evolving from a “batch-orientated lift-n-shift to some on-premises data warehouse-monolith” to one where “streams of data move constantly to one or more target systems that might well exist in someone else’s cloud”. Although still data-herding, it is a million miles from what people were doing in Sunopsis, Oracle Warehouse Builder, or Informatica in the early 2000’s. Today, we consume changes captured from databases and Web-API sourced 3rd party data and often push it to a remote target using another web service, however this is rarely going to be a simple pass-through, almost invariably the outgoing API requires some form of data enrichment before it can be consumed, and that takes us back to the transform, T, part of ETL. Which is my way of saying that going to the ETL talks at conferences is as relevant today as it ever was.

At Oracle Code-One 2018 in San Francisco, one talk particularly resonated with my current work using Kafka to load an Oracle Cloud SaaS product.

In my project we needed to structure our outgoing topics to fit the format expected by the upload to SaaS API. The API is a REST service that uploads JSON payloads with a schema specified in the SaaS vendor’s documentation. Here we come to a difficulty with any form of stream processing, the parts we need to emit are often the amalgamation of many streams with different velocities. Take a PRODUCT being uploaded to some ERP system, the product barcode may change at a different frequency (and at different times) to information about product supplier. If we need to provide both barcode and supplier in the same JSON payload we going to have to find a way to join those streams across extended time periods. This brings up questions about stream retentions and the need for some backing store. Maybe, Kafka K-Tables could be part of the mix to present a pivoted view of the current state of all the keys, but things can get complex with key management, partitioning, replication, and the need for adequate sizing in terms of memory and storage of the nodes.

From the talk I learnt that sometime in the 12 months Oracle will be releasing the first version of its in-database Kafka product. As with all future products Oracle can’t commit to firm timeline for delivery but I am hopeful it will be in the next couple of months. To hear about the future Oracle in-database Kafka consumer set my head racing on simpler solutions to joining data over extended periods.

The initial version of the product is an in-database consumer of Kafka Topics (probably with a limited set of SERDEs). The basic approach is to extend the database functionality by adding code (similar to the way in which Enterprise R extends the database by installing some packages). We use the new package to register the topic and expose the data as an Oracle view. This does not sound that exciting but that view is the gateway to a whole host of database functionality that might be tricky on a Kafka stream or within KSQL:

  • we can join it to other tables and views (including other topics exposed as views).
  • we can use SQL functions to enrich and enhance the data .
  • we can use aggregate functions.
  • we can use analytic functions.
  • we can pivot and unpviot data

More importantly, we can insert this data into other tables and then feed other systems.

As I understand it, the first release is a consumer only – we have to wait a little for Kafka producers. Another wait-for is AVRO processing.  In some ways this lack of AVRO from day 1 is a shame, but not one that we can’t work around. For example I use Golden Gate for Big Data to produce Kafka topics from database change, and this comes in AVRO format if we want lovely things like auto-registration of topics. However this is not a technical showstopper if we need consume GoldenGate change capture in a database, for that is exactly what the product has done from the start! The problems come when a data-flow architect mandates all data moves using the same tech. Maybe Oracle to Oracle streaming is simpler using basic native GoldenGate.

Maybe the product is the compete answer to every problem just yet, but is one to look at for the future.

Are There Constraints in your Data Warehouse?

A while back I published a small blog on using Oracle SQLDeveloper to create quick data model diagrams. In that blog I said that the data model creation process depended on the presence of parent child relationships with supporting foreign key relationships, that is the presence of the FK declares the relationship to the modelling engine. I also mentioned that when FK relationships don’t exist it is not possible to automatically reverse out the table relationships. For source systems (such as OLTP programs) a lack of constraints often come from one of two perspectives: Our product is database agnostic, not all database vendors implement constraints the same way, therefore we go with the lowest common functionality; the other reason for no constraints is that “database is just a persistence layer” and all data integrity is enforced in the application. 

What about in data warehouses? Do we use constraints or not? I created a Twitter poll that received 42 votes. As we all know 42 implies we have the true answer to the ultimate question.Screen Shot 2017 12 11 at 12 28 56

To be honest, I could have done with some more Twitter votes to give more statistically meaningful results however the results are fairly clear; people use constraints in their data warehouses. I could also do with more Twitter followers too!

Firstly, let’s consider the reasons why not.

No Need, My Data is Good

Congratulations, you are in a minority! Maybe your data is good, though in over 20 years of doing BI I don’t recall a project with perfect source data; maybe your data cleanse and ETL process creates perfection from the source, but can you really be sure that some data anomaly or process orchestration error will not corrupt your perfect, but constraint free, data warehouse.

No Need, They Slow the Database

I can see where that notion comes from, however I think it is a wrong-headed idea. It’s true that active primary and foreign key constraints have indexes lurking in the background and that maintaining those indexes during bulk ETL can reduce ETL throughput. It is also true that foreign keys can block actions such as table truncation. However, in my view, we design our ETL processes to work with constraints, not get rid of constraints. I’ll come back to reasons a little later on

What have Constraints ever done for us?

85% of the poll respondents use some form of constraint in the data warehouse, and from some of the poll comments posted, for good reason. Here I am going to duck into the Oracle database world as not all vendors have constraints that do exactly the same things – for example, some vendors have declarative but unenforced foreign keys, so some of the features of Oracle FKs are lost.

  1. Enabled database constraints prevent data not matching our rules from going into the database; inserts and updates that fail the rules get blocked and, for FK relationships, deletes might also be prevented. Simplistically, we have three kinds of rule: the row keys must be unique in a table (primary and unique constraints), the column value must exist in another table (the foreign key constraint) and the value inserted much match a rule (the check constraint), NOT NULL is just a variant of the check constraint.Similarly, primary keys constraints are a special variant of the unique constraint (UNIQUE and NOT NULL). In the Oracle world unique and primary key constraints are validated by b-tree indexes – they need not be created as unique indexes.
  2. Constraints describe data properties to the query optimizer. True, table and index statistics inform the optimizer about the quantity of data involved (and in the case of histograms and extended statistics detail on how the data is distributed and any inter-column correlations within it) but it is the constraints that finesse that information so the optimizer knows that for each value for a column in TABLE_1 there is exactly one row in TABLE_2 for the same key. The more the optimizer knows about the data, the better to query plan generated is likely to be.

Point 1 is all about the quality of the data we are storing; the truth of our data warehouse. In other words the ETL. Conversely point two is all about the query performance and our ability to access some DWH specific query optimizations. In fact the second point is really about declarative constraints and not necessarily enforced constraints as we shall see in a moment.

A second poll around the same time asked about whether people used enabled or declarative constraints. Not so many responses here, but the majority of respondents  are using some form of constraint in the query layer

Screen Shot 2017 12 11 at 12 30 41

Oracle constraints have 3 properties that can be independently set: VALIDATION, BELIEF and ENFORCEMENT. Enforcement is the checking that new data conforms to the constraint, validation on the other hand is concerned with data already in the table. Asserting belief by using RELY only works if you use NOVALIDATE and this tells optimizer to treat the constraint as if was valid; this is especially useful in some join optimizations such as query rewrite.

Validated constraints require an index behind it – one point to note is that a validated but disabled unique constraint backed by a unique index will still block duplicate data inserts, not through a constraint violation but through duplicate index keys.

If constraints are so important to ensure data validity in the data load process why do we think that using declarative RELY constraints are such a benefit in the reporting/analytic layers?  The reason here is related to performance on data load and the doubtful benefit of unique indexes on massive fact tables. In our data warehouse each row of the fact table is (or should be) unique; for a sales table we have a product (what), customer (who), store (where), and a date/time (when), this information uniquely defines a sales line transaction (the quantity and values being measures). An enforced and validated constraint on these four columns is likely to require a huge four column composite index. updating it for each sales record insert is an ETL performance overhead and gives us no query benefit as people never need single row analytic access; invariably they want to aggregate and aggregating implies that at least one dimension is not known and thus the composite index is perhaps not the right choice.

RELY can be a risky option if we lie about the validity of our data, but with the correct and rigorous ETL design we should be sate to use RELY in the analytic layer. This rigour requires us to stage and validate our data before publishing it. Foreign keys on facts are simple to validate in the key lookup chain process, that we found the key in our validated dimension table is an implicit FK validation. Row key uniqueness can only be violated by loading something twice – we can prevent this in staging by building in some check process before data publication .

Are You Free? Fiddling with PCTFREE

I just spotted on my friend, Martin Widlake’s, Twitter feed a poll that I missed when it was active

Screenshot 2019 04 05 at 16 55 52

Long term readers of the blog may have read this tale of mine before but the story needs to be repeated again and again. Though, this time in a single blog and not the three part series.

Back then I managed the UK data warehouse team for a SI and outsourcing company. One of our customers had a 24×7 support contract for their central data hub system and as that was more ETL like than transactional support fitted into my empire.

The customer operated multiple brands of restaurant, hotel and bars, however they all shared a common data hub that moved data between the outlets and the central ERP and data warehouse systems. This data was sent as a single (large, since it was at item level) file that had to be processed as a single transaction or moved to a ‘fail stream’ for editing and data correction. To cope with business demand the incoming data was routed to one of four processors – that is we ran the data from 4 outlets at a time through the system. At times, and particularly when new menus went live in the outlets the system was prone to crash with deadlock errors in the database. The DBA team said this had to be the developers fault for trying to update the same record twice, the developers said not as outlets are processed in their own threads and as outlet was part of the PK of a record only one process could be active on a single record.

The real clue as to what was going on came from the deadlock graph in the trace file

Deadlock graph:
---------Blocker(s)-------- ---------Waiter(s)---------
Resource Name process session holds waits process session holds waits
TX-00050021-00014589 11 31 X 13 16 S
TX-00060017-00014503 13 16 X 11 31 S
session 31: DID 0001-000B-00000002 session 16: DID 0001-000D-00000002
session 16: DID 0001-000D-00000002 session 31: DID 0001-000B-00000002
Rows waited on:
Session 16: no row
Session 31: no row

This is not quite a ‘normal’ deadlock where two updates are waiting on a single row, instead we have two processes that need to update a database block but can’t because of contention in the block header. But why would this happen?

The specific tables that were always associated with the problem were the item allocation tables; there where 3 of them dealing with different types of items, but they all had the same basic design:

 

DESC STKTAB
OUTLET_ID NOT NULL NUMBER
STOCKITEM_ID NOT NULL NUMBER
DATELASTSOLD DATE
DATEVALIDATED DATE

The primary key is the outlet and the item, and when a new allocation is created the two date fields are set as NULL. When the outlet is polled the lastvalidated date is set to sysdate and lastsolddate is set to the date time of the last sales transaction for that item. In the database our update expands a row from around 10 bytes for two key values to 24 bytes when we set to dates.Llooking at the the table DDL we see we use default values for both PCTFREE (10) and INITRANS(1). To make the perfect storm of a problem our bulk allocation of items system uses a nested loop approach of “for each item to be allocated loop through the outlets to be allocated to and create a new row in the allocations table” This tends to make many outlets having records for a specific item appear in the same database block. Coupled with a long transaction time and lots of row expansion when items are sold for the first time we get fairly full data blocks running out of space and no room to create the needed number interested transaction locks (ITL).

We could fix up some of our pain by rebuilding the table, but that is only an interim fix as the next round of bulk allocations to a large brand will trigger the same behaviour. We could change the PCTFREE value to make the blocks “less full” so that we have room for massive row expansion.  In fact this is perhaps the simplest fix without lifting the hood and fixing the code.

The moral to the story though is that you need to consider how tables update and how sizes of rows might change before you decide on keeping PCTFREE at the default. If possible consider how concurrent updates might cause issues in a data block.

Oracle Data Integrator and the REST

Recent Oracle Data Integrator (ODI) releases had REST available as a technology within the topology tab. Oracle’s A-team wrote some good blog posts on this and there’s a handy tutorial here. The use case is to invoke a REST web service, save the JSON response and then parse it and do some more processing. This uses an ODI package to orchestrate the steps required. However, there are a few things here that I am not so comfortable with. Firstly, this is an ODI 12.2 and higher method, there are some people (perhaps users of BI-Apps) that have not been able to upgrade yet. The second thing is that we need to process the returned JSON document by persisting it as a file local to an ODI Agent and then use that Agent to read the file as a complex text type; I think this is a little messy, The final thing I want to mention is that the documented method is using a multi-step package to process a single REST call; for a data enrichment use case we may need to add some form of looping to call the REST service for each record to be processed.

My customer’s use case was to standardize customer address data and where possible geocode it.  They had licensed access to a REST service API (in this case Google Maps, but other providers (such as openstreetmap.org) can provide similar functionality) to provide standardized (and corrected) address data including geocode components. Like many organizations with user entered data there were a few problems with data quality and hence a need to as far as possible correct the data.

The REST API for address data could process a single address in one call. For API license reasons we also needed to restrict the number of addresses being passed in a day to 1000. We could do a few smart things with our processing of non-validated addresses. Firstly, just because a customer address has not been validated and geocoded it does not mean that we don’t already have a validated address for another customer at the same location – spouses, siblings, co-workers might have been validated already, here we are just looking for matching address points not customer’s names. If we can enrich and geocode without using the API all well and good. For the ones we can’t match we need to check if we tried to validate before – if it failed last time it will fail again unless we have edited the address. Then we process the first 1000 records and either enhance and standardize the data or mark it as failed to match in the API.

So how do we do this in ODI?  One answer is to write a simple mapping to create a table of addresses to process and a multi-step procedure to process the table. More on the mapping later.

ODI procedures for REST

A procedure in ODI is set of one or more task steps that can be executed on an ODI source or target. Each step (task) can be written in any one of many supported languages including SQL, Jyrthon and Groovy. Three things should be remembered:

  • A source step can be used as a row source for a target on the same task. That is, for each source row obtained we execute the target task once.
  • The source and target technologies need not be the same, we can read a source with Oracle SQL and do the target step in Jython.
  • We can use external java libraries with Jython.  If we need to use a custom JAR we just drop the file into the ODI agent’s JAR library directory and restart the agent. However, the libraries we need to send GET requests to a web service are already in the path, we may need something extra to parse a JSON object if we go that route.

For this procedure we need to select a customer_id (so that we know the record to update) and the address associated that customer – to make our life somewhat easier our addresses to process mapping combines individual address components into a single text string representing the address. So our SOURCE tab on the procedure is something simple like:

Select customer_id THE_CUSTOMER, cust_address THE_ADDRESS from addresses_to_process

Note the aliases are important as they are used in the target command. In the real world we would use ODI substitution variables to identify the source schema (and possibly source table).

The target command is more complicated

// first import the classes we need
import java.net.URL as API_URL
import java.io.BufferedReader as BRD
import java.io.InputStreamReader as SRD
import java.net.HttpURLConnection as api_con
import base64
import string

// supply our credentials – best to use a variable for this and not hard code here
API_KEY=”[my api key]”
// create the URL to request the data – note the use of the query alias for the address
URLRAW=”https://maps.googleapis.com/maps/api/geocode/json?address=#THE_ADDRESS&key=“+API_KEY
// but addresses could contain spaces so we must strip them
API_URL=URLRAW.replace(‘ ‘,’%20’)

// query API
url = API_URL(API_URL)
api_con = url.openConnection()
api_con.setRequestMethod(“GET”)
api_con.connect()
// check response
responseCode=api_con.getResponseCode()
responseMessage=api_con.getResponseMessage()
if responseCode == 200 and responseMessage == “OK”:
br=BRD(SRD(api_con.getInputStream()))
outJSON = br.readLine()
con.disconnect()


At this point we have populated outJSON with a single JSON response representing a standardized address (address_components), a formatted address and a geocode (location).  The Google Map Geocode API documentation gives an example output.
{
“results” : [
{
         “address_components” : [
{
“long_name” : “1600”,
“short_name” : “1600”,
“types” : [ “street_number” ]
},
{
“long_name” : “Amphitheatre Pkwy”,
“short_name” : “Amphitheatre Pkwy”,
“types” : [ “route” ]
},
{
“long_name” : “Mountain View”,
“short_name” : “Mountain View”,
“types” : [ “locality”, “political” ]
},
{
“long_name” : “Santa Clara County”,
“short_name” : “Santa Clara County”,
“types” : [ “administrative_area_level_2”, “political” ]
},
{
“long_name” : “California”,
“short_name” : “CA”,
“types” : [ “administrative_area_level_1”, “political” ]
},
{
“long_name” : “United States”,
“short_name” : “US”,
“types” : [ “country”, “political” ]
},
{
“long_name” : “94043”,
“short_name” : “94043”,
“types” : [ “postal_code” ]
}

],
“formatted_address” : “1600 Amphitheatre Parkway, Mountain View, CA 94043, USA”,
geometry” : {
“location” : {
“lat” : 37.4224764,
“lng” : -122.0842499
},
“location_type” : “ROOFTOP”,
“viewport” : {
“northeast” : {
“lat” : 37.4238253802915,
“lng” : -122.0829009197085
},
“southwest” : {
“lat” : 37.4211274197085,
“lng” : -122.0855988802915
}
}
},

“place_id” : “ChIJ2eUgeAK6j4ARbn5u_wAGqWA”,
“types” : [ “street_address” ]
}
],
“status” : “OK”
}

We then need to extract the data from the JSON document.  There are two ways we can go about this. If we are using an Oracle 12.2 or later database we could store the JSON document in the database and use the in-database JSON functionality to parse the document and extract the data of interest or we can use a third party JSON parser library such as Jackson and parse the document before storing it. To use an external JAR file we just need to make sure the library is in the ODI jar file path then restart the agent to register the JAR, finally we import it into the procedure code. Whichever way we process the JSON document we will need to persist the output to the database along with the customer_id (#THE_CUSTOMER) key value.

The returned JSON document can be quite large, so if we store it directly in an Oracle DB we may need to insert the data using an anonymous PL/SQL block to get around limitations on VARCHAR2 size on an insert from SQL.  This is fairly simple if we are using SQL from Jython, we just prepare a statement and execute it.

sqlStm = "Begin
INSERT INTO JSON_WORK VALUES ('#THE_CUSTOMER', ‘“+ outJSON +”’);
END"

Ideally we only want to issue the commit statement at the end of processing loop.

One further point to note on the Jython step REST query is that we could write this as in-line jython on a SQL technology step  – this could make storing the results to the database easier to code. I think I come down on writing JDBC inserts in Jython over embedding the REST calls in SQL.

We could write the procedure as 3 tasks:

  • Open an JDBC connection to the database that will have the JSON results inserted into it
  • A loop step that gets each address to be validated from the source database and does a GET to the REST API, then uses the open JDBC connection to write the results to the target table (no commit)
  • A commit and close JDBC connection step

Along with the procedure we will need some mappings to make a list of 1000 addresses to process and to take the mapping

Map logic

And a mapping to process the returned data

Map2

The net result was a simple process to enrich data from an external provider.