Some people might take the view that I have one of those long dead technical expertises, ETL; a sort of Latin of the tech world. However, that view is not strictly accurate. The domain of ETL is evolving from a “batch-orientated lift-n-shift to some on-premises data warehouse-monolith” to one where “streams of data move constantly to one or more target systems that might well exist in someone else’s cloud”. Although still data-herding, it is a million miles from what people were doing in Sunopsis, Oracle Warehouse Builder, or Informatica in the early 2000’s. Today, we consume changes captured from databases and Web-API sourced 3rd party data and often push it to a remote target using another web service, however this is rarely going to be a simple pass-through, almost invariably the outgoing API requires some form of data enrichment before it can be consumed, and that takes us back to the transform, T, part of ETL. Which is my way of saying that going to the ETL talks at conferences is as relevant today as it ever was.
At Oracle Code-One 2018 in San Francisco, one talk particularly resonated with my current work using Kafka to load an Oracle Cloud SaaS product.
In my project we needed to structure our outgoing topics to fit the format expected by the upload to SaaS API. The API is a REST service that uploads JSON payloads with a schema specified in the SaaS vendor’s documentation. Here we come to a difficulty with any form of stream processing, the parts we need to emit are often the amalgamation of many streams with different velocities. Take a PRODUCT being uploaded to some ERP system, the product barcode may change at a different frequency (and at different times) to information about product supplier. If we need to provide both barcode and supplier in the same JSON payload we going to have to find a way to join those streams across extended time periods. This brings up questions about stream retentions and the need for some backing store. Maybe, Kafka K-Tables could be part of the mix to present a pivoted view of the current state of all the keys, but things can get complex with key management, partitioning, replication, and the need for adequate sizing in terms of memory and storage of the nodes.
From the talk I learnt that sometime in the 12 months Oracle will be releasing the first version of its in-database Kafka product. As with all future products Oracle can’t commit to firm timeline for delivery but I am hopeful it will be in the next couple of months. To hear about the future Oracle in-database Kafka consumer set my head racing on simpler solutions to joining data over extended periods.
The initial version of the product is an in-database consumer of Kafka Topics (probably with a limited set of SERDEs). The basic approach is to extend the database functionality by adding code (similar to the way in which Enterprise R extends the database by installing some packages). We use the new package to register the topic and expose the data as an Oracle view. This does not sound that exciting but that view is the gateway to a whole host of database functionality that might be tricky on a Kafka stream or within KSQL:
- we can join it to other tables and views (including other topics exposed as views).
- we can use SQL functions to enrich and enhance the data .
- we can use aggregate functions.
- we can use analytic functions.
- we can pivot and unpviot data
More importantly, we can insert this data into other tables and then feed other systems.
As I understand it, the first release is a consumer only – we have to wait a little for Kafka producers. Another wait-for is AVRO processing. In some ways this lack of AVRO from day 1 is a shame, but not one that we can’t work around. For example I use Golden Gate for Big Data to produce Kafka topics from database change, and this comes in AVRO format if we want lovely things like auto-registration of topics. However this is not a technical showstopper if we need consume GoldenGate change capture in a database, for that is exactly what the product has done from the start! The problems come when a data-flow architect mandates all data moves using the same tech. Maybe Oracle to Oracle streaming is simpler using basic native GoldenGate.
Maybe the product is the compete answer to every problem just yet, but is one to look at for the future.