This is a tale that goes back to the times of Oracle 9i, Warehouse Builder and 4TB being a big thing. Though the lessons learned apply just as much today, perhaps even more so.
My customer ran a “membership card” grocery shopping scheme, each item-sale was linked to a specific customer. There were around a million registered customers and a somewhat smaller number of active customers, on average each active customer shops once a week. Once a week a customer “master” file is received containing new, updated and unchanged customers. So far this is all standard data warehouse ETL stuff. There was an additional requirement in that a customer may have sales transactions before they first appear in the master customer feed; this is a synchronisation issue between the stores that create customers (in real-time) and the ERP system that generates the master file (weekly). Again this not a tricky problem, we just create a skeleton customer using the new business key for the customer and apply an update when the detail comes in from the mastering process.
Most of the ETL code base for this customer was developed using Oracle Warehouse Builder (OWB). Everyday we ran a mapping looking for newly created customers in the sales fact feed and generate new customer records; once a week we ran the master data feed mapping that looked for changes between the existing records and the new feed then applied the differences as inserts or updates (merge). Two relatively simple mappings.
Over time the performance of customer dimension table degraded. Full table scans took an age to complete and we often ran out of tablespace. Something was wrong here. Looking at the data dictionary we could see the customer dimension consumed far more database blocks than the sales fact table but had far fewer rows of data. Looking at rows per block we could see that we were only storing a handful of rows in each block. That would explain the abysmal table scan performance – table scans scan whole blocks, not the rows within it so physically we are reading way more bytes than we should. The simple remedy is to do an ALTER TABLE CUSTOMER_DIM MOVE to rebuild it. However that is a cure of the symptoms and not a resolution of the problem. We still need to get the root cause and prevent it happening again.
Each day we got a very small number of new customer records arriving (say 20). New customer records are created for them and inserted in to the customer dimension table. Looking at the OWB settings we were inserting new data with the default APPEND PARALLEL hint. BINGO! this is the problem with our small inserts. Unlike a ‘normal’ insert, the append hint writes the data into a buffer then appends the whole buffer above the existing rows of the table. in the case of no parallel we insert above the current high water mark of the block (if it fits); but when we use PARALLEL we insert into a new block for each parallel process. Our parallel 16 CUSTOMER_DIM table is getting 16 new blocks a day containing between 1 and two rows each!
Remove the OWB PARALLEL hint from the mappings (APPEND is OK though) and redeploy. Shrink the table with the ALTER TABLE MOVE to fix up the symptoms.
This problem is not unique to OWB, the same thing can happen with ODI or any ETL tool that might automagically include insert hints in the generated code. Hints are not the magic bullet to fix performance, sometimes (such as in this case) they do more harm than good.
The lesson learned is that you need to understand what the right thing to do with YOUR data is and not believe that the “out of the box” setting is the right one for all use cases. You need to look at the settings in the code generation step and to override them if need be.
Happy ODC Appreciation Day reading!