I just spotted on my friend, Martin Widlake’s, Twitter feed a poll that I missed when it was active
Long term readers of the blog may have read this tale of mine before but the story needs to be repeated again and again. Though, this time in a single blog and not the three part series.
Back then I managed the UK data warehouse team for a SI and outsourcing company. One of our customers had a 24×7 support contract for their central data hub system and as that was more ETL like than transactional support fitted into my empire.
The customer operated multiple brands of restaurant, hotel and bars, however they all shared a common data hub that moved data between the outlets and the central ERP and data warehouse systems. This data was sent as a single (large, since it was at item level) file that had to be processed as a single transaction or moved to a ‘fail stream’ for editing and data correction. To cope with business demand the incoming data was routed to one of four processors – that is we ran the data from 4 outlets at a time through the system. At times, and particularly when new menus went live in the outlets the system was prone to crash with deadlock errors in the database. The DBA team said this had to be the developers fault for trying to update the same record twice, the developers said not as outlets are processed in their own threads and as outlet was part of the PK of a record only one process could be active on a single record.
The real clue as to what was going on came from the deadlock graph in the trace file
Resource Name process session holds waits process session holds waits
TX-00050021-00014589 11 31 X 13 16 S
TX-00060017-00014503 13 16 X 11 31 S
session 31: DID 0001-000B-00000002 session 16: DID 0001-000D-00000002
session 16: DID 0001-000D-00000002 session 31: DID 0001-000B-00000002
Rows waited on:
Session 16: no row
Session 31: no row
This is not quite a ‘normal’ deadlock where two updates are waiting on a single row, instead we have two processes that need to update a database block but can’t because of contention in the block header. But why would this happen?
The specific tables that were always associated with the problem were the item allocation tables; there where 3 of them dealing with different types of items, but they all had the same basic design:
OUTLET_ID NOT NULL NUMBER
STOCKITEM_ID NOT NULL NUMBER
The primary key is the outlet and the item, and when a new allocation is created the two date fields are set as NULL. When the outlet is polled the lastvalidated date is set to sysdate and lastsolddate is set to the date time of the last sales transaction for that item. In the database our update expands a row from around 10 bytes for two key values to 24 bytes when we set to dates.Llooking at the the table DDL we see we use default values for both PCTFREE (10) and INITRANS(1). To make the perfect storm of a problem our bulk allocation of items system uses a nested loop approach of “for each item to be allocated loop through the outlets to be allocated to and create a new row in the allocations table” This tends to make many outlets having records for a specific item appear in the same database block. Coupled with a long transaction time and lots of row expansion when items are sold for the first time we get fairly full data blocks running out of space and no room to create the needed number interested transaction locks (ITL).
We could fix up some of our pain by rebuilding the table, but that is only an interim fix as the next round of bulk allocations to a large brand will trigger the same behaviour. We could change the PCTFREE value to make the blocks “less full” so that we have room for massive row expansion. In fact this is perhaps the simplest fix without lifting the hood and fixing the code.
The moral to the story though is that you need to consider how tables update and how sizes of rows might change before you decide on keeping PCTFREE at the default. If possible consider how concurrent updates might cause issues in a data block.