Last month I presented two talks at the UKOUG Tech 18 conference in Liverpool. In my first talk I discussed whether the star schema is a dead concept now we are in a world of multiple sources, event streams and status change tracking. I also attended a short talk on data warehousing “design mistakes” presented by David Kurtz of Accenture Enkitec. Now here is an interesting dichotomy between my (not necessarily Oracle) and David’s Oracle example world; the use of surrogate keys.
Surrogate keys have long been part of the Kimball inspired methodology. They provide unique identifiers for each version of a dimensional row. These identifiers [should] have no business meaning; users would not typically know that 424242 is a can of tomato soup as of Jan 1 2010; this forces the user to use joins on the surrogate key between the dimension and the fact. Surrogates keys also do other things that seem beneficial – for example they present a single column (almost certainly numeric) unique key that might otherwise be a multiple column key.
Creating a new numeric surrogate key is trivial if we can use a sequence or an identity column. However, this also gives us complexities in the ETL processing around looking up the correct key value to use on a fact table (we need a look up on the business key and the date) and around change detection and record closure on the slowly changing type 2 dimensions. These problems become more complex if we move to a streaming paradigm and either need to generate new keys at speed or find the correct keys for a fact, a problem worsened by the lack of identity values and sequences in some architectures. As an aside Frank Pachot recently blogged about why GUIDs or UUIDs are not good as unique identifiers in an Oracle database.
David Kurtz contended (and supported with query plans) that a unique, single column, dimension key would give far better performance than a key based on the business key combined with an effective date range. He reasoned that the non-equality date range predicate inhibited the query optimization benefits of a defined PK-FK relationship and, significantly, prevented the use of Bloom filters to reduce the amount of data being processed. One question in my mind was whether the Oracle Temporal Validity feature allowed us to use versioned dimensions joining to the fact on the business key. Not knowing meant it was research time.
Temporal Validity was a feature introduced in Oracle 12.1. Put simply it allows us to tell the database that we are tracking valid from and valid to dates on a table. This allows us to write queries using the AS OF clause and avoid writing messy looking date filters in our queries.
For my testing I used an Oracle 18 XE VM with the SH schema installed – – XE with partitions, nice!
/* create a temporal validity copy of SH.PRODUCTS */
CREATE TABLE "SH"."PRODUCTS_TV"
( "PROD_ID" NUMBER(6,0),
"PROD_NAME" VARCHAR2(50 BYTE),
"PROD_DESC" VARCHAR2(4000 BYTE),
"PROD_SUBCATEGORY" VARCHAR2(50 BYTE),
"PROD_SUBCATEGORY_DESC" VARCHAR2(2000 BYTE),
"PROD_CATEGORY" VARCHAR2(50 BYTE),
"PROD_CATEGORY_DESC" VARCHAR2(2000 BYTE),
"PROD_UNIT_OF_MEASURE" VARCHAR2(20 BYTE),
"PROD_PACK_SIZE" VARCHAR2(30 BYTE),
"PROD_STATUS" VARCHAR2(20 BYTE),
"PROD_TOTAL" VARCHAR2(13 BYTE),
"PROD_VALID" VARCHAR2(1 BYTE),
PERIOD FOR PROD_VALID_TIME (PROD_EFF_FROM, PROD_EFF_TO)
CREATE INDEX "SH"."PRODUCTSTV_PK" ON "SH"."PRODUCTS_TV" ("PROD_ID") ;
ALTER TABLE "SH"."PRODUCTS_TV" ADD CONSTRAINT "PRODUCTSTV_PK" PRIMARY KEY ("PROD_ID")
USING INDEX ENABLE NOVALIDATE;
insert /*+APPEND*/ into SH.products_tv (select * from SH.products);
UPDATE "SH"."PRODUCTS_TV" SET PROD_EFF_TO = TO_DATE('2019-01-04 00:00:00', 'YYYY-MM-DD HH24:MI:SS') WHERE PROD_ID=13;
select count(1) from sh.products_tv AS OF PERIOD for prod_valid_time sysdate;
select count(1) from sh.products_tv;
So far so good! Now let’s insert a new version of the product with PROD_ID=13
INSERT INTO "SH"."PRODUCTS_TV" (PROD_ID, PROD_NAME, PROD_DESC, PROD_SUBCATEGORY, PROD_SUBCATEGORY_ID, PROD_SUBCATEGORY_DESC, PROD_CATEGORY, PROD_CATEGORY_ID, PROD_CATEGORY_DESC, PROD_STATUS, PROD_LIST_PRICE, PROD_MIN_PRICE, PROD_TOTAL, PROD_TOTAL_ID, PROD_EFF_FROM) VALUES ('13', 'change', 'change', 'change', '2044', 'dgf', 'fghgf', '204', 'gfg', 'STATUS', '0', '0', 'TOTAL', '1', TO_DATE('2019-01-04 00:00:00', 'YYYY-MM-DD HH24:MI:SS'))
ORA-00001: unique constraint (SH.PRODUCTSTV_PK) violated
OPPS!! It seems that we can’t have multiple versions of the same PK. There seems to be 3 options around this:
- We don’t use primary keys on the dimension
- We use a compound primary key of the business key and the effective from date
- We use a surrogate key for the PK and join key
From David’s talk it would seem that the first two options are not great ideas in terms of query performance. The final option is not really any different to the original design except we get the “benefit” of being able to write AS OF PERIOD queries.
Looking at query plan I captured for the simple count as of query we see that query we execute uses the PROD_EFF_FROM and PROD_EFF_TO columns and nothing special or fancy, It’s here where I should have looked at Tim’s blog first!
Tim explains that Temporal Validity uses query rewrite under the covers to map AS OFF to the same style of query as we used to write on effective date ranges.
So my question to David about whether Temporal Validity alters his findings on the use of surrogate keys in DW queries is answered by “Nothing to see, move along”
More on surrogates in a future blog.