A while back I published a small blog on using Oracle SQLDeveloper to create quick data model diagrams. In that blog I said that the data model creation process depended on the presence of parent child relationships with supporting foreign key relationships, that is the presence of the FK declares the relationship to the modelling engine. I also mentioned that when FK relationships don’t exist it is not possible to automatically reverse out the table relationships. For source systems (such as OLTP programs) a lack of constraints often come from one of two perspectives: Our product is database agnostic, not all database vendors implement constraints the same way, therefore we go with the lowest common functionality; the other reason for no constraints is that “database is just a persistence layer” and all data integrity is enforced in the application.
What about in data warehouses? Do we use constraints or not? I created a Twitter poll that received 42 votes. As we all know 42 implies we have the true answer to the ultimate question.
To be honest, I could have done with some more Twitter votes to give more statistically meaningful results however the results are fairly clear; people use constraints in their data warehouses. I could also do with more Twitter followers too!
Firstly, let’s consider the reasons why not.
No Need, My Data is Good
Congratulations, you are in a minority! Maybe your data is good, though in over 20 years of doing BI I don’t recall a project with perfect source data; maybe your data cleanse and ETL process creates perfection from the source, but can you really be sure that some data anomaly or process orchestration error will not corrupt your perfect, but constraint free, data warehouse.
No Need, They Slow the Database
I can see where that notion comes from, however I think it is a wrong-headed idea. It’s true that active primary and foreign key constraints have indexes lurking in the background and that maintaining those indexes during bulk ETL can reduce ETL throughput. It is also true that foreign keys can block actions such as table truncation. However, in my view, we design our ETL processes to work with constraints, not get rid of constraints. I’ll come back to reasons a little later on
What have Constraints ever done for us?
85% of the poll respondents use some form of constraint in the data warehouse, and from some of the poll comments posted, for good reason. Here I am going to duck into the Oracle database world as not all vendors have constraints that do exactly the same things – for example, some vendors have declarative but unenforced foreign keys, so some of the features of Oracle FKs are lost.
- Enabled database constraints prevent data not matching our rules from going into the database; inserts and updates that fail the rules get blocked and, for FK relationships, deletes might also be prevented. Simplistically, we have three kinds of rule: the row keys must be unique in a table (primary and unique constraints), the column value must exist in another table (the foreign key constraint) and the value inserted much match a rule (the check constraint), NOT NULL is just a variant of the check constraint.Similarly, primary keys constraints are a special variant of the unique constraint (UNIQUE and NOT NULL). In the Oracle world unique and primary key constraints are validated by b-tree indexes – they need not be created as unique indexes.
- Constraints describe data properties to the query optimizer. True, table and index statistics inform the optimizer about the quantity of data involved (and in the case of histograms and extended statistics detail on how the data is distributed and any inter-column correlations within it) but it is the constraints that finesse that information so the optimizer knows that for each value for a column in TABLE_1 there is exactly one row in TABLE_2 for the same key. The more the optimizer knows about the data, the better to query plan generated is likely to be.
Point 1 is all about the quality of the data we are storing; the truth of our data warehouse. In other words the ETL. Conversely point two is all about the query performance and our ability to access some DWH specific query optimizations. In fact the second point is really about declarative constraints and not necessarily enforced constraints as we shall see in a moment.
A second poll around the same time asked about whether people used enabled or declarative constraints. Not so many responses here, but the majority of respondents are using some form of constraint in the query layer
Oracle constraints have 3 properties that can be independently set: VALIDATION, BELIEF and ENFORCEMENT. Enforcement is the checking that new data conforms to the constraint, validation on the other hand is concerned with data already in the table. Asserting belief by using RELY only works if you use NOVALIDATE and this tells optimizer to treat the constraint as if was valid; this is especially useful in some join optimizations such as query rewrite.
Validated constraints require an index behind it – one point to note is that a validated but disabled unique constraint backed by a unique index will still block duplicate data inserts, not through a constraint violation but through duplicate index keys.
If constraints are so important to ensure data validity in the data load process why do we think that using declarative RELY constraints are such a benefit in the reporting/analytic layers? The reason here is related to performance on data load and the doubtful benefit of unique indexes on massive fact tables. In our data warehouse each row of the fact table is (or should be) unique; for a sales table we have a product (what), customer (who), store (where), and a date/time (when), this information uniquely defines a sales line transaction (the quantity and values being measures). An enforced and validated constraint on these four columns is likely to require a huge four column composite index. updating it for each sales record insert is an ETL performance overhead and gives us no query benefit as people never need single row analytic access; invariably they want to aggregate and aggregating implies that at least one dimension is not known and thus the composite index is perhaps not the right choice.
RELY can be a risky option if we lie about the validity of our data, but with the correct and rigorous ETL design we should be sate to use RELY in the analytic layer. This rigour requires us to stage and validate our data before publishing it. Foreign keys on facts are simple to validate in the key lookup chain process, that we found the key in our validated dimension table is an implicit FK validation. Row key uniqueness can only be violated by loading something twice – we can prevent this in staging by building in some check process before data publication .