Partitioning for query performance

Not so long ago over on Twitter (remember that platform?) Tim Hall (Mr Oracle Base) asked about resources on the ‘whys’ around partitioning. Tim, of course, knows all the syntax details, his question was more about when and which type to use. I mentioned some draft writing I had for-a-never-to-be-completed book on partitioning. I have now converted some of the material into a series of blogs. This is the first part.

For a long while my professional database world has been centred on business intelligence and reporting. When I started out the world of data warehousing and reporting was alien to many DBAs and developers; long running queries and full table scan access were Dark Arts worthy of any evil wizard. The world has moved on; many of the strange things I did with data seem to be normal in today’s database environments. Features such as set-based data operations such as unions and intersections, analytic functions, bitmap indexing and partitioning.

It is partitioning I will look at in these blogs. As usual, I am mainly talking about Oracle partitioning, other vendors also offer similar features which I might mention.

What is partitioning?

Partitioning is the name we give for dividing a database object into one or more pieces based on a partition key. To most people the database object is a table, however, indexes may also be partitioned.

Partitioning can be either horizontal (row-wise) or vertical (column-wise).

The simplest implementation of a vertical partition uses INNER JOINs between two or more tables that share the same primary key. Typically, we would use this technique to create a slim dimension table and move large or infrequently accessed attributes such as LOB columns to a “side” table. Some database vendors offer columnar storage (the extreme implementation of a vertical partition with one partition per table column).

Partitioning is different from sharding. In a sharded table the content is distributed across multiple database instances whereas in a partitioned table the partitions are in a single instance. There is a possible exception where “partitions” are in multiple databases (perhaps using database links) but that is quite uncommon. Like partitioning, sharding gives a choice between vertical sharding (where the columns are distributed across multiple databases) and horizontal where whole rows are distributed.

An often forgotten point is that the minimum number of partitions is one; I will give a single partition tables use case a bit later on. However, for now, let’s talk about straightforward horizontally partitioned tables.

The partition key value determines into which partition a row is inserted. Typically, we use a single column of the table as the key column and apply rules to map the key value from the table to a physical partition. In the early days of partitioning we used an in-built hash algorithm to reduce the key value to one of a small number of “buckets” (similar to the results of the ntile function) or mapped a range of values to each partition. Other methods to assign a row to a partition came in subsequent versions of the Oracle database. List partitioning is the other cornerstone of partitioning, here the actual key values are mapped to specific individual partitions. The other partitioning methods are variations on the methods already mentioned. In Oracle it is also possible to further partition partitions into sub-partitions.

Let’s consider a table of CLOTHING_ITEMS , we might chose to partition this huge table on item colour. We could partition it so that each colour has its own partition.

Screenshot 2021 05 18 at 13 16 50

Instead of the table on the left create a partitioned table composed of multiple partitions, one for each colour.

Why partition?

Ignoring single partition tables, partitioning divides a large table into smaller table-like chunks, These chunks inherit the column structure of the parent table, but can have their own space management. Each partition

can have Its own tablespace (and hence be on different storage).
have differing values for PCTFREE and other space usage parameters.
have differing table compression set.
can be set to read only.
can be recovered separately.
is accessed transparently in queries (we need not specify the partition name explicitly in a query).
queries that filter on the partition key only access the partitions containing the data of interest.

The items in the list in BOLD COLOUR only really work for us if the data in those partitions is not changeable. For example if we partition on sales date it is quite likely that the sales partition for May 2020 will no longer be subject to update or insert.

Partitioning was originally promoted as a way to improve the management and recovery of large tables, but it was soon realised that partitioning was also important in improving query performance. The PCTFREE and data compression settings can boost query performance as fewer database blocks need be accessed to read the same set of data, however, it is when we query on the partition key that we have the potential for even more improvement. Querying on the key will only access the partitions that contain matching key values.

Choice of partition type and key

For query performance we a looking at a partitioning scheme that minimises the number of partitions we visit in the query and possibly allows us to make best use of parallel processing. This means that the partition key should be a predicate in every query we run against the partitioned table. If it is not we would need to access all the partitions to find the information we need.

Another thing to consider that for most use cases the partition key value should be unchangeable. If the change of a key value alters the partition to which the data belongs the row has to either migrate to the correct partition or remain in the wrong place. In my opinion for Data Warehouses and fact stores the partition key should not change value. There may be a use in a more transactional system to partition on a status or activity such as “processed flag” and allow row movement to put the data in the correct partition (more on this in a later blog).

Let’s consider partitioning our SALES table by STORE_STATE, that is the location the item was sold. Let’s assume we sell in each of the 50 US states and we keep the store’s state on the fact table (a big ask!). We have three fundamental ways to partition: Hash, Range and List.

HASH if we create 4 hash partitions each has will either have 12 or 13 (50/12=12.5) states stored in it. Sales from (yet-to-be-state) Washington DC will most likely go into one of the ’12 states’ partitions. Fundamentally this seems a good way of dividing the data up. However there two data-driven features that might make this less good. Firstly, there is data distribution; some states like Alaska may have fewer sales than say California, this could dramatically skew partition size if chance puts all the lower sales states in the same partition. Secondly, it is likely that regional queries will hit many or even all of the partitions which may well bring unnecessary IO scanning the data.

RANGE allows us to partition the data on key value for values of state less than the maximum for the partition. For example all states alphabetically less than California go into one partition. We can, by choice of range bounds, get a more even distribution of data but we still have a potential problem with multi partition access; for example Alaska and Alabama might be in the same partition but the West Coast states of California, Oregon and Washington are not. It is also possible to define a default partition to receive data that does not match any of the range rules

LIST allows us to create partitions that contain are own choice of partition content, for example we may wish to group our West Coast states into a single region partition. Again it is possible to create a catch-all default to cope with unlisted keys

List partitions seem to give us the ability to have some control of partition size and allow us to exploit partition elimination in our queries. They are almost certainly not the right answer for a sales data warehouse. There are two elephants in the list partition room.

it would not be usual to have a higher level of aggregation key (state) on a low level fact with a STORE column. If we did this we need to force the reporting tools to always add STATE to every query so that we can get partition elimination
and worse, the partitions are storing sales for all of time.

Maybe we are better off partitioning by date of sale. Let’s discount using a hash partition as that gives no advantage for things like compression and the like (the inserts into the table will be Round-Robin as the date increases) instead we have the options of LIST and RANGE. This time a LIST is poor choice for the simple reason of keeping a list of every possible date (or horror upon horror every possible DATE+TIME.) Range is far simpler to maintain either as DBA tasks to add new partitions or by using the interval partition functionality that automatically adds new partitions when required. More on Intervals in the next blog post.

What about partitioning on more than one column?

Before we had a more complete range of partitioning styles there were two significant reasons why people did this. The first reason was that two columns needed to be combined to identify something, for example in an ERP system the date might be stored in a year column and period column so partitioning on these two columns will ensure that 2018-P3 goes in a different partition to 2020-P8. A second use was to ensure that specific combinations of data were associated to the individual partitions, for example year 2019 sales for a specific store are in a different partition to those for another store. Our ability in recent Oracle databases to partition on virtual columns and the ability to use a variety of sub-partitioning styles makes this use less common

Next time I will delve into DATE/TIMESTAMP partitioning and sub-partitions in more detail, but first another historical aside.

Before Oracle 8i there was a precursor to partitioning called partition views. Here we create a physical table for each partition and create an UNION ALL view over it for query access. Each table in the view has its own check constraints to implement the partition key rules, alternatively the rules can be hardcoded into the various WHERE clauses of the view. The optimizer is aware of these view constructions an will not access the tables that do not contain data that match the rules. This feature required some database parameters to be set.

DW_PETE Talks Data

Views on BI, DW, DQ, and Big Data

Partitioning for query performance

What is partitioning?

Why partition?

Choice of partition type and key

Leave a comment Cancel reply

What is partitioning?

Why partition?

Choice of partition type and key

Share this:

Related

Leave a comment Cancel reply