Partitioning for query performance

Not so long ago over on Twitter (remember that platform?) Tim Hall (Mr Oracle Base) asked about resources on the ‘whys’ around partitioning. Tim, of course, knows all the syntax details, his question was more about when and which type to use. I mentioned some draft writing I had for-a-never-to-be-completed book on partitioning. I have now converted some of the material into a series of blogs. This is the first part.

For a long while my professional database world has been centred on business intelligence and reporting. When I started out the world of data warehousing and reporting was alien to many DBAs and developers; long running queries and full table scan access were Dark Arts worthy of any evil wizard. The world has moved on; many of the strange things I did with data seem to be normal in today’s database environments. Features such as set-based data operations such as unions and intersections, analytic functions, bitmap indexing and partitioning.

It is partitioning I will look at in these blogs. As usual, I am mainly talking about Oracle partitioning, other vendors also offer similar features which I might mention.

What is partitioning?

Partitioning is the name we give for dividing a database object into one or more pieces based on a partition key. To most people the database object is a table, however, indexes may also be partitioned.

Partitioning can be either horizontal (row-wise) or vertical (column-wise).

The simplest implementation of a vertical partition uses INNER JOINs between two or more tables that share the same primary key. Typically, we would use this technique to create a slim dimension table and move large or infrequently accessed attributes such as LOB columns to a “side” table. Some database vendors offer columnar storage (the extreme implementation of a vertical partition with one partition per table column).

Partitioning is different from sharding. In a sharded table the content is distributed across multiple database instances whereas in a partitioned table the partitions are in a single instance. There is a possible exception where “partitions” are in multiple databases (perhaps using database links) but that is quite uncommon. Like partitioning, sharding gives a choice between vertical sharding (where the columns are distributed across multiple databases) and horizontal where whole rows are distributed.

An often forgotten point is that the minimum number of partitions is one; I will give a single partition tables use case a bit later on. However, for now, let’s talk about straightforward horizontally partitioned tables.

The partition key value determines into which partition a row is inserted. Typically, we use a single column of the table as the key column and apply rules to map the key value from the table to a physical partition. In the early days of partitioning we used an in-built hash algorithm to reduce the key value to one of a small number of “buckets” (similar to the results of the ntile function) or mapped a range of values to each partition. Other methods to assign a row to a partition came in subsequent versions of the Oracle database. List partitioning is the other cornerstone of partitioning, here the actual key values are mapped to specific individual partitions. The other partitioning methods are variations on the methods already mentioned. In Oracle it is also possible to further partition partitions into sub-partitions.

Let’s consider a table of CLOTHING_ITEMS , we might chose to partition this huge table on item colour. We could partition it so that each colour has its own partition.

Screenshot 2021 05 18 at 13 16 50

Instead of the table on the left create a partitioned table composed of multiple partitions, one for each colour.

Why partition?

Ignoring single partition tables, partitioning divides a large table into smaller table-like chunks, These chunks inherit the column structure of the parent table, but can have their own space management. Each partition

can have Its own tablespace (and hence be on different storage).
have differing values for PCTFREE and other space usage parameters.
have differing table compression set.
can be set to read only.
can be recovered separately.
is accessed transparently in queries (we need not specify the partition name explicitly in a query).
queries that filter on the partition key only access the partitions containing the data of interest.

The items in the list in BOLD COLOUR only really work for us if the data in those partitions is not changeable. For example if we partition on sales date it is quite likely that the sales partition for May 2020 will no longer be subject to update or insert.

Partitioning was originally promoted as a way to improve the management and recovery of large tables, but it was soon realised that partitioning was also important in improving query performance. The PCTFREE and data compression settings can boost query performance as fewer database blocks need be accessed to read the same set of data, however, it is when we query on the partition key that we have the potential for even more improvement. Querying on the key will only access the partitions that contain matching key values.

Choice of partition type and key

For query performance we a looking at a partitioning scheme that minimises the number of partitions we visit in the query and possibly allows us to make best use of parallel processing. This means that the partition key should be a predicate in every query we run against the partitioned table. If it is not we would need to access all the partitions to find the information we need.

Another thing to consider that for most use cases the partition key value should be unchangeable. If the change of a key value alters the partition to which the data belongs the row has to either migrate to the correct partition or remain in the wrong place. In my opinion for Data Warehouses and fact stores the partition key should not change value. There may be a use in a more transactional system to partition on a status or activity such as “processed flag” and allow row movement to put the data in the correct partition (more on this in a later blog).

Let’s consider partitioning our SALES table by STORE_STATE, that is the location the item was sold. Let’s assume we sell in each of the 50 US states and we keep the store’s state on the fact table (a big ask!). We have three fundamental ways to partition: Hash, Range and List.

HASH if we create 4 hash partitions each has will either have 12 or 13 (50/12=12.5) states stored in it. Sales from (yet-to-be-state) Washington DC will most likely go into one of the ’12 states’ partitions. Fundamentally this seems a good way of dividing the data up. However there two data-driven features that might make this less good. Firstly, there is data distribution; some states like Alaska may have fewer sales than say California, this could dramatically skew partition size if chance puts all the lower sales states in the same partition. Secondly, it is likely that regional queries will hit many or even all of the partitions which may well bring unnecessary IO scanning the data.

RANGE allows us to partition the data on key value for values of state less than the maximum for the partition. For example all states alphabetically less than California go into one partition. We can, by choice of range bounds, get a more even distribution of data but we still have a potential problem with multi partition access; for example Alaska and Alabama might be in the same partition but the West Coast states of California, Oregon and Washington are not. It is also possible to define a default partition to receive data that does not match any of the range rules

LIST allows us to create partitions that contain are own choice of partition content, for example we may wish to group our West Coast states into a single region partition. Again it is possible to create a catch-all default to cope with unlisted keys

List partitions seem to give us the ability to have some control of partition size and allow us to exploit partition elimination in our queries. They are almost certainly not the right answer for a sales data warehouse. There are two elephants in the list partition room.

it would not be usual to have a higher level of aggregation key (state) on a low level fact with a STORE column. If we did this we need to force the reporting tools to always add STATE to every query so that we can get partition elimination
and worse, the partitions are storing sales for all of time.

Maybe we are better off partitioning by date of sale. Let’s discount using a hash partition as that gives no advantage for things like compression and the like (the inserts into the table will be Round-Robin as the date increases) instead we have the options of LIST and RANGE. This time a LIST is poor choice for the simple reason of keeping a list of every possible date (or horror upon horror every possible DATE+TIME.) Range is far simpler to maintain either as DBA tasks to add new partitions or by using the interval partition functionality that automatically adds new partitions when required. More on Intervals in the next blog post.

What about partitioning on more than one column?

Before we had a more complete range of partitioning styles there were two significant reasons why people did this. The first reason was that two columns needed to be combined to identify something, for example in an ERP system the date might be stored in a year column and period column so partitioning on these two columns will ensure that 2018-P3 goes in a different partition to 2020-P8. A second use was to ensure that specific combinations of data were associated to the individual partitions, for example year 2019 sales for a specific store are in a different partition to those for another store. Our ability in recent Oracle databases to partition on virtual columns and the ability to use a variety of sub-partitioning styles makes this use less common

Next time I will delve into DATE/TIMESTAMP partitioning and sub-partitions in more detail, but first another historical aside.

Before Oracle 8i there was a precursor to partitioning called partition views. Here we create a physical table for each partition and create an UNION ALL view over it for query access. Each table in the view has its own check constraints to implement the partition key rules, alternatively the rules can be hardcoded into the various WHERE clauses of the view. The optimizer is aware of these view constructions an will not access the tables that do not contain data that match the rules. This feature required some database parameters to be set.

Stopping Duplicate Late Arriving Data

This is old stuff (Oracle 9i, if that gives you a time span) and today there are other ways of doing this, both in Oracle databases and with other data transfer strategies such as streaming. There were probably other ways of doing this even then. So reader beware, this might not be of any use to you at all.

Way back in time I led my company’s team of developers and support engineers responsible for our large database customers. Most of these customers had retail data warehouses. One such customer ran a hundred or so membership stores (each transaction was linked to a known customer).

Each night the stores uploaded point of sale basket headers (basket number, where, when and who plus a few measures such as total transaction value) and also the individual point of sales lines for each basket by item quantity and value. Every store generated their own basket numbers which were cyclical, that is when the maximum number of baskets was reached it went back to 1. To make a globally unique basket identifier we would need to combine basket number, store number and transaction date. Customer number is not needed to uniquely identify a basket as each basket can only have one customer. The BASKET_HDR table might look a bit like this

Screenshot 2021 03 12 at 14 08 23

and the basket line detail looks a bit like this

Screenshot 2021 03 12 at 14 26

As in the majority of data warehouses, data is taken from the sources, staged, validated and then if all is well, published. The validation includes checking the customers and items sold exist (we had special rules about first-time customers to allow them to shop). Our business goal was to publish as much valid data as possible so that the stakeholders had information available at the start of the working day; some needed total sales, others were interested in product categories and brands, some looked at customers and their sectors, and one group needed to know about supply chain logistics (getting stock to stores). We also reported on rejected and missing data to give as complete a picture as possible.

One data transfer issue we became aware of was the possibility that store might retransmit a basket of sales a few days later. As the EPOS was outside our control we would not be able to resolve this on source, so we needed to develop a method to prevent these ‘duplicates’ from being loaded for a second time.

Stopping a duplicate from being inserted into a table sounds trivial, use a constraint that blocks duplicate inserts such as primary or unique key constraints. However our target basket header table is partitioned by week and contains billions of rows. Likewise the basket item table is also partitioned and has about 15 times more rows. Maintaining a global index to enforce the key on the basket table is not trivial and in addition it wouldn’t be of much benefit for query performance optimisation – the only users accessing raw basket level data are the data miners and the batch data publishing process user.

Another method commonly used to find duplicate keys is to outer join the just-loaded stage data to the target data and look for non NULLs from the target where our key values match. Taking this slightly further we can use a multi-table insert to direct duplicates to a reject table and new data to a second staging table for the regular data validation of keys and in-batch duplicates.

Screenshot 2021 03 12 at 13 47

Our analysis of duplicates showed that any duplicate received occured within a period of 1 to 2 weeks from the first load of that basket and never after 3 weeks. As our raw basket table has about 260 weeks of data we only need to check a small window of this; perhaps we can do something more inventive.

We could make a table that contained the keys for three weeks worth of basket headers and used that as the table in the outer join, doing this we don’t need to replicate any of the customer numbers or measure data, just a simple three column table. Screenshot 2021 03 12 at 14 26

If we partition this table by transaction date we can create a process to drop the oldest data and create a new partition to receive the next day’s baskets. Remember, partition dropping is far more efficient than deleting by date.

How about if we changed the table from a heap table to an Index Organized Table (IOT), after all we are not really interested in any of the attributes of a basket other than the key columns. This would in effect create a B-Tree index of the key columns that can be referenced in a join without the need to store a conventional heap table.

CREATE TABLE recent_baskets (
basket_number NUMBER NOT NULL,
store_number NUMBER NOT NULL,
transaction_date DATE NOT NULL,
CONSTRAINT recent_baskets_pk primary key(basket_number, store_number, transaction_date)
)
organization index
logging
disable row movement

partition by range ( transaction_date ) ;

As an aside, I often invert thinking about data objects. Here I wanted an index but didn’t need a table (or rather the table contained the exact same column data as the index, so why store it twice?). As another example: I like partition exchange in data loads, but what can I do for a non-partitioned table? One answer is to create a single partition partitioned table (perhaps a single hash bucket) to exchange with the conventional target table. We read the documents as “the contents of a conventional table becoming part of a partitioned table” whereas it reality it is a two way street, there is nothing wrong with feeding a table with a partition.

Back to our topic.

By creating a recent baskets table we can readily look for duplicated basket headers and, as lines can not exist without the corresponding header, we can also use the same recent basket table to remove duplicate basket lines. In both cases we take our source table and left outer join it to RECENT_BASKETS on the three key columns. The results of this select are used in a multi table insert, it the recent_basket keys are null we are good to insert into the staging pipeline, if not we neatly park the duplicates with the ELSE path of the multi-table insert.

When our data processing is complete we will need to do a bit of housekeeping to make things ready for the next data load. If our recent baskets table is partitioned by day we need to drop the oldest partition and create a new one for the next day’s data, if we are using weekly partitions we need do this maintenance weekly. Either way there is an important difference between partitioned heap tables and IOTs – partition data for an IOT is held in the same dictionary tables and views a partitioned indexes not the usual ones used for heap tables.

The final step of our process is to insert the validated, deduplicated data into the warehouse basket level tables and insert the header keys into the IOT.

Grandma Brown’s SQL Cookbook

I like cake, and like people world over I know that the best cake was the cake my grandmother made. So good in fact there is a family hand-me down recipe book so that the family that followed her can make cakes as good as the ones she made when she was around.

Likewise some database developers rely on a series of tried and tested recipes to write SQL queries

NewImage

One of the problems with old recipe books is that the world has moved on. This is true for both cakes and for SQL. My grandma baked in a coal-fired range oven, weighed ingredients in pounds and ounces and had a very limited larder. Somethings translate (such as weights and measures) but hardware properties and ingredient availability less readily. For Grandma Brown and her SQL recipes we also find that the things she always did and handed down to you may not give the best results now.

Enough of the cake analogy – what I am getting to is that just because you did something once in the past and it seemed to give the correct result does not mean it is right to use it every time or is even generally applicable.

DISTINCT

Or (for youngsters) UNIQUE, can have its uses – I sometimes use if for quick and dirty profiling to get a better feel for the range of data and its cardinality in a table, however I prefer to use group by with an actual count of values for this. There are three common problems I see with this:

The distinct does nothing useful in some SQL. Some database operations implicitly work over a distinct set of data, for example UNION, GROUP BY and IN(SELECT DISTINCT COL1 from TABLE1). A DISTINCT won’t change the result but may make the query slower to execute as it will attempt to find duplicates that don’t exist.

There is no point doing a distinct on data that can’t have duplicates such as SELECT DISTINCT primary_key from TABLE1; or even SELECT DISTINCT DUMMY from DUAL;

It is even possible that the use of DISTINCT/UNIQUE is masking a bug in your query logic such as a missed join condition or a filter condition.

ALIASES

Don’t get me wrong here, aliasing a table in a query is good thing. Well chosen aliases make a query easier to read and understand (a good thing if you are doing support), they provide clarity of the source table of two identically named columns. Where this goes wrong is when instead of meaningful aliases such as PD for PRODUCT_DIM you develop using single letter alias that have no relationship to the actual tables. Working out what A, B and C relate to is hard enough without coming across in-line views that use the same A,B and C alias convention but referring to different source tables. Talking of in-line views…

IN-LINE VIEWS

When Grandma started out some ingredients didn’t exist, now it is perfectly possible to use Common Table Expressions (CTE) as a replacement for many in-line views. The CTE offers many advantages over the in-line view, such as being able to use the same CTE many times in a query, the possibility of recursion, the possibility to hint the CTE and improved clarity of the SQL (easier to support). However, it is not the exact same thing as an in-line view so expect different query plans. Sometimes it is necessary to force Oracle to materialize the CTE and sometimes you need do the reverse and force it not to materialize it. There are hints to do that. which brings me to hints.

HINTS

Most optimizer hints are a way of telling the cost based optimizer that you know more about your data than it has obtained from the statistics on tables and indexes and from other information obtained from table structure such as keys and constraints. If there was such a thing as a universal hint that would benefit all queries you would think that Oracle would have made it a default. There is no such thing; do not apply the same hint to every query as a go faster switch it will mainly not work and could make things worse, much, much worse. To my mind hints are useful when your are trying to optimize a query. Once you work out the best plan you can look at giving extra information to the optimiser so that an unhinted query has the same plan. The other thing that people following recipes often get wrong is mixing hints with aliases, it is so easy to malform a hint.

Syntax translation

I have old US cake recipes that uses General Purpose flour. We don’t have anything called that here in France so I am going to have to use something different. Likewise when we translate code from one vendor’s SQL to another we come across things that would work a lot better if they were translated into the right equivalent. This is particularly common with conversion functions such as dates to strings. Sometimes we can replace complex hand-written functions that do string manipulation with vendor supplied in-database functionality.

Old recipes

In the past people cooked things like roast swan stuffed with sturgeon. The probably don’t now. Our tastes and skills have evolved. My grandma never vacuum packed and water-bathed a thing in her life. Similarly SQL has evolved. If your coding style is stuck in Oracle 7.3 you will miss out on great things like Analytic Functions, the model clause and match recognise. Using some new methods can give great results, but use them in the right places

I suspect that other people can add further chapters to the recipe book.

Obfuscation or Masking of selective column data

I just blogged this piece on a specific need I had in BI reporting, the ability to return a set of rows that relate to one user, but also return masked rows for the other users – I needed to work out totals across the whole data set but not reveal any descriptive data relating to the other users. I came up a with solution on Oracle Analytics but I also looked at what we can do in the database to make the solution more robust.

I asked a question on Twitter about techniques we might use to achieve this within the database. I got loads of great suggestions, some that helped me frame my question better, other suggestions showed me new ways of looking at things. One from a friend asked if Fine-Grained Access Control would work for me. Actually, this idea was not one on my list of possibilities, perhaps that was because I mainly do SQL and this was the dark arts of DBA-hood.

Taking a quick look at the go to resource on all things Oracle, Tim Hall’s Oracle Base website led me to a piece on changes to security features in release 10. Just the information I was looking for. Something I could adapt and play with.

Unlike many places Google may lead you to, Tim’s web pages are really useful, Tim takes time to explain stuff and gives actual WORKING examples that you can follow along with. If you don’t know how to do something there’s a far chance that Tim does and has an example you can learn from.

The use case I wanted to use was not the same as Tim’s so I had the opportunity to base my work on some of his piece and adapt and modify (and bring in some other concepts) to get the data doing what I wanted,

As I needed to use some Enterprise Edition features I needed access to a suitable database and as I am just playing I decided my Vagrant hosted 18XE database would do nicely. 18XE has the Enterprise Edition features in a small and FREE database. This XE set up allows me to run the database on the old MacBook I have in the corner of my home office and run SQL-Developer on my desktop connected to the a plugable database on XE.

Please remember that this is a quick and dirty prototype and other tasks may be needed to make it production ready

What I am going to do:

Create a test user in the database and give necessary grants.
Create a test table and populate it with 6 rows.
Create the access function (more on that in a moment).
Create the VPD policies.
Test.

As the SYS user on the VPD I create a new user (FLA) with ability to create tables, views packages, functions etc. Also give it this grant

GRANT EXECUTE ON dbms_rls TO FLA;

Now connect to the new user and do the following:

--------------------------------------------------------
-- DDL for Table FLA_TEST
--------------------------------------------------------

CREATE TABLE "FLA"."FLA_TEST" 
( "ITEM_DESC" VARCHAR2(200 BYTE), 
"ITEM_SIZE" VARCHAR2(20 BYTE), 
"SUPPLIER" VARCHAR2(20 BYTE), 
"QUANTITY_SOLD" NUMBER(38,0)
) ;
REM INSERTING into "FLA".FLA_TEST
SET DEFINE OFF;
Insert into "FLA".FLA_TEST (ITEM_DESC,ITEM_SIZE,SUPPLIER,QUANTITY_SOLD) values ('Bright-white','100 ml','Shiny-teeth plc',10);
Insert into "FLA".FLA_TEST (ITEM_DESC,ITEM_SIZE,SUPPLIER,QUANTITY_SOLD) values ('Strong-White','100 ml','Shiny-teeth plc',18);
Insert into "FLA".FLA_TEST (ITEM_DESC,ITEM_SIZE,SUPPLIER,QUANTITY_SOLD) values ('Strong-White','50 ml','Shiny-teeth plc',25);
Insert into "FLA".FLA_TEST (ITEM_DESC,ITEM_SIZE,SUPPLIER,QUANTITY_SOLD) values ('Fang-brite','100 ml','Vampire product co',3);
Insert into "FLA".FLA_TEST (ITEM_DESC,ITEM_SIZE,SUPPLIER,QUANTITY_SOLD) values ('No-bleed fang clean','25 ml','Vampire product co',7);
Insert into "FLA".FLA_TEST (ITEM_DESC,ITEM_SIZE,SUPPLIER,QUANTITY_SOLD) values ('Star sparkle','50 ml','Party-Looks',4);

Here I’m inserting 6 kinds of toothpaste with 3 different suppliers. If I was a smart person I would use numeric keys for supplier and not VARCHAR2 but no matter it still will work. What I want to do is apply a data access control on my table so that for a given supplier I can see my data and the numeric data for the other rows. For my BI use case I want to add up the total sum for all of the data but not see descriptions of the rows belonging to the two other suppliers.

First off I need an access function to work with my access policy. I could use some form of static constant (Tim’s examples go this way for simplicity and clarity) but for my use case it will be useful to pass the supplier in a database context – this can be achieved by an on-connect call in the BI server and will be a secure way to pass the suppler details to the database. It also gives a decree of security as without the correct context name and handle set data will not be returned. We need to create a context using the sys user and then write some PL/SQl

create or replace context sup_ctx using FLA.t1_pkg;

Now back to the FLA user to do the rest of the tasks

Create a package to set context for our testing – it makes things a little simpler to debug, however you only need the DBMS_SESSION.SET_CONTEXT call

--------------------------------------------------------
-- DDL for Package T1_PKG
--------------------------------------------------------

CREATE OR REPLACE EDITIONABLE PACKAGE "FLA"."T1_PKG" 
as
procedure setx(p_sup varchar2);
end t1_pkg;

/
--------------------------------------------------------
-- DDL for Package Body T1_PKG
--------------------------------------------------------

CREATE OR REPLACE EDITIONABLE PACKAGE BODY "FLA"."T1_PKG" 
as
procedure setx(p_sup varchar2) is
begin
dbms_session.set_context('sup_ctx','x',p_sup);
end;
end t1_pkg;

/

Next build the access function. Note the silliness with quotes as I used a string with spaces as my supplier value. We build a policy condition called con which is basically a boolean condition

--------------------------------------------------------
-- DDL for Function PF_JOB
--------------------------------------------------------

CREATE OR REPLACE EDITIONABLE FUNCTION “FLA"."PF_JOB" (oowner IN VARCHAR2, ojname IN VARCHAR2)
RETURN VARCHAR2 AS
con VARCHAR2 (200);
BEGIN
con := 'SUPPLIER='''|| sys_context('sup_ctx', 'x') ||'''';
RETURN (con);
END pf_job;

/

Next we set up the access policy itself. It identifies the object (our table) the policy is on, the function that implements the policy and the fine-grained but itself, the columns to mask and whether we want all rows back (we do!)

BEGIN
DBMS_RLS.ADD_POLICY (object_schema => ‘FLA',
object_name => 'fla_test',
policy_name => 'sp_job',
function_schema => ‘FLA',
policy_function => 'pf_job',
sec_relevant_cols => 'item_desc,supplier',
sec_relevant_cols_opt => DBMS_RLS.ALL_ROWS);
END;
/

That’s enough to get things up so let’s test things

exec t1_pkg.setx('Vampire product co');
select * from FLA_TEST;

We can only see descriptive detail for items from the Vampire Product co.

NewImage

So far so good. I highlight NULL values in SQL Developer in yellow so I can see at a glance what’s going on. In our use case we don’t really want NULLs so let’s create a view of it.

CREATE OR REPLACE FORCE EDITIONABLE VIEW “FLA"."FLA_TEST_VIEW" ("ITEM_DESC", "ITEM_SIZE", "SUPPLIER", "QUANTITY_SOLD") AS 
SELECT 
coalesce(item_desc, 'Other Item') item_desc,
item_size,
coalesce( supplier, 'Other Supplier') supplier,
quantity_sold
FROM FLA.fla_test;

select * from FLA_TEST_VIEW;

NewImage

That’s better. Now we can aggregate things up like this:

select item_desc, sum(quantity_sold) from FLA_TEST_VIEW group by item_desc;
ITEM_DESC		SUM(QUANTITY_SOLD)
Other Item		57
Fang-brite		3
No-bleed fang clean	7

Just what we needed.

Finally we test again using different context values

SQL> set echo on
SQL> column item_desc format A20
SQL> column item_size format A10
SQL> column SUPPLIER format A20
SQL> 
SQL> exec t1_pkg.setx('Shiny-teeth plc');

PL/SQL procedure successfully completed.

SQL> select * from FLA_TEST_VIEW;

ITEM_DESC            ITEM_SIZE  SUPPLIER             QUANTITY_SOLD
-------------------- ---------- -------------------- -------------
Bright-white	     100 ml     Shiny-teeth plc      10
Strong-White         100 ml     Shiny-teeth plc      18
Strong-White         50 ml      Shiny-teeth plc      25
Other Item           100 ml     Other Supplier        3
Other Item           25 ml      Other Supplier        7
Other Item           50 ml      Other Supplier        4

6 rows selected.

SQL> exec t1_pkg.setx('Party-Looks');

PL/SQL procedure successfully completed.

SQL> select * from FLA_TEST_VIEW;

ITEM_DESC            ITEM_SIZE  SUPPLIER             QUANTITY_SOLD
-------------------- ---------- -------------------- -------------
Other Item           100 ml     Other Supplier       10
Other Item           100 ml     Other Supplier       18
Other Item           50 ml      Other Supplier       25
Other Item           100 ml     Other Supplier        3
Other Item           25 ml      Other Supplier        7
Star sparkle         50 ml      Party-Looks           4

6 rows selected.

SQL> exec t1_pkg.setx('Vampire product co');

PL/SQL procedure successfully completed.

SQL> select * from FLA_TEST_VIEW;

ITEM_DESC            ITEM_SIZE  SUPPLIER             QUANTITY_SOLD
-------------------- ---------- -------------------- -------------
Other Item           100 ml     Other Supplier       10
Other Item           100 ml     Other Supplier       18
Other Item           50 ml      Other Supplier       25
Fang-brite           100 ml     Vampire product co    3
No-bleed fang clean  25 ml      Vampire product co    7
Other Item           50 ml      Other Supplier        4

6 rows selected.

SQL> exec t1_pkg.setx('non exisitant supplier');

PL/SQL procedure successfully completed.

SQL> select * from FLA_TEST_VIEW;

ITEM_DESC            ITEM_SIZE  SUPPLIER             QUANTITY_SOLD
-------------------- ---------- -------------------- -------------
Other Item           100 ml     Other Supplier       10
Other Item           100 ml     Other Supplier       18
Other Item           50 ml      Other Supplier       25
Other Item           100 ml     Other Supplier        3
Other Item           25 ml      Other Supplier        7
Other Item           50 ml      Other Supplier        4
6 rows selected.

Just the job. Test complete, now to remove the policy and write things up

BEGIN
DBMS_RLS.DROP_POLICY (object_schema => '"fLA"',
object_name => 'fla_test',
policy_name => 'sp_job');
END;
/

When I started planning these two blogs I had not appreciated the simplicity of using Fine-grained Access Control. It does exactly what I attempted to achieve. And because I am using a context to build the filter condition I can build it into a user specific connection from the BI Server to the database.

Applying Selective Data Restrictions to Provide Data Privacy in BI

Almost 20 years ago I was the technical lead on a large (in those days) data warehouse for a cash and carry wholesaler (that is a sort of warehouse supermarket hybrid with a target customer base of convenience stores, filling stations and small hospitality operations like cafes, bars and hotels). Dimensionally, they knew the customer and their segment (you needed a membership card to shop); they knew each item in the basket and how that fitted in their product dimension, shop and transaction dates were also dimensions. So far, so classic sales data warehouse.

One of the things they wanted to do was to recover some of their development and operating costs by selling data to their suppliers. To do this they used an additional dimension for supplier. Of course this dimension correlates to the product dimension but exposing it as a separate dimension it allows analysis by supplier for any of their products. They also wanted to remove access to individual customer level data so that the lowest customer level was customer segment; even in those days they could see the need not to expose data that could be misused to target individuals. Simplistically, we present a star schema with a dimensionality of product (item), supplier (manufacturer), customer segment, store location and date. Pre-built analyses can be run against the dataset and the results burst out to the individual suppliers.

Life however is not so simple. Some suppliers wanted to buy premium access and create their own reporting and to slice and dice the data any way they liked. These suppliers tended to be large companies with many products in their portfolios. In fact two of them (competitors of each other) had thousands of products across many of the same product groups: laundry, health and beauty, and foods and both consumer and industrial brands. Giving third parties direct data access means we need to make sure that information not about their data or at too low a level does not leak out. This is where the challenge is. We need:

Query access to all of a supplier’s products.
No access to a competitor’s products.
Query access to customer segments (the type of businesses purchasing the products).
No access to individual customers.
No access to supplier details except your own.
No access to financial data (that is we expose sales volumes and not cash amounts).
Access to market share information – that is supplier sales against all sales for a product group.

The last bullet point is potentially the trickiest part.

I’m not going to discuss how we achieved this as it was a couple of decades back and what we could achieve then in the database or in the reporting tools was somewhat restricted. At that time the database was Oracle 8i (upgraded to 9i), and the tools were one of Oracle Sales Analyzer, Business Objects or Siebel Analytics (depending where on the project timeline we sat). Instead, let’s skip forward to now and look at ways we can tackle those challenges.

Let’s just consider the Oracle database and Oracle Analytics (though OBIEE is going to be analogous here). We have two locations we can restrict data access: in the query tool and in the database. In all likelihood we will do both as we need to prevent the knowledgeable user from getting around our rules.

In Oracle Analytics we are going to do a few steps to prevent external ‘Supplier’ users from straying from their role assigned data resource. Typically we allow access to a single subject area with just the supplier_sales fact table and its supporting dimensions. We could also implement something like the data level security described by Oracle’s A-Team. In this blog they use an initialisation block to populate a session variable containing a list of countries and to use the variable as a filter on the logical SQL. We can use something similar to create the supplier filter to use in our queries. However as it stands the filter does not allow contribution share calculations without some other hoops to be jumped through.

The contribution problem

If I was a supplier of toothpaste and I had a range of 15 product names in a variety of pack sizes (say 50 SKUs in total) I might want to know the percentage share for the 100ml pack of basic fluoride toothpaste against all of the toothpastes I supply. Here’s a simplified diagram of part of the supllier data

NewImage

This is relatively simple – we sum the sales of the toothpastes I supply and calculate the contribution of the SKU I am interested in, but what if I want to know my sales of the SKU against the total market? If I sum at product type = ‘toothpaste’ level I will only see the toothpastes I supply and not the whole market as other suppliers are filtered out. I want my data to look more like this:

NewImage

I could remove my supplier filter and use a case statement on the logical product dimension columns; something like (remember supplier is also an attribute of product)

CASE WHEN “Product”.”SupplierCode” = VALUEOF(NQ_SESSION.USER_COMPANY) THEN “Product”.”Description” ELSE “Other products” END

We would need do this for each of the exposed dimensional attributes so that when we roll up the sales items not supplied by us they will go into a single bucket when the reporting platform aggregates the data. There is a cost to the case statement, some of the aggregation is moved from the database to the BI server. Perhaps a more viable solution is to create a new fact data source at the product type level to aggregate the total sales numbers for all products in each product type and map that into the logical model for supplier sales. A third option also exists in using the EVALUATE_AGGR function to push an analytical sql requests directly to the database. I would not favour this, as access to the EVALUATE family of functions has security implications and are perhaps best not exposed to external 3rd parties.

On the reporting side it looks like our best option is to filter product level sales by supplier and include an aggregate data source at “product type all suppliers” to provide the rolled up sales quantity metrics.

An obvious problem with implementing the security solely inside the reporting layer is a lack of portability – what if direct database access is needed? What if our reporting solution changes? Let’s assume we are using an oracle database as our source. We have a few items in our tool chest that might help with this:

NewImage

First off, the last three methods won’t really help us selectively hide data.

Label security works on the basis of person A having more access than person B – this is basically implemented by giving a “permission number” to a user and checking a (probably hidden) column in the table for the row permission threshold and if the user has a higher level than the threshold return the row. In our case user supplier A can see rows that supplier B can’t and B can see rows that A can’t and there’s the problem. Label Security assumes a hierarchy of access rights and not a flat structure of peers with different access levels. If Label Security was implemented using ‘between’ instead of “greater than” it could be made to work for us.

Redaction is by columns – to use it we would need to split the data into two parts, the un-redacted data for supplier A and the redacted for all suppliers EXCEPT A. The two sets are then UNION ALL’d to provide the full set. Yes it can be done but is probably not worth it.

Data Masking and Subsetting is not the right tool for this production query use case. it is just not going to do this particular job.

Of the ways that can be made to work, the simplest to implement is to create supplier specific subsets of the data and use database grants to allow select access to individual suppliers. Potentially, this is a very poor method to scale as we need database users for each supplier and their own objects containing their data.

The two methods that, to me, show the most promise are the virtual private database and its child the fine grained access control. The VPD allows us to use a single table for the data and apply policies on the data to only show rows that match an access policy in the database. These policies are implemented by functions and calls to the VPD packages. The traditional VPD from Oracle 9 days rewrote queries to add a where predicate on the required column. So we somehow pass the user running the query and use that to build a filter such as Where [other filters] AND SUPPLIER_ID = 123. In effect we are emulating what was described in the A-Team blog. However, we can get smart here and instead of a function that maps user to SUPPLIER_ID we could store a value in a database context and use the context to build our filter. This context can be set by the BI server as part of the query to the database. With no context set no data is returned.

If the source database is 10g or later we can be even smarter with this as the fine grain enhancements to VPD allow us to selectively display or show nulls for values returned. We can now use one table as our source for both supplier sales and “all other supplier sales”.

Part 2 of this blog will say how this is set up.

Chocolate Clusters

Partition Exchange (backwards)

The other day my Twitter-buddy (and real life friend) Jeff Smith replied to question about efficient data removal using partitioning (which as people reminded him) an option in the Enterprise Database (but included in XE version 18 ).

I have used partitioning from Oracle 8 days (and for some customers, partition views which predated that technology). As versions are released the technology becomes more and more capable. Partitioning has always been a key part of my data warehousing and large database work. It addresses many important needs including: manageability and resilience, information life cycle and query performance.

One particular aspect I have enjoyed in my ETL processing was the ability to do partition exchange.

Partition Exchange, simply

This is where we create a table with the same column structure as our partitioned table, load it with data that would belong in a single partition and then execute an exchange statement that simply swaps data dictionary information for the table and the partition. The table becomes part of the partitioned table and the partition now becomes the table. Being a dictionary operation this can be very quick and is an efficient way of working with bulk data. The data itself does not move in the exchange – this means the the tablespace you create the table in will become the tablespace for the partition – you may need to factor that in if you need partitions in specific tablespace – I will rephrase that YOU SHOULD CREATE YOUR TABLE TO EXCHANGE IN THE CORRECT TABLESPACE.

One way of using this is for a daily load of new data to a weekly partition table. We use exchange as it is less invasive on the target partitioned table. For example

NewImage

Here I leave out some of the steps we would do in a real process such as stats collection and indexing.

One thing that people often overlook is that in partition exchange the PARTITION BECOMES THE TABLE. This fact allows us to modify the exchange technique to allow it to work with non partitioned tables in exactly the same way.

For this quick tip we create a single partition partitioned table – one where the partition will hold all of the data possible from the table. Two ideas spring to mind, a hash partitioned table with just 1 partition bucket or a range partitioned table with the partition key max value to a value greater than the greatest possible actual key value. We use the partition as our work space for the data loading process and when complete we exchange it for the table.

Streaming and Star Transforms

In a few weeks time I will be in Seattle to speak at ODTUG’s KSCOPE 2019. I have two talks – one about how data profiling is an important adjunct to data modelling, the second, some thoughts on how the advent of streamed data has influenced our traditional star-schema data warehouse.

Without it being too much of a spoiler, one of my takeaways will be that the star is not dead, it’s just that what we do has morphed.

Let’s consider a classic Oracle data warehouse star schema. We have a fact table (let’s call it SALES) joined to some dimension tables using foreign key relationships. We record the item sold (PRODUCT_KEY), the person that bought it (CUSTOMER_KEY), the store location it was sold in (STORE_KEY) and date and time when the item was sold (SELL_DATE_TIME). We also record how much they spent and how many of each item they purchased in the transaction line. One fact table, four dimensions and two measures, almost as simple as it gets. Except of course for the elephant in the room, the date/time dimension. We would not usually hold every possible date/time in the dimension table. For simplicity, I am going to store the actual transaction time on the fact table. This also has a benefit as this is likely to be the partitioning key and, as a bonus benefit, the optimizer knows date arithmetic and won’t get flustered when working out intervals in cardinality estimates. The other keys should be surrogate keys if we are using slowly changing dimensions – come to my talk or see my blog on temporal validity for a reason why that feature is not all that it seems.

There’s nothing really new in the above. In our old, on-premises, big iron in the data center world we would have partitioned the fact table to make access by transaction date/time work on a small slice of the fact table. We would bitmap index the fact dimension keys to allow us to use star transformation query plans. Star Transformations select rows from the fact by first ‘AND’ing the dimension key bitmap indexes; we combine the bitmaps for selling a banana (what) to Peter (who) in Mayenne (where) to find only the rows where all of those conditions are true. Star transformations require us to have bitmap indexes on the fact dimension columns and to have at least RELY foreign constraints in place.

Oracle bitmap indexes are wonderful things but they need a bit of love and care when doing data loads. They index ranges of rows and not single rows. This can cause issues with contention for concurrent access and during data load would constantly perform (slow) index updates as new data arrives. There are things we do here to get around the slow index update. Simplest is to disable the index before loading data and rebuild it afterwards. If, as I hope, the fact is partitioned we only need disable the local bitmap indexes for the partitions being loaded. Other methods to avoid index updates during insert exist such as using loading to another table, indexing and then using partition exchange to swap the data and indexes for the target partition.

Using a streamed source changes things – we don’t really have the opportunity to maintain the bitmap indexes during a constant load process. Obtaining the keys on fact load is quite simple we just join to the active records of the dimensions and obtain the keys; we can do this directly in the stream or on the relational side. To be a bit picky here we should get the key of the dimension record that was active at the time of the transaction, but unless we are processing late arriving data this is usually the current record. These key lookups are likely to be quite speedy to perform unlike indexing a row by row insert into a bitmap indexed table (or partition).

One way to avoid the update issue is to not use bitmap indexes at all! This is not as bad a choice as it first sounds. Star Transforms have been with us since Oracle 9i and since then we have FIVE major version number shifts in the database. New features have been released that can get around the need for bitmap indexes in a near-realtime fact.

Storage indexes on the disk cells.
Zone maps to “coarse index” the fact to dimension joins.
Bloom filters.

All three of the above are “negative indexes” they ignore the blocks of disk where the required data will not be found and thus restrict database block reads to the blocks that probably contain the data being sought.

Zone maps are interesting in that they can become stale through DML (and inserting new rows is DML) we would periodically need to refresh the stale zones however this operation is carried less often then you might think as we tend to insert data and not revisit existing data.

My bottom line is we probably don’t need to use star transform query plans with stream sourced fact data. Newer database structures support performant data access without the bitmap index updates seen on continuous data load.

NewImage

Go Faster Tape

When I was a kid you could buy a roll of sticky tape with a checker board print on it. This was used to decorate bikes to make them go faster. The reality of this was that it did not actually work, but as a young child you just accepted it for what it was, magic.

The same could be said for the silver bullets you might read about in tuning blogs. The reality of tuning is that there is never a one-size-fits-all solution. Invariably, performance is driven by the distribution of the data and the characteristics of the platform it is being accessed on; everyone’s data and environment is different so what works for one person may not work for another. That is not to say that there is no such thing as good practice for designing data access, it’s just that you can’t say “this block size” or “index this” is the best thing for all circumstances.

However, there are some general goals to explore if not achieve

NewImage

Obviously, by not doing something at all we get the best performance gain. Surprisingly, by looking at a system you might spot time consuming but unnecessary steps, sometimes it requires you to restate the problem in a different way and alter processing logic to omit a step.

It’s harder to call if doing things less frequently or faster gives the best performance boost. Less often could be something simple such as commit less often (of course, commit as often as transactional integrity requires, but for bulk we might not need commit for each row). Likewise pre-filtering data can reduce the volume to be processed. Sometimes the line between less often and faster is quite blurred; is changing from a nested loop join to a hash join an example of “less often” or “faster”? You could argue the same for PL/SQL loops being replaced with a SQL join. Somethings definitely sit in the realm of “same but faster” such as where an operation occurs, for example in the CPU, in memory, at the storage-cell, on disk, using off-line data.

The fourth way – do something different

In some cases, typically with aggregation, it might be possible to move a complex function to the result of an aggregate rather than for each value, though some functions will not allow this.

NewImage

Let’s look at a typical problem we see in the data warehouse / data analytics world where we get an extract of dimensional data for a source. A simplistic statement of this: using the staged data as a source insert new data and update existing data. We might write a PL/SQL block that loops through the source and tries to update the target and if it fails tries to insert. This however is likely to be slow as we loop through our data set. Faster, we could replace the loop code with a MERGE or UPSERT statement. This gets rid of our loop code and runs as a simple SQL statement. There is still scope for improvement though, but this requires us to revisit the requirements.

In a real world dimension we are likely to have new records, updated records and (almost certainly as the majority) unchanged records. So how do we remove the unchanged records so we don’t process them? I have blogged on that before but there are two basic approaches we can take:

We can outer join source to target and filter out the duplicate rows.
We can SQL MINUS the source and target.

As I said before the join approach may be fine, but when you might get NULLS in some columns there is some coding to do as comparing a NULL to a NULL is not quite straight forward in Oracle. Some people use a MD5 hash on the concatenated columns but again NULLS may trip things up and the hash function is very compute intensive so could have a negative impact on systems with low CPU power.

However if we could change our source data extract to contain only new and changed data we can go back to a simple merge load. One approach for this is to use log based change data on the source system; for example Oracle GoldenGate on the source database only passes rows that have a commit – and for many systems commit implies change. However, experience shows that there are “noisy” systems out there that commit on view and that leaves us with data that need not be processed. I say change data capture but we could equally use data streams such as Apache Kafka.

Having less data in the source allows us to further reduce the data we process in a SQL MINUS:

	-- Basic Set based approach to find change
	SELECT * FROM SOURCE_SCHEMA.D1
	MINUS 
	SELECT * FROM TARGET_SCHEMA.D1;

	-- Modify where SOURCE_SCHEMA.D1 has many fewer rows than TARGET_SCHEMA.D1
	SELECT * FROM SOURCE_SCHEMA.D1
	MINUS 
	SELECT T1.* FROM TARGET_SCHEMA.D1 T1 INNER JOIN SOURCE_SCHEMA.D1 S1 on S1.Dim_Key = T1.DIM_KEY;

Our next question is do we merge the data or can we do something with partition exchange? The Oracle Merge statement can be very efficient if we are merging on an indexed key, but there may be some mileage in only using append inserts to an empty table and exchanging the data using partition exchange. You can also do this with a non-partitioned table!

 	— Insert into exchange table EX1 unchanged data
	INSERT /*+APPEND */ INTO EX1 (
	SELECT * FROM SOURCE_SCHEMA.D1
	INTERSECT 
	SELECT T1.* FROM TARGET_SCHEMA.D1 T1 INNER JOIN SOURCE_SCHEMA.D1 S1 on S1.Dim_Key = T1.DIM_KEY );
	COMMIT;
	— Now the changes + new data
	INSERT /*+APPEND */ INTO EX1 (
	SELECT * FROM SOURCE_SCHEMA.D1
	MINUS 	
	SELECT T1.* FROM TARGET_SCHEMA.D1 T1 INNER JOIN SOURCE_SCHEMA.D1 S1 on S1.Dim_Key = T1.DIM_KEY);
	COMMIT;

This is the basic approach I took in one of the ODI SCD2 IKMs I developed.

Maybe not go faster tape – but some methods to look at for performance improvement.

Kafka-in-a-database

Some people might take the view that I have one of those long dead technical expertises, ETL; a sort of Latin of the tech world. However, that view is not strictly accurate. The domain of ETL is evolving from a “batch-orientated lift-n-shift to some on-premises data warehouse-monolith” to one where “streams of data move constantly to one or more target systems that might well exist in someone else’s cloud”. Although still data-herding, it is a million miles from what people were doing in Sunopsis, Oracle Warehouse Builder, or Informatica in the early 2000’s. Today, we consume changes captured from databases and Web-API sourced 3rd party data and often push it to a remote target using another web service, however this is rarely going to be a simple pass-through, almost invariably the outgoing API requires some form of data enrichment before it can be consumed, and that takes us back to the transform, T, part of ETL. Which is my way of saying that going to the ETL talks at conferences is as relevant today as it ever was.

At Oracle Code-One 2018 in San Francisco, one talk particularly resonated with my current work using Kafka to load an Oracle Cloud SaaS product.

In my project we needed to structure our outgoing topics to fit the format expected by the upload to SaaS API. The API is a REST service that uploads JSON payloads with a schema specified in the SaaS vendor’s documentation. Here we come to a difficulty with any form of stream processing, the parts we need to emit are often the amalgamation of many streams with different velocities. Take a PRODUCT being uploaded to some ERP system, the product barcode may change at a different frequency (and at different times) to information about product supplier. If we need to provide both barcode and supplier in the same JSON payload we going to have to find a way to join those streams across extended time periods. This brings up questions about stream retentions and the need for some backing store. Maybe, Kafka K-Tables could be part of the mix to present a pivoted view of the current state of all the keys, but things can get complex with key management, partitioning, replication, and the need for adequate sizing in terms of memory and storage of the nodes.

From the talk I learnt that sometime in the 12 months Oracle will be releasing the first version of its in-database Kafka product. As with all future products Oracle can’t commit to firm timeline for delivery but I am hopeful it will be in the next couple of months. To hear about the future Oracle in-database Kafka consumer set my head racing on simpler solutions to joining data over extended periods.

The initial version of the product is an in-database consumer of Kafka Topics (probably with a limited set of SERDEs). The basic approach is to extend the database functionality by adding code (similar to the way in which Enterprise R extends the database by installing some packages). We use the new package to register the topic and expose the data as an Oracle view. This does not sound that exciting but that view is the gateway to a whole host of database functionality that might be tricky on a Kafka stream or within KSQL:

we can join it to other tables and views (including other topics exposed as views).
we can use SQL functions to enrich and enhance the data .
we can use aggregate functions.
we can use analytic functions.
we can pivot and unpviot data

More importantly, we can insert this data into other tables and then feed other systems.

As I understand it, the first release is a consumer only – we have to wait a little for Kafka producers. Another wait-for is AVRO processing. In some ways this lack of AVRO from day 1 is a shame, but not one that we can’t work around. For example I use Golden Gate for Big Data to produce Kafka topics from database change, and this comes in AVRO format if we want lovely things like auto-registration of topics. However this is not a technical showstopper if we need consume GoldenGate change capture in a database, for that is exactly what the product has done from the start! The problems come when a data-flow architect mandates all data moves using the same tech. Maybe Oracle to Oracle streaming is simpler using basic native GoldenGate.

Maybe the product is the compete answer to every problem just yet, but is one to look at for the future.

DW_PETE Talks Data

Views on BI, DW, DQ, and Big Data

Partitioning for query performance

What is partitioning?

Why partition?

Choice of partition type and key

Stopping Duplicate Late Arriving Data

Grandma Brown’s SQL Cookbook

DISTINCT

ALIASES

IN-LINE VIEWS

HINTS

Syntax translation

Old recipes

Obfuscation or Masking of selective column data

Applying Selective Data Restrictions to Provide Data Privacy in BI

The contribution problem

Chocolate Clusters

Partition Exchange (backwards)

Partition Exchange, simply

Streaming and Star Transforms

Go Faster Tape

The fourth way – do something different

Kafka-in-a-database