Grandma Brown’s SQL Cookbook

I like cake, and like people world over I know that the best cake was the cake my grandmother made. So good in fact there is a family hand-me down recipe book so that the family that followed her can make cakes as good as the ones she made when she was around.

Likewise some database developers rely on a series of tried and tested recipes to write SQL queries

NewImage

One of the problems with old recipe books is that the world has moved on. This is true for both cakes and for SQL. My grandma baked in a coal-fired range oven, weighed ingredients in pounds and ounces and had a very limited larder. Somethings translate (such as weights and measures) but hardware properties and ingredient availability less readily. For Grandma Brown and her SQL recipes we also find that the things she always did and handed down to you may not give the best results now.

Enough of the cake analogy – what I am getting to is that just because you did something once in the past and it seemed to give the correct result does not mean it is right to use it every time or is even generally applicable.

DISTINCT

Or (for youngsters) UNIQUE, can have its uses – I sometimes use if for quick and dirty profiling to get a better feel for the range of data and its cardinality in a table, however I prefer to use group by with an actual count of values for this. There are three common problems I see with this:

The distinct does nothing useful in some SQL. Some database operations implicitly work over a distinct set of data, for example UNION, GROUP BY and IN(SELECT DISTINCT COL1 from TABLE1). A DISTINCT won’t change the result but may make the query slower to execute as it will attempt to find duplicates that don’t exist. 

There is no point doing a distinct on data that can’t have duplicates such as SELECT DISTINCT primary_key from TABLE1;  or even SELECT DISTINCT DUMMY from DUAL; 

It is even possible that the use of DISTINCT/UNIQUE is masking a bug in your query logic such as a missed join condition or a filter condition. 

ALIASES

Don’t get me wrong here, aliasing a table in a query is good thing. Well chosen aliases make a query easier to read and understand (a good thing if you are doing support), they provide clarity of the source table of two identically named columns. Where this goes wrong is when instead of meaningful aliases such as PD for PRODUCT_DIM you develop using single letter alias that have no relationship to the actual tables. Working out what A, B and C relate to is hard enough without coming across in-line views that use the same A,B and C alias convention but referring to different source tables. Talking of in-line views…

IN-LINE VIEWS

When Grandma started out some ingredients didn’t exist, now it is perfectly possible to use Common Table Expressions (CTE) as a replacement for many in-line views. The CTE offers many advantages over the in-line view, such as being able to use the same CTE many times in a query, the possibility of recursion, the possibility to hint the CTE and improved clarity of the SQL (easier to support). However, it is not the exact same thing as an in-line view so expect different query plans. Sometimes it is necessary to force Oracle to materialize the CTE and sometimes you need do the reverse and force it not to materialize it. There are hints to do that. which brings me to hints.

HINTS

Most optimizer hints are a way of telling the cost based optimizer that you know more about your data than it has obtained from the statistics on tables and indexes and from other information obtained from table structure such as keys and constraints. If there was such a thing as a universal hint that would benefit all queries you would think that Oracle would have made it a default. There is no such thing; do not apply the same hint to every query as a go faster switch it will mainly not work and could make things worse, much, much worse. To my mind hints are useful when your are trying to optimize a query. Once you work out the best plan you can look at giving extra information to the optimiser so that an unhinted query has the same plan. The other thing that people following recipes often get wrong is mixing hints with aliases, it is so easy to malform a hint.

Syntax translation

I have old US cake recipes that uses General Purpose flour. We don’t have anything called that here in France so I am going to have to use something different. Likewise when we translate code from one vendor’s SQL to another we come across things that would work a lot better if they were translated into the right equivalent. This is particularly common with conversion functions such as dates to strings. Sometimes we can replace complex hand-written functions that do string manipulation with vendor supplied in-database functionality.

Old recipes

In the past people cooked things like roast swan stuffed with sturgeon. The probably don’t now. Our tastes and skills have evolved. My grandma never vacuum packed and water-bathed a thing in her life. Similarly SQL has evolved. If your coding style is stuck in Oracle 7.3 you will miss out on great things like Analytic Functions, the model clause and match recognise. Using some new methods can give great results, but use them in the right places

I suspect that other people can add further chapters to the recipe book.

Advertisements

Obfuscation or Masking of selective column data

I just blogged this piece on a specific need I had in BI reporting, the ability to return a set of rows that relate to one user, but also return masked rows for the other users – I needed to work out totals across the whole data set but not reveal any descriptive data relating to the other users. I came up a with solution on Oracle Analytics but I also looked at what we can do in the database to make the solution more robust.

I asked a question on Twitter about techniques we might use to achieve this within the database. I got loads of great suggestions, some that helped me frame my question better, other suggestions showed me new ways of looking at things. One from a friend asked if Fine-Grained Access Control would work for me. Actually, this idea was not one on my list of possibilities, perhaps that was because I mainly do SQL and this was the dark arts of DBA-hood.

Taking a quick look at the go to resource on all things Oracle, Tim Hall’s Oracle Base website led me to a piece on changes to security features in release 10. Just the information I was looking for. Something I could adapt and play with.

Unlike many places Google may lead you to, Tim’s web pages are really useful, Tim takes time to explain stuff and gives actual WORKING examples that you can follow along with. If you don’t know how to do something there’s a far chance that Tim does and has an example you can learn from.

The use case I wanted to use was not the same as Tim’s so I had the opportunity to base my work on some of his piece and adapt and modify (and bring in some other concepts) to get the data doing what I wanted,

As I needed to use some Enterprise Edition features I needed access to a suitable database and as I am just playing I decided my Vagrant hosted 18XE database would do nicely. 18XE has the Enterprise Edition features in a small and  FREE database. This XE set up allows me to run the database on the old MacBook I have in the corner of my home office and run SQL-Developer on my desktop connected to the a plugable  database on XE.

Please remember that this is a quick and dirty prototype and other tasks may be needed to make it production ready

What I am going to do:

  1. Create a test user in the database and give necessary grants.
  2. Create a test table and populate it with 6 rows.
  3. Create the access function (more on that in a moment).
  4. Create the VPD policies.
  5. Test.

As the SYS user on the VPD I create a new user (FLA) with ability to create tables, views packages, functions etc. Also give it this grant

GRANT EXECUTE ON dbms_rls TO FLA;

Now connect to the new user and do the following:

--------------------------------------------------------
-- DDL for Table FLA_TEST
--------------------------------------------------------

CREATE TABLE "FLA"."FLA_TEST" 
( "ITEM_DESC" VARCHAR2(200 BYTE), 
"ITEM_SIZE" VARCHAR2(20 BYTE), 
"SUPPLIER" VARCHAR2(20 BYTE), 
"QUANTITY_SOLD" NUMBER(38,0)
) ;
REM INSERTING into "FLA".FLA_TEST
SET DEFINE OFF;
Insert into "FLA".FLA_TEST (ITEM_DESC,ITEM_SIZE,SUPPLIER,QUANTITY_SOLD) values ('Bright-white','100 ml','Shiny-teeth plc',10);
Insert into "FLA".FLA_TEST (ITEM_DESC,ITEM_SIZE,SUPPLIER,QUANTITY_SOLD) values ('Strong-White','100 ml','Shiny-teeth plc',18);
Insert into "FLA".FLA_TEST (ITEM_DESC,ITEM_SIZE,SUPPLIER,QUANTITY_SOLD) values ('Strong-White','50 ml','Shiny-teeth plc',25);
Insert into "FLA".FLA_TEST (ITEM_DESC,ITEM_SIZE,SUPPLIER,QUANTITY_SOLD) values ('Fang-brite','100 ml','Vampire product co',3);
Insert into "FLA".FLA_TEST (ITEM_DESC,ITEM_SIZE,SUPPLIER,QUANTITY_SOLD) values ('No-bleed fang clean','25 ml','Vampire product co',7);
Insert into "FLA".FLA_TEST (ITEM_DESC,ITEM_SIZE,SUPPLIER,QUANTITY_SOLD) values ('Star sparkle','50 ml','Party-Looks',4);

Here I’m inserting 6 kinds of toothpaste with 3 different suppliers. If I was a smart person I would use numeric keys for supplier and not VARCHAR2 but no matter it still will work. What I want to do is apply a data access control on my table so that for a given supplier I can see my data and the numeric data for the other rows. For my BI use case I want to add up the total sum for all of the data but not see descriptions of the rows belonging to the two other suppliers.

First off I need an access function to work with my access policy. I could use some form of static constant (Tim’s examples go this way for simplicity and clarity) but for my use case it will be useful to pass the supplier in a database context – this can be achieved by an on-connect call in the BI server and will be a secure way to pass the suppler details to the database. It also gives a decree of security as without the correct context name and handle set data will not be returned. We need to create a context using the sys user and then write some PL/SQl

create or replace context sup_ctx using FLA.t1_pkg;

Now back to the FLA user to do the rest of the tasks

Create a package to set context for our testing – it makes things a little simpler to debug, however you only need the DBMS_SESSION.SET_CONTEXT call

--------------------------------------------------------
-- DDL for Package T1_PKG
--------------------------------------------------------

CREATE OR REPLACE EDITIONABLE PACKAGE "FLA"."T1_PKG" 
as
procedure setx(p_sup varchar2);
end t1_pkg;

/
--------------------------------------------------------
-- DDL for Package Body T1_PKG
--------------------------------------------------------

CREATE OR REPLACE EDITIONABLE PACKAGE BODY "FLA"."T1_PKG" 
as
procedure setx(p_sup varchar2) is
begin
dbms_session.set_context('sup_ctx','x',p_sup);
end;
end t1_pkg;

/

 

Next build the access function. Note the silliness with quotes as I used a string with spaces as my supplier value.  We build a policy condition called con which is basically a boolean condition

--------------------------------------------------------
-- DDL for Function PF_JOB
--------------------------------------------------------

CREATE OR REPLACE EDITIONABLE FUNCTION “FLA"."PF_JOB" (oowner IN VARCHAR2, ojname IN VARCHAR2)
RETURN VARCHAR2 AS
con VARCHAR2 (200);
BEGIN
con := 'SUPPLIER='''|| sys_context('sup_ctx', 'x') ||'''';
RETURN (con);
END pf_job;

/

Next we set up the access policy itself. It identifies the object (our table) the policy is on, the function that implements the policy and the fine-grained but itself, the columns to mask and whether we want all rows back (we do!)

BEGIN
DBMS_RLS.ADD_POLICY (object_schema => ‘FLA',
object_name => 'fla_test',
policy_name => 'sp_job',
function_schema => ‘FLA',
policy_function => 'pf_job',
sec_relevant_cols => 'item_desc,supplier',
sec_relevant_cols_opt => DBMS_RLS.ALL_ROWS);
END;
/

That’s enough to get things up so let’s test things

exec t1_pkg.setx('Vampire product co');
select * from FLA_TEST;

We can only see descriptive detail for items from the Vampire Product co.

NewImage

So far so good. I highlight NULL values in SQL Developer in yellow so I can see at a glance what’s going on. In our use case we don’t really want NULLs so let’s create a view of it.

CREATE OR REPLACE FORCE EDITIONABLE VIEW “FLA"."FLA_TEST_VIEW" ("ITEM_DESC", "ITEM_SIZE", "SUPPLIER", "QUANTITY_SOLD") AS 
SELECT 
coalesce(item_desc, 'Other Item') item_desc,
item_size,
coalesce( supplier, 'Other Supplier') supplier,
quantity_sold
FROM FLA.fla_test;

select * from FLA_TEST_VIEW;

NewImage

That’s better. Now we can aggregate things up like this:

select item_desc, sum(quantity_sold) from FLA_TEST_VIEW group by item_desc;
ITEM_DESC		SUM(QUANTITY_SOLD)
Other Item		57
Fang-brite		3
No-bleed fang clean	7

Just what we needed.

Finally we test again using different context values

SQL> set echo on
SQL> column item_desc format A20
SQL> column item_size format A10
SQL> column SUPPLIER format A20
SQL> 
SQL> exec t1_pkg.setx('Shiny-teeth plc');

PL/SQL procedure successfully completed.

SQL> select * from FLA_TEST_VIEW;

ITEM_DESC            ITEM_SIZE  SUPPLIER             QUANTITY_SOLD
-------------------- ---------- -------------------- -------------
Bright-white	     100 ml     Shiny-teeth plc      10
Strong-White         100 ml     Shiny-teeth plc      18
Strong-White         50 ml      Shiny-teeth plc      25
Other Item           100 ml     Other Supplier        3
Other Item           25 ml      Other Supplier        7
Other Item           50 ml      Other Supplier        4

6 rows selected.

SQL> exec t1_pkg.setx('Party-Looks');

PL/SQL procedure successfully completed.

SQL> select * from FLA_TEST_VIEW;

ITEM_DESC            ITEM_SIZE  SUPPLIER             QUANTITY_SOLD
-------------------- ---------- -------------------- -------------
Other Item           100 ml     Other Supplier       10
Other Item           100 ml     Other Supplier       18
Other Item           50 ml      Other Supplier       25
Other Item           100 ml     Other Supplier        3
Other Item           25 ml      Other Supplier        7
Star sparkle         50 ml      Party-Looks           4

6 rows selected.

SQL> exec t1_pkg.setx('Vampire product co');

PL/SQL procedure successfully completed.

SQL> select * from FLA_TEST_VIEW;

ITEM_DESC            ITEM_SIZE  SUPPLIER             QUANTITY_SOLD
-------------------- ---------- -------------------- -------------
Other Item           100 ml     Other Supplier       10
Other Item           100 ml     Other Supplier       18
Other Item           50 ml      Other Supplier       25
Fang-brite           100 ml     Vampire product co    3
No-bleed fang clean  25 ml      Vampire product co    7
Other Item           50 ml      Other Supplier        4

6 rows selected.

SQL> exec t1_pkg.setx('non exisitant supplier');

PL/SQL procedure successfully completed.

SQL> select * from FLA_TEST_VIEW;

ITEM_DESC            ITEM_SIZE  SUPPLIER             QUANTITY_SOLD
-------------------- ---------- -------------------- -------------
Other Item           100 ml     Other Supplier       10
Other Item           100 ml     Other Supplier       18
Other Item           50 ml      Other Supplier       25
Other Item           100 ml     Other Supplier        3
Other Item           25 ml      Other Supplier        7
Other Item           50 ml      Other Supplier        4
6 rows selected.

Just the job. Test complete, now to remove the policy and write things up

BEGIN
DBMS_RLS.DROP_POLICY (object_schema => '"fLA"',
object_name => 'fla_test',
policy_name => 'sp_job');
END;
/

When I started planning these two blogs I had not appreciated the simplicity of using Fine-grained Access Control. It does exactly what I attempted to achieve.  And because I am using a context to build the filter condition I can build it into a user specific connection from the BI Server to the database.

Applying Selective Data Restrictions to Provide Data Privacy in BI

Almost 20 years ago I was the technical lead on a large (in those days) data warehouse for a cash and carry wholesaler (that is a sort of warehouse supermarket hybrid with a target customer base of convenience stores, filling stations and small hospitality operations like cafes, bars and hotels). Dimensionally, they knew the customer and their segment (you needed a membership card to shop); they knew each item in the basket and how that fitted in their product dimension, shop and transaction dates were also dimensions. So far, so classic sales data warehouse.

One of the things they wanted to do was to recover some of their development and operating costs by selling data to their suppliers. To do this they used an additional dimension for supplier. Of course this dimension correlates to the product dimension but exposing it as a separate dimension it allows analysis by supplier for any of their products. They also wanted to remove access to individual customer level data so that the lowest customer level was customer segment; even in those days they could see the need not to expose data that could be misused to target individuals. Simplistically, we present a star schema with a dimensionality of product (item), supplier (manufacturer), customer segment, store location and date. Pre-built analyses can be run against the dataset and the results burst out to the individual suppliers.

Life however is not so simple. Some suppliers wanted to buy premium access and create their own reporting and to slice and dice the data any way they liked. These suppliers tended to be large companies with many products in their portfolios. In fact two of them (competitors of each other) had thousands of products across many of the same product groups: laundry, health and beauty, and foods and both consumer and industrial brands. Giving third parties direct data access means we need to make sure that information not about their data or at too low a level does not leak out. This is where the challenge is. We need:

  • Query access to all of a supplier’s products.
  • No access to a competitor’s products.
  • Query access to customer segments (the type of businesses purchasing the products).
  • No access to individual customers.
  • No access to supplier details except your own.
  • No access to financial data (that is we expose sales volumes and not cash amounts).
  • Access to market share information – that is supplier sales against all sales for a product group.

The last bullet point is potentially the trickiest part.

I’m not going to discuss how we achieved this as it was a couple of decades back and what we could achieve then in the database or in the reporting tools was somewhat restricted. At that time the database was Oracle 8i (upgraded to 9i), and the tools were one of Oracle Sales Analyzer, Business Objects or Siebel Analytics (depending where on the project timeline we sat). Instead, let’s skip forward to now and look at ways we can tackle those challenges.

Let’s just consider the Oracle database and Oracle Analytics (though OBIEE is going to be analogous here). We have two locations we can restrict data access: in the query tool and in the database. In all likelihood we will do both as we need to prevent the knowledgeable user from getting around our rules.

In Oracle Analytics we are going to do a few steps to prevent external ‘Supplier’ users from straying from their role assigned data resource. Typically we allow access to a single subject area with just the supplier_sales fact table and its supporting dimensions. We could also implement something like the data level security described by Oracle’s A-Team. In this blog they use an initialisation block to populate a session variable containing a list of countries and to use the variable as a filter on the logical SQL. We can use something similar to create the supplier filter to use in our queries. However as it stands the filter does not allow contribution share calculations without some other hoops to be jumped through.

The contribution problem

If I was a supplier of toothpaste and I had a range of 15 product names in a variety of pack sizes (say 50 SKUs in total) I might want to know the percentage share for the 100ml pack of basic fluoride toothpaste against all of the toothpastes I supply. Here’s a simplified diagram of part of the supllier data

 

NewImage

 

 

This is relatively simple – we sum the sales of the toothpastes I supply and calculate the contribution of the SKU I am interested in, but what if I want to know my sales of the SKU against the total market? If I sum at product type = ‘toothpaste’ level I will only see the toothpastes I supply and not the whole market as other suppliers are filtered out. I want my data to look more like this:

NewImage

 

I could remove my supplier filter and use a case statement on the logical product dimension columns; something like (remember supplier is also an attribute of product)

 

CASE WHEN “Product”.”SupplierCode” = VALUEOF(NQ_SESSION.USER_COMPANY) THEN “Product”.”Description” ELSE “Other products” END

We would need do this for each of  the exposed dimensional attributes so that when we roll up the sales items not supplied by us they will go into a single bucket when the reporting platform aggregates the data. There is a cost to the case statement, some of the aggregation is moved from the database to the BI server. Perhaps a more viable solution is to create a new fact data source at the product type level to aggregate the total sales numbers for all products in each product type and map that into the logical model for supplier sales. A third option also exists in using the EVALUATE_AGGR function to push an analytical sql requests directly to the database. I would not favour this, as access to the EVALUATE family of functions has security implications and are perhaps best not exposed to external 3rd parties.

On the reporting side it looks like our best option is to filter product level sales by supplier and include an aggregate data source at “product type all suppliers” to provide the rolled up sales quantity metrics.

An obvious problem with implementing the security solely inside the reporting layer is a lack of portability – what if direct database access is needed? What if our reporting solution changes? Let’s assume we are using an oracle database as  our source. We have a few items in our tool chest that might help with this:

 

NewImage

First off, the last three methods won’t really help us selectively hide data.

Label security works on the basis of person A having more access than person B – this is basically implemented by giving a “permission number” to a user and checking a (probably hidden) column in the table for the row permission threshold and if the user has a higher level than the threshold return the row. In our case user supplier A can see rows that supplier B can’t and B can see rows that A can’t and there’s the problem. Label Security assumes a hierarchy of access rights and not a flat structure of peers with different access levels. If Label Security was implemented using ‘between’ instead of “greater than” it could be made to work for us. 

Redaction is by columns – to use it we would need to split the data into two parts, the un-redacted data for supplier A and the redacted for all suppliers EXCEPT A. The two sets are then UNION ALL’d to provide the full set. Yes it can be done but is probably not worth it.

Data Masking and Subsetting is not the right tool for this production query use case. it is just not going to do this particular job.

Of the ways that can be made to work, the simplest to implement is to create supplier specific subsets of the data and use database grants to allow select access to individual suppliers. Potentially, this is a very poor method to scale as we need database users for each supplier and their own objects containing their data.

The two methods that, to me, show the most promise are the virtual private database and its child the fine grained access control. The VPD allows us to use a single table for the data and apply policies on the data to only show rows that match an access policy in the database. These policies are implemented by functions and calls to the VPD packages. The traditional VPD from Oracle 9 days rewrote queries to add a where predicate on the required column. So we somehow pass the user running the query and use that to build a filter such as Where [other filters] AND SUPPLIER_ID = 123. In effect we are emulating what was described in the A-Team blog.  However, we can get smart here and instead of a function that maps user to SUPPLIER_ID we could store a value in a database context and use the context to build our filter. This context can be set by the BI server as part of the query to the database. With no context set no data is returned.

If the source database is 10g or later we can be even smarter with this as the fine grain enhancements to VPD allow us to selectively display or show nulls for values returned. We can now use one table as our source for both supplier sales and “all other supplier sales”.

Part 2 of this blog will say how this is set up.

 

 

 

Chocolate Clusters

More on my favourite sweet in a moment.

One of the things that takes time in BI is pushing the data around to aggregate it. We read it, sort it, and group it before consolidating it by like terms. We strive to reduce this time by a range of techniques: in the ETL, by pre-aggregating intermediate results, and also by making it less time consuming to retrieve the data by positioning it on appropriate storage.

This comes down to:

  • Move less data – either physically omit unneeded information or pre-aggregate it to do something costly just once and reduce the size of data being accessed.
  • Make fetching the data more efficient.

My friend Martin Widlake has a great conference talk on data clustering. If he is presenting it at a conference that you can attend I urge you to hear him. I’m not going to cover the same ground here, however, clustering of data is an important tool in data warehousing. In fact we have done (or think that we have) this for years. The basic idea is that data that gets accessed together sits together in the database blocks and potentially reducing the number of blocks to visit in a query.

Back in the early days people would do things like “ordered inserts” which sort of worked but was not guaranteed; rows could fill “holes” in the database, and more importantly, we don’t insert all of our data in one go, we would end up with a series of clusters. Maybe we could get around the gap filling issue by using the append hint, but that has it own issues with “above the high watermark” inserts. Another approach to clustering could be the use of partitioning, or the old-school partition view method to create a series of partitions or tables that group our data together.

As a sideline I run this e-commerce sweetshop (candy store) from Yorkshire, e-buy-gum.com. I sell chewing gum (obviously), chocolates and boiled sugar-candy trumpets (the “toot-sweet” range). I want to analyse my sales by product line. Consider my sales of just one chocolate item, the Cadbury’s Creme Egg (other filled egg shaped chocolates are available). My sales table is partitioned by week. So if I want to see sales for April 2019 I would need to scan through 4 to 6 partitions to look for rows that relate to that single product code (remembering to ignore any rows for the last week of March and the first week of May). However if we sell a lot of items we are scanning a load of blocks in the database needlessly. There are a few techniques we could use to reduce unneeded table (or partition) scans. We could bitmap index, but this solution is not for everyone, factors like cardinality and index maintenance can reduce the usefulness of the approach. Another way forward is to sub-partition our data on the product so that all of data is found in a subset of the sub-partitions. By creating 16 sub partitions (hash or list) on the product key we reduce our scan requirement to one sub-partition per week.

But what if my needs change to looking at the total chocolate sales in the month – we get this by joining to the product dimension and restricting the selection to the products that are the product_group “chocolate”. We have a problem here with sub-partition contents: why should my creme egg data be in the same partition as my chocolate wafer bar? If we hash partitioned on product key there is no reason that all of our chocolate products will be in the same partition, in fact it is most unlikely they would be. If we went with list partitioning we come up with another problem: how do we add new products to the list?

Of course, as an entrepreneur, I’m not just interested in the sales of my products, I’m interested in who I sell to, their locations and a whole host of things I can use to boost my bottom line. So I have decided to look at my total chocolate sales in Q1 to France. here my brilliant idea to sub-partition the data falls down (even more) it’s just not possible to arrange our data by date, product and customer at the same time.

Step up attribute join clustering. I wrote about this before. As a recap it allows us to use dimensional attributes as clustering keys and on multiple dimensions at once. So, for example our French sales of the chocolate-group products are near by in the the database and add in the feature of zone maps we can zero in on the data very efficiently with out looking at database blocks we don’t need to visit. The other plus is that the cluster keys need not be part of the of the fact table thus saving space in the fact.

In my previous blog I used a slightly different example – here is the DDL i grabbed from SQL-Developer last time I wrote about thisTabscript

As you can see we clustering our sales data along product category, sub category and product interleaved with customer country, state and city and a similar thing for stores. In the e-buy-gum case we are not using stores.

Happy clustering.

Partition Exchange (backwards)

The other day my Twitter-buddy (and real life friend) Jeff Smith replied to question about efficient data removal using partitioning (which as people reminded him) an option in the Enterprise Database (but included in XE version 18 ).

I have used partitioning from Oracle 8 days (and for some customers, partition views which predated that technology). As versions are released the technology becomes more and more capable. Partitioning has always been a key part of my data warehousing and large database work. It addresses many important needs including: manageability and resilience, information life cycle and query performance.

One particular aspect I have enjoyed in my ETL processing was the ability to do partition exchange.

Partition Exchange, simply

This is where we create a table with the same column structure as our partitioned table, load it with data that would belong in a single partition and then execute an exchange statement that simply swaps data dictionary information for the table and the partition. The table becomes part of the partitioned table and the partition now becomes the table. Being a dictionary operation this can be very quick and is an efficient way of working with bulk data. The data itself does not move in the exchange – this means the the tablespace you create the table in will become the tablespace for the partition – you may need to factor that in if you need partitions in specific tablespace – I will rephrase that YOU SHOULD CREATE YOUR TABLE TO EXCHANGE IN THE CORRECT TABLESPACE.

One way of using this is for a daily load of new data to a weekly partition table. We use exchange as it is less invasive on the target partitioned table. For example

NewImage

Here I leave out some of the steps we would do in a real process such as stats collection and indexing.

One thing that people often overlook is that in partition exchange the PARTITION BECOMES THE TABLE.  This fact allows us to modify the exchange technique to allow it  to work with non partitioned tables in exactly the same way.

For this quick tip we create a single partition partitioned table – one where the partition will hold all of the data possible from the table. Two ideas spring to mind, a hash partitioned table with just 1 partition bucket or a range partitioned table with the partition key max value to a value greater than the greatest possible actual key value.  We use the partition as our work space for the data loading process and when complete we exchange it for the table.

Streaming and Star Transforms

In a few weeks time I will be in Seattle to speak at ODTUG’s KSCOPE 2019. I have two talks – one about how data profiling is an important adjunct to data modelling, the second, some thoughts on how the advent of streamed data has influenced our traditional star-schema data warehouse.

Without it being too much of a spoiler, one of my takeaways will be that the star is not dead, it’s just that what we do has morphed.

Let’s consider a classic Oracle data warehouse star schema. We have a fact table (let’s call it SALES) joined to some dimension tables using foreign key relationships. We record the item sold (PRODUCT_KEY), the person that bought it (CUSTOMER_KEY), the store location it was sold in (STORE_KEY) and date and time when the item was sold (SELL_DATE_TIME). We also record how much they spent and how many of each item they purchased in the transaction line. One fact table, four dimensions and two measures, almost as simple as it gets.  Except of course for the elephant in the room, the date/time dimension. We would not usually hold every possible date/time in the dimension table. For simplicity, I am going to store the actual transaction time on the fact table. This also has a benefit as this is likely to be the partitioning key and, as a bonus benefit, the optimizer knows date arithmetic and won’t get flustered when working out intervals in cardinality estimates. The other keys should be surrogate keys if we are using slowly changing dimensions – come to my talk or see my blog on temporal validity for a reason why that feature is not all that it seems.

There’s nothing really new in the above. In our old, on-premises, big iron in the data center world we would have partitioned the fact table to make access by transaction date/time work on a small slice of the fact table. We would bitmap index the fact dimension keys to allow us to use star transformation query plans. Star Transformations  select rows from the fact by first ‘AND’ing the dimension key bitmap indexes; we combine the bitmaps for selling a banana (what) to Peter (who) in Mayenne (where) to find only the rows where all of those conditions are true. Star transformations require us to have bitmap indexes on the fact dimension columns and to have at least RELY foreign constraints in place. 

Oracle bitmap indexes are wonderful things but they need a bit of love and care when doing data loads. They index ranges of rows and not single rows. This can cause issues with contention for concurrent access and during data load would constantly perform (slow) index updates as new data arrives. There are things we do here to get around the slow index update. Simplest is to disable the index before loading data and rebuild it afterwards. If, as I hope, the fact is partitioned we only need disable the local bitmap indexes for the partitions being loaded. Other methods to avoid index updates during insert exist such as using loading to another table, indexing and then using partition exchange to swap the data and indexes for the target partition.

Using a streamed source changes things – we don’t really have the opportunity to maintain the bitmap indexes during a constant load process. Obtaining the keys on fact load is quite simple we just join to the active records of the dimensions and obtain the keys; we can do this directly in the stream or on the relational side. To be a bit picky here we should get the key of the dimension record that was active at the time of the transaction, but unless we are processing late arriving data this is usually the current record. These key lookups are likely to be quite speedy to perform unlike indexing a row by row insert into a bitmap indexed table (or partition). 

One way to avoid the update issue is to not use bitmap indexes at all! This is not as bad a choice as it first sounds. Star Transforms have been with us since Oracle 9i and since then we have FIVE major version number shifts in the database.  New features have been released that can get around the need for bitmap indexes in a near-realtime fact.

  • Storage indexes on the disk cells.
  • Zone maps to “coarse index” the fact to dimension joins.
  • Bloom filters.

All three of the above are “negative indexes” they ignore the blocks of disk where the required data will not be found and thus restrict database block reads to the blocks that probably contain the data being sought.

Zone maps are interesting in that they can become stale through DML (and inserting new rows is DML) we would periodically need to refresh the stale zones however this operation is carried less often then you might think as we tend to insert data and not revisit existing data.

My bottom line is we probably don’t need to use star transform query plans with stream sourced fact data. Newer database structures support performant data access without the bitmap index updates seen on continuous data load.

 

NewImage

Go Faster Tape

When I was a kid you could buy a roll of sticky tape with a checker board print on it. This was used to decorate bikes to make them go faster. The reality of this was that it did not actually work, but as a young child you just accepted it for what it was, magic.

The same could be said for the silver bullets you might read about in tuning blogs. The reality of tuning is that there is never a one-size-fits-all solution. Invariably, performance is driven by the distribution of the data and the characteristics of the platform it is being accessed on; everyone’s data and environment is different so what works for one person may not work for another. That is not to say that there is no such thing as good practice for designing data access, it’s just that you can’t say “this block size” or “index this” is the best thing for all circumstances.

However, there are some general goals to explore if not achieve

NewImage

 

Obviously, by not doing something at all we get the best performance gain. Surprisingly, by looking at a system you might spot time consuming but unnecessary steps, sometimes it requires you to restate the problem in a different way and alter processing logic to omit a step.

It’s harder to call if doing things less frequently or faster gives the best performance boost. Less often could be something simple such as commit less often (of course, commit as often as transactional integrity requires, but for bulk we might not need commit for each row). Likewise pre-filtering data can reduce the volume to be processed. Sometimes the line between less often and faster is quite blurred; is changing from a nested loop join to a hash join an example of “less often” or “faster”?  You could argue the same for PL/SQL loops being replaced with a SQL join. Somethings definitely sit in the realm of “same but faster” such as where an operation occurs, for example in the CPU, in memory, at the storage-cell, on disk, using off-line data.

The fourth way – do something different

In some cases, typically with aggregation, it might be possible to move a complex function to the result of an aggregate rather than for each value, though some functions will not allow this.

NewImage

Let’s look at a typical problem we see in the data warehouse / data analytics world where we get an extract of dimensional data for a source. A simplistic statement of this: using the staged data as a source insert new data and update existing data. We might write a PL/SQL block that loops through the source and tries to update the target and if it fails tries to insert.  This however is likely to be slow as we loop through our data set. Faster, we could replace the loop code with a MERGE or UPSERT statement. This gets rid of our loop code and runs as a simple SQL statement. There is still scope for improvement though, but this requires us to revisit the requirements.

In a real world dimension we are likely to have new records, updated records and (almost certainly as the majority) unchanged records. So how do we remove the unchanged records so we don’t process them? I have blogged on that before but there are two basic approaches we can take:

  • We can outer join source to target and filter out the duplicate rows.
  • We can SQL MINUS the source and target.

As I said before the join approach may be fine, but when you might get NULLS in some columns there is some coding to do as comparing a NULL to a NULL is not quite straight forward in Oracle. Some people use a MD5 hash on the concatenated columns but again NULLS may trip things up and the hash function is very compute intensive so could have a negative impact on systems with low CPU power.

However if we could change our source data extract to contain only new and changed data we can go back to a simple merge load. One approach for this is to use log based change data on the source system; for example Oracle GoldenGate on the source database only passes rows that have a commit – and for many systems commit implies change. However, experience shows that there are “noisy” systems out there that commit on view and that leaves us with data that need not be processed. I say change data capture but we could equally use data streams such as Apache Kafka.

Having less data in the source allows us to further reduce the data we process in a SQL MINUS:

	-- Basic Set based approach to find change
SELECT * FROM SOURCE_SCHEMA.D1
MINUS
SELECT * FROM TARGET_SCHEMA.D1;

-- Modify where SOURCE_SCHEMA.D1 has many fewer rows than TARGET_SCHEMA.D1
SELECT * FROM SOURCE_SCHEMA.D1
MINUS
SELECT T1.* FROM TARGET_SCHEMA.D1 T1 INNER JOIN SOURCE_SCHEMA.D1 S1 on S1.Dim_Key = T1.DIM_KEY;

Our next question is do we merge the data or can we do something with partition exchange? The Oracle Merge statement can be very efficient if we are merging on an indexed key, but there may be some mileage in only using append inserts to an empty table and exchanging the data using partition exchange. You can also do this with a non-partitioned table!

 	— Insert into exchange table EX1 unchanged data
INSERT /*+APPEND */ INTO EX1 (
SELECT * FROM SOURCE_SCHEMA.D1
INTERSECT
SELECT T1.* FROM TARGET_SCHEMA.D1 T1 INNER JOIN SOURCE_SCHEMA.D1 S1 on S1.Dim_Key = T1.DIM_KEY );
COMMIT;
— Now the changes + new data
INSERT /*+APPEND */ INTO EX1 (
SELECT * FROM SOURCE_SCHEMA.D1
MINUS
SELECT T1.* FROM TARGET_SCHEMA.D1 T1 INNER JOIN SOURCE_SCHEMA.D1 S1 on S1.Dim_Key = T1.DIM_KEY);
COMMIT;

This is the basic approach I took in one of the ODI SCD2 IKMs I developed.

Maybe not go faster tape – but some methods to look at for performance improvement.