Apart from the obligatory “raindrops on roses and whiskers on kittens” I must also add analytic functions to my list, and one function in particular, FIRST_VALUE (oh, and its almost identical twin LAST_VALUE.)
FIRST_VALUE returns the first value encountered for a selection window; obviously, we specify the column to apply the function on and as this is an analytic function we have that OVER part at the end where we supply an optional partition by clause. What is not optional is the ORDER BY clause; clearly, it makes no sense to look for a first value if the order we see those values is not controlled and this means we must window too! Getting the first value or last value in a select is perhaps less useful if we have NULL values in that column, but this is just where the IGNORE NULLS option comes in handy. We can find the first non-null value encountered.
Lets consider this data set:
I have a couple of null values in “Your” things.
select first_value(THING) over (partition by WHOSE order by OTHER_THING rows between unbounded PRECEDING and unbounded FOLLOWING) AS our_fav from FAV_THINGS
select first_value(THING IGNORE NULLS) over (partition by WHOSE order by OTHER_THING rows between unbounded PRECEDING and unbounded FOLLOWING) our_fav from FAV_THINGS
Ignore Nulls just skips those pesky null values, but of course if every row in the window is a NULL then we will get a null back.
The first time I used this kind of functionality was a few years back when Stewart Bryson and I were involved in an Exadata V1 data warehouse. The customer had a requirement to use log-based change data capture, but was unable to set the CDC process to use supplemental logging. With log based CDC we get a time-marked (SCN + an order number (to handle multiple changes for a single SCN value)) and a single row representing each insert or delete and TWO rows for each update (a before and an after state). With supplemental logging we get all of the columns populated; without it we only get columns that change value and NULLs in the columns unaffected. Although far from ideal we can use the captured data (in our case a subscriber view over it) to synthesise the “missing” data in the capture. This is where LAST_VALUE comes into play.
As our customer was only interested in the final outcome of change – a ‘noisy’ source application meant a row might change many times over a few seconds as each individual column was updated Only the final state was of interest in the data warehouse. In outline we built a data set from the UNION ALL of the CDC subscriber view and the current state of our target table (not forgetting to filter the target table select to only include the PK values found in the CDC subscriber view being processed) – the only complication is to remember there are TWO kinds of NULLs in a CDC view: data that becomes NULL and data that remains unchanged on UPDATE. The resulting query (which we encapsulated in a view) was quite complex. LAG and LEAD analytics and CASE statements to handle individual changes with an ultimate LAST_VALUE to build the final state of the row to merge into the target table.
Recently, I revisited FIRST_VALUE for a piece of ETL logic to supply ‘default’ values to some columns based on the content of other columns. A few years back I would fight hard to keep data cleanse / data enrichment logic like this on the source system; today, I am less convinced that this is the right place to do things – we live in world where source data systems are frequently outside our scope of influence – cloud based packaged solutions, multiple data sources (web, cloud, on-premise, subscription) leads us to a take a view that perhaps the best place to do things is where our data first lands on a system we control. I am working up a conference talk on this topic, so more on that later.
Simplistically, we needed to store some ‘rules’ in a table and then if a rule is matched apply the rule to the target column in the data. Suppose I run a grocery and the things I sell have item codes, what if I made a rule that said anything containing ‘OJ” was going to be of type “Orange Juice” and category “Fruit Juice”, likewise if it contains “GJ” then it is “Grape Juice” and again “Fruit Juice”. It can be seen that we have two rules that supply the same value of “Juice” so we need to have a solution that can take an arbitrary number of rules to map to the same target value. One way of doing this is to use a regular expression as the join condition (REGEXP_LIKE is suitable) and to build a single expression separated by the ‘|’ symbol; remember in Oracle regular expressions we us a pipe as an OR operator between expressions. This is a simple use of LISTAGG to build a ‘|’ separated list we will then use in our REGEXP_LIKE match, we also get the benefit of optional ignore case.
Now suppose I wrote a rule PRODUCT_NAME contains “JUIC” – that will do nicely for telling me that “Orange Juice” is a fruit juice. However what about Carrot Juice, or even Juicy Fruit™ one option is give some form of rule “priority” so that highly specific rules take priority over generic ones, and take the result of the most important matching rule, ROW_NUMBER does that nicely.
Did you spot that little ‘gotcha’ in the bit with item codes? We have one item code match string supplying two target columns. This is where we need to pivot our target values around (Use a CASE statement as the Oracle PIVOT command only works on aggregates) to give a sparse set of data (lots of NULLS) then use FIRST_VALUE IGNORE NULLS to generate our transformed column values. It is now we can reduce our data set to just one row with the ROW_NUMBER()=1 filter. Each row in our output gives the best-fit rules for each target column