10 ideas for using Window Functions to analyse data with SQL

Ever had to compare a row of data against a benchmark or grouped aggregate? How about knowing preceding and subsequent rows of data of a record in within one query? Or being able not just to order, but also rank records, with SQL? These are great examples where window functions can be able to achieve analysis that may be wordy to describe in SQL, or event impossible.

 

Window functions allow you to look beyond a row of data, return its position relative to a specified group (partition) and order, or aggregate across multiple rows, without having to group your data. Still confused? Read on and learn about 10 ways you may consider using window functions in your next SQL query to perform more complex analysis with fewer lines of SQL. Note: the follow examples are written with BigQuery SQL syntax in mind, but similar functions can be found in other SQL dialects supporting window functions. This post will not focus on syntax, but more of the possibilities that can be achieved with window functions.

 

There are mainly 3 groups of window functions:

  • Aggregate functions allow you to do what regular aggregate functions do, but in the context of window functions, they allow you to aggregate across multiple rows without grouping them explicitly. Generally the same result can be achieved without window functions, using multiple subqueries and joins, but window functions allow you to do the same in much more succinct SQL.
  • Ranking functions allow you to assign various types of ranking across rows, in your specified groups (partitions) and orders
  • Value functions help you to identify specific values from another row of data, for example, the one just before or just after that row, also in your specified grouping (partition) and order

Grouping (partitions) in window functions, in the above scenarios, refer to just the field(s) you are querying over and does not need to be applied across the entire query, unlike regular grouping. Here we go!

 

Jump to:

  1. Creating aggregated benchmarks to compare the value of a particular record/group against
  2. Counting the members of a group/ similar occurrences while retaining each record separately
  3. Calculate the moving average
  4. Getting the first/last occurrence of an event across a group/ de-duplication
  5. Getting the top x records and aggregating the rest of the data
  6. Performing a multi touch attribution analysis
  7. Calculating the time interval between consecutive events
  8. Calculating gap to nearest competitor/ best competitor
  9. Calculating changes in value over consecutive time periods – tracking performance changes
  10. Finding sequence of events

 

Aggregate window functions

1. Creating aggregated benchmarks to compare the value of a particular record/group against

You have prices of the same bag of nuts in different stores, and will like to see how the price in each store compares to a benchmark.

sales:

store     region    price
A         East      2.50
B         North     1.90
C         North     2.20
D         East      2.70
E         South     4.50
D         South     4.70

 

With window functions, you can easily create new columns with different benchmarks for which you’ll like to compare the prices, such as the average price, the minimum price, etc. Ignoring the region for now:

SELECT store, price,
AVG(price) over () as average,
MIN(price) over () as minimum,
FROM prices

 

What you’ll get would be something like this:

store     price     average   minimum
A         2.50      3.08      1.90
B         1.90      3.08      1.90
C         2.20      3.08      1.90
D         2.70      3.08      1.90
E         4.50      3.08      1.90
D         4.70      3.08      1.90

 

Since no partition (grouping) was specified, the window function looked across all rows in each case and produced the same result for each record. Further, since you cannot use analytical functions (+, -, etc) over the result of a window function, you can use this as a subquery, and in the outer query easily obtain how far the price of the bag of nuts in each store is from the global average or minimum.

 

Taking this a step further, you may wish to consider the average or minimum at a regional level rather than at the global level. This can be done by specifying a PARTITION:

SELECT store, region, price, 
AVG(price) OVER () as average,
MIN(price) OVER () as minimum,
AVG(price) OVER (PARTITION BY region) as region_avg,
MIN(price) OVER (PARTITION BY region) as region_min,
FROM prices

 

And get:

store     region    price     average   minimum   region_avg  region_min
A         East      2.50      3.08      1.90      2.60        2.20
B         North     1.90      3.08      1.90      2.05        1.90
C         North     2.20      3.08      1.90      2.05        1.90      
D         East      2.70      3.08      1.90      2.60        2.20
E         South     4.50      3.08      1.90      4.60        4.50
D         South     4.70      3.08      1.90      4.60        4.50

 

By using this as a subquery, the possibilities for analysis grow as you calculate the difference from benchmarks and more.

 

2. Counting the members of a group/ similar occurrences while retaining each record separately

Taking the earlier example, let’s say you’d like to generate a report with the prices of each store, yet have a column of data showing the number of stores in each region, you can easily do this by:

SELECT store, region, price,
SUM(1) OVER (PARTITION BY region) as reg_stores
FROM prices

 

And …

store     region    price     reg_stores
A         East      2.50      2
B         North     1.90      2
C         North     2.20      2
D         East      2.70      2
E         South     4.50      2
D         South     4.70      2

 

This may seem pretty basic, but if you combine it with ranking functions (further down), you can complete a competitive analysis in just SQL, counting the number of competitors, the number of competitors with a better/same price, and more.

 

3. Calculate the moving average

Similar to the first example of calculating an average across a window, but taking sequence into consideration, and useful for time series analysis (e.g. calculate if a record has went way beyond its moving average, possibly an anomaly or something to act on). This would require additional arguments in the OVER clause specifying not just the PARTITION but also specifying and ORDER as well as the ROWS BETWEEN to consider. Using a stock prices example:

prices:

date      closing
11/16     1.39
11/17     1.40
11/20     1.39
11/21     1.42
11/22     1.43
11/23     1.43
11/24     1.43
11/27     1.44
11/28     1.43
11/29     1.44
11/30     1.42
12/01     1.42

 

To obtain a simple moving average of the previous 2 days:

SELECT date, closing,
AVG(closing) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING) as mov_avg
FROM prices
ORDER BY date

 

And here’s the result

date      closing  mov_avg
11/16     1.39     null
11/17     1.40     1.39
11/20     1.39     1.395
11/21     1.42     1.395
11/22     1.43     1.405
11/23     1.43     1.425
11/24     1.43     1.43
11/27     1.44     1.43
11/28     1.43     1.435
11/29     1.44     1.435
11/30     1.42     1.435
12/01     1.42     1.43

 

In this case, we did not specify a PARTITION but you can too, before the ORDER BY, if there is a need to. ORDER BY specifies how you would like to order the results before looking at the ROWS BETWEEN to consider in the moving average. Other than PRECEDING you can also use keywords such as FOLLOWING, CURRENT ROW and UNBOUNDED (in place of a integer for number of rows). This can be used as a simplified and very easy to implement anomaly detection solution for key metrics you’d like to ensure stay within a specified range, while still printing out each individual records side by side. Furthermore, by not aggregating the raw records, you’d know the exact moment when specified thresholds were breached.

 

Ranking Window Functions

4. Getting the first/last occurrence of an event across a group/ de-duplication

You’ve got a records of purchases done by each customer and would like to pick out just the first or last item purchased per customer.

 

One possible way to achieve this without window functions is to create a subquery grouping by customer ID and obtaining the first/last event time per customer, then inner joining the timestamp and customer ID back to your original  data.

 

But what if there are multiple events happening at that same time? The result would be multiple events per user left in the output. By using the ROW_NUMBER window function, you can assign a non-repeated ranking by a specified PARTITION and a specified ORDER.

purchases:

event_time            customer   item
2017-12-01 11:20:50   Adam       apple
2017-12-01 12:10:20   Brittany   pear
2017-12-02 12:15:40   Brittany   orange
2017-12-02 14:15:10   Brittany   avocado
2017-12-03 10:20:30   Adam       banana
2017-12-03 12:50:00   Adam       mango

 

Something like this:

SELECT event_time, user, item,
ROW_NUMBER() OVER (PARTITION BY customer ORDER BY event_time) as first,
ROW_NUMBER() OVER (PARTITION BY customer ORDER BY event_time DESC) as last,
FROM purchases

 

Will assign a non repeated number within each PARTITION to each record in the specified ORDER, and even in the event of a clash at the same timestamp, each record will be assigned a different  ROW_NUMBER.

purchases:

event_time            customer   item     first  last
2017-12-01 11:20:50   Adam       apple    1      3
2017-12-01 12:10:20   Brittany   pear     1      3
2017-12-02 12:15:40   Brittany   orange   2      2
2017-12-02 14:15:10   Brittany   avocado  3      1
2017-12-03 10:20:30   Adam       banana   2      2
2017-12-03 12:50:00   Adam       mango    3      1

 

By using this as an inner query, then wrapping an outer query around it specifying to take only rows where first=1, you’ll get the first purchase per customer, and where last=1 will give you the last purchase per customer. This can also be used as a quicker way to de-duplicate your data, by specifying which conditions to pick out duplicates for, and the priority (ORDER) to consider records. The additional benefit of using this to de-duplicate data over using lots of GROUP BYs is that you retain the integrity of the record (the entire record in the output is a actual input record, as compared to aggregating certain values where the raw input is mutated).

 

5. Getting the top x records and aggregating the rest of the data

You’ve got a list of sales per store, with the number of stores going into the hundreds, and most of the sales you’re interested in coming from just the top 5 stores. How can you output the sales per store for the top 5 stores, and aggregate the rest into ‘others’?

 

One possible way without window functions is to create a subquery with the same data, ORDER by sales DESC, then LIMIT 5. Left join that subquery to the original data, and in the outer query, select store if there is a matching record in the other subquery, and ‘others’ if nothing joins. SUM your sales figures and GROUP BY this new store column and voila! You should be left with 6 rows, 10 for the sales of the top 5 stores, and 1 for others.

 

What if there is a more elegant way of producing the same result, but with greater control over the result? Cue window functions! You might have already gotten the idea, by assigning ROW_NUMBER to each record in the inner query, then in the outer query, select store if the rank is less than or equal to 5, and ‘others’ in any other case.

sales:

store     sales
A         2.00
B         2.40
C         1277.00
D         59.30
E         48.10
F         566.90
G         832.20
H         832.20
I         832.20
J         566.90

 

Other than ROW_NUMBER, you may wish to consider RANK and DENSE_RANK. RANK assigns the same rank to records with the same value in the specified ORDER, increasing the next rank by the number of records with that same previous rank. DENSE_RANK also assigns the same rank to records with the same value in the specified ORDER, but in contrast to RANK, the next rank is always in running number order, regardless of how many records were assigned the previous rank.

SELECT store, sales,
ROW_NUMBER() OVER (ORDER BY sales DESC) as row_number,
RANK() OVER (ORDER BY sales DESC) as rank,
DENSE_RANK() OVER (ORDER BY sales DESC) as dense_rank
FROM sales

 

Will give you:

store     sales     row_number rank       dense_rank
A         2.00      10         10         7
B         2.40      9          9          6
C         1277.00   1          1          1
D         59.30     7          7          4
E         48.10     8          8          5
F         566.90    5          5          3
G         832.20    2          2          2
H         832.20    3          2          2
I         832.20    4          2          2
J         566.90    6          5          3

 

If you’ll just like to see exactly 5 stores unaggregated you’ll do a outer query with a field defined like this:

IF(row_number<=5, store, 'others') as store,

To replace the store column. If in the case of a tie at 5th position you’d want all those at that tie to show up you can use rank in place of row_number. And finally, if you are more interested in all stores with the top 5 sales values, regardless of the number of stores at each value, you can use dense_rank instead. As you can see, window functions make the same analysis much simpler and with greater control.

 

Note: This may not always be the most straightforward way of obtaining the top X records. If for example, out of the 10 stores, there are 5 groups, and you’d instead like to obtain stores belonging to the top 2 groups and aggregate the rest. In this case, once there are multiple aggregations to be completed, it may be simpler to just create a subquery defining the groups you want singled out, then LEFT or INNER JOINing this back to the main query.

 

6. Performing a multi touch attribution analysis

If you’re doing marketing for an e-commerce platform, you might be interested in performing a multi touchpoint attribution analysis to more accurately evaluate the performance of any channel, campaign, adgroup/adset, or copy. If you’ve gotten your users’ conversion information and information of their various interactions with your campaigns, you can, simply by using a couple of window functions, perform a variety of multi touch attribution analysis, such as:

  • First touch – to get the first channel/campaign the user interacted with)
  • Last touch – last channel/campaign the user interacted with before making a purchase or conversion
  • Linear – dividing the conversion value equally across all interaction points prior to purchase/conversion
  • U shape – higher weight to first and last touch points, equally divided for middle touch points
  • Time decay – higher weight for the most recent touch points, gradually decreasing in weight for the earlier ones
  • And many more, restricted only to your imagination

Maybe an example will make this clearer (with lots of simplifications):

purchases:

date        customer   item       value
2017-11-29  Adam       socks      8.99
2017-12-01  Brittany   book       4.90

sessions:

date        customer   channel
2017-11-20  Adam       Banner Ad
2017-11-24  Adam       Search Ad
2017-11-25  Adam       Retargeting Ad
2017-11-29  Brittany   Search Ad
2017-11-29  Adam       Banner Ad
2017-12-01  Brittany   Google Search
2017-12-02  Adam       Google Search

 

Let’s do a LEFT JOIN of the purchases to the sessions table on customer to get all the interactions by that customer with your website. On this result, you’ll want to exclude sessions that happen after the purchase, and if you are considering a wide date range for your data you may want to restrict campaigns associated with a purchase to a smaller timeframe (perhaps 30 days). Finally, perform a couple of window functions over this result and you’re almost there.

SELECT
p.date as purch_date,
p.customer as customer,
p.item as item,
p.value as value,
s.date as sess_date,
s.channel as channel,
ROW_NUMBER() OVER (PARTITION BY customer, purch_date, item ORDER BY sess_date) as sess_asc,
ROW_NUMBER() OVER (PARTITION BY customer, purch_date, item ORDER BY sess_date DESC) as sess_desc,
SUM(1) OVER (PARTITION BY customer, purch_date, item) as sess_ttl
FROM
    purchases as p
LEFT JOIN
    sessions as s
ON p.customer=s.customer
WHERE s.date<=p.date   #to get only sessions before and up to the purchase
AND DATEDIFF(p.date, s.date)<=30   #to get sessions within 30 days prior to purcahse

What we’ve got so far:

purch_date    customer    item   value   sess_date    channel         sess_asc  sess_desc  sess_ttl
2017-11-29    Adam        socks  8.99    2017-11-20   Banner Ad       1         4          4
2017-11-29    Adam        socks  8.99    2017-11-24   Search Ad       2         3          4
2017-11-29    Adam        socks  8.99    2017-11-25   Retargeting Ad  3         2          4
2017-11-29    Adam        socks  8.99    2017-11-29   Banner Ad       4         1          4
2017-12-01    Brittany    book   4.90    2017-11-29   Search Ad       1         2          2
2017-12-01    Brittany    book   4.90    2017-12-01   Google Search   2         1          2

 

With this as an inner query, you can easily get the first touch, last touch and linear attribution models out easily in an outer query. At that level, if you just want to obtain the attributed revenue per channel/campaign, you won’t need to consider the purchase item, date and value already.

SELECT
s.date as sess_date,
s.channel as channel,
SUM(IF(sess_asc=1, value, 0)) as first_touch,
SUM(IF(sess_desc=1, value, 0)) as last_touch,
SUM(value/sess_ttl) as linear
FROM
    [that earlier query]
GROUP BY sess_date, channel

As for a U shape (position based) attribution model, things get a little more complex, but nonetheless still manageable in a few lines of SQL. We’ll use a CASE statement to assign different percentages to apply to each conversion value, based on whether it is the first, last, or middle touch points, and also depending of the number of touch points. Let’s look at the following possible cases.

  1. There are only 2 touch points. In that case we’ll want to assign an equal value (50%) to each session, namely the session where sess_asc=1 and 50% to the session where sess_desc=1 (which are the only 2 touch points)
  2. There is only 1 touch point. In this case we’ll want to assign all credit to that touch point, but as this touch point/session will have sess_asc=1 and sess_desc=1, you can use the same logic as the earlier case of 2 touch points.
  3. Assuming we’ll want to assign 40% each to the first and last touch point, and 20% spread between the middle ones, for all conversions with 3 or more touch points, we can just assign 40% to a touch point if sess_asc=1 or sess_desc=1, and in any other cases, divide 20% equally among all touch points excluding the first and last (sess_ttl-2).

With that in mind, it’ll look something like this…

SUM(
    CASE
    WHEN sess_ttl<=2 THEN IF(sess_asc=1, 0.5, 0)+IF(sess_desc=1, 0.5, 0)
    WHEN sess_asc=1 OR sess_desc=1 THEN 0.4
    ELSE 0.2/(sess_ttl-2) END 
* value) as U_shape

And there you have it! Lots more still possible with attribution models.

 

Value Window Functions

7. Calculating the time interval between consecutive events

You’ve got a stream of web visits data coming in at irregular time intervals and would like to analyse these time intervals.

pageviews (assume for 1 user, or use a PARTITION to get these metrics by user):

event_time            page
2017-12-01 11:20:50   about
2017-12-01 12:10:20   homepage
2017-12-01 12:15:40   products
2017-12-01 14:15:10   homepage
2017-12-01 18:20:30   deal_of_the_day
2017-12-01 20:50:00   products

 

By using a LAG or LEAD window function, you can obtain the previous or next value of a specified column in your specified PARTITION and ORDER. That will produce an output where you can compare the previous/ next value to the current value, and from there do things such as examine if there are irregular gaps which may point to data dropping off, pages with a high bounce rate or pages where users are staying for a long time, or if these intervals change throughout the day.

SELECT
event_time, page,
LAG(event_time) OVER (ORDER BY event_time) as lag,
LEAD(event_time) OVER (ORDER BY event_time) as lead
FROM pageviews

And get this:

event_time            page             lag                  lead
2017-12-01 11:20:50   about            null                 2017-12-01 12:10:20
2017-12-01 12:10:20   homepage         2017-12-01 11:20:50  2017-12-01 12:15:40
2017-12-01 12:15:40   products         2017-12-01 12:10:20  2017-12-01 14:15:10
2017-12-01 14:15:10   homepage         2017-12-01 12:15:40  2017-12-01 18:20:30
2017-12-01 18:20:30   deal_of_the_day  2017-12-01 14:15:10  2017-12-01 20:50:00
2017-12-01 20:50:00   products         2017-12-01 18:20:30  null

 

By performing operations over this such as finding out the difference from event_time, you can analyse how the time intervals between events change for different pages, and how it changes throughout the day.

 

8. Calculating gap to nearest competitor/ best competitor

You’ve got a list of stores with various sales performance metrics such as total sales, average value per sale, number of customers, and so on, and would like to be able to obtain, with each record, the metrics for the next best store, as well as the best in its group of stores. You can use a LEAD or LAG window function, similar to the previous example, to get these metrics by specifying the ORDER which you’ll use to evaluate the ‘best store’, and PARTITION if you’ll also like to compare them in groups. FIRST_VALUE and LAST_VALUE function similarly, but obtain the first or last value within the specified PARTITION, in the specified ORDER.

performance:

store     region    ttl_sales  avg_sale   cust_base   
A         East      1000       5.70       120
B         East      1200       4.50       160
C         South     1050       4.80       140
D         East      900        5.60       190
E         South     900        3.60       100
D         South     950        4.20       180

 

Let’s say for total sales, we’d like to benchmark each store against the store with the highest sales per region. For average sale, we’d like to benchmark each store against the next better performer in terms of avg_sale regardless of region. For customer base, we’ll benchmark each store against the store with the highest sales regardless of region. To do that:

SELECT
store, region, ttl_sales as t_sales, avg_sale as a_sale, cust_base as c_base,
FIRST_VALUE(ttl_sales) OVER (PARTITION BY region ORDER BY ttl_sales DESC) as t_sales_bm,
LEAD(avg_sale) OVER (ORDER BY avg_sale) as a_sale_bm,
FIRST_VALUE(cust_base) OVER (ORDER BY ttl_sales DESC) as c_base_bm
FROM performance

 

store     region    t_sales  a_sale   c_base  t_sales_bm  a_sale_bm  c_base_bm
A         East      1000     5.70     120     1200        null       160
B         East      1200     4.50     160     1200        4.80       160
C         South     1050     4.80     140     1050        5.60       160
D         East      900      5.60     190     1200        5.70       160
E         South     900      3.60     100     1050        4.20       160
D         South     950      4.20     180     1050        4.50       160

 

Again, many different angles of analysis are possible by tweaking the parameters and applying operations in an outer query over this.

 

9. Calculating changes in value over consecutive time periods – tracking performance changes

Similar to example #7, just that in this case you’re not analysing the time difference itself, but a value of interest. For example, if you have the monthly sales volume, and would like to calculate the month-on-month growth.

sales:

month    sales
2017-06  1200
2017-07  1400
2017-08  1500
2017-09  1300
2017-10  1000

 

First, we use the LAG window function to obtain sales figures from the previous month.

SELECT
month, sales,
LAG(sales) OVER (ORDER BY month) as last_mth
FROM sales

 

Now we can compare each month’s sales to the previous month:

month    sales  last_mth
2017-06  1200   null
2017-07  1400   1200
2017-08  1500   1400
2017-09  1300   1500
2017-10  1000   1300

 

To calculate the month-on-month growth, we can use something such as:

SELECT
month, sales, 
ROUND((sales/last_mth-1)*100,2) as growth #in percent
FROM 
    [that earlier query]

 

Here it is:

month    sales  growth
2017-06  1200   null
2017-07  1400   16.67
2017-08  1500   7.14
2017-09  1300   -13.33
2017-10  1000   -23.08

And on to our last example.

 

10. Finding sequence of events

Going back to the example in #6, where you have some conversion event of interest, and have a list of sessions leading up to the event. However, instead of attributing the eventual revenue/value across different touch points, we’d now like to analyse the entire interaction path completed by the user and analyse if there are patterns in different conversions path. Retrieving the data again,

purchases:

date        customer   item       value
2017-11-29  Adam       socks      8.99
2017-12-01  Brittany   book       4.90

sessions:

date        customer   channel
2017-11-20  Adam       Banner Ad
2017-11-24  Adam       Search Ad
2017-11-25  Adam       Retargeting Ad
2017-11-29  Brittany   Search Ad
2017-11-29  Adam       Banner Ad
2017-12-01  Brittany   Google Search
2017-12-02  Adam       Google Search

 

We’ll do a LEFT JOIN again, but this time other than adding sess_desc to identify the final touch point before conversion and for de-duplication, we’ll add a couple of LAG window functions to trace the path. If we are just interested in the last 3 events leading up to the conversion,

SELECT
purch_date, customer, item, value,
CONCAT(
    IFNULL(CONCAT(lag_2, '-'), ''),
    IFNULL(CONCAT(lag_1, '-'), ''),
    IFNULL(channel, '')
) as conversion_path,
FROM
    (SELECT
    p.date as purch_date,
    p.customer as customer,
    p.item as item,
    p.value as value,
    s.date as sess_date,
    s.channel as channel,
    #we'll keep this to be able to identify unique conversions later
    ROW_NUMBER() OVER (PARTITION BY customer, purch_date, item ORDER BY sess_date DESC) as sess_desc,
    LAG(channel, 2) OVER (PARTITION BY customer, purch_date, item ORDER BY sess_date) as lag_2,
    LAG(channel, 1) OVER (PARTITION BY customer, purch_date, item ORDER BY sess_date) as lag_1,
    FROM
        purchases as p
    LEFT JOIN
        sessions as s
    ON p.customer=s.customer
    WHERE s.date<=p.date   #to get only sessions before and up to the purchase
    AND DATEDIFF(p.date, s.date)<=30   #to get sessions within 30 days prior to purcahse
    )
WHERE sess_desc=1

 

LAG can take in an optional 2nd argument which is the number of records backwards to check. With the above query, we’ll be left with:

date        customer   item       value  conversion_path
2017-11-29  Adam       socks      8.99   Search Ad-Retargeting Ad-Banner Ad
2017-12-01  Brittany   book       4.90   Search Ad-Google Ad

 

By grouping the different conversion path after that, you’ll be able to analyse and evaluate the performance of different paths as well as how different campaigns/ channel contribute to it.

 

Closing

And that wraps everything up! I was considering doing an additional post on window function syntax in BigQuery SQL (either Legacy or Standard) but having looked through BigQuery’s documentation, felt that this may not be necessary. Please do, however, let me know if adding a post on syntax will be helpful. Have you used window functions to perform your analysis with SQL? Any other examples where window functions can help make analysis much simpler? Which of the above ideas is your favorite? Share your experience and thoughts in the comments below! Till the next time, thanks for reading 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *