Ever had to compare a row of data against a benchmark or grouped aggregate? How about knowing preceding and subsequent rows of data of a record in within one query? Or being able not just to order, but also rank records, with SQL? These are great examples where window functions can be able to achieve analysis that may be wordy to describe in SQL, or event impossible.
Window functions allow you to look beyond a row of data, return its position relative to a specified group (partition) and order, or aggregate across multiple rows, without having to group your data. Still confused? Read on and learn about 10 ways you may consider using window functions in your next SQL query to perform more complex analysis with fewer lines of SQL. Note: the follow examples are written with BigQuery SQL syntax in mind, but similar functions can be found in other SQL dialects supporting window functions. This post will not focus on syntax, but more of the possibilities that can be achieved with window functions.
There are mainly 3 groups of window functions:
- Aggregate functions allow you to do what regular aggregate functions do, but in the context of window functions, they allow you to aggregate across multiple rows without grouping them explicitly. Generally the same result can be achieved without window functions, using multiple subqueries and joins, but window functions allow you to do the same in much more succinct SQL.
- Ranking functions allow you to assign various types of ranking across rows, in your specified groups (partitions) and orders
- Value functions help you to identify specific values from another row of data, for example, the one just before or just after that row, also in your specified grouping (partition) and order
Grouping (partitions) in window functions, in the above scenarios, refer to just the field(s) you are querying over and does not need to be applied across the entire query, unlike regular grouping. Here we go!
Jump to:
- Creating aggregated benchmarks to compare the value of a particular record/group against
- Counting the members of a group/ similar occurrences while retaining each record separately
- Calculate the moving average
- Getting the first/last occurrence of an event across a group/ de-duplication
- Getting the top x records and aggregating the rest of the data
- Performing a multi touch attribution analysis
- Calculating the time interval between consecutive events
- Calculating gap to nearest competitor/ best competitor
- Calculating changes in value over consecutive time periods – tracking performance changes
- Finding sequence of events
Aggregate window functions
1. Creating aggregated benchmarks to compare the value of a particular record/group against
You have prices of the same bag of nuts in different stores, and will like to see how the price in each store compares to a benchmark.
sales:
store region price A East 2.50 B North 1.90 C North 2.20 D East 2.70 E South 4.50 D South 4.70
With window functions, you can easily create new columns with different benchmarks for which you’ll like to compare the prices, such as the average price, the minimum price, etc. Ignoring the region for now:
SELECT store, price, AVG(price) over () as average, MIN(price) over () as minimum, FROM prices
What you’ll get would be something like this:
store price average minimum A 2.50 3.08 1.90 B 1.90 3.08 1.90 C 2.20 3.08 1.90 D 2.70 3.08 1.90 E 4.50 3.08 1.90 D 4.70 3.08 1.90
Since no partition (grouping) was specified, the window function looked across all rows in each case and produced the same result for each record. Further, since you cannot use analytical functions (+, -, etc) over the result of a window function, you can use this as a subquery, and in the outer query easily obtain how far the price of the bag of nuts in each store is from the global average or minimum.
Taking this a step further, you may wish to consider the average or minimum at a regional level rather than at the global level. This can be done by specifying a PARTITION:
SELECT store, region, price, AVG(price) OVER () as average, MIN(price) OVER () as minimum, AVG(price) OVER (PARTITION BY region) as region_avg, MIN(price) OVER (PARTITION BY region) as region_min, FROM prices
And get:
store region price average minimum region_avg region_min A East 2.50 3.08 1.90 2.60 2.20 B North 1.90 3.08 1.90 2.05 1.90 C North 2.20 3.08 1.90 2.05 1.90 D East 2.70 3.08 1.90 2.60 2.20 E South 4.50 3.08 1.90 4.60 4.50 D South 4.70 3.08 1.90 4.60 4.50
By using this as a subquery, the possibilities for analysis grow as you calculate the difference from benchmarks and more.
2. Counting the members of a group/ similar occurrences while retaining each record separately
Taking the earlier example, let’s say you’d like to generate a report with the prices of each store, yet have a column of data showing the number of stores in each region, you can easily do this by:
SELECT store, region, price, SUM(1) OVER (PARTITION BY region) as reg_stores FROM prices
And …
store region price reg_stores A East 2.50 2 B North 1.90 2 C North 2.20 2 D East 2.70 2 E South 4.50 2 D South 4.70 2
This may seem pretty basic, but if you combine it with ranking functions (further down), you can complete a competitive analysis in just SQL, counting the number of competitors, the number of competitors with a better/same price, and more.
3. Calculate the moving average
Similar to the first example of calculating an average across a window, but taking sequence into consideration, and useful for time series analysis (e.g. calculate if a record has went way beyond its moving average, possibly an anomaly or something to act on). This would require additional arguments in the OVER clause specifying not just the PARTITION but also specifying and ORDER as well as the ROWS BETWEEN to consider. Using a stock prices example:
prices:
date closing 11/16 1.39 11/17 1.40 11/20 1.39 11/21 1.42 11/22 1.43 11/23 1.43 11/24 1.43 11/27 1.44 11/28 1.43 11/29 1.44 11/30 1.42 12/01 1.42
To obtain a simple moving average of the previous 2 days:
SELECT date, closing, AVG(closing) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING) as mov_avg FROM prices ORDER BY date
And here’s the result
date closing mov_avg 11/16 1.39 null 11/17 1.40 1.39 11/20 1.39 1.395 11/21 1.42 1.395 11/22 1.43 1.405 11/23 1.43 1.425 11/24 1.43 1.43 11/27 1.44 1.43 11/28 1.43 1.435 11/29 1.44 1.435 11/30 1.42 1.435 12/01 1.42 1.43
In this case, we did not specify a PARTITION but you can too, before the ORDER BY, if there is a need to. ORDER BY specifies how you would like to order the results before looking at the ROWS BETWEEN to consider in the moving average. Other than PRECEDING you can also use keywords such as FOLLOWING, CURRENT ROW and UNBOUNDED (in place of a integer for number of rows). This can be used as a simplified and very easy to implement anomaly detection solution for key metrics you’d like to ensure stay within a specified range, while still printing out each individual records side by side. Furthermore, by not aggregating the raw records, you’d know the exact moment when specified thresholds were breached.
Ranking Window Functions
4. Getting the first/last occurrence of an event across a group/ de-duplication
You’ve got a records of purchases done by each customer and would like to pick out just the first or last item purchased per customer.
One possible way to achieve this without window functions is to create a subquery grouping by customer ID and obtaining the first/last event time per customer, then inner joining the timestamp and customer ID back to your original data.
But what if there are multiple events happening at that same time? The result would be multiple events per user left in the output. By using the ROW_NUMBER window function, you can assign a non-repeated ranking by a specified PARTITION and a specified ORDER.
purchases:
event_time customer item 2017-12-01 11:20:50 Adam apple 2017-12-01 12:10:20 Brittany pear 2017-12-02 12:15:40 Brittany orange 2017-12-02 14:15:10 Brittany avocado 2017-12-03 10:20:30 Adam banana 2017-12-03 12:50:00 Adam mango
Something like this:
SELECT event_time, user, item, ROW_NUMBER() OVER (PARTITION BY customer ORDER BY event_time) as first, ROW_NUMBER() OVER (PARTITION BY customer ORDER BY event_time DESC) as last, FROM purchases
Will assign a non repeated number within each PARTITION to each record in the specified ORDER, and even in the event of a clash at the same timestamp, each record will be assigned a different ROW_NUMBER.
purchases:
event_time customer item first last 2017-12-01 11:20:50 Adam apple 1 3 2017-12-01 12:10:20 Brittany pear 1 3 2017-12-02 12:15:40 Brittany orange 2 2 2017-12-02 14:15:10 Brittany avocado 3 1 2017-12-03 10:20:30 Adam banana 2 2 2017-12-03 12:50:00 Adam mango 3 1
By using this as an inner query, then wrapping an outer query around it specifying to take only rows where first=1, you’ll get the first purchase per customer, and where last=1 will give you the last purchase per customer. This can also be used as a quicker way to de-duplicate your data, by specifying which conditions to pick out duplicates for, and the priority (ORDER) to consider records. The additional benefit of using this to de-duplicate data over using lots of GROUP BYs is that you retain the integrity of the record (the entire record in the output is a actual input record, as compared to aggregating certain values where the raw input is mutated).
5. Getting the top x records and aggregating the rest of the data
You’ve got a list of sales per store, with the number of stores going into the hundreds, and most of the sales you’re interested in coming from just the top 5 stores. How can you output the sales per store for the top 5 stores, and aggregate the rest into ‘others’?
One possible way without window functions is to create a subquery with the same data, ORDER by sales DESC, then LIMIT 5. Left join that subquery to the original data, and in the outer query, select store if there is a matching record in the other subquery, and ‘others’ if nothing joins. SUM your sales figures and GROUP BY this new store column and voila! You should be left with 6 rows, 10 for the sales of the top 5 stores, and 1 for others.
What if there is a more elegant way of producing the same result, but with greater control over the result? Cue window functions! You might have already gotten the idea, by assigning ROW_NUMBER to each record in the inner query, then in the outer query, select store if the rank is less than or equal to 5, and ‘others’ in any other case.
sales:
store sales A 2.00 B 2.40 C 1277.00 D 59.30 E 48.10 F 566.90 G 832.20 H 832.20 I 832.20 J 566.90
Other than ROW_NUMBER, you may wish to consider RANK and DENSE_RANK. RANK assigns the same rank to records with the same value in the specified ORDER, increasing the next rank by the number of records with that same previous rank. DENSE_RANK also assigns the same rank to records with the same value in the specified ORDER, but in contrast to RANK, the next rank is always in running number order, regardless of how many records were assigned the previous rank.
SELECT store, sales, ROW_NUMBER() OVER (ORDER BY sales DESC) as row_number, RANK() OVER (ORDER BY sales DESC) as rank, DENSE_RANK() OVER (ORDER BY sales DESC) as dense_rank FROM sales
Will give you:
store sales row_number rank dense_rank A 2.00 10 10 7 B 2.40 9 9 6 C 1277.00 1 1 1 D 59.30 7 7 4 E 48.10 8 8 5 F 566.90 5 5 3 G 832.20 2 2 2 H 832.20 3 2 2 I 832.20 4 2 2 J 566.90 6 5 3
If you’ll just like to see exactly 5 stores unaggregated you’ll do a outer query with a field defined like this:
IF(row_number<=5, store, 'others') as store,
To replace the store column. If in the case of a tie at 5th position you’d want all those at that tie to show up you can use rank in place of row_number. And finally, if you are more interested in all stores with the top 5 sales values, regardless of the number of stores at each value, you can use dense_rank instead. As you can see, window functions make the same analysis much simpler and with greater control.
Note: This may not always be the most straightforward way of obtaining the top X records. If for example, out of the 10 stores, there are 5 groups, and you’d instead like to obtain stores belonging to the top 2 groups and aggregate the rest. In this case, once there are multiple aggregations to be completed, it may be simpler to just create a subquery defining the groups you want singled out, then LEFT or INNER JOINing this back to the main query.
6. Performing a multi touch attribution analysis
If you’re doing marketing for an e-commerce platform, you might be interested in performing a multi touchpoint attribution analysis to more accurately evaluate the performance of any channel, campaign, adgroup/adset, or copy. If you’ve gotten your users’ conversion information and information of their various interactions with your campaigns, you can, simply by using a couple of window functions, perform a variety of multi touch attribution analysis, such as:
- First touch – to get the first channel/campaign the user interacted with)
- Last touch – last channel/campaign the user interacted with before making a purchase or conversion
- Linear – dividing the conversion value equally across all interaction points prior to purchase/conversion
- U shape – higher weight to first and last touch points, equally divided for middle touch points
- Time decay – higher weight for the most recent touch points, gradually decreasing in weight for the earlier ones
- And many more, restricted only to your imagination
Maybe an example will make this clearer (with lots of simplifications):
purchases:
date customer item value 2017-11-29 Adam socks 8.99 2017-12-01 Brittany book 4.90
sessions:
date customer channel 2017-11-20 Adam Banner Ad 2017-11-24 Adam Search Ad 2017-11-25 Adam Retargeting Ad 2017-11-29 Brittany Search Ad 2017-11-29 Adam Banner Ad 2017-12-01 Brittany Google Search 2017-12-02 Adam Google Search
Let’s do a LEFT JOIN of the purchases to the sessions table on customer to get all the interactions by that customer with your website. On this result, you’ll want to exclude sessions that happen after the purchase, and if you are considering a wide date range for your data you may want to restrict campaigns associated with a purchase to a smaller timeframe (perhaps 30 days). Finally, perform a couple of window functions over this result and you’re almost there.
SELECT p.date as purch_date, p.customer as customer, p.item as item, p.value as value, s.date as sess_date, s.channel as channel, ROW_NUMBER() OVER (PARTITION BY customer, purch_date, item ORDER BY sess_date) as sess_asc, ROW_NUMBER() OVER (PARTITION BY customer, purch_date, item ORDER BY sess_date DESC) as sess_desc, SUM(1) OVER (PARTITION BY customer, purch_date, item) as sess_ttl FROM purchases as p LEFT JOIN sessions as s ON p.customer=s.customer WHERE s.date<=p.date #to get only sessions before and up to the purchase AND DATEDIFF(p.date, s.date)<=30 #to get sessions within 30 days prior to purcahse
What we’ve got so far:
purch_date customer item value sess_date channel sess_asc sess_desc sess_ttl 2017-11-29 Adam socks 8.99 2017-11-20 Banner Ad 1 4 4 2017-11-29 Adam socks 8.99 2017-11-24 Search Ad 2 3 4 2017-11-29 Adam socks 8.99 2017-11-25 Retargeting Ad 3 2 4 2017-11-29 Adam socks 8.99 2017-11-29 Banner Ad 4 1 4 2017-12-01 Brittany book 4.90 2017-11-29 Search Ad 1 2 2 2017-12-01 Brittany book 4.90 2017-12-01 Google Search 2 1 2
With this as an inner query, you can easily get the first touch, last touch and linear attribution models out easily in an outer query. At that level, if you just want to obtain the attributed revenue per channel/campaign, you won’t need to consider the purchase item, date and value already.
SELECT s.date as sess_date, s.channel as channel, SUM(IF(sess_asc=1, value, 0)) as first_touch, SUM(IF(sess_desc=1, value, 0)) as last_touch, SUM(value/sess_ttl) as linear FROM [that earlier query] GROUP BY sess_date, channel
As for a U shape (position based) attribution model, things get a little more complex, but nonetheless still manageable in a few lines of SQL. We’ll use a CASE statement to assign different percentages to apply to each conversion value, based on whether it is the first, last, or middle touch points, and also depending of the number of touch points. Let’s look at the following possible cases.
- There are only 2 touch points. In that case we’ll want to assign an equal value (50%) to each session, namely the session where sess_asc=1 and 50% to the session where sess_desc=1 (which are the only 2 touch points)
- There is only 1 touch point. In this case we’ll want to assign all credit to that touch point, but as this touch point/session will have sess_asc=1 and sess_desc=1, you can use the same logic as the earlier case of 2 touch points.
- Assuming we’ll want to assign 40% each to the first and last touch point, and 20% spread between the middle ones, for all conversions with 3 or more touch points, we can just assign 40% to a touch point if sess_asc=1 or sess_desc=1, and in any other cases, divide 20% equally among all touch points excluding the first and last (sess_ttl-2).
With that in mind, it’ll look something like this…
SUM( CASE WHEN sess_ttl<=2 THEN IF(sess_asc=1, 0.5, 0)+IF(sess_desc=1, 0.5, 0) WHEN sess_asc=1 OR sess_desc=1 THEN 0.4 ELSE 0.2/(sess_ttl-2) END * value) as U_shape
And there you have it! Lots more still possible with attribution models.
Value Window Functions
7. Calculating the time interval between consecutive events
You’ve got a stream of web visits data coming in at irregular time intervals and would like to analyse these time intervals.
pageviews (assume for 1 user, or use a PARTITION to get these metrics by user):
event_time page 2017-12-01 11:20:50 about 2017-12-01 12:10:20 homepage 2017-12-01 12:15:40 products 2017-12-01 14:15:10 homepage 2017-12-01 18:20:30 deal_of_the_day 2017-12-01 20:50:00 products
By using a LAG or LEAD window function, you can obtain the previous or next value of a specified column in your specified PARTITION and ORDER. That will produce an output where you can compare the previous/ next value to the current value, and from there do things such as examine if there are irregular gaps which may point to data dropping off, pages with a high bounce rate or pages where users are staying for a long time, or if these intervals change throughout the day.
SELECT event_time, page, LAG(event_time) OVER (ORDER BY event_time) as lag, LEAD(event_time) OVER (ORDER BY event_time) as lead FROM pageviews
And get this:
event_time page lag lead 2017-12-01 11:20:50 about null 2017-12-01 12:10:20 2017-12-01 12:10:20 homepage 2017-12-01 11:20:50 2017-12-01 12:15:40 2017-12-01 12:15:40 products 2017-12-01 12:10:20 2017-12-01 14:15:10 2017-12-01 14:15:10 homepage 2017-12-01 12:15:40 2017-12-01 18:20:30 2017-12-01 18:20:30 deal_of_the_day 2017-12-01 14:15:10 2017-12-01 20:50:00 2017-12-01 20:50:00 products 2017-12-01 18:20:30 null
By performing operations over this such as finding out the difference from event_time, you can analyse how the time intervals between events change for different pages, and how it changes throughout the day.
8. Calculating gap to nearest competitor/ best competitor
You’ve got a list of stores with various sales performance metrics such as total sales, average value per sale, number of customers, and so on, and would like to be able to obtain, with each record, the metrics for the next best store, as well as the best in its group of stores. You can use a LEAD or LAG window function, similar to the previous example, to get these metrics by specifying the ORDER which you’ll use to evaluate the ‘best store’, and PARTITION if you’ll also like to compare them in groups. FIRST_VALUE and LAST_VALUE function similarly, but obtain the first or last value within the specified PARTITION, in the specified ORDER.
performance:
store region ttl_sales avg_sale cust_base A East 1000 5.70 120 B East 1200 4.50 160 C South 1050 4.80 140 D East 900 5.60 190 E South 900 3.60 100 D South 950 4.20 180
Let’s say for total sales, we’d like to benchmark each store against the store with the highest sales per region. For average sale, we’d like to benchmark each store against the next better performer in terms of avg_sale regardless of region. For customer base, we’ll benchmark each store against the store with the highest sales regardless of region. To do that:
SELECT store, region, ttl_sales as t_sales, avg_sale as a_sale, cust_base as c_base, FIRST_VALUE(ttl_sales) OVER (PARTITION BY region ORDER BY ttl_sales DESC) as t_sales_bm, LEAD(avg_sale) OVER (ORDER BY avg_sale) as a_sale_bm, FIRST_VALUE(cust_base) OVER (ORDER BY ttl_sales DESC) as c_base_bm FROM performance
store region t_sales a_sale c_base t_sales_bm a_sale_bm c_base_bm A East 1000 5.70 120 1200 null 160 B East 1200 4.50 160 1200 4.80 160 C South 1050 4.80 140 1050 5.60 160 D East 900 5.60 190 1200 5.70 160 E South 900 3.60 100 1050 4.20 160 D South 950 4.20 180 1050 4.50 160
Again, many different angles of analysis are possible by tweaking the parameters and applying operations in an outer query over this.
9. Calculating changes in value over consecutive time periods – tracking performance changes
Similar to example #7, just that in this case you’re not analysing the time difference itself, but a value of interest. For example, if you have the monthly sales volume, and would like to calculate the month-on-month growth.
sales:
month sales 2017-06 1200 2017-07 1400 2017-08 1500 2017-09 1300 2017-10 1000
First, we use the LAG window function to obtain sales figures from the previous month.
SELECT month, sales, LAG(sales) OVER (ORDER BY month) as last_mth FROM sales
Now we can compare each month’s sales to the previous month:
month sales last_mth 2017-06 1200 null 2017-07 1400 1200 2017-08 1500 1400 2017-09 1300 1500 2017-10 1000 1300
To calculate the month-on-month growth, we can use something such as:
SELECT month, sales, ROUND((sales/last_mth-1)*100,2) as growth #in percent FROM [that earlier query]
Here it is:
month sales growth 2017-06 1200 null 2017-07 1400 16.67 2017-08 1500 7.14 2017-09 1300 -13.33 2017-10 1000 -23.08
And on to our last example.
10. Finding sequence of events
Going back to the example in #6, where you have some conversion event of interest, and have a list of sessions leading up to the event. However, instead of attributing the eventual revenue/value across different touch points, we’d now like to analyse the entire interaction path completed by the user and analyse if there are patterns in different conversions path. Retrieving the data again,
purchases:
date customer item value 2017-11-29 Adam socks 8.99 2017-12-01 Brittany book 4.90
sessions:
date customer channel 2017-11-20 Adam Banner Ad 2017-11-24 Adam Search Ad 2017-11-25 Adam Retargeting Ad 2017-11-29 Brittany Search Ad 2017-11-29 Adam Banner Ad 2017-12-01 Brittany Google Search 2017-12-02 Adam Google Search
We’ll do a LEFT JOIN again, but this time other than adding sess_desc to identify the final touch point before conversion and for de-duplication, we’ll add a couple of LAG window functions to trace the path. If we are just interested in the last 3 events leading up to the conversion,
SELECT purch_date, customer, item, value, CONCAT( IFNULL(CONCAT(lag_2, '-'), ''), IFNULL(CONCAT(lag_1, '-'), ''), IFNULL(channel, '') ) as conversion_path, FROM (SELECT p.date as purch_date, p.customer as customer, p.item as item, p.value as value, s.date as sess_date, s.channel as channel, #we'll keep this to be able to identify unique conversions later ROW_NUMBER() OVER (PARTITION BY customer, purch_date, item ORDER BY sess_date DESC) as sess_desc, LAG(channel, 2) OVER (PARTITION BY customer, purch_date, item ORDER BY sess_date) as lag_2, LAG(channel, 1) OVER (PARTITION BY customer, purch_date, item ORDER BY sess_date) as lag_1, FROM purchases as p LEFT JOIN sessions as s ON p.customer=s.customer WHERE s.date<=p.date #to get only sessions before and up to the purchase AND DATEDIFF(p.date, s.date)<=30 #to get sessions within 30 days prior to purcahse ) WHERE sess_desc=1
LAG can take in an optional 2nd argument which is the number of records backwards to check. With the above query, we’ll be left with:
date customer item value conversion_path 2017-11-29 Adam socks 8.99 Search Ad-Retargeting Ad-Banner Ad 2017-12-01 Brittany book 4.90 Search Ad-Google Ad
By grouping the different conversion path after that, you’ll be able to analyse and evaluate the performance of different paths as well as how different campaigns/ channel contribute to it.
Closing
And that wraps everything up! I was considering doing an additional post on window function syntax in BigQuery SQL (either Legacy or Standard) but having looked through BigQuery’s documentation, felt that this may not be necessary. Please do, however, let me know if adding a post on syntax will be helpful. Have you used window functions to perform your analysis with SQL? Any other examples where window functions can help make analysis much simpler? Which of the above ideas is your favorite? Share your experience and thoughts in the comments below! Till the next time, thanks for reading 🙂