Often in analytics, we’d want to find that one number that is representative of a sample to be able to briefly describe it. There are a couple of ways this can be done quickly in BigQuery. Depending on the situation, you might want to use a different method to summarise the statistics of a group or sample to a single number. Here are some possible ideas, starting with the most basic ones, and then some slightly more complicated ones.
The following examples assume you’d want to get a representative estimate per category for parameter value. Given BigQuery’s increasing emphasis on StandardSQL, I’ll be providing the StandardSQL syntax implementation of each method. If the legacy SQL syntax is more relevant to you and you need help with them, please leave a comment below and I’ll add the legacy SQL syntax for these too 🙂
JUMP TO:
Mean
This is probably the most straightforward. Given a large sample size that is symmetrically distributed without outliers using the mean (average) is probably the best option. The usual reason for not using the mean is when there are outliers.
This is also the easiest to implement in BigQuery. More details on BigQuery’s StandardSQL AVG function here.
SELECT category, AVG(value) as mean FROM data GROUP BY category
Median
Medians are great when conditions are similar to the above, but where there are outliers present. It is also sometimes considered when there is skew in the data.
Unfortunately in BigQuery StandardSQL median calculation seems to only be available as an analytic function and not an aggregate function, whereas it was available as a both in Legacy SQL. What this means is that you have to first add medians as a new column to your data as a subquery, then select only one row per group/aggregate in the outer query. More on the PERCENTILE_CONT function here.
SELECT category, ANY_VALUE(median) as median FROM (SELECT category, value, PERCENTILE_CONT(value, 0.5) OVER (PARTITION BY category) as median FROM data) GROUP BY category
Mode
The mode is the most frequently occurring value, but probably one of the less frequently used method of parameter estimation. In some cases the mode would be the value we would want to use to describe the data. In other cases it would not be representative of the data at all, and this depends on the context of the problem. One limitation is that there may be more than one value that is the most frequently occurring, and these values may be far apart. In this case it may be difficult to pick the value to represent the category/data. Mode is least affected by outliers and skews.
There seems to be a native MODE function in Cloud Dataprep but not in BigQuery. In BigQuery StandardSQL you’d need to first get the frequency rank of each value, then get values where the frequency rank=1, and select an aggregation method to deal with cases where there is more than one most frequently occurring value. More on RANK and other ranking functions here.
SELECT category, AVG(value) as mode_1, ANY_VALUE(value) as mode_2, MIN(value) as mode_3, MAX(value) as mode_4 FROM (SELECT category, value, RANK() OVER (PARTITION BY category ORDER BY SUM(1) DESC) as frequency_rank FROM data GROUP BY category, value) WHERE frequency_rank=1 GROUP BY category
Trimmed mean/ Truncated mean
An adaptation of the mean, the trimmed mean allows you to remove effects of outliers while still capturing as much information from the data as possible. The trimming/ truncating removes a certain number or percentage of data points from both ends of the spectrum. This allows you to slide on a scale between mean and median: if you trimmed the data all the way to the middle (a fully truncated mean), you’d be left with one value in the middle – the median. However, doing so would mean discarding some of the additional data points to the left and right of it which may contain valuable information. More on trimmed/truncated mean from Wikipedia here.
To implement this in BigQuery a similar method to calculating the median is adopted. First we calculate the cutoff values for each category, then average out the remaining values. To calculate a 25% trimmed mean:
SELECT category, AVG(value) as trimmed_mean, FROM (SELECT category, value, PERCENTILE_CONT(value, .25) OVER (PARTITION BY category) as lower_bound, PERCENTILE_CONT(value, .75) OVER (PARTITION BY category) as upper_bound, FROM data) WHERE value BETWEEN lower_bound AND upper_bound GROUP BY category
Bayesian Estimate
This is a simplified version of a Bayesian estimate but basically leverages on the idea of priors and posteriors, where we have some estimate of a larger group that we are quite confident about, and an estimate for a subset of that group that we are less confident about, perhaps due to a small sample size. When such a situation arises, instead of simply picking one over the other, we can slide between these two values, more towards the prior if there is a tiny amount of data about the subset, or more towards the posterior, when we have much more data on the subset that we feel confident about using its data.
Despite the simplification, implementation is a little longer compared to the earlier cases. We first calculate a parameter estimate for the entire group as well as for the subsets, then build a slider that takes in the means, as well as the weight we place on the prior estimate. For the parameter estimate we may decide to use any of the above methods. Unlike the earlier methods, this method has a additional parameter to be tuned, the weight placed on the prior. Using a weight of 500:
SELECT category, (500*ANY_VALUE(prior_estimate)+SUM(1)*AVG(value))/(500+SUM(1)) as adjusted_estimate FROM (SELECT category, value, AVG(value) OVER () as prior_estimate FROM data) GROUP BY category
Note that in this case SUM(1)*AVG(value) can also be replaced with SUM(value) if value is never NULL. If value is sometimes NULL, the above implementation would replace NULLs with the average value. To ignore NULLs from the calculation you can use:
SELECT category, (500*ANY_VALUE(prior_estimate)+SUM(value))/(500+COUNT(value)) as adjusted_estimate FROM (SELECT category, value, AVG(value) OVER () as prior_estimate FROM data) GROUP BY category
That’s all I have for this quick post. Any thoughts or comments? Let me know below!