Random Sampling with BigQuery

BigQuery is great for storing huge amounts of data over which to run analytics. But sometimes, what we need is just a sample of a dataset to for a dry run. This may be so when we intend to run the analysis outside of BigQuery, which may be in an environment less suited to process the billions of rows of data. Let’s dive in and figure out how to easily sample your data in BigQuery.

The RAND() Function

The RAND() returns a float between 0 and 1. The Legacy SQL documentation for RAND() here and Standard SQL documentation for RAND() here. With this we’re ready to begin!

Option 1: Getting a Specific Number of Rows

If you’d like to get a random sample of 1000 rows, you can simply ORDER BY the newly created column and LIMIT 1000. Testing this out on the NYC Taxi and Limousine Trips dataset in BigQuery, a fairly large dataset with 1.1 billion rows:

In Legacy SQL:

SELECT
RAND() as x
FROM [nyc-tlc:yellow.trips]
ORDER BY x
LIMIT 1000

And in Standard SQL:

#standardSQL
SELECT
RAND() as x
FROM `nyc-tlc.yellow.trips`
ORDER BY x
LIMIT 1000

Sometimes, instead of a specific number of rows, we’d want a percentage sample of the dataset. Let’s get a 10% sample and see how we can generalise it to other percentages.

Option 2: Getting a percentage sample using integer and modulo functions

Steps:

Create a field with RAND()
Multiply field by 10
Get modulo of step 2
For 10% sample, select where step 3 equals any number between 0 and 9

And since this logic is entirely self contained, we can place this in the where clause and not have to print the output as an additional column.

In Legacy SQL:

SELECT
SUM(1) as count
FROM [nyc-tlc:yellow.trips]
WHERE INTEGER(RAND()*10)%10=0

returned 110,890,150, approximately 10.001% of the entire dataset (1,108,779,463 rows). Running it again will give a different result but it should be in that ballpark.

The same in Standard SQL:

#standardSQL
SELECT
SUM(1) as count
FROM `nyc-tlc.yellow.trips`
WHERE MOD(CAST(RAND()*10 AS INT64),10)=0

And on a random run I got 110,875,761, again very close to 10% (9.9998%).

Selecting =0, =1, … =9 should return approximately the same result.

To get a 50% sample you could do a modulo with 2 instead, using remainder of 0 or 1.

To get a 1% sample you can multiply by 100 instead (of 10), and use a modulo of 100.

For a 30% sample, instead of ‘=0’ you might put ‘IN (0,1,2)’ or any 3 numbers between 0 and 9.

And so on. You get the drift 😉

Done! Random Sampling with BigQuery

With that you can now make your huge dataset into something more manageable for exporting into your local machine for a quick dry run analysis before launching the entire dataset into whatever you’re building.

Hi Yu! That’s an interesting question, I’ve not thought about it before. One way I can think of offhand without the use of UDFs is by cross joining your dataset with a dummy set, then applying the same concept on the result. Something like that in Legacy SQL:

SELECT X, y FROM
(SELECT 1 as X, 2 as y) as base
CROSS JOIN
(SELECT * FROM
(SELECT NULL as var),
(SELECT NULL as var),
(SELECT NULL as var)
) as dummy

or in Standard SQL:

SELECT X, y FROM
(SELECT 1 as X, 2 as y) as base
CROSS JOIN
(SELECT * FROM
(SELECT NULL as var) UNION ALL
(SELECT NULL as var) UNION ALL
(SELECT NULL as var)
) as dummy

The more you duplicate it by should make the result more random but may also slow the query. From the duplicated dataset you can follow the earlier steps for random sampling to get any sized sample you’d like. There must be a more elegant way than the above to get the desired result but that’s what I have at the top of my head. Let me know if you think of a better solution! 🙂

4 thoughts on “Random Sampling with BigQuery”

Yu May 18, 2019 Reply

Hi, Thank you for your post. Just one question: how to do sample with replacement in BigQuery? Say I have a population of 1000, and I want to get a random sample of size 1500 using sample with replacement.
- Aaron Post authorMay 20, 2019 Reply
  
  Hi Yu! That’s an interesting question, I’ve not thought about it before. One way I can think of offhand without the use of UDFs is by cross joining your dataset with a dummy set, then applying the same concept on the result. Something like that in Legacy SQL:
  
  SELECT X, y FROM
  (SELECT 1 as X, 2 as y) as base
  CROSS JOIN
  (SELECT * FROM
  (SELECT NULL as var),
  (SELECT NULL as var),
  (SELECT NULL as var)
  ) as dummy
  
  or in Standard SQL:
  
  SELECT X, y FROM
  (SELECT 1 as X, 2 as y) as base
  CROSS JOIN
  (SELECT * FROM
  (SELECT NULL as var) UNION ALL
  (SELECT NULL as var) UNION ALL
  (SELECT NULL as var)
  ) as dummy
  
  The more you duplicate it by should make the result more random but may also slow the query. From the duplicated dataset you can follow the earlier steps for random sampling to get any sized sample you’d like. There must be a more elegant way than the above to get the desired result but that’s what I have at the top of my head. Let me know if you think of a better solution! 🙂
Luis October 14, 2020 Reply

Hi Aaron, What if I would like to get 32% or 35% samples? Thanks
- Aaron Post authorDecember 4, 2020 Reply
  
  Hi Luis, maybe you can try something like RAND()<0.32

The RAND() Function

Option 1: Getting a Specific Number of Rows

Option 2: Getting a percentage sample using integer and modulo functions

Done! Random Sampling with BigQuery

4 thoughts on “Random Sampling with BigQuery”

Leave a Reply Cancel reply