BigQuery is great for storing huge amounts of data over which to run analytics. But sometimes, what we need is just a sample of a dataset to for a dry run. This may be so when we intend to run the analysis outside of BigQuery, which may be in an environment less suited to process the billions of rows of data. Let’s dive in and figure out how to easily sample your data in BigQuery.
The RAND() Function
The RAND() returns a float between 0 and 1. The Legacy SQL documentation for RAND() here and Standard SQL documentation for RAND() here. With this we’re ready to begin!
Option 1: Getting a Specific Number of Rows
If you’d like to get a random sample of 1000 rows, you can simply ORDER BY the newly created column and LIMIT 1000. Testing this out on the NYC Taxi and Limousine Trips dataset in BigQuery, a fairly large dataset with 1.1 billion rows:
In Legacy SQL:
SELECT RAND() as x FROM [nyc-tlc:yellow.trips] ORDER BY x LIMIT 1000
And in Standard SQL:
#standardSQL SELECT RAND() as x FROM `nyc-tlc.yellow.trips` ORDER BY x LIMIT 1000
Sometimes, instead of a specific number of rows, we’d want a percentage sample of the dataset. Let’s get a 10% sample and see how we can generalise it to other percentages.
Option 2: Getting a percentage sample using integer and modulo functions
Steps:
- Create a field with RAND()
- Multiply field by 10
- Get modulo of step 2
- For 10% sample, select where step 3 equals any number between 0 and 9
And since this logic is entirely self contained, we can place this in the where clause and not have to print the output as an additional column.
In Legacy SQL:
SELECT SUM(1) as count FROM [nyc-tlc:yellow.trips] WHERE INTEGER(RAND()*10)%10=0
returned 110,890,150, approximately 10.001% of the entire dataset (1,108,779,463 rows). Running it again will give a different result but it should be in that ballpark.
The same in Standard SQL:
#standardSQL SELECT SUM(1) as count FROM `nyc-tlc.yellow.trips` WHERE MOD(CAST(RAND()*10 AS INT64),10)=0
And on a random run I got 110,875,761, again very close to 10% (9.9998%).
Selecting =0, =1, … =9 should return approximately the same result.
To get a 50% sample you could do a modulo with 2 instead, using remainder of 0 or 1.
To get a 1% sample you can multiply by 100 instead (of 10), and use a modulo of 100.
For a 30% sample, instead of ‘=0’ you might put ‘IN (0,1,2)’ or any 3 numbers between 0 and 9.
And so on. You get the drift 😉
Done! Random Sampling with BigQuery
With that you can now make your huge dataset into something more manageable for exporting into your local machine for a quick dry run analysis before launching the entire dataset into whatever you’re building.
Hi, Thank you for your post. Just one question: how to do sample with replacement in BigQuery? Say I have a population of 1000, and I want to get a random sample of size 1500 using sample with replacement.
Hi Yu! That’s an interesting question, I’ve not thought about it before. One way I can think of offhand without the use of UDFs is by cross joining your dataset with a dummy set, then applying the same concept on the result. Something like that in Legacy SQL:
SELECT X, y FROM
(SELECT 1 as X, 2 as y) as base
CROSS JOIN
(SELECT * FROM
(SELECT NULL as var),
(SELECT NULL as var),
(SELECT NULL as var)
) as dummy
or in Standard SQL:
SELECT X, y FROM
(SELECT 1 as X, 2 as y) as base
CROSS JOIN
(SELECT * FROM
(SELECT NULL as var) UNION ALL
(SELECT NULL as var) UNION ALL
(SELECT NULL as var)
) as dummy
The more you duplicate it by should make the result more random but may also slow the query. From the duplicated dataset you can follow the earlier steps for random sampling to get any sized sample you’d like. There must be a more elegant way than the above to get the desired result but that’s what I have at the top of my head. Let me know if you think of a better solution! 🙂
Hi Aaron, What if I would like to get 32% or 35% samples? Thanks
Hi Luis, maybe you can try something like RAND()<0.32