Random Sampling with BigQuery

Aaron

6 years ago

Sampling

BigQuery is great for storing huge amounts of data over which to run analytics. But sometimes, what we need is just a sample of a dataset to for a dry run. This may be so when we intend to run the analysis outside of BigQuery, which may be in an environment less suited to process the billions of rows of data. Let’s dive in and figure out how to easily sample your data in BigQuery.

The RAND() Function

The RAND() returns a float between 0 and 1. The Legacy SQL documentation for RAND() here and Standard SQL documentation for RAND() here. With this we’re ready to begin!

Option 1: Getting a Specific Number of Rows

If you’d like to get a random sample of 1000 rows, you can simply ORDER BY the newly created column and LIMIT 1000. Testing this out on the NYC Taxi and Limousine Trips dataset in BigQuery, a fairly large dataset with 1.1 billion rows:

In Legacy SQL:

SELECT
RAND() as x
FROM [nyc-tlc:yellow.trips]
ORDER BY x
LIMIT 1000

And in Standard SQL:

#standardSQL
SELECT
RAND() as x
FROM `nyc-tlc.yellow.trips`
ORDER BY x
LIMIT 1000

Sometimes, instead of a specific number of rows, we’d want a percentage sample of the dataset. Let’s get a 10% sample and see how we can generalise it to other percentages.

Option 2: Getting a percentage sample using integer and modulo functions

Steps:

Create a field with RAND()
Multiply field by 10
Get modulo of step 2
For 10% sample, select where step 3 equals any number between 0 and 9

And since this logic is entirely self contained, we can place this in the where clause and not have to print the output as an additional column.

In Legacy SQL:

SELECT
SUM(1) as count
FROM [nyc-tlc:yellow.trips]
WHERE INTEGER(RAND()*10)%10=0

returned 110,890,150, approximately 10.001% of the entire dataset (1,108,779,463 rows). Running it again will give a different result but it should be in that ballpark.

The same in Standard SQL:

#standardSQL
SELECT
SUM(1) as count
FROM `nyc-tlc.yellow.trips`
WHERE MOD(CAST(RAND()*10 AS INT64),10)=0

And on a random run I got 110,875,761, again very close to 10% (9.9998%).

Selecting =0, =1, … =9 should return approximately the same result.

To get a 50% sample you could do a modulo with 2 instead, using remainder of 0 or 1.

To get a 1% sample you can multiply by 100 instead (of 10), and use a modulo of 100.

For a 30% sample, instead of ‘=0’ you might put ‘IN (0,1,2)’ or any 3 numbers between 0 and 9.

And so on. You get the drift 😉

Done! Random Sampling with BigQuery

With that you can now make your huge dataset into something more manageable for exporting into your local machine for a quick dry run analysis before launching the entire dataset into whatever you’re building.