BigQuery is great for storing huge amounts of data over which to run analytics. But sometimes, what we need is just a sample of a dataset to for a dry run. This may be so when we intend to run the analysis outside of BigQuery, which may be in an environment less suited to process the billions of rows of data. Let’s dive in and figure out how to easily sample your data in BigQuery.
The RAND() Function
Option 1: Getting a Specific Number of Rows
If you’d like to get a random sample of 1000 rows, you can simply ORDER BY the newly created column and LIMIT 1000. Testing this out on the NYC Taxi and Limousine Trips dataset in BigQuery, a fairly large dataset with 1.1 billion rows:
In Legacy SQL:
SELECT RAND() as x FROM [nyc-tlc:yellow.trips] ORDER BY x LIMIT 1000
And in Standard SQL:
#standardSQL SELECT RAND() as x FROM `nyc-tlc.yellow.trips` ORDER BY x LIMIT 1000
Sometimes, instead of a specific number of rows, we’d want a percentage sample of the dataset. Let’s get a 10% sample and see how we can generalise it to other percentages.
Option 2: Getting a percentage sample using integer and modulo functions
- Create a field with RAND()
- Multiply field by 10
- Get modulo of step 2
- For 10% sample, select where step 3 equals any number between 0 and 9
And since this logic is entirely self contained, we can place this in the where clause and not have to print the output as an additional column.
In Legacy SQL:
SELECT SUM(1) as count FROM [nyc-tlc:yellow.trips] WHERE INTEGER(RAND()*10)%10=0
returned 110,890,150, approximately 10.001% of the entire dataset (1,108,779,463 rows). Running it again will give a different result but it should be in that ballpark.
The same in Standard SQL:
#standardSQL SELECT SUM(1) as count FROM `nyc-tlc.yellow.trips` WHERE MOD(CAST(RAND()*10 AS INT64),10)=0
And on a random run I got 110,875,761, again very close to 10% (9.9998%).
Selecting =0, =1, … =9 should return approximately the same result.
To get a 50% sample you could do a modulo with 2 instead, using remainder of 0 or 1.
To get a 1% sample you can multiply by 100 instead (of 10), and use a modulo of 100.
For a 30% sample, instead of ‘=0’ you might put ‘IN (0,1,2)’ or any 3 numbers between 0 and 9.
And so on. You get the drift 😉
Done! Random Sampling with BigQuery
With that you can now make your huge dataset into something more manageable for exporting into your local machine for a quick dry run analysis before launching the entire dataset into whatever you’re building.