Random Sampling with BigQuery

BigQuery is great for storing huge amounts of data over which to run analytics. But sometimes, what we need is just a sample of a dataset to for a dry run. This may be so when we intend to run the analysis outside of BigQuery, which may be in an environment less suited to process the billions of rows of data. Let’s dive in and figure out how to easily sample your data in BigQuery.

 

The RAND() Function

The RAND() returns a float between 0 and 1. The Legacy SQL documentation for RAND() here and Standard SQL documentation for RAND() here. With this we’re ready to begin!

 

Option 1: Getting a Specific Number of Rows

If you’d like to get a random sample of 1000 rows, you can simply ORDER BY the newly created column and LIMIT 1000. Testing this out on the NYC Taxi and Limousine Trips dataset in BigQuery, a fairly large dataset with 1.1 billion rows:

 

In Legacy SQL:

SELECT
RAND() as x
FROM [nyc-tlc:yellow.trips]
ORDER BY x
LIMIT 1000

 

And in Standard SQL:

#standardSQL
SELECT
RAND() as x
FROM `nyc-tlc.yellow.trips`
ORDER BY x
LIMIT 1000

 

Sometimes, instead of a specific number of rows, we’d want a percentage sample of the dataset. Let’s get a 10% sample and see how we can generalise it to other percentages.

 

Option 2: Getting a percentage sample using integer and modulo functions

Steps:

  1. Create a field with RAND()
  2. Multiply field by 10
  3. Get modulo of step 2
  4. For 10% sample, select where step 3 equals any number between 0 and 9

 

And since this logic is entirely self contained, we can place this in the where clause and not have to print the output as an additional column.

 

In Legacy SQL:

SELECT
SUM(1) as count
FROM [nyc-tlc:yellow.trips]
WHERE INTEGER(RAND()*10)%10=0

returned 110,890,150, approximately 10.001% of the entire dataset (1,108,779,463 rows). Running it again will give a different result but it should be in that ballpark.

 

The same in Standard SQL:

#standardSQL
SELECT
SUM(1) as count
FROM `nyc-tlc.yellow.trips`
WHERE MOD(CAST(RAND()*10 AS INT64),10)=0

 And on a random run I got 110,875,761, again very close to 10% (9.9998%).

 

Selecting =0, =1, … =9 should return approximately the same result.

To get a 50% sample you could do a modulo with 2 instead, using remainder of 0 or 1.

To get a 1% sample you can multiply by 100 instead (of 10), and use a modulo of 100.

For a 30% sample, instead of ‘=0’ you might put ‘IN (0,1,2)’ or any 3 numbers between 0 and 9.

And so on. You get the drift 😉

 

Done! Random Sampling with BigQuery

With that you can now make your huge dataset into something more manageable for exporting into your local machine for a quick dry run analysis before launching the entire dataset into whatever you’re building.

4 thoughts on “Random Sampling with BigQuery

  1. Yu Reply

    Hi, Thank you for your post. Just one question: how to do sample with replacement in BigQuery? Say I have a population of 1000, and I want to get a random sample of size 1500 using sample with replacement.

    • Aaron Post authorReply

      Hi Yu! That’s an interesting question, I’ve not thought about it before. One way I can think of offhand without the use of UDFs is by cross joining your dataset with a dummy set, then applying the same concept on the result. Something like that in Legacy SQL:

      SELECT X, y FROM
      (SELECT 1 as X, 2 as y) as base
      CROSS JOIN
      (SELECT * FROM
      (SELECT NULL as var),
      (SELECT NULL as var),
      (SELECT NULL as var)
      ) as dummy

      or in Standard SQL:

      SELECT X, y FROM
      (SELECT 1 as X, 2 as y) as base
      CROSS JOIN
      (SELECT * FROM
      (SELECT NULL as var) UNION ALL
      (SELECT NULL as var) UNION ALL
      (SELECT NULL as var)
      ) as dummy

      The more you duplicate it by should make the result more random but may also slow the query. From the duplicated dataset you can follow the earlier steps for random sampling to get any sized sample you’d like. There must be a more elegant way than the above to get the desired result but that’s what I have at the top of my head. Let me know if you think of a better solution! 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *