Managing data access controls on BigQuery

Setting up access controls across an organization, even for a small one as ours, isn’t easy. Here’s an approach similar to what we use at Wego to manage access controls on BigQuery.

 

Long story short:

  1. We hadn’t set up access controls properly from the start. Employees who needed to do data analysis for their work (across various functions) had both read and write access to all BigQuery datasets in the common project
  2. The number of employees was growing, and so was the number of people who needed access to data in BigQuery to aid their decision making process
  3. The amount and variety of data we were storing was growing at a growing rate. Not everyone needs or should have access to all the data
  4. We wanted to ensure everyone had access to the data they need for their work. At the same time we did not want to risk a catastrophic failure because of the current system (or lack of) access controls

 

What we needed

  1. Everyone should be able to run queries – though only on datasets they have view access to
  2. Almost no one should need to create datasets
  3. Specific departments or groups should be able to see datasets relevant to them, but not all datasets
  4. New employees should automatically have the basic accesses their departments require
  5. Everyone should be able to share queries through Job Id – this is very useful for collaboration. With just a URL containing that Job Id you should be able to view the query and the results of an analysis another teammate has done
  6. Everyone should be able to write to a temporary dataset used to save larger queries, and export from that to Google Cloud Storage for downloading to their locals
  7. Other than service accounts, individual user accounts should hardly need to create/edit tables

 

Predefined BigQuery roles on Google Cloud Platform’s IAM

Changes made here apply on the project level, so dataset specific requirements cannot be applied through IAM. So anything that is done at this level are only applicable for project wide controls. Anything that is related to specific datasets cannot be done here.

Earlier on in attempting to set up access controls we made the assumption that the predefined BigQuery roles followed some sort of hierarchy (ie that those at a higher level would have all the access rights of the lower levels plus more). We also assumed that the predefined roles were all there was to select from. Both assumptions were wrong.

With that earlier assumption we thought that Data Viewer was the lowest on the hierarchy and thus to provide anyone with any access all would involve giving them at least view access to all datasets, which was not what we wanted. However, as you can see from the Predefined Roles Comparison Matrix (not sure how we missed this earlier on) it is not an incremental sort of access level. The project wide capability we required everyone to have was to create jobs and run queries. this was exactly provided for in the Job User predefined role. This would also prevent users from adding datasets on their own.

 

Access control by Groups

As we didn’t want to manually configure each user’s access controls, we had to do it through groups. We have groups defined on Google Cloud according to various teams, offices, and a companywide group. By assigning the companywide group the BigQuery Job User role, we immediately enabled everyone to be able to run queries. They would still not be able to run queries on our internal datasets though, without access to the internal datasets, which is next.

 

Enabling dataset access by groups

Next, we wanted teams to be able to run queries on the datasets that are relevant to their team. To do so, we reviewed and enabled View access to datasets via the BigQuery UI. This can also be done through API. With this set up, new employees would automatically gain access to the datasets their teams require. We also did not have to clean up access entries to individual employees when they leave the company.

 

Sharing queries/Jobs with custom roles

One useful function in BigQuery is the ability to share queries and jobs with just a URL containing the job ID. It enables collaboration as users can share both their query syntax and results with just a URL. This can easily be sent over Slack. However the ability to view another user’s jobs is not enabled with the predefined BigQuery Job User role, as an additional permission the bigquery.jobs.list is required. In order to enable this, we created a custom role out of the existing BigQuery Job User role. We simply added the missing permission, created the role, and switched the companywide group from the predefined Job User role to the custom Job User role. That’s it, everyone can share their queries now.

IAM & admin > Roles > Check BigQuery Job User > Create role from Selection > Add permission > Check bigquery.jobs.list > Create

 

Creating a temporary dataset for exporting data

Sometimes query results are too large to be directly downloaded, and BigQuery  requires you to first save the results to a table, export it to Google Cloud Storage, and then download the data to your computer from there. Since we did not want to allow users to edit production datasets, we decided to create a temporary dataset for these exports. Here’s more on the configuration of that temporary dataset.

 

temporary dataset naming

So firstly we created a dataset, prefixed by a couple of ‘1’s or ‘a’s to keep it at the top. This is to reduce the chances of someone who has write access to a dataset accidentally adding these temporary tables to that dataset. The option to save a table comes with a dropdown of the available datasets, with the default selection on the first alphabetically sorted dataset.

 

setting a default expiration date

Next we set a default expiry on this dataset at 8 days (instead of 7). This is to prevent weekly jobs from running into issues. The scenario is as follows:

  1. A scheduled job creates a table in the temporary folder
  2. A subsequent step utilises this table, but does not delete it
  3. The following week the task runs again, updating the data, but the expiry is unchanged. This results in the deletion of that table around that schedule time.
  4. The step utilising the table fails as it can’t find the table

Basically given that there are weekly jobs the expiry should be anything but 7 days, unless you delete tables after the entire task is complete. By deleting the table the next run of the job will create a new table which will then have a new expiry.

From the UI:

Click on dataset > Beside default table expiration click edit > Select expiration > Click Ok

 

Adding access to the dataset

Finally with that dataset ready we just need to add the company wide email group to the list of users with Edit access. From the UI:

Click on small down arrow beside dataset > Share dataset > Change User by e-mail to Group by e-mail > Enter group name > Change Can view to Can edit > Click Add

 

DOne!

And that’s it, a simple set up to maintain a basic level of access controls over BigQuery, yet allowing users to be able to access the data they need! Any suggestions how this can be done better? Let me know your thoughts in the comments below!

Leave a Reply

Your email address will not be published. Required fields are marked *