How to Automate Governance Best Practices With Google Data Catalog and Terraform

How to Automate Governance Best Practices With Google Data Catalog and Terraform

There’s extensive documentation on what IAM Roles are available for Google Data Catalog. But when you are getting started with your data governance journey, you probably have wondered what kind of access controls are needed and who should be granted them in your organization…

  • What end user should be able to discover my data assets?
  • Who should be able to classify and add tags to them?
  • And finally, be able to create templates and set standards for the data classification process?

This can get really complex, so in this blog post, we will start by looking at the access controls on top of metadata, which is Google Data Catalog playing field.

We can automate all that by using Terraform.

The solution

Image for post
Data Governance Architecture

This is simply a suggestion on how to work with Data Catalog. To start off, let’s say you have some common templates that will be used to create tags in different projects.

For that we need two different pieces:

  • The Tag Central Project

This is where we store all the common resources, like Tag TemplatesPolicy Tags, and Custom entries. So we don’t duplicate those, are charged only once, and have a much easier time when managing and making changes to them.

To showcase this, in the Terraform sample we will create 4 Tag Templates in the Tag Central project:

★ Data Engineering Template

★ Derived Data Template

★ Data Governance Template

★ Data Quality Template

  • A Group of Analytics Projects

The personas

Now let’s look at the personas who will interact with the Tag Central and Analytics Projects and that we will automatically set up with Terraform.

Keep in mind that the names are just suggestions, and you could replace them with names that play similar roles, you could call Data Governors as Data Architects or Data Curators as Data Stewards and many other names in this data alphabet soup.

  • Data Governors
Image for post
Data Governor Persona

Data Governor is the the role for people who perform administrative workloads on top of your metadata. And this means Creating/Updating/Deleting the Data Catalog resources like Tag Templates and setting the standards of your data governance process.

  • Data Curators
Image for post
Data Curator Persona

Data Curators will take care of your data assets 🙂 … They will select the relevant ones and add meaning to them (by creating tags), so other users can easily discover and make use of them.

  • Data Analysts
Image for post
Data Analyst Persona

This is the person who will use the curated assets and define and develop domain-specific analytics to support your decision making.

Take into consideration, that those personas can change or overlap, depending on the size of your organization or the way it is structure. So you can have the same person doing more than one role.

If you use different personas, please feel free to contribute to the sample repository or add comments to this blog post with your use case, this will be really helpful.

The automation

Without further ado, let’s look at the Terraform automation because doing things manually does us no good!

This Github repo:mesmacosta/google-datacatalog-governance-best-praticesRepo with scripts and automation to help ensure best practices in Google Data Catalog …github.com

Contains all the sample and a detailed step-by-step guide on how to run it.

To run Terraform, we are going to use a service account, since at the time of this writing Data Catalog does not support using end-user credentials from the Google Cloud SDK.

And to follow the best practices we won’t download the service account key, but use service account impersonation.

Create the Service Account

So the first step is creating a service account and setting the appropriate IAM roles:https://medium.com/media/cca0bf7c553ec02d2258640f6b3845f7

Set Terraform variables placeholders

Next, we need to set Terraform variable placeholders. So after your clone, the GitHub repo, change the .tfvars placeholders.

Let’s look at an example of a valid configuration file:https://medium.com/media/c24cc4c794b0aa4193d93b9adfd4d736

In the sample code above, whenever you see member, it can be any of: user:{email}serviceAccount:{email}group:{email} or domain:{domain}.

Run Terraform

And at last, let’s execute:

# After that, let's get Terraform started.
# Run the following to pull in the providers.
terraform init

# With the providers downloaded and terraform variables set,
# you're ready to use Terraform. Go ahead!

# Plan first to validate the execution
terraform plan -input=false -out=tfplan -var-file=".tfvars" 

# If successfull, execute it
terraform apply tfplan

Generated Resources

After Terraform completes, we can look at the generated resources:

Image for post
IAM Roles

We can see that all the projects we set up in Terraform contain the discussed personas, with the appropriate permissions.

And let’s not forget the common resources created by Terraform:

Image for post
Data Catalog Tag Templates

That’s pretty much it, thanks for reading :).

Wrapping up

Data Governance is a really complex area, and any automation that helps us to set and enforce those standards is welcome. In this blog post, we looked at Terraform samples that supports us when working at the project level.

Keep in mind that if you want to use the suggested access controls at the folder or organization level, which is a common use case for large organizations.

The iam module at the GitHub repo, is easily adaptable to that use case, all you need to do is switch the google_project_iam_member resource to google_folder_iam_member or google_organization_iam_member respectively.

Read more