-
Notifications
You must be signed in to change notification settings - Fork 189
Config file driven multiple AWS Databricks workspaces deployment [Issue 187] #188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
nkvuong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
half way through the review - some minor changes suggested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should keep this file, but make sure we update the providers to the latest versions - https://developer.hashicorp.com/terraform/language/files/dependency-lock
| } | ||
| } | ||
| ``` | ||
| You can separate out these 2 pipelines into different projects, instead of keeping everything in the same repo folder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| You can separate out these 2 pipelines into different projects, instead of keeping everything in the same repo folder. | |
| You should separate out these 2 pipelines into different projects, instead of keeping everything in the same repo folder. |
| You can separate out these 2 pipelines into different projects, instead of keeping everything in the same repo folder. | ||
|
|
||
| Since we are using CMK (customer managed key) for encryption on root S3 bucket and Databricks managed resources, you also need to provide an AWS IAM ARN for `cmk_admin`. The format will be: `arn:aws:iam::123456:user/xxx`. You need to create this user and assign KMS admin role to it. | ||
| > Step 1.1: Manually create a service principal with account admin role on Account Console, generate client secret; and note down client_id and client_secret values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| > Step 1.1: Manually create a service principal with account admin role on Account Console, generate client secret; and note down client_id and client_secret values. | |
| > Step 1.1: Manually create a service principal with account admin role on Account Console, generate client secret; and note down `client_id` and `client_secret` values. |
| workspace_3 = var.workspace_3_config | ||
| } | ||
| export TF_VAR_aws_account_id=xxxx | ||
| export AWS_ACCESS_KEY_ID=your_aws_role_access_key_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's avoid this, tf should just use awscli profile to auth
| In the default setting, this template creates one VPC (with one public subnet and one private subnet for hosting VPCEs). Each incoming workspace will add 2 private subnets into this VPC. If you need to create multiple VPCs, you should copy paste the VPC configs and change accordingly, or you can wrap VPC configs into a module, we leave this to you. | ||
|
|
||
| At this step, your workspaces deployment and VPC networking infra should have been successfully deployed and you will have `n` config json files for `n` workspaces deployed, under `/artifacts` folder, to be used in another Terraform project to deploy workspace objects including IP Access List. | ||
| Then in root level `variables.tf`, change the region default value to your region. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you add this to the yml config as well?
| @@ -0,0 +1,43 @@ | |||
| # VPC Configuration | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should start with three dashes
| # VPC Configuration | |
| --- | |
| # VPC Configuration |
| @@ -0,0 +1,107 @@ | |||
| terraform { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still suggest customer to configure log delivery? system tables should replace this now as parsing the audit logs is quite time-consuming
| deploy_metastore: "false" | ||
| existing_metastore_id: "xxxxx-xxxx-xxxx-xxxx-xxxxxx" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should combine these into a single flag existing_metastore_id
| } | ||
|
|
||
| module "vpc" { | ||
| source = "terraform-aws-modules/vpc/aws" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we pin the module version?
| } | ||
|
|
||
|
|
||
| resource "aws_iam_role_policy" "cross_account" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we already have data.databricks_aws_crossaccount_policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR revamps the AWS Databricks modular privatelink example to implement a config file-driven deployment approach. The changes enable deploying multiple Databricks workspaces across multiple VPCs using YAML configuration files, with support for Unity Catalog catalogs and isolated infrastructure.
- Complete restructuring to use YAML configuration files instead of hardcoded Terraform variables
- New modular architecture with account-level and workspace-level pipelines
- Enhanced Unity Catalog support with isolated S3 buckets and external locations
Reviewed Changes
Copilot reviewed 59 out of 69 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| main.tf | Replaced hardcoded workspace configs with YAML-driven configuration parsing |
| variables.tf | Simplified to core variables, removed workspace-specific config variables |
| providers.tf | Updated provider versions and added time provider |
| outputs.tf | Modified to output workspace URLs from new module structure |
| identity_management.tf | Added comprehensive user and group management using YAML config |
| databricks_account_tf_pipeline/ | New module for account-level resource deployment including workspaces and metastore |
| databricks_workspace_tf_pipeline/ | New module for workspace-level resources including UC catalogs |
| configs/ | New YAML configuration files for environment and user management |
| README.md | Updated documentation reflecting the new config-driven approach |
Files not reviewed (1)
- examples/aws-databricks-modular-privatelink/.terraform.lock.hcl: Language not supported
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| principals { | ||
| type = "AWS" | ||
| identifiers = [var.cmk_admin] | ||
| identifiers = [data.aws_caller_identity.current.account_id] |
Copilot
AI
Oct 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The account ID should be formatted as an ARN. This should be arn:aws:iam::${data.aws_caller_identity.current.account_id}:root to properly identify the root user of the account.
| identifiers = [data.aws_caller_identity.current.account_id] | |
| identifiers = ["arn:aws:iam::${data.aws_caller_identity.current.account_id}:root"] |
| principals { | ||
| type = "AWS" | ||
| identifiers = [var.cmk_admin] | ||
| identifiers = [data.aws_caller_identity.current.account_id] |
Copilot
AI
Oct 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The account ID should be formatted as an ARN. This should be arn:aws:iam::${data.aws_caller_identity.current.account_id}:root to properly identify the root user of the account.
| identifiers = [data.aws_caller_identity.current.account_id] | |
| identifiers = ["arn:aws:iam::${data.aws_caller_identity.current.account_id}:root"] |
Revamping this example of modular AWS DB workspace deployment - we can now define multiple environments with yaml files to deploy m workspaces into n VPCs and create multiple catalogs with isolated underlying infra. All the workspaces are using backend PL and CMK feature; the VPCs are configured with NAT for egress.