subcategory |
---|
Storage |
This resource will mount your cloud storage on dbfs:/mnt/name
. Right now it supports mounting AWS S3, Azure (Blob Storage, ADLS Gen1 & Gen2), Google Cloud Storage. It is important to understand that this will start up the cluster if the cluster is terminated. The read and refresh terraform command will require a cluster and may take some time to validate the mount.
Note When cluster_id
is not specified, it will create the smallest possible cluster in the default availability zone with name equal to or starting with terraform-mount
for the shortest possible amount of time. To avoid mount failure due to potentially quota or capacity issues with the default cluster, we recommend specifying a cluster to use for mounting.
This resource provides two ways of mounting a storage account:
- Use a storage-specific configuration block - this could be used for the most cases, as it will fill most of the necessary details. Currently we support following configuration blocks:
s3
- to mount AWS S3gs
- to mount Google Cloud Storageabfs
- to mount ADLS Gen2 using Azure Blob Filesystem (ABFS) driveradl
- to mount ADLS Gen1 using Azure Data Lake (ADL) driverwasb
- to mount Azure Blob Storage using Windows Azure Storage Blob (WASB) driver
- Use generic arguments - you have a responsibility for providing all necessary parameters that are required to mount specific storage. This is most flexible option
cluster_id
- (Optional, String) Cluster to use for mounting. If no cluster is specified, a new cluster will be created and will mount the bucket for all of the clusters in this workspace. If the cluster is not running - it's going to be started, so be aware to set auto-termination rules on it.name
- (Optional, String) Name, under which mount will be accessible indbfs:/mnt/<MOUNT_NAME>
. If not specified, provider will try to infer it from depending on the resource type:bucket_name
for AWS S3 and Google Cloud Storagecontainer_name
for ADLS Gen2 and Azure Blob Storagestorage_resource_name
for ADLS Gen1
uri
- (Optional, String) the URI for accessing specific storage (s3a://....
,abfss://....
,gs://....
, etc.)extra_configs
- (Optional, String map) configuration parameters that are necessary for mounting of specific storageresource_id
- (Optional, String) resource ID for a given storage account. Could be used to fill defaults, such as storage account & container names on Azure.encryption_type
- (Optional, String) encryption type. Currently used only for AWS S3 mounts
locals {
tenant_id = "00000000-1111-2222-3333-444444444444"
client_id = "55555555-6666-7777-8888-999999999999"
secret_scope = "some-kv"
secret_key = "some-sp-secret"
container = "test"
storage_acc = "lrs"
}
resource "databricks_mount" "this" {
name = "tf-abfss"
uri = "abfss://${local.container}@${local.storage_acc}.dfs.core.windows.net"
extra_configs = {
"fs.azure.account.auth.type" : "OAuth",
"fs.azure.account.oauth.provider.type" : "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id" : local.client_id,
"fs.azure.account.oauth2.client.secret" : "{{secrets/${local.secret_scope}/${local.secret_key}}}",
"fs.azure.account.oauth2.client.endpoint" : "https://login.microsoftonline.com/${local.tenant_id}/oauth2/token",
"fs.azure.createRemoteFileSystemDuringInitialization" : "false",
}
}
-> Note AAD passthrough is considered a legacy data access pattern. Use Unity Catalog for fine-grained data access control.
-> Note Mounts using AAD passthrough cannot be created using a service principal.
To mount ALDS Gen2 with Azure Active Directory Credentials passthrough we need to execute the mount commands using the cluster configured with AAD Credentials passthrough & provide necessary configuration parameters (see documentation for more details).
provider "azurerm" {
features {}
}
variable "resource_group" {
type = string
description = "Resource group for Databricks Workspace"
}
variable "workspace_name" {
type = string
description = "Name of the Databricks Workspace"
}
data "azurerm_databricks_workspace" "this" {
name = var.workspace_name
resource_group_name = var.resource_group
}
# it works only with AAD token!
provider "databricks" {
host = data.azurerm_databricks_workspace.this.workspace_url
}
data "databricks_node_type" "smallest" {
local_disk = true
}
data "databricks_spark_version" "latest" {
}
resource "databricks_cluster" "shared_passthrough" {
cluster_name = "Shared Passthrough for mount"
spark_version = data.databricks_spark_version.latest.id
node_type_id = data.databricks_node_type.smallest.id
autotermination_minutes = 10
num_workers = 1
spark_conf = {
"spark.databricks.cluster.profile" : "serverless",
"spark.databricks.repl.allowedLanguages" : "python,sql",
"spark.databricks.passthrough.enabled" : "true",
"spark.databricks.pyspark.enableProcessIsolation" : "true"
}
custom_tags = {
"ResourceClass" : "Serverless"
}
}
variable "storage_acc" {
type = string
description = "Name of the ADLS Gen2 storage container"
}
variable "container" {
type = string
description = "Name of container inside storage account"
}
resource "databricks_mount" "passthrough" {
name = "passthrough-test"
cluster_id = databricks_cluster.shared_passthrough.id
uri = "abfss://${var.container}@${var.storage_acc}.dfs.core.windows.net"
extra_configs = {
"fs.azure.account.auth.type" : "CustomAccessToken",
"fs.azure.account.custom.token.provider.class" : "{{sparkconf/spark.databricks.passthrough.adls.gen2.tokenProviderClassName}}",
}
}
This block allows specifying parameters for mounting of the ADLS Gen2. The following arguments are required inside the s3
block:
instance_profile
- (Optional) (String) ARN of registered instance profile for data access. If it's not specified, then thecluster_id
should be provided, and the cluster should have an instance profile attached to it. If bothcluster_id
&instance_profile
are specified, thencluster_id
takes precedence.bucket_name
- (Required) (String) S3 bucket name to be mounted.
// now you can do `%fs ls /mnt/experiments` in notebooks
resource "databricks_mount" "this" {
name = "experiments"
s3 {
instance_profile = databricks_instance_profile.ds.id
bucket_name = aws_s3_bucket.this.bucket
}
}
This block allows specifying parameters for mounting of the ADLS Gen2. The following arguments are required inside the abfs
block:
client_id
- (Required) (String) This is the client_id (Application Object ID) for the enterprise application for the service principal.tenant_id
- (Optional) (String) This is your azure directory tenant id. It is required for creating the mount. (Could be omitted if Azure authentication is used, and we can extracttenant_id
from it).client_secret_key
- (Required) (String) This is the secret key in which your service principal/enterprise app client secret will be stored.client_secret_scope
- (Required) (String) This is the secret scope in which your service principal/enterprise app client secret will be stored.container_name
- (Required) (String) ADLS gen2 container name. (Could be omitted ifresource_id
is provided)storage_account_name
- (Required) (String) The name of the storage resource in which the data is. (Could be omitted ifresource_id
is provided)directory
- (Computed) (String) This is optional if you don't want to add an additional directory that you wish to mount. This must start with a "/".initialize_file_system
- (Required) (Bool) either or not initialize FS for the first use
In this example, we're using Azure authentication, so we can omit some parameters (tenant_id
, storage_account_name
, and container_name
) that will be detected automatically.
resource "databricks_secret_scope" "terraform" {
name = "application"
initial_manage_principal = "users"
}
resource "databricks_secret" "service_principal_key" {
key = "service_principal_key"
string_value = "${var.ARM_CLIENT_SECRET}"
scope = databricks_secret_scope.terraform.name
}
resource "azurerm_storage_account" "this" {
name = "${var.prefix}datalake"
resource_group_name = var.resource_group_name
location = var.resource_group_location
account_tier = "Standard"
account_replication_type = "GRS"
account_kind = "StorageV2"
is_hns_enabled = true
}
resource "azurerm_role_assignment" "this" {
scope = azurerm_storage_account.this.id
role_definition_name = "Storage Blob Data Contributor"
principal_id = data.azurerm_client_config.current.object_id
}
resource "azurerm_storage_container" "this" {
name = "marketing"
storage_account_name = azurerm_storage_account.this.name
container_access_type = "private"
}
resource "databricks_mount" "marketing" {
name = "marketing"
resource_id = azurerm_storage_container.this.resource_manager_id
abfs {
client_id = data.azurerm_client_config.current.client_id
client_secret_scope = databricks_secret_scope.terraform.name
client_secret_key = databricks_secret.service_principal_key.key
initialize_file_system = true
}
}
This block allows specifying parameters for mounting of the Google Cloud Storage. The following arguments are required inside the gs
block:
service_account
- (Optional) (String) email of registered Google Service Account for data access. If it's not specified, then thecluster_id
should be provided, and the cluster should have a Google service account attached to it.bucket_name
- (Required) (String) GCS bucket name to be mounted.
resource "databricks_mount" "this_gs" {
name = "gs-mount"
gs {
service_account = "[email protected]"
bucket_name = "mybucket"
}
}
This block allows specifying parameters for mounting of the ADLS Gen1. The following arguments are required inside the adl
block:
-
client_id
- (Required) (String) This is the client_id for the enterprise application for the service principal. -
tenant_id
- (Optional) (String) This is your azure directory tenant id. It is required for creating the mount. (Could be omitted if Azure authentication is used, and we can extracttenant_id
from it) -
client_secret_key
- (Required) (String) This is the secret key in which your service principal/enterprise app client secret will be stored. -
client_secret_scope
- (Required) (String) This is the secret scope in which your service principal/enterprise app client secret will be stored. -
storage_resource_name
- (Required) (String) The name of the storage resource in which the data is for ADLS gen 1. This is what you are trying to mount. (Could be omitted ifresource_id
is provided) -
spark_conf_prefix
- (Optional) (String) This is the spark configuration prefix for adls gen 1 mount. The options arefs.adl
,dfs.adls
. Usefs.adl
for runtime 6.0 and above for the clusters. Otherwise usedfs.adls
. The default value is:fs.adl
. -
directory
- (Computed) (String) This is optional if you don't want to add an additional directory that you wish to mount. This must start with a "/".
resource "databricks_mount" "mount" {
name = "{var.RANDOM}"
adl {
storage_resource_name = "{env.TEST_STORAGE_ACCOUNT_NAME}"
tenant_id = data.azurerm_client_config.current.tenant_id
client_id = data.azurerm_client_config.current.client_id
client_secret_scope = databricks_secret_scope.terraform.name
client_secret_key = databricks_secret.service_principal_key.key
spark_conf_prefix = "fs.adl"
}
}
This block allows specifying parameters for mounting of the Azure Blob Storage. The following arguments are required inside the wasb
block:
auth_type
- (Required) (String) This is the auth type for blob storage. This can either be SAS tokens (SAS
) or account access keys (ACCESS_KEY
).token_secret_scope
- (Required) (String) This is the secret scope in which your auth type token is stored.token_secret_key
- (Required) (String) This is the secret key in which your auth type token is stored.container_name
- (Required) (String) The container in which the data is. This is what you are trying to mount. (Could be omitted ifresource_id
is provided)storage_account_name
- (Required) (String) The name of the storage resource in which the data is. (Could be omitted ifresource_id
is provided)directory
- (Computed) (String) This is optional if you don't want to add an additional directory that you wish to mount. This must start with a "/".
resource "azurerm_storage_account" "blobaccount" {
name = "${var.prefix}blob"
resource_group_name = var.resource_group_name
location = var.resource_group_location
account_tier = "Standard"
account_replication_type = "LRS"
account_kind = "StorageV2"
}
resource "azurerm_storage_container" "marketing" {
name = "marketing"
storage_account_name = azurerm_storage_account.blobaccount.name
container_access_type = "private"
}
resource "databricks_secret_scope" "terraform" {
name = "application"
initial_manage_principal = "users"
}
resource "databricks_secret" "storage_key" {
key = "blob_storage_key"
string_value = azurerm_storage_account.blobaccount.primary_access_key
scope = databricks_secret_scope.terraform.name
}
resource "databricks_mount" "marketing" {
name = "marketing"
wasb {
container_name = azurerm_storage_container.marketing.name
storage_account_name = azurerm_storage_account.blobaccount.name
auth_type = "ACCESS_KEY"
token_secret_scope = databricks_secret_scope.terraform.name
token_secret_key = databricks_secret.storage_key.key
}
}
Migration from the specific mount resource is straightforward:
- rename
mount_name
toname
- wrap storage-specific settings (
container_name
, ...) into corresponding block (adl
,abfs
,s3
,wasbs
) - for S3 mounts, rename
s3_bucket_name
tobucket_name
In addition to all arguments above, the following attributes are exported:
id
- mount namesource
- (String) HDFS-compatible url
-> Note Importing this resource is not currently supported.
The following resources are often used in the same context:
- End to end workspace management guide.
- databricks_aws_bucket_policy data to configure a simple access policy for AWS S3 buckets, so that Databricks can access data in it.
- databricks_cluster to create Databricks Clusters.
- databricks_dbfs_file data to get file content from Databricks File System (DBFS).
- databricks_dbfs_file_paths data to get list of file names from get file content from Databricks File System (DBFS).
- databricks_dbfs_file to manage relatively small files on Databricks File System (DBFS).
- databricks_instance_profile to manage AWS EC2 instance profiles that users can launch databricks_cluster and access data, like databricks_mount.
- databricks_library to install a library on databricks_cluster.