This section describes how to configure the to integrate with Databricks hosted in Azure.
NOTE: You cannot integrate with existing Azure Databricks clusters. |
Additional Databricks features supported by the platform:
The must be deployed in Microsoft Azure.
NOTE: If you are using Azure Databricks as a datasource, please verify that openJDKv1.8.0_242 or earlier is installed on the |
Beginning in Release 7.1, the integration with Azure Databricks switched from using a Hive-based driver to a Simba driver for the integration with Spark.
NOTE: If you have upgraded from a release before Release 7.1, you should review the Connect String Options in your Azure Databricks connections, such as Databricks Tables. These options may not work with the Simba driver. |
No installation is required to use this new driver.
For more information on the Simba driver, see https://databricks.com/spark/odbc-driver-download.
NOTE: If you are using Azure AD to integrate with an Azure Databricks cluster, the Azure AD secret value stored in |
The must be installed on Microsoft Azure.
Import from nested folders is not supported for running jobs from Azure Databricks.
Azure Databricks integration works with Spark 2.4.x only.
NOTE: The version of Spark for Azure Databricks must be applied to the platform configuration through the |
Supported versions:
Azure Databricks 7.3 (Recommended)
NOTE: Some versions of Azure Databricks 6.x have already reached End of Life. Customers should upgrade to a version of Azure Databricks that is supported by the vendor. These EOL versions are likely to be deprecated from support by the |
NOTE: When an Azure Databricks version is deprecated, it is no longer available for selection when creating a new cluster. As a result, if a new cluster needs to be created for whatever reason, it must be created using a version supported by Azure Databricks, which may require you to change the value of the Spark version settings in the |
For more information on supported versions, see https://docs.databricks.com/release-notes/runtime/releases.html#supported-databricks-runtime-releases-and-support-schedule.
By default, the number of jobs permitted on an Azure Databricks cluster is set to 1000
.
1000
.5000
by using the run-submit API. For more information, see "Configure Databricks job management" below.150
. Managing limits:
To enable retrieval and auditing of job information after a job has been completed, the does not delete jobs from the cluster. As a result, jobs can accumulate over time to exceeded the number of jobs permitted on the cluster. If you reach these limits, you may receive a
Quota for number of jobs has been reached
limit. For more information, see https://docs.databricks.com/user-guide/jobs.html.
Optionally, you can allow the to manage your jobs to avoid these limitations. For more information, see "Configure Databricks job management" below.
NOTE: Integration with pre-existing Azure Databricks clusters is not supported. |
When a user first requests access to Azure Databricks, a new Azure Databricks cluster is created for the user. Access can include a request to run a job or to browse Databricks Tables. Cluster creation may take a few minutes.
A new cluster is also created when a user launches a job after:
A user's cluster automatically terminates after a configurable time period. A new cluster is automatically created when the user next requests access to Azure Databricks access. See "Configure Platform" below.
To enable Azure Databricks, please perform the following configuration changes.
Steps:
Locate the following parameter, which enables for smaller job execution. Set it to
Enabled
:
Photon execution |
Locate the following parameters. Set them to the values listed below, which enables Azure Databricks (small to extra-large jobs) running environments:
"webapp.runInDatabricks": true, "webapp.runWithSparkSubmit": false, "webapp.runinEMR": false, "webapp.runInDataflow": false, |
Locate the following parameter and set it to azure
:
"metadata.cloud": "azure", |
NOTE: Do not set this parameter to any value other than |
When a user submits a job, the provides all the cluster specifications in the Databricks API and it creates cluster only for per-user or per-job. that means once the job is complete, the cluster is terminated. Cluster creation may take less than 30 seconds if instance pools are used. If the instance pools are not used, it may take 10-15 minutes.
For more information on job clusters, see https://docs.databricks.com/clusters/configure.html.
The job clusters automatically terminate after the job is completed. A new cluster is automatically created when the user next requests access to Azure Databricks access.
Cluster Mode | Description |
---|---|
USER | When a user submits a job, Default cluster mode to run jobs in Azure Databricks. |
JOB | When a user submits a job, |
Please review and modify the following configuration settings.
NOTE: When you have finished modifying these settings, save them and restart the platform to apply. |
Parameter | Description | Value | |
---|---|---|---|
databricks.clusterMode | Determines the cluster mode for running a Databricks job. | Default: USER | |
feature.parameterization.matchLimitOnSampling.databricksSpark | Maximum number of parameterized source files that are permitted for matching in a single dataset with parameters. | ||
databricks.workerNodeType | Type of node to use for the Azure Databricks Workers/Executors. There are 1 or more Worker nodes per cluster. | Default:
For more information, see the sizing guide for Azure Databricks. | |
databricks.sparkVersion | Azure Databricks cluster version which also includes the version of Spark. | Depending on your version of Azure Databricks, please set this property according to the following:
Please do not use other values. | |
databricks.serviceUrl | URL to the Azure Databricks Service where Spark jobs will be run (Example: https://westus2.azuredatabricks.net) | ||
databricks.minWorkers | Initial number of Worker nodes in the cluster, and also the minimum number of Worker nodes that the cluster can scale down to during auto-scale-down | Minimum value: Increasing this value can increase compute costs. | |
databricks.maxWorkers | Maximum number of Worker nodes the cluster can create during auto scaling | Minimum value: Not less than Increasing this value can increase compute costs. | |
databricks.poolId | If you have enabled instance pooling in Azure Databricks, you can specify the worker node pool identifier here. For more information, see Configure instance pooling below. |
| |
databricks.poolName | If you have enabled instance pooling in Azure Databricks, you can specify the worker node pool name here. For more information, see Configure instance pooling below. | See previous.
| |
databricks.driverNodeType | Type of node to use for the Azure Databricks Driver. There is only 1 Driver node per cluster. | Default: For more information, see the sizing guide for Databricks.
| |
databricks.driverPoolId | If you have enabled instance pooling in Azure Databricks, you can specify the driver node pool identifier here. For more information, see Configure instance pooling below. |
| |
databricks.driverPoolName | If you have enabled instance pooling in Azure Databricks, you can specify the driver node pool name here. For more information, see Configure instance pooling below. | See previous.
| |
databricks.logsDestination | DBFS location that cluster logs will be sent to every 5 minutes | Leave this value as /trifacta/logs . | |
databricks.enableAutotermination | Set to true to enable auto-termination of a user cluster after N minutes of idle time, where N is the value of the autoterminationMinutes property. | Unless otherwise required, leave this value as true . | |
databricks.clusterStatePollerDelayInSeconds | Number of seconds to wait between polls for Azure Databricks cluster status when a cluster is starting up | ||
databricks.clusterStartupWaitTimeInMinutes | Maximum time in minutes to wait for a Cluster to get to Running state before aborting and failing an Azure Databricks job | ||
databricks.clusterLogSyncWaitTimeInMinutes | Maximum time in minutes to wait for a Cluster to complete syncing its logs to DBFS before giving up on pulling the cluster logs to the | Set this to 0 to disable cluster log pulls. | |
databricks.clusterLogSyncPollerDelayInSeconds | Number of seconds to wait between polls for a Databricks cluster to sync its logs to DBFS after job completion | ||
databricks.autoterminationMinutes | Idle time in minutes before a user cluster will auto-terminate. | Do not set this value to less than the cluster startup wait time value. | |
databricks.enableLocalDiskEncryption | Enables encryption of data like shuffle data that is temporarily stored on cluster's local disk. | - | |
databricks.patCacheTTLInMinutes | Lifespan in minutes for the Databricks personal access token in-memory cache | Default: 10 | |
databricks.maxAPICallRetries | Maximum number of retries to perform in case of 429 error code response | Default: 5. For more information, see Configure Maximum Retries for REST API section below. | |
spark.useVendorSparkLibraries | When |
|
Instance pooling reduces cluster node spin-up time by maintaining a set of idle and ready instances. The can be configured to leverage instance pooling on the Azure Databricks cluster for both worker and driver nodes.
NOTE: When instance pooling is enabled, the following parameters are not used:
|
For more information, see https://docs.azuredatabricks.net/clusters/instance-pools/index.html.
Pre-requisites:
ATTACH_TO
permission.To enable:
Acquire your pool identifier or pool name from Azure Databricks.
NOTE: You can use either the Databricks pool identifier or pool name. If both poolId and poolName are specified, poolId is used first. If that fails to find a matching identifier, then the poolName value is checked. |
Tip: If you specify a poolName value only, then you can run your Databricks jobs against the available clusters across multiple |
Set either of the following parameters:
Set the following parameter to the Azure Databricks pool identifier:
"databricks.poolId": "<my_pool_id>", |
Or, you can set the following parameter to the Azure Databricks pool name:
"databricks.poolName": "<my_pool_name>", |
The can be configured to use Databricks instance pooling for driver pools.
To enable:
Acquire your driver pool identifier or driver pool name from Databricks.
NOTE: You can use either the Databricks driver pool identifier or driver pool name. If both driverPoolId and driverPoolName are specified, driverPoolId is used first. If that fails to find a matching identifier, then the driverPoolName value is checked. |
Tip: If you specify a driverPoolName value only, then you can run your Databricks jobs against the available clusters across multiple |
Set either of the following parameters:
Set the following parameter to the Databricks driver pool identifier:
"databricks.driverPoolId": "<my_pool_id>", |
Or, you can set the following parameter to the Databricks driver pool name:
"databricks.driverPoolName": "<my_pool_name>", |
Azure Databricks enforces a hard limit of 1000 created jobs per workspace, and by default cluster jobs are not deleted. To support jobs more than 1000 jobs per cluster, you can enable job management for Azure Databricks.
NOTE: This feature covers the deletion of the job definition on the cluster, which counts toward the enforced limits. The |
Tip: Regardless of your job management option, when you hit the limit for the number of job definitions that can be created on the Databricks workspace, the platform by default falls back to using the runs/submit API, if the Databricks Job Runs Submit Fallback setting has been enabled. |
Steps:
Locate the following property and set it to one of the values listed below:
Databricks Job Management |
Property Value | Description |
---|---|
Never Delete | (default) Job definitions are never deleted from the Azure Databricks cluster. |
Always Delete | The Azure Databricks job definition is deleted during the clean-up phase, which occurs after a job completes. |
Delete Successful Only | When a job completes successfully, the Azure Databricks job definition is deleted during the clean-up phase. Failed or canceled jobs are not deleted, which allows you to debug as needed. |
Skip Job Creation | For jobs that are to be executed only one time, the |
Default | Inherits the default system-wide setting. |
When this feature is enabled, the platform falls back to use the runs/submit API as a fallback when the job limit for the Databricks workspace has been reached:
Databricks Job Runs Submit Fallback |
By default, Databricks workspaces apply limits on the number of jobs that can be submitted before the cluster begins to fail jobs. These limits are the following:
Maximum number of concurrent jobs per cluster
Max number of concurrent jobs per workspace
Max number of concurrent clusters per workspace
Depending on how your clusters are configured, these limits can vary. For example, if the maximum number of concurrent jobs per cluster is set to 20
, then the 21st concurrent job submitted to the cluster fails.
To prevent unnecessary job failure, the submits the throttling of jobs to Databricks. When job throttling is enabled and the 21 concurrent job is submitted, the
holds that job internally the first of any of the following events happens:
NOTE: The |
Steps:
Please complete the following steps to enable job throttling.
Locate the following settings and set their values accordingly:
Setting | Description | |
---|---|---|
databricks.userClusterThrottling.enabled | When set to | |
databricks.userClusterthrottling.maxTokensAllottedPerUserCluster | Set this value to the maximum number of concurrent jobs that can run on one user cluster. Default value is | |
databricks.userClusterthrottling.tokenExpiryInMinutes | The time in minutes after which tokens reserved by a job are revoked, irrespective of the job status. If a job is in progress and this limit is reached, then the Databricks token is expired, and the token is revoked under the assumption that it is stale. Default value is
| |
jobMonitoring.queuedJobTimeoutMinutes | The maximum time in minutes in which a job is permitted to remain in the queue for a slot on Databricks cluster. If this limit is reached, the job is marked as failed. | |
batch-job-runner.cleanup.enabled | When set to
|
Each user must insert a Databricks Personal Access Token to access Databricks resources. For more information, see Databricks Settings Page.
When profiling is enabled for a Spark job, the transform and profiling tasks are combined by default. As needed, you can separate these two tasks. Publishing behaviors vary depending on the approach. For more information, see Configure for Spark.
To enable SSO authentication with Azure Databricks, you enable SSO integration with Azure AD. For more information, see Configure SSO for Azure AD.
For enhanced security, you can configure the to use an Azure Managed Identity. When this feature is enabled, the platform queries the Key Vault for the secret holding the applicationId and secret to the service principal that provides access to the Azure services.
NOTE: This feature is supported for Azure Databricks only. |
NOTE: Your Azure Key Vault must already be configured, and the applicationId and secret must be available in the Key Vault. See Configure for Azure. |
To enable, the following parameters for the must be specified.
Parameter | Description |
---|---|
azure.managedIdentities.enabled | Set to true to enable use of Azure managed identities. |
azure.managedIdentities.keyVaultApplicationidSecretName | Specify the name of the Azure Key Vault secret that holds the service principal Application Id. |
azure.managedIdentities.keyVaultApplicationSecretSecretName | Specify the name of the Key Vault secret that holds the service principal secret. |
Save your changes.
Optionally, you can provide to the the name of a shared Databricks cluster to be used to access Databricks Tables.
Pre-requisites:
NOTE: Any shared cluster must be maintained by the customer. |
Depending on the credential provider type, the following properties must be specified in the Spark configuration for the shared cluster.
ADLS Gen1 Credentials:
"dfs.adls.oauth2.client.id" "dfs.adls.oauth2.credential" "dfs.adls.oauth2.refresh.url" "dfs.adls.oauth2.access.token.provider.type" "dfs.adls.oauth2.access.token.provider" "fs.azure.account.oauth2.client.endpoint" "spark.hadoop.dfs.adls.oauth2.client.id" "spark.hadoop.dfs.adls.oauth2.credential" "spark.hadoop.dfs.adls.oauth2.refresh.url" "spark.hadoop.fs.azure.account.oauth2.client.endpoint" "spark.hadoop.dfs.adls.oauth2.access.token.provider.type" "spark.hadoop.dfs.adls.oauth2.access.token.provider" |
For more information, see https://docs.databricks.com/data/data-sources/azure/azure-datalake.html.
ADLS Gen2 Credentials:
"fs.azure.account.auth.type" "fs.azure.account.oauth.provider.type" "fs.azure.account.oauth2.client.id" "fs.azure.account.oauth2.client.secret" "fs.azure.account.oauth2.client.endpoint" "spark.hadoop.fs.azure.account.auth.type" "spark.hadoop.fs.azure.account.oauth.provider.type" "spark.hadoop.fs.azure.account.oauth2.client.id" "spark.hadoop.fs.azure.account.oauth2.client.secret" "spark.hadoop.fs.azure.account.oauth2.client.endpoint" |
For more information, see https://docs.databricks.com/data/data-sources/azure/azure-datalake-gen2.html.
WASB Credentials:
fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net spark.hadoop.fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net azure.wasb.stores |
For more information, see https://docs.databricks.com/data/data-sources/azure/azure-storage.html.
Steps:
Locate the following parameter, and add the name of the Databricks cluster to use to browse Databricks Table:
"feature.databricks.connection.clusterName": "<your_cluster_name>", |
When a cluster name is provided:
NOTE: While using a single cluster shared across users to access Databricks Tables, each user must have a valid Databricks Personal Access Token to the shared cluster. |
If a cluster name is not provided:
Individual users can specify the name of the cluster that accesses Databricks Tables through the Databricks settings. See "Specify Databricks Tables cluster name" below.
Individual users can specify the name of the cluster to which they are permissioned to access Databricks Tables. This cluster can also be shared among users. For more information, see Databricks Settings Page.
As needed, you can pass additional properties to the Spark running environment through the spark.props
configuration area.
NOTE: These properties are passed to Spark for all jobs. |
Steps:
spark.props
.Insert new Spark properties. For example, you can specify the spark.props.spark.executor.memory
property, which changes the memory allocated to the Spark executor on each node by using the following in the spark.props
area:
"spark": { ... "props": { "spark.executor.memory": "6GB" } ... } |
For more information on modifying these settings, see Configure for Spark.
There is a limit of 30 requests per second per workspace on the Databricks REST APIs. If this limit is reached, then a HTTP status code 429 error is returned, indicating that rate limiting is being applied by the server. By default, the re-attempts to submit a request
5
times and then fails the job if the request is not accepted.
If you want to change the number of retries, change the value for the databricks.maxAPICallRetries
flag.
Value | Description | |
---|---|---|
5 | (default) When a request is submitted through the Azure Databricks REST APIs, up to
| |
0 | When an API call fails, the request fails. As the number of concurrent jobs increases, more jobs may fail.
| |
5+ | Increasing this setting above the default value may result in more requests eventually getting processed. However, increasing the value may consume additional system resources in a high concurrency environment and jobs might take longer to run due to exponentially increasing waiting time. |
When the above configuration has been completed, you can select the running environment through the application.
NOTE: When a Databricks job fails, the failure is reported immediately in the |
See Run Job Page.
You can use API calls to execute jobs.
Please make sure that the request body contains the following:
"execution": "databricksSpark", |
For more information, see
operation/runJobGroup |
When running a job using Spark on Azure Databricks, the job may fail with the above invalid version error. In this case, the Databricks version of Spark has been deprecated.
Solution:
Since an Azure Databricks cluster is created for each user, the solution is to identify the cluster version to use, configure the platform to use it, and then restart the platform.
databricks.sparkVersion
.In Azure Databricks, compare your value to the list of supported Azure Databricks version. If your version is unsupported, identify a new version to use.
NOTE: Please make note of the version of Spark supported for the version of Azure Databricks that you have chosen. |
In the configuration, set
databricks.sparkVersion
to the new version to use.
NOTE: The value for |
When you run a job on Databricks, the job may fail with the following error:
java.lang.ClassCastException: org.apache.spark.scheduler.ResultTask cannot be cast to org.apache.spark.scheduler.Task at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) |
The job.log
file may contain something similar to the following:
2022-07-19T15:41:24.832Z - [sid=0cf0cff5-2729-4742-a7b9-4607ca287a98] - [rid=83eb9826-fc3b-4359-8e8f-7fbf77300878] - [Async-Task-9] INFO com.trifacta.databricks.spark.JobHelper - Got error org.apache.spark.SparkException: Stage 0 failed. Error: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-10-243-149-238.eu-west-1.compute.internal, executor driver): java.lang.ClassCastException: org.apache.spark.scheduler.ResultTask cannot be cast to org.apache.spark.scheduler.Task at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616) ... |
This error is due to a class mismatch between the and Databricks.
Solution:
The solution is to disable the precedence of using the Spark JARs provided from the over the Databricks Spark JARs. Please perform the following steps:
Locate the spark.props
section and add the following configuration elements:
"spark": { ... "props": { "spark.driver.userClassPathFirst": false, "spark.executor.userClassPathFirst": false, ... } }, |