This section describes how to configure the to integrate with Databricks hosted in Azure.
NOTE: You cannot integrate with existing Azure Databricks clusters. |
Beginning in Release 7.1, the integration with Azure Databricks switched from using a Hive-based driver to a Simba driver for the integration with Spark.
NOTE: If you have upgraded from a release before Release 7.1, you should review the Connect String Options in your Azure Databricks connections, such as Databricks Tables. These options may not work with the Simba driver. |
No installation is required to use this new driver.
For more information on the Simba driver, see https://databricks.com/spark/odbc-driver-download.
NOTE: If you are using Azure AD to integrate with an Azure Databricks cluster, the Azure AD secret value stored in |
Azure Databricks integration works with Spark 2.4.x only.
NOTE: The version of Spark for Azure Databricks must be applied to the platform configuration through the |
By default, the number of jobs permitted on an Azure Databricks cluster is set to 1000
.
1000
. 150
. NOTE: To enable retrieval and auditing of job information after a job has been completed, the |
For more information, see https://docs.databricks.com/user-guide/jobs.html.
NOTE: Integration with pre-existing Azure Databricks clusters is not supported. |
When a user first requests access to Azure Databricks, a new Azure Databricks cluster is created for the user. Access can include a request to run a job or to browse Databricks Tables. Cluster creation may take a few minutes.
A new cluster is also created when a user launches a job after:
A user's cluster automatically terminates after a configurable time period. A new cluster is automatically created when the user next requests access to Azure Databricks access. See "Configure Platform" below.
To enable Azure Databricks, please perform the following configuration changes.
Steps:
Locate the following parameters. Set them to the values listed below, which enable the (smaller jobs) and Azure Databricks (small to extra-large jobs) running environments:
"webapp.runInTrifactaServer": true, "webapp.runInDatabricks": true, "webapp.runWithSparkSubmit": false, "webapp.runinEMR": false, "webapp.runInDataflow": false, |
Please review and modify the following configuration settings.
NOTE: When you have finished modifying these settings, save them and restart the platform to apply. |
Parameter | Description | Value | |
---|---|---|---|
feature.parameterization.matchLimitOnSampling.databricksSpark | Maximum number of parameterized source files that are permitted for matching in a single dataset with parameters. | ||
databricks.workerNodeType | Type of node to use for the Azure Databricks Workers/Executors. There are 1 or more Worker nodes per cluster. | Default:
For more information, see the sizing guide for Azure Databricks. | |
databricks.sparkVersion | Azure Databricks cluster version which also includes the Spark Version. | Depending on your version of Azure Databricks, please set this property according to the following:
Please do not use other values. For more information, see Configure for Spark. | |
databricks.serviceUrl | URL to the Azure Databricks Service where Spark jobs will be run (Example: https://westus2.azuredatabricks.net) | ||
databricks.minWorkers | Initial number of Worker nodes in the cluster, and also the minimum number of Worker nodes that the cluster can scale down to during auto-scale-down | Minimum value: Increasing this value can increase compute costs. | |
databricks.maxWorkers | Maximum number of Worker nodes the cluster can create during auto scaling | Minimum value: Not less than Increasing this value can increase compute costs. | |
databricks.poolId | If you have enabled instance pooling in Azure Databricks, you can specify the pool identifier here. For more information, see Configure instance pooling below. | ||
databricks.driverNodeType | Type of node to use for the Azure Databricks Driver. There is only 1 Driver node per cluster. | Default: For more information, see the sizing guide for Databricks.
| |
databricks.logsDestination | DBFS location that cluster logs will be sent to every 5 minutes | Leave this value as /trifacta/logs . | |
databricks.enableAutotermination | Set to true to enable auto-termination of a user cluster after N minutes of idle time, where N is the value of the autoterminationMinutes property. | Unless otherwise required, leave this value as true . | |
databricks.clusterStatePollerDelayInSeconds | Number of seconds to wait between polls for Azure Databricks cluster status when a cluster is starting up | ||
databricks.clusterStartupWaitTimeInMinutes | Maximum time in minutes to wait for a Cluster to get to Running state before aborting and failing an Azure Databricks job | ||
databricks.clusterLogSyncWaitTimeInMinutes | Maximum time in minutes to wait for a Cluster to complete syncing its logs to DBFS before giving up on pulling the cluster logs to the | Set this to 0 to disable cluster log pulls. | |
databricks.clusterLogSyncPollerDelayInSeconds | Number of seconds to wait between polls for a Databricks cluster to sync its logs to DBFS after job completion | ||
databricks.autoterminationMinutes | Idle time in minutes before a user cluster will auto-terminate. | Do not set this value to less than the cluster startup wait time value. | |
spark.useVendorSparkLibraries | When | Default is Do not modify unless you are experiencing failures in Azure Databricks job execution. For more information, see Troubleshooting below. |
Instance pooling reduces cluster node spin-up time by maintaining a set of idle and ready instances. The can be configured to leverage instance pooling on the Azure Databricks cluster.
Pre-requisites:
ATTACH_TO
permission.To enable:
Set the following parameter to the Azure Databricks pool identifier:
"databricks.poolId": "<my_pool_id>", |
NOTE: When instance pooling is enabled, the following parameters are not used:
|
For more information, see https://docs.azuredatabricks.net/clusters/instance-pools/index.html.
Each user must insert a Databricks Personal Access Token to access Databricks resources. For more information, see Databricks Personal Access Token Page.
To enable SSO authentication with Azure Databricks, you enable SSO integration with Azure AD. For more information, see Configure SSO for Azure AD.
For enhanced security, you can configure the to use an Azure Managed Identity. When this feature is enabled, the platform queries the Key Vault for the secret holding the applicationId and secret to the service principal that provides access to the Azure services.
NOTE: This feature is supported for Azure Databricks only. |
NOTE: Your Azure Key Vault must already be configured, and the applicationId and secret must be available in the Key Vault. See Configure for Azure. |
To enable, the following parameters for the must be specified.
Parameter | Description |
---|---|
azure.managedIdentities.enabled | Set to true to enable use of Azure managed identities. |
azure.managedIdentities.keyVaultApplicationidSecretName | Specify the name of the Azure Key Vault secret that holds the service principal Application Id. |
azure.managedIdentities.keyVaultApplicationSecretSecretName | Specify the name of the Key Vault secret that holds the service principal secret. |
Save your changes.
As needed, you can pass additional properties to the Spark running environment through the spark.props
configuration area.
NOTE: These properties are passed to Spark for all jobs. |
Steps:
spark.props
.Insert new Spark properties. For example, you can specify the spark.props.spark.executor.memory
property, which changes the memory allocated to the Spark executor on each node by using the following in the spark.props
area:
"spark": { ... "props": { "spark.executor.memory": "6GB" } ... } |
For more information on modifying these settings, see Configure for Spark.
When the above configuration has been completed, you can select the running environment through the application. See Run Job Page.
You can use API calls to execute jobs.
Please make sure that the request body contains the following:
"execution": "databricksSpark", |
For more information, see
operation/runJobGroup |
When running a job using Spark on Azure Databricks, the job may fail with the above invalid version error. In this case, the Databricks version of Spark has been deprecated.
Solution:
Since an Azure Databricks cluster is created for each user, the solution is to identify the cluster version to use, configure the platform to use it, and then restart the platform.
databricks.sparkVersion
.In Azure Databricks, compare your value to the list of supported Azure Databricks version. If your version is unsupported, identify a new version to use.
NOTE: Please make note of the version of Spark supported for the version of Azure Databricks that you have chosen. |
databricks.sparkVersion
to the new version to use.spark.version
to the appropriate version of Spark to use.