...
- Supported for Azure Databricks versions 5.5 LTS and 6.x.
- The
must be installed on Microsoft Azure.D s platform - Nested folders are not supported when running jobs from Azure Databricks.
Info NOTE: Avoid including spaces in the paths to your ADLS sources. Spaces in the path value can cause errors during execution on Databricks.
- When a job is started and no cluster is available, a cluster is initiated, which can take up to four minutes. If the job is canceled during cluster startup:
- The job is terminated, and the cluster remains.
- The job is reported in the application as Failed, instead of Canceled.
Azure Databricks integration works with Spark 2.4.x only.
Info NOTE: The version of Spark for Azure Databricks must be applied to the platform configuration through the
databricks.sparkVersion
property. Details are provided later.- Azure Databricks integration does not work with Hive.
...
Parameter | Description | Value | ||
---|---|---|---|---|
feature.parameterization.matchLimitOnSampling.databricksSpark | Maximum number of parameterized source files that are permitted for matching in a single dataset with parameters. | |||
databricks.workerNodeType | Type of node to use for the Azure Databricks Workers/Executors. There are 1 or more Worker nodes per cluster. | Default:
For more information, see the sizing guide for Azure Databricks. | ||
databricks.sparkVersion | Azure Databricks cluster version which also includes the Spark Version. | Depending on your version of Azure Databricks, please set this property according to the following:
Please do not use other values. For more information, see Configure for Spark. | ||
databricks.serviceUrl | URL to the Azure Databricks Service where Spark jobs will be run (Example: https://westus2.azuredatabricks.net) | |||
databricks.minWorkers | Initial number of Worker nodes in the cluster, and also the minimum number of Worker nodes that the cluster can scale down to during auto-scale-down | Minimum value: Increasing this value can increase compute costs. | ||
databricks.maxWorkers | Maximum number of Worker nodes the cluster can create during auto scaling | Minimum value: Not less than Increasing this value can increase compute costs. | ||
databricks.poolId | If you have enabled instance pooling in Azure Databricks, you can specify the pool identifier here. For more information, see Configure instance pooling below. | |||
databricks.driverNodeType | Type of node to use for the Azure Databricks Driver. There is only 1 Driver node per cluster. | Default: For more information, see the sizing guide for Databricks.
| ||
databricks.logsDestination | DBFS location that cluster logs will be sent to every 5 minutes | Leave this value as /trifacta/logs . | ||
databricks.enableAutotermination | Set to true to enable auto-termination of a user cluster after N minutes of idle time, where N is the value of the autoterminationMinutes property. | Unless otherwise required, leave this value as true . | ||
databricks.clusterStatePollerDelayInSeconds | Number of seconds to wait between polls for Azure Databricks cluster status when a cluster is starting up | |||
databricks.clusterStartupWaitTimeInMinutes | Maximum time in minutes to wait for a Cluster to get to Running state before aborting and failing an Azure Databricks job | |||
databricks.clusterLogSyncWaitTimeInMinutes | Maximum time in minutes to wait for a Cluster to complete syncing its logs to DBFS before giving up on pulling the cluster logs to the
| Set this to 0 to disable cluster log pulls. | ||
databricks.clusterLogSyncPollerDelayInSeconds | Number of seconds to wait between polls for a Databricks cluster to sync its logs to DBFS after job completion | |||
databricks.autoterminationMinutes | Idle time in minutes before a user cluster will auto-terminate. | Do not set this value to less than the cluster startup wait time value. | ||
spark.useVendorSparkLibraries | When | Default is Do not modify unless you are experiencing failures in Azure Databricks job execution. For more information, see Troubleshooting below. |
...
To enable, the following parameters for the
D s platform |
---|
D s config |
---|
Parameter | Description |
---|---|
azure.managedIdentities.enabled | Set to true to enable use of Azure managed identities. |
azure.managedIdentities.keyVaultApplicationidSecretName | Specify the name of the Azure Key Vault secret that holds the service principal Application Id. |
azure.managedIdentities.keyVaultApplicationSecretSecretName | Specify the name of the Key Vault secret that holds the service principal secret. |
Save your changes.
Pass additional Spark properties
...