Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Supported for Azure Databricks versions 5.5 LTS and 6.x.
  • The 
    D s platform
     must be installed on Microsoft Azure.
  • Nested folders are not supported when running jobs from Azure Databricks.

    Info

    NOTE: Avoid including spaces in the paths to your ADLS sources. Spaces in the path value can cause errors during execution on Databricks.


  • When a job is started and no cluster is available, a cluster is initiated, which can take up to four minutes. If the job is canceled during cluster startup:
    • The job is terminated, and the cluster remains. 
    • The job is reported in the application as Failed, instead of Canceled.
  • Azure Databricks integration works with Spark 2.4.x only.

    Info

    NOTE: The version of Spark for Azure Databricks must be applied to the platform configuration through the databricks.sparkVersion property. Details are provided later.


  • Azure Databricks integration does not work with Hive.

...

    
ParameterDescriptionValue
feature.parameterization.matchLimitOnSampling.databricksSparkMaximum number of parameterized source files that are permitted for matching in a single dataset with parameters.
databricks.workerNodeTypeType of node to use for the Azure Databricks Workers/Executors. There are 1 or more Worker nodes per cluster.

Default: Standard_D3_v2

Info

NOTE: This property is unused when instance pooling is enabled. For more information, see Configure instance pooling below.

For more information, see the sizing guide for Azure Databricks.

databricks.sparkVersionAzure Databricks cluster version which also includes the Spark Version.

Depending on your version of Azure Databricks, please set this property according to the following:

  • Azure Databricks 5.5 LTR: 5.5.x-scala2.11
  • Azure Databricks 6.x: Please use the default value for your Azure Databricks distribution.
  • Azure Databricks 7.x: Not supported.

Please do not use other values. For more information, see Configure for Spark.

databricks.serviceUrlURL to the Azure Databricks Service where Spark jobs will be run (Example: https://westus2.azuredatabricks.net)
databricks.minWorkersInitial number of Worker nodes in the cluster, and also the minimum number of Worker nodes that the cluster can scale down to during auto-scale-down

Minimum value: 1

Increasing this value can increase compute costs.

databricks.maxWorkersMaximum number of Worker nodes the cluster can create during auto scaling

Minimum value: Not less than databricks.minWorkers.

Increasing this value can increase compute costs.

databricks.poolId

If you have enabled instance pooling in Azure Databricks, you can specify the pool identifier here. For more information, see Configure instance pooling below.


databricks.driverNodeType

Type of node to use for the Azure Databricks Driver. There is only 1 Driver node per cluster.

Default: Standard_D3_v2

For more information, see the sizing guide for Databricks.

Info

NOTE: This property is unused when instance pooling is enabled. For more information, see Configure instance pooling below.


databricks.logsDestinationDBFS location that cluster logs will be sent to every 5 minutesLeave this value as /trifacta/logs.
databricks.enableAutoterminationSet to true to enable auto-termination of a user cluster after N minutes of idle time, where N is the value of the autoterminationMinutes property.Unless otherwise required, leave this value as true.
databricks.clusterStatePollerDelayInSecondsNumber of seconds to wait between polls for Azure Databricks cluster status when a cluster is starting up
databricks.clusterStartupWaitTimeInMinutesMaximum time in minutes to wait for a Cluster to get to Running state before aborting and failing an Azure Databricks job
databricks.clusterLogSyncWaitTimeInMinutes

Maximum time in minutes to wait for a Cluster to complete syncing its logs to DBFS before giving up on pulling the cluster logs to the

D s node
.

Set this to 0 to disable cluster log pulls.
databricks.clusterLogSyncPollerDelayInSecondsNumber of seconds to wait between polls for a Databricks cluster to sync its logs to DBFS after job completion 
databricks.autoterminationMinutesIdle time in minutes before a user cluster will auto-terminate.Do not set this value to less than the cluster startup wait time value.
spark.useVendorSparkLibraries

When true, the platform bypasses shipping its installed Spark libraries to the cluster with each job's execution.

Default is false.

Do not modify unless you are experiencing failures in Azure Databricks job execution. For more information, see Troubleshooting below
Info

NOTE: Set this value to true.


Configure instance pooling

...

To enable, the following parameters for the

D s platform
 must be specified.

D s config

ParameterDescription
azure.managedIdentities.enabledSet to true to enable use of Azure managed identities.
azure.managedIdentities.keyVaultApplicationidSecretNameSpecify the name of the Azure Key Vault secret that holds the service principal Application Id.
azure.managedIdentities.keyVaultApplicationSecretSecretNameSpecify the name of the Key Vault secret that holds the service principal secret.

Save your changes.

Pass additional Spark properties

...