Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • The 

    D s platform
     must be deployed in Microsoft Azure.  

    Info

    NOTE: If you are using Azure Databricks as a datasource, please verify that openJDKv1.8.0_242 212 or earlier is installed on the

    D s node
    . Java 8 is required. If necessary, downgrade the Java version and restart the platform. There is a known issue with TLS v1.3.


...

    
ParameterDescriptionValue
feature.parameterization.matchLimitOnSampling.databricksSparkMaximum number of parameterized source files that are permitted for matching in a single dataset with parameters.
databricks.workerNodeTypeType of node to use for the Azure Databricks Workers/Executors. There are 1 or more Worker nodes per cluster.

Default: Standard_D3_v2

Info

NOTE: This property is unused when instance pooling is enabled. For more information, see Configure instance pooling below.

For more information, see the sizing guide for Azure Databricks.

databricks.sparkVersionAzure Databricks cluster version which also includes the Spark Version.

Depending on your version of Azure Databricks, please set this property according to the following:

  • Azure Databricks 5.5 LTR: 5.5.x-scala2.11
  • Azure Databricks 6.x: Please use the default value for your Azure Databricks distribution.
  • Azure Databricks 7.x: Not supported.

Please do not use other values. For more information, see Configure for Spark.

databricks.serviceUrlURL to the Azure Databricks Service where Spark jobs will be run (Example: https://westus2.azuredatabricks.net)
databricks.minWorkersInitial number of Worker nodes in the cluster, and also the minimum number of Worker nodes that the cluster can scale down to during auto-scale-down

Minimum value: 1

Increasing this value can increase compute costs.

databricks.maxWorkersMaximum number of Worker nodes the cluster can create during auto scaling

Minimum value: Not less than databricks.minWorkers.

Increasing this value can increase compute costs.

databricks.poolId

If you have enabled instance pooling in Azure Databricks, you can specify the pool identifier here. For more information, see Configure instance pooling below.


Info

NOTE: If both poolId and poolName are specified, poolId is used first. If that fails to find a matching identifier, then the poolName value is checked.


databricks.poolNameIf you have enabled instance pooling in Azure Databricks, you can specify the pool name here. For more information, see Configure instance pooling below.

See previous.

Tip

Tip: If you specify a poolName value only, then you can use the instance pools with the same poolName available across multiple Databricks workspaces when you create a new cluster.


databricks.driverNodeType

Type of node to use for the Azure Databricks Driver. There is only 1 Driver node per cluster.

Default: Standard_D3_v2

For more information, see the sizing guide for Databricks.

Info

NOTE: This property is unused when instance pooling is enabled. For more information, see Configure instance pooling below.


databricks.logsDestinationDBFS location that cluster logs will be sent to every 5 minutesLeave this value as /trifacta/logs.
databricks.enableAutoterminationSet to true to enable auto-termination of a user cluster after N minutes of idle time, where N is the value of the autoterminationMinutes property.Unless otherwise required, leave this value as true.
databricks.clusterStatePollerDelayInSecondsNumber of seconds to wait between polls for Azure Databricks cluster status when a cluster is starting up
databricks.clusterStartupWaitTimeInMinutesMaximum time in minutes to wait for a Cluster to get to Running state before aborting and failing an Azure Databricks job
databricks.clusterLogSyncWaitTimeInMinutes

Maximum time in minutes to wait for a Cluster to complete syncing its logs to DBFS before giving up on pulling the cluster logs to the

D s node
.

Set this to 0 to disable cluster log pulls.
databricks.clusterLogSyncPollerDelayInSecondsNumber of seconds to wait between polls for a Databricks cluster to sync its logs to DBFS after job completion 
databricks.autoterminationMinutesIdle time in minutes before a user cluster will auto-terminate.Do not set this value to less than the cluster startup wait time value.
spark.useVendorSparkLibraries

When true, the platform bypasses shipping its installed Spark libraries to the cluster with each job's execution.


Info

NOTE: This setting is ignored. The vendor Spark libraries are always used for Azure Databricks.


...

  1. D s config
    methodws
  2. Locate the following property and set it to one of the values listed below:

    Code Block
    Databricks Job Management


    Property ValueDescription
    Never Delete(default) Job definitions are never deleted from the Azure Databricks cluster.
    Always DeleteThe Azure Databricks job definition is deleted during the clean-up phase, which occurs after a job completes.
    Delete Successful OnlyWhen a job completes successfully, the Azure Databricks job definition is deleted during the clean-up phase. Failed or canceled jobs are not deleted, which allows you to debug as needed.
    Skip Job Creation

    For jobs that are to be executed only one time, the

    D s platform
    can be configured to use a different mechanism for submitting the job. When this option is enabled, the
    D s platform
    submits jobs using the run-submit API, instead of the run-now API. The run-submit API does not create an Azure Databricks job definition. Therefore the submitted job does not count toward the enforced job limit.

    DefaultInherits the default system-wide setting.


  3. Save your changes and restart the platform.

...

To enable, the following parameters for the

D s platform
 must be specified.

D s config

ParameterDescription
azure.managedIdentities.enabledSet to true to enable use of Azure managed identities.
azure.managedIdentities.keyVaultApplicationidSecretNameSpecify the name of the Azure Key Vault secret that holds the service principal Application Id.
azure.managedIdentities.keyVaultApplicationSecretSecretNameSpecify the name of the Key Vault secret that holds the service principal secret.

Save your changes.

Pass additional Spark properties

...

  1. D s config
  2. Acquire the value for databricks.sparkVersion.
  3. In Azure Databricks, compare your value to the list of supported Azure Databricks version. If your version is unsupported, identify a new version to use. 

    Info

    NOTE: Please make note of the version of Spark supported for the version of Azure Databricks that you have chosen.


  4. In the 
    D s platform
     configuration:
    1. Set databricks.sparkVersion to the new version to use.
    2. Set spark.version to the appropriate version of Spark to use.
  5. Restart the 
    D s platform
    .
  6. The platform is restarted. A new Azure Databricks cluster is created for each user using the specified values, when the user runs a job.

Spark job fails with "spark scheduler cannot be cast" error

When you run a job on Databricks, the job may fail with the following error:

Code Block
java.lang.ClassCastException: org.apache.spark.scheduler.ResultTask cannot be cast to org.apache.spark.scheduler.Task
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

The job.log  file may contain something similar to the following:

Code Block
2022-07-19T15:41:24.832Z - [sid=0cf0cff5-2729-4742-a7b9-4607ca287a98] - [rid=83eb9826-fc3b-4359-8e8f-7fbf77300878] - [Async-Task-9] INFO com.trifacta.databricks.spark.JobHelper - Got error org.apache.spark.SparkException: Stage 0 failed. Error: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-10-243-149-238.eu-west-1.compute.internal, executor driver): java.lang.ClassCastException: org.apache.spark.scheduler.ResultTask cannot be cast to org.apache.spark.scheduler.Task
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616)
...

This error is due to a class mismatch between the 

D s platform
and Databricks.

Solution:

The solution is to disable the precedence of using the Spark JARs provided from the 

D s platform
over the Databricks Spark JARs. Please perform the following steps:

  1. D s config
    methodt
  2. Locate the spark.props  section and add the following configuration elements:

    Code Block
    "spark": {
        ...
        "props": {
          "spark.driver.userClassPathFirst": false,
          "spark.executor.userClassPathFirst": false,
          ...
        }
    },


  3. Save your changes and restart the platform.