Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r0871

...

...

Prerequisites

  • The 

    D s platform
     must be deployed in Microsoft Azure.  

    Info

    NOTE: If you are using Azure Databricks as a datasource, please verify that openJDKv1.8.0_302 or earlier is installed on the

    D s node
    . Java 8 is required. If necessary, downgrade the Java version and restart the platform. There is a known issue with TLS v1.3.

...

The job clusters automatically terminate after the job is completed. A new cluster is automatically created when the user next requests access to Azure Databricks access. 

Cluster ModeDescription
USER

When a user submits a job,

D s product
 creates a new cluster and persists the cluster ID in 
D s product
 metadata for the user if the cluster does not exist or invalid. If the user already has an existing interactive valid cluster, then the existing cluster is reused when submitting the job.

Default cluster mode to run jobs in Azure Databricks.

JOB

When a user submits a job,

D s product
provides all the cluster specifications in the Databricks API. Databricks creates a cluster only for this job and terminates it as soon as the job completes.

Configure use of cluster policies

...

Instance pooling for worker nodes

Pre-requisitesPrerequisites:

  • All cluster nodes used by the
    D s platform
    are taken from the pool. If the pool has an insufficient number of nodes, cluster creation fails.
  • Each user must have access to the pool and must have at least the ATTACH_TO permission.
  • Each user must have a personal access token from the same Azure Databricks workspace. See Configure personal access token below.

...

  1. D s config
    methodws
  2. Locate the following property and set it to one of the values listed below:

    Code Block
    Databricks Job Management
    Property ValueDescription
    Never Delete(default) Job definitions are never deleted from the Azure Databricks cluster.
    Always DeleteThe Azure Databricks job definition is deleted during the clean-up phase, which occurs after a job completes.
    Delete Successful OnlyWhen a job completes successfully, the Azure Databricks job definition is deleted during the clean-up phase. Failed or canceled jobs are not deleted, which allows you to debug as needed.
    Skip Job Creation

    For jobs that are to be executed only one time, the

    D s platform
    can be configured to use a different mechanism for submitting the job. When this option is enabled, the
    D s platform
    submits jobs using the run-submit API, instead of the run-now API. The run-submit API does not create an Azure Databricks job definition. Therefore the submitted job does not count toward the enforced job limit.

    DefaultInherits the default system-wide setting.
  3. When this feature is enabled, the platform falls back to use the runs/submit API as a fallback when the job limit for the Databricks workspace has been reached:

    Code Block
    Databricks Job Runs Submit Fallback
  4. Save your changes and restart the platform.

...

  1. D s config
  2. In the 
    D s webapp
    , select User menu > Admin console > Admin settings.
  3. Locate the following settings and set their values accordingly:

    SettingDescription

    databricks.userClusterThrottling.enabled

    When set to true, job throttling per Databricks cluster is enabled. Please specify the following settings.

    databricks.userClusterthrottling.maxTokensAllottedPerUserCluster

    Set this value to the maximum number of concurrent jobs that can run on one user cluster. Default value is 20.

    databricks.userClusterthrottling.tokenExpiryInMinutes

    The time in minutes after which tokens reserved by a job are revoked, irrespective of the job status. If a job is in progress and this limit is reached, then the Databricks token is expired, and the token is revoked under the assumption that it is stale. Default value is 120 (2 hours).

    Tip

    Tip: Set this value to 0 to prevent token expiration. However, this setting is not recommended, as jobs can remain in the queue indefinitely.

    jobMonitoring.queuedJobTimeoutMinutesThe maximum time in minutes in which a job is permitted to remain in the queue for a slot on Databricks cluster. If this limit is reached, the job is marked as failed.
    batch-job-runner.cleanup.enabled

    When set to true, the Batch Job Runner service is permitted to clean up throttling tokens and job-level personal access tokens.

    Tip

    Tip: Unless you have reason to do otherwise, you should leave this setting to true.

  4. Save your changes and restart the platform.

...

  1. D s config
  2. Locate the following properties and set accordingly:

    SettingDescription
    databricks.secretNamespace

    If multiple instances of

    D s product
    are using the same Databricks cluster, you can specify the Databricks namespace to which these properties apply.

    databricks.secrets

    An array containing strings representing the properties that you wish to store in Databricks Secrets Management. For example, the default value stores a recommended set of Spark and Databricks properties:

    Code Block
    ["spark.hadoop.dfs.adls.oauth2.client.id","spark.hadoop.dfs.adls.oauth2.credential",
    "dfs.adls.oauth2.client.id","dfs.adls.oauth2.credential",
    "fs.azure.account.oauth2.client.id","fs.azure.account.oauth2.client.secret"],

    You can add or remove properties from this array list as needed.

  3. Save your changes and restart the platform.

...

To enable, the following parameters for the

D s platform
 must be specified.

D s config

ParameterDescription
azure.managedIdentities.enabledSet to true to enable use of Azure managed identities.
azure.managedIdentities.keyVaultApplicationidSecretNameSpecify the name of the Azure Key Vault secret that holds the service principal Application Id.
azure.managedIdentities.keyVaultApplicationSecretSecretNameSpecify the name of the Key Vault secret that holds the service principal secret.

Save your changes.

Enable shared cluster for Databricks Tables

Optionally, you can provide to the 

D s platform
 the name of a shared Databricks cluster to be used to access Databricks Tables. 

Pre-requisitesPrerequisites:

Info

NOTE: Any shared cluster must be maintained by the customer.

...

If you want to change the number of retries, change the value for the databricks.maxAPICallRetries  flag.  

ValueDescription
5

(default) When a request is submitted through the Azure Databricks REST APIs, up to 5 retries can be performed in the case of failures.

  • The waiting period increases exponentially for every retry. For example, for the 1st retry, the wait time is 10 seconds, 20 seconds for the next retry, 40 seconds for the third retry and so on.
  • You can set the values accordingly based on number of minutes /seconds you want to try.
0

When an API call fails, the request fails. As the number of concurrent jobs increases, more jobs may fail.

Info

NOTE: This setting is not recommended.


5+Increasing this setting above the default value may result in more requests eventually getting processed. However, increasing the value may consume additional system resources in a high concurrency environment and jobs might take longer to run due to exponentially increasing waiting time. 

Use

Run job from application

...

  1. D s config
  2. Acquire the value for databricks.sparkVersion.
  3. In Azure Databricks, compare your value to the list of supported Azure Databricks version. If your version is unsupported, identify a new version to use. 

    Info

    NOTE: Please make note of the version of Spark supported for the version of Azure Databricks that you have chosen.

  4. In the 

    D s platform
     configuration, set databricks.sparkVersion to the new version to use.

    Info

    NOTE: The value for spark.version does not apply to Databricks.

  5. Restart the 
    D s platform
    .
  6. The platform is restarted. A new Azure Databricks cluster is created for each user using the specified values, when the user runs a job.

Spark job fails with "spark scheduler cannot be cast" error

...

  1. a job

...

Code Block
java.lang.ClassCastException: org.apache.spark.scheduler.ResultTask cannot be cast to org.apache.spark.scheduler.Task
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

The job.log  file may contain something similar to the following:

Code Block
2022-07-19T15:41:24.832Z - [sid=0cf0cff5-2729-4742-a7b9-4607ca287a98] - [rid=83eb9826-fc3b-4359-8e8f-7fbf77300878] - [Async-Task-9] INFO com.trifacta.databricks.spark.JobHelper - Got error org.apache.spark.SparkException: Stage 0 failed. Error: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-10-243-149-238.eu-west-1.compute.internal, executor driver): java.lang.ClassCastException: org.apache.spark.scheduler.ResultTask cannot be cast to org.apache.spark.scheduler.Task
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616)
...

This error is due to a class mismatch between the 

D s platform
and Databricks.

Solution:

The solution is to disable the precedence of using the Spark JARs provided from the 

D s platform
over the Databricks Spark JARs. Please perform the following steps:

  1. D s config
    methodt
  2. Locate the spark.props  section and add the following configuration elements:

    Code Block
    "spark": {
        ...
        "props": {
          "spark.driver.userClassPathFirst": false,
          "spark.executor.userClassPathFirst": false,
          ...
        }
    },
    Save your changes and restart the platform.