...
- Table access control: https://docs.databricks.com/security/access-control/table-acls/object-privileges.html
...
Prerequisites
The
must be deployed in Microsoft Azure.D s platform Info NOTE: If you are using Azure Databricks as a datasource, please verify that openJDKv1.8.0_302 or earlier is installed on the
. Java 8 is required. If necessary, downgrade the Java version and restart the platform. There is a known issue with TLS v1.3.D s node
...
The job clusters automatically terminate after the job is completed. A new cluster is automatically created when the user next requests access to Azure Databricks access.
Cluster Mode | Description | ||
---|---|---|---|
USER | When a user submits a job,
Default cluster mode to run jobs in Azure Databricks. | ||
JOB | When a user submits a job,
|
Configure use of cluster policies
...
Instance pooling for worker nodes
Pre-requisitesPrerequisites:
- All cluster nodes used by the
are taken from the pool. If the pool has an insufficient number of nodes, cluster creation fails.D s platform - Each user must have access to the pool and must have at least the
ATTACH_TO
permission. - Each user must have a personal access token from the same Azure Databricks workspace. See Configure personal access token below.
...
D s config method ws Locate the following property and set it to one of the values listed below:
Code Block Databricks Job Management
Property Value Description Never Delete (default) Job definitions are never deleted from the Azure Databricks cluster. Always Delete The Azure Databricks job definition is deleted during the clean-up phase, which occurs after a job completes. Delete Successful Only When a job completes successfully, the Azure Databricks job definition is deleted during the clean-up phase. Failed or canceled jobs are not deleted, which allows you to debug as needed. Skip Job Creation For jobs that are to be executed only one time, the
can be configured to use a different mechanism for submitting the job. When this option is enabled, theD s platform
submits jobs using the run-submit API, instead of the run-now API. The run-submit API does not create an Azure Databricks job definition. Therefore the submitted job does not count toward the enforced job limit.D s platform Default Inherits the default system-wide setting. When this feature is enabled, the platform falls back to use the runs/submit API as a fallback when the job limit for the Databricks workspace has been reached:
Code Block Databricks Job Runs Submit Fallback
- Save your changes and restart the platform.
...
D s config - In the
, select User menu > Admin console > Admin settings.D s webapp Locate the following settings and set their values accordingly:
Setting Description databricks.userClusterThrottling.enabled
When set to
true
, job throttling per Databricks cluster is enabled. Please specify the following settings.databricks.userClusterthrottling.maxTokensAllottedPerUserCluster
Set this value to the maximum number of concurrent jobs that can run on one user cluster. Default value is
20
.databricks.userClusterthrottling.tokenExpiryInMinutes The time in minutes after which tokens reserved by a job are revoked, irrespective of the job status. If a job is in progress and this limit is reached, then the Databricks token is expired, and the token is revoked under the assumption that it is stale. Default value is
120
(2 hours).Tip Tip: Set this value to
0
to prevent token expiration. However, this setting is not recommended, as jobs can remain in the queue indefinitely.jobMonitoring.queuedJobTimeoutMinutes The maximum time in minutes in which a job is permitted to remain in the queue for a slot on Databricks cluster. If this limit is reached, the job is marked as failed. batch-job-runner.cleanup.enabled When set to
true
, the Batch Job Runner service is permitted to clean up throttling tokens and job-level personal access tokens.Tip Tip: Unless you have reason to do otherwise, you should leave this setting to
true
.- Save your changes and restart the platform.
...
D s config Locate the following properties and set accordingly:
Setting Description databricks.secretNamespace
If multiple instances of
are using the same Databricks cluster, you can specify the Databricks namespace to which these properties apply.D s product databricks.secrets
An array containing strings representing the properties that you wish to store in Databricks Secrets Management. For example, the default value stores a recommended set of Spark and Databricks properties:
Code Block ["spark.hadoop.dfs.adls.oauth2.client.id","spark.hadoop.dfs.adls.oauth2.credential", "dfs.adls.oauth2.client.id","dfs.adls.oauth2.credential", "fs.azure.account.oauth2.client.id","fs.azure.account.oauth2.client.secret"],
You can add or remove properties from this array list as needed.
- Save your changes and restart the platform.
...
To enable, the following parameters for the
D s platform |
---|
D s config |
---|
Parameter | Description |
---|---|
azure.managedIdentities.enabled | Set to true to enable use of Azure managed identities. |
azure.managedIdentities.keyVaultApplicationidSecretName | Specify the name of the Azure Key Vault secret that holds the service principal Application Id. |
azure.managedIdentities.keyVaultApplicationSecretSecretName | Specify the name of the Key Vault secret that holds the service principal secret. |
Save your changes.
Enable shared cluster for Databricks Tables
Optionally, you can provide to the
D s platform |
---|
Pre-requisitesPrerequisites:
Info |
---|
NOTE: Any shared cluster must be maintained by the customer. |
...
If you want to change the number of retries, change the value for the databricks.maxAPICallRetries
flag.
Value | Description | ||
---|---|---|---|
5 | (default) When a request is submitted through the Azure Databricks REST APIs, up to
| ||
0 | When an API call fails, the request fails. As the number of concurrent jobs increases, more jobs may fail.
| ||
5+ | Increasing this setting above the default value may result in more requests eventually getting processed. However, increasing the value may consume additional system resources in a high concurrency environment and jobs might take longer to run due to exponentially increasing waiting time. |
Use
Run job from application
...
D s config - Acquire the value for
databricks.sparkVersion
. In Azure Databricks, compare your value to the list of supported Azure Databricks version. If your version is unsupported, identify a new version to use.
Info NOTE: Please make note of the version of Spark supported for the version of Azure Databricks that you have chosen.
In the
configuration, setD s platform databricks.sparkVersion
to the new version to use.Info NOTE: The value for
spark.version
does not apply to Databricks.- Restart the
.D s platform - The platform is restarted. A new Azure Databricks cluster is created for each user using the specified values, when the user runs a job.
Spark job fails with "spark scheduler cannot be cast" error
...
- a job
...
Code Block |
---|
java.lang.ClassCastException: org.apache.spark.scheduler.ResultTask cannot be cast to org.apache.spark.scheduler.Task
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) |
The job.log
file may contain something similar to the following:
Code Block |
---|
2022-07-19T15:41:24.832Z - [sid=0cf0cff5-2729-4742-a7b9-4607ca287a98] - [rid=83eb9826-fc3b-4359-8e8f-7fbf77300878] - [Async-Task-9] INFO com.trifacta.databricks.spark.JobHelper - Got error org.apache.spark.SparkException: Stage 0 failed. Error: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-10-243-149-238.eu-west-1.compute.internal, executor driver): java.lang.ClassCastException: org.apache.spark.scheduler.ResultTask cannot be cast to org.apache.spark.scheduler.Task
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616)
... |
This error is due to a class mismatch between the
D s platform |
---|
Solution:
The solution is to disable the precedence of using the Spark JARs provided from the
D s platform |
---|
D s config method t Locate the
spark.props
section and add the following configuration elements:
Save your changes and restart the platform.Code Block "spark": { ... "props": { "spark.driver.userClassPathFirst": false, "spark.executor.userClassPathFirst": false, ... } },