...
Supported versions of Databricks
AWS Databricks 8.3
AWS Databricks 7.3 (Recommended)
AWS Databricks 5.5 LTS
...
The job clusters automatically terminate after the job is completed. A new cluster is automatically created when the user next requests access to AWS Databricks access.
Cluster Mode | Description | ||
---|---|---|---|
USER | When a user submits a job,
Reset to JOB mode to run jobs in AWS Databricks. | ||
JOB | When a user submits a job,
|
Configure Instance Profiles in AWS Databricks
...
Info |
---|
NOTE: For AWS Databricks, you can configure the instance profile value in |
aws.credentialProvider | AWS Databricks permissions | ||||
---|---|---|---|---|---|
instance |
| ||||
temporary |
| ||||
default | n/a |
Info |
---|
NOTE: If the
|
...
Instance pooling reduces cluster node spin-up time by maintaining a set of idle and ready instances. The
D s platform |
---|
...
For more information, see https://docs.databricksazuredatabricks.comnet/clusters/instance-pools/configureindex.html.
Instance pooling for worker nodes
...
- All cluster nodes used by the
are taken from the pool. If the pool has an insufficient number of nodes, cluster creation fails.D s platform - Each user must have access to the pool and must have at least the
ATTACH_TO
permission. - Each user must have a personal access token from the same AWS Azure Databricks workspace. See Configure personal access token below.
...
Acquire your pool identifier or pool name from AWS Azure Databricks.
Info NOTE: You can use either the Databricks pool identifier or pool name.If both poolId and poolName are specified, poolId is used first. If that fails to find a matching identifier, then the poolName value is checked.
Tip Tip: If you specify a poolName value only, then you can run your Databricks jobs against the available clusters across multiple
. This mechanism allows for better resource allocation and broader execution options.D s item item workspaces D s config Set either of the following parameters:
Set the following parameter to the AWS Azure Databricks pool identifier:
Code Block "databricks.poolId": "<my_pool_id>",
Or, you can set the following parameter to the AWS Azure Databricks pool name:
Code Block "databricks.poolName": "<my_pool_name>",
- Save your changes and restart the platform.
...
Following is the list of parameters that have to be set to integrate the AWS Databricks with
D s platform |
---|
Required Parameters
Parameter | Description | Value |
---|---|---|
| URL to the AWS Databricks Service where Spark jobs will be run | - |
metadata.cloud | Must be set to | Default: aws |
Following is the list of parameters that can be reviewed or modified based on your requirements:
Optional Parameters
Parameter | Description | Value | ||||
---|---|---|---|---|---|---|
databricks.awsAttributes.firstOnDemandInstances | Number of initial cluster nodes to be placed on on-demand instances. The remainder is placed on availability instances | Default: 1 | ||||
databricks.awsAttributes.availability | Availability type used for all subsequent nodes past the firstOnDemandInstances. | Default: SPOT_WITH_FALLBACK | ||||
databricks.awsAttributes.availabilityZone | Identifier for the availability zone/datacenter in which the cluster resides. The provided availability zone must be in the same region as the Databricks deployment. | |||||
databricks.awsAttributes.spotBidPricePercent | The max price for AWS spot instances, as a percentage of the corresponding instance type's on-demand price. When spot instances are requested for this cluster, only spot instances whose max price percentage matches this field will be considered. | Default: 100 | ||||
databricks.awsAttributes.ebsVolume | The type of EBS volumes that will be launched with this cluster. | Default: None | ||||
databricks.awsAttributes.instanceProfileArn | EC2 instance profile ARN for the cluster nodes. This is only used when AWS credential provider is set to temporary/instance. The instance profile must have previously been added to the Databricks environment by an account administrator. | For more information, see Configure for AWS Authentication. | ||||
databricks.clusterMode | Determines the cluster mode for running a Databricks job. | Default: JOB | ||||
feature.parameterization.matchLimitOnSampling.databricksSpark | Maximum number of parameterized source files that are permitted for matching in a single dataset with parameters. | Default: 0 | ||||
databricks.workerNodeType | Type of node to use for the AWS Databricks Workers/Executors. There are 1 or more Worker nodes per cluster. | Default: | ||||
databricks.sparkVersion | AWS Databricks runtime version which also references the appropriate version of Spark. | Depending on your version of AWS Databricks, please set this property according to the following:
Please do not use other values. | ||||
databricks.minWorkers | Initial number of Worker nodes in the cluster, and also the minimum number of Worker nodes that the cluster can scale down to during auto-scale-down. | Minimum value: Increasing this value can increase compute costs. | ||||
databricks.maxWorkers | Maximum number of Worker nodes the cluster can create during auto scaling. | Minimum value: Not less than Increasing this value can increase compute costs. | ||||
databricks.poolId | If you have enabled instance pooling in AWS Databricks, you can specify the pool identifier here. |
| ||||
databricks.poolName | If you have enabled instance pooling in AWS Databricks, you can specify the pool name here. | See previous.
| ||||
databricks.driverNodeType | Type of node to use for the AWS Databricks Driver. There is only one Driver node per cluster. | Default: For more information, see the sizing guide for Databricks.
| ||||
databricks.driverPoolId | If you have enabled instance pooling in AWS Databricks, you can specify the driver node pool identifier here. For more information, see Configure instance pooling below. |
| ||||
databricks.driverPoolName | If you have enabled instance pooling in AWS Databricks, you can specify the driver node pool name here. For more information, see Configure instance pooling below. | See previous.
| ||||
databricks.logsDestination | DBFS location that cluster logs will be sent to every 5 minutes | Leave this value as /trifacta/logs . | ||||
databricks.enableAutotermination | Set to true to enable auto-termination of a user cluster after N minutes of idle time, where N is the value of the autoterminationMinutes property. | Unless otherwise required, leave this value as true . | ||||
databricks.clusterStatePollerDelayInSeconds | Number of seconds to wait between polls for AWS Databricks cluster status when a cluster is starting up | |||||
databricks.clusterStartupWaitTimeInMinutes | Maximum time in minutes to wait for a Cluster to get to Running state before aborting and failing an AWS Databricks job. | Default: 60 | ||||
databricks.clusterLogSyncWaitTimeInMinutes | Maximum time in minutes to wait for a Cluster to complete syncing its logs to DBFS before giving up on pulling the cluster logs to the
| Set this to 0 to disable cluster log pulls. | ||||
databricks.clusterLogSyncPollerDelayInSeconds | Number of seconds to wait between polls for a Databricks cluster to sync its logs to DBFS after job completion. | Default: 20 | ||||
databricks.autoterminationMinutes | Idle time in minutes before a user cluster will auto-terminate. | Do not set this value to less than the cluster startup wait time value. | ||||
databricks.maxAPICallRetries | Maximum number of retries to perform in case of 429 error code response | Default: 5. For more information, see Configure Maximum Retries for REST API section below. | ||||
databricks.enableLocalDiskEncryption | Enables encryption of data like shuffle data that is temporarily stored on cluster's local disk. | - | ||||
databricks.patCacheTTLInMinutes | Lifespan in minutes for the Databricks personal access token in-memory cache | Default: 10 | ||||
spark.useVendorSparkLibraries | When |
|
Configure Databricks Job Management
...
D s config method ws Locate the following property and set it to one of the values listed below:
Code Block Databricks Job Management
Property Value Description Never Delete (default) Job definitions are never deleted from the AWS Databricks cluster. Always Delete The AWS Databricks job definition is deleted during the clean-up phase, which occurs after a job completes. Delete Successful Only When a job completes successfully, the AWS Databricks job definition is deleted during the clean-up phase. Failed or canceled jobs are not deleted, which allows you to debug as needed. Skip Job Creation For jobs that are to be executed only one time, the
can be configured to use a different mechanism for submitting the job. When this option is enabled, theD s platform
submits jobs using the run-submit API, instead of the run-now API. The run-submit API does not create an AWS Databricks job definition. Therefore the submitted job does not count toward the enforced job limit.D s platform Default Inherits the default system-wide setting. When this feature is enabled, the platform falls back to use the runs/submit API as a fallback when the job limit for the Databricks workspace has been reached:
Code Block Databricks Job Runs Submit Fallback
- Save your changes and restart the platform.
...
D s config - In the
, select User menu > Admin console > Admin settings.D s webapp Locate the following settings and set their values accordingly:
Setting Description databricks.userClusterThrottling.enabled
When set to
true
, job throttling per Databricks cluster is enabled. Please specify the following settings.databricks.userClusterthrottling.maxTokensAllottedPerUserCluster
Set this value to the maximum number of concurrent jobs that can run on one user cluster. Default value is
20
.databricks.userClusterthrottling.tokenExpiryInMinutes The time in minutes after which tokens reserved by a job are revoked, irrespective of the job status. If a job is in progress and this limit is reached, then the Databricks token is expired, and the token is revoked under the assumption that it is stale. Default value is
120
(2 hours).Tip Tip: Set this value to
0
to prevent token expiration. However, this setting is not recommended, as jobs can remain in the queue indefinitely.jobMonitoring.queuedJobTimeoutMinutes The maximum time in minutes in which a job is permitted to remain in the queue for a slot on Databricks cluster. If this limit is reached, the job is marked as failed. batch-job-runner.cleanup.enabled When set to
true
, the Batch Job Runner service is permitted to clean up throttling tokens and job-level personal access tokens.Tip Tip: Unless you have reason to do otherwise, you should leave this setting to
true
.- Save your changes and restart the platform.
...
If you want to change the number of retries, change the value for the databricks.maxAPICallRetries
flag.
Value | Description | ||
---|---|---|---|
5 | (default) When a request is submitted through the AWS Databricks REST APIs, up to
| ||
0 | When an API call fails, the request fails. As the number of concurrent jobs increases, more jobs may fail.
| ||
5+ | Increasing this setting above the default value may result in more requests eventually getting processed. However, increasing the value may consume additional system resources in a high concurrency environment and jobs might take longer to run due to exponentially increasing waiting time. |
Use
Run Job From Application
...