Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r0822

...

Excerpt

This section provides high-level information on how to configure the

D s platform
to integrate with Databricks hosted on AWS.


AWS Databricks is a unified data analytics platform that has been optimized for use on the 
AWS infrastructure.

Additional Databricks features supported by the platform:

Prerequisites

  • The 
    D s platform
     must be installed in a customer-managed AWS environment.
  • The base storage layer must be set to S3. For more information, see Set Base Storage Layer.

...

  1. D s config
    methodws
  2. Locate the following parameter, which enables 

    D s photon
     for smaller job execution. Set it to Enabled:

    Code Block
    Photon execution
  3. You do not need to save to enable the above configuration change.
  4. D s config
  5. Locate the following parameters. Set them to the values listed below, which enables AWS Databricks (small to extra-large jobs) running environments:

    Code Block
    "webapp.runInDatabricks": true,
    "webapp.runWithSparkSubmit": false,
    "webapp.runInDataflow": false,
  6. Do not save your changes until you have completed the following configuration section.

Configure

Configure cluster mode 

When a user submits a job, the 

D s product
 provides all the cluster specifications in the Databricks API and it creates cluster only for per-user or per-job, that means once the job is complete, the cluster is terminated.  Cluster creation may take less than 30 seconds if instance pools are used. If the instance pools are not used, it may take 10-15 minutes.

For more information on job clusters, see https://docs.databricks.com/clusters/configure.html .

The job clusters automatically terminate after the job is completed. A new cluster is automatically created when the user next requests access to AWS Databricks access. 

Cluster ModeDescription
USER

When a user submits a job,

D s product
 creates a new cluster and persists the cluster ID in 
D s product
 metadata for the user if the cluster does not exist or invalid. If the user already has an existing interactive valid cluster, then the existing cluster is reused when submitting the job.

Reset to JOB mode to run jobs in AWS Databricks.

JOB

When a user submits a job,

D s product
provides all the cluster specifications in the Databricks API. Databricks creates a cluster only for this job and terminates it as soon as the job completes.

Default cluster mode to run jobs in AWS Databricks.


Configure use of cluster policies


Optionally, you can configure the 

D s platform
 to use the Databricks cluster policies that have been specified by your Databricks administrator for creating and using clusters. These policies are effectively templates for creation and use of Databricks clusters and govern aspects of clusters such as the type and count of nodes the resources that can be accessed via the cluster, and other settings. For more information on Databricks cluster policies, see https://docs.databricks.com/administration-guide/clusters/policies.html


Prerequisites


Info

NOTE: Your Databricks administrator must create and deploy the Databricks cluster policies from which

D s item
itemusers
can select for their personal use.


Notes:


  • When this feature is enabled, each user may select the appropriate Databricks cluster policy to use for jobs. If none is selected by a user, jobs are launched without a cluster policy for the user using the Databricks properties set in platform configuration. 

    Info

    NOTE: Except for Spark version and cluster policy identifier in job-level overrides, other Databricks cluster configuration in the 

    D s platform
     is ignored when this feature is in use. Other job-level overrides are also ignored.

  • If a cluster policy is modified and existing clusters are using it, then subsequent job executions using that policy attempt to use the same cluster. This can cause issues in performance and even job failures.

    Tip

    Tip: Avoid editing cluster policies that are in use, as these changed policies may be applied to clusters generated under the old policies. Instead, you should create a new policy and assign it for use.

  • If the cluster policy references a Databricks instance pool that does not exist, the job fails. 


Steps:


  1. D s config
    methodws
  2. Locate the following parameter and set it to Enabled:

    Code Block
    Databricks Cluster Policies
  3. Save your changes and restart the platform.


Info

NOTE: Each user must select a cluster policy to use. For more information, see Databricks Settings Page.


Job overrides:


A user's cluster policy can overridden when a job is executed via API. Set the request attribute for the clusterPolicyId.


Info

NOTE: If a Databricks cluster policy is used, all job-level overrides except for clusterPolicyId are ignored.


For more information, see API Workflow - Run Job.


Policy template for AWS - without instance pools:


The following example cluster policy can provide a basis for creating your own AWS cluster policies when instance pools are not in use:


Code Block
{
  "autoscale.max_workers": {
    "type": "fixed",
    "value": 3,
    "hidden": true
  },
  "autoscale.min_workers": {
    "type": "fixed",
    "value": 1,
    "hidden": true
  },
  "autotermination_minutes": {
    "type": "fixed",
    "value": 10,
    "hidden": true
  },
  "aws_attributes.availability": {
    "type": "fixed",
    "value": "SPOT_WITH_FALLBACK",
    "hidden": false
  },
  "aws_attributes.ebs_volume_count": {
    "type": "fixed",
    "value": 0,
    "hidden": false
  },
  "aws_attributes.ebs_volume_size": {
    "type": "fixed",
    "value": 0,
    "hidden": false
  },
  "aws_attributes.first_on_demand": {
    "type": "fixed",
    "value": 1,
    "hidden": false
  },
  "aws_attributes.spot_bid_price_percent": {
    "type": "fixed",
    "value": 100,
    "hidden": false
  },
  "aws_attributes.instance_profile_arn": {
    "type": "fixed",
    "value": "arn:aws:iam::9999999999999:instance-profile/SOME_Role_ARN",
    "hidden": false
  },
  "driver_node_type_id": {
    "type": "fixed",
    "value": "i3.xlarge",
    "hidden": true
  },
  "enable_local_disk_encryption": {
    "type": "fixed",
    "value": false
  },
  "node_type_id": {
    "type": "fixed",
    "value": "i3.xlarge",
    "hidden": true
  }
}


Policy template for AWS - without instance pools:


The following example cluster policy can provide a basis for creating your own AWS cluster policies when instance pools are in use:


Code Block
{
  "autoscale.max_workers": {
    "type": "fixed",
    "value": 3,
    "hidden": true
  },
  "autoscale.min_workers": {
    "type": "fixed",
    "value": 1,
    "hidden": true
  },
  "aws_attributes.instance_profile_arn": {
    "type": "fixed",
    "value": "arn:aws:iam::9999999999:instance-profile/SOME_POLICY",
    "hidden": false
  },
  "enable_local_disk_encryption": {
    "type": "fixed",
    "value": false
  },
  "instance_pool_id": {
    "type": "fixed",
    "value": "SOME_POOL",
    "hidden": true
  },
  "driver_instance_pool_id": {
    "type": "fixed",
    "value": "SOME_POOL",
    "hidden": true
  },
  "autotermination_minutes": {
    "type": "fixed",
    "value": 10,
    "hidden": true
  },
}


Configure Instance Profiles in AWS Databricks

D s platform
  EC2 instances can be configured with permissions to access AWS resources like S3 by attaching an IAM instance profile. Similarly, instance profiles can be attached to EC2 instances for use with AWS Databricks clusters.

Info

NOTE: You must register the instance profiles in the Databricks workspace, or your Databricks clusters reject the instance profile ARNs and display an error. For more information, see https://docs.databricks.com/administration-guide/cloud-configurations/aws/instance-profiles.html#step-5-add-the-instance-profile-to-databricks.

To configure the instance profile for AWS Databricks, you must provide an IAM instance profile ARN in databricks.awsAttributes. instanceProfileArn parameter.

Info

NOTE: For AWS Databricks, you can configure the instance profile value in  databricks.awsAttributes.instanceProfileArn , only when the  aws.credentialProvider is set to instance or temporary.

aws.credentialProviderAWS Databricks permissions
instance

D s platform
 or Databricks jobs gets all permissions directly from the instance profile.

temporary

D s platform
  or Databricks jobs use temporary credentials that are issued based on system or user IAM roles.

Info

NOTE: The instance profile must have policies that allow  

D s platform
 or Databricks to assume those roles.


defaultn/a


Info

NOTE: If the aws.credentialProvider is set to temporary or instance while using AWS Databricks:

  • databricks.awsAttributes.instanceProfileArn must be set to a valid value f or Databricks jobs to run successfully.
  • aws.ec2InstanceRoleForAssumeRole flag is ignored for Databricks jobs.

For more information, see  Configure for AWS Authentication .

Configure instance pooling

Instance pooling reduces cluster node spin-up time by maintaining a set of idle and ready instances. The 

D s platform
 can be configured to leverage instance pooling on the Azure AWS Databricks cluster for both worker and driver nodes.

...

  • All cluster nodes used by the 
    D s platform
     are taken from the pool. If the pool has an insufficient number of nodes, cluster creation fails.
  • Each user must have access to the pool and must have at least the ATTACH_TO permission.
  • Each user must have a personal access token from the same Azure AWS Databricks workspace. See Configure personal access token below.

...

  1. Acquire your pool identifier or pool name from Azure AWS Databricks.

    Info

    NOTE: You can use either the Databricks pool identifier or pool name.If both poolId and poolName are specified, poolId is used first. If that fails to find a matching identifier, then the poolName value is checked.

    Tip

    Tip: If you specify a poolName value only, then you can run your Databricks jobs against the available clusters across multiple

    D s item
    itemworkspaces
    . This mechanism allows for better resource allocation and broader execution options.

  2. D s config
  3. Set either of the following parameters: 

    1. Set the following parameter to the Azure AWS Databricks pool identifier:

      Code Block
      "databricks.poolId": "<my_pool_id>",
    2. Or, you can set the following parameter to the Azure AWS Databricks pool name:

      Code Block
      "databricks.poolName": "<my_pool_name>",
  4. Save your changes and restart the platform.

...

ParameterDescriptionValue

databricks.awsAttributes.firstOnDemandInstances

Number of initial cluster nodes to be placed on on-demand instances. The remainder is placed on availability instances

Default: 1

databricks.awsAttributes.availability

Availability type used for all subsequent nodes past the firstOnDemandInstances.

Default: SPOT_WITH_FALLBACK

databricks.awsAttributes.availabilityZone

Identifier for the availability zone/datacenter in which the cluster resides. The provided availability zone must be in the same region as the Databricks deployment.


databricks.awsAttributes.spotBidPricePercent

The max price for AWS spot instances, as a percentage of the corresponding instance type's on-demand price. When spot instances are requested for this cluster, only spot instances whose max price percentage matches this field will be considered.

Default: 100

databricks.awsAttributes.ebsVolume

The type of EBS volumes that will be launched with this cluster.

Default: None

databricks.awsAttributes.instanceProfileArn

EC2 instance profile ARN for the cluster nodes. This is only used when AWS credential provider is set to temporary/instance. The instance profile must have previously been added to the Databricks environment by an account administrator.

For more information, see Configure for AWS Authentication.
databricks.clusterMode

Determines the cluster mode for running a Databricks job.

Default: JOB
feature.parameterization.matchLimitOnSampling.databricksSparkMaximum number of parameterized source files that are permitted for matching in a single dataset with parameters.Default: 0
databricks.workerNodeTypeType of node to use for the AWS Databricks Workers/Executors. There are 1 or more Worker nodes per cluster.

Default: i3.xlarge

            


databricks.sparkVersionAWS Databricks runtime version which also references the appropriate version of Spark.

Depending on your version of AWS Databricks, please set this property according to the following:

  • AWS Databricks 8.3: 8.3.x-scala2.12

    Info

    NOTE: Except for the above version, AWS Databricks 8.x is not supported.

  • AWS Databricks 7.3: 7.3.x-scala2.12

    Info

    NOTE: Except for the above version, AWS Databricks 7.x is not supported.

  • AWS Databricks 5.5 LTR: 5.5.x-scala2.11

Please do not use other values.

databricks.minWorkersInitial number of Worker nodes in the cluster, and also the minimum number of Worker nodes that the cluster can scale down to during auto-scale-down.

Minimum value: 1

Increasing this value can increase compute costs.

databricks.maxWorkersMaximum number of Worker nodes the cluster can create during auto scaling.

Minimum value: Not less than databricks.minWorkers.

Increasing this value can increase compute costs.

databricks.poolId

If you have enabled instance pooling in AWS Databricks, you can specify the pool identifier here.

Info

NOTE: If both poolId and poolName are specified, poolId is used first. If that fails to find a matching identifier, then the poolName value is checked.

databricks.poolNameIf you have enabled instance pooling in AWS Databricks, you can specify the pool name here.

See previous.

Tip

Tip: If you specify a poolName value only, then you can use the instance pools with the same poolName available across multiple Databricks workspaces when you create a new cluster.

databricks.driverNodeType

Type of node to use for the AWS Databricks Driver. There is only one Driver node per cluster.

Default: i3.xlarge

For more information, see the sizing guide for Databricks.

Info

NOTE: This property is unused when instance pooling is enabled. For more information, see Configure instance pooling below.

databricks. driverPoolIdIf you have enabled instance pooling in AWS Databricks, you can specify the driver node pool identifier here. For more information, see Configure instance pooling below.
Info

NOTE: If both driverPoolId and driverPoolName are specified, driverPoolId is used first. If that fails to find a matching identifier, then the driverPoolName value is checked.

databricks.driverPoolNameIf you have enabled instance pooling in AWS Databricks, you can specify the driver node pool name here. For more information, see Configure instance pooling below.

See previous.

Tip

Tip: If you specify a driverPoolName value only, then you can use the instance pools with the same driverPoolName available across multiple Databricks workspaces when you create a new cluster.

databricks.logsDestinationDBFS location that cluster logs will be sent to every 5 minutesLeave this value as /trifacta/logs.
databricks.enableAutoterminationSet to true to enable auto-termination of a user cluster after N minutes of idle time, where N is the value of the autoterminationMinutes property.Unless otherwise required, leave this value as true.
databricks.clusterStatePollerDelayInSecondsNumber of seconds to wait between polls for AWS Databricks cluster status when a cluster is starting up
databricks.clusterStartupWaitTimeInMinutesMaximum time in minutes to wait for a Cluster to get to Running state before aborting and failing an AWS Databricks job. Default: 60
databricks.clusterLogSyncWaitTimeInMinutes

Maximum time in minutes to wait for a Cluster to complete syncing its logs to DBFS before giving up on pulling the cluster logs to the

D s node
.

Set this to 0 to disable cluster log pulls.
databricks.clusterLogSyncPollerDelayInSecondsNumber of seconds to wait between polls for a Databricks cluster to sync its logs to DBFS after job completion. Default: 20
databricks.autoterminationMinutesIdle time in minutes before a user cluster will auto-terminate.Do not set this value to less than the cluster startup wait time value.
databricks.maxAPICallRetriesMaximum number of retries to perform in case of 429 error code responseDefault: 5. For more information, see Configure Maximum Retries for REST API section below.
databricks.enableLocalDiskEncryption

Enables encryption of data like shuffle data that is temporarily stored on cluster's local disk.

-
databricks.patCacheTTLInMinutesLifespan in minutes for the Databricks personal access token in-memory cacheDefault: 10
spark.useVendorSparkLibraries

When true, the platform bypasses shipping its installed Spark libraries to the cluster with each job's execution.

Info

NOTE: This setting is ignored. The vendor Spark libraries are always used for AWS Databricks.

...

Configure AWS Databricks workspace overrides

A single AWS Databricks account can have access to multiple Databricks workspaces. You can create more than one workspace by using Account API if you are account is on the E2 version of the platform or on a selected custom plan that allows multiple workspaces per account.

For more information, see https://docs.databricks.com/administration-guide/account-api/new-workspace.html

Each workspace has a unique deployment name associated with it that defines the workspace URL. For example:  https://<deployment-name>. cloud.databricks.com .  

Info

NOTE:

  • The existing property databricks.serviceUrl is used to configure the URL to the Databricks Service to run Spark jobs.
  • The databricks.serviceUrl defines the default Databricks workspace for all user in the
    D s product
    workspace.
  • Individual user can override this setting in the User Preferences in the Databricks Personal Access Token page.

For more information, see Databricks Settings Page.

For more information, see Configure Platform section above.

...

I ndividual users can specify the name of the cluster to which they are permissioned to access Databricks Tables. This cluster can also be shared among users. For more information, see Databricks Settings Page

Configure maximum retries for REST API 

There is a limit of 30 requests per second per workspace on the Databricks REST APIs. If this limit is reached, then a HTTP status code 429 error is returned, indicating that rate limiting is being applied by the server.   By default, the

D s platform
 re-attempts to submit a request  5 times and then fails the job if the request is not accepted. 

If you want to change the number of retries, change the value for the  databricks.maxAPICallRetries  flag.  

ValueDescription
5

(default) When a request is submitted through the AWS Databricks REST APIs, up to 5 retries can be performed in the case of failures.

  • The waiting period increases exponentially for every retry. For example, for the 1st retry, the wait time is 10 seconds, 20 seconds for the next retry, 40 seconds for the third retry and so on.
  • You can set the values accordingly based on number of minutes /seconds you want to try.
0

When an API call fails, the request fails. As the number of concurrent jobs increases, more jobs may fail.

Info

NOTE: This setting is not recommended.


5+Increasing this setting above the default value may result in more requests eventually getting processed. However, increasing the value may consume additional system resources in a high concurrency environment and jobs might take longer to run due to exponentially increasing waiting time.

...