Page tree

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Release 8.2.2


Contents:

   

Contents:


This section provides high-level information on how to configure the Designer Cloud Powered by Trifacta platform to integrate with Databricks hosted on AWS.
AWS Databricks is a unified data analytics platform that has been optimized for use on the 
AWS infrastructure.

Additional Databricks features supported by the platform:

Prerequisites

  • The Designer Cloud Powered by Trifacta platform must be installed in a customer-managed AWS environment.
  • The base storage layer must be set to S3. For more information, see Set Base Storage Layer.

Limitations

  • Import datasets created from nested folders is not supported for running jobs from AWS Databricks.

  • If the job is submitted using the User cluster mode and no cluster is available, the following are the launch times for a new cluster with and without instance pools:
    • Without instance pools: Up to 5 minutes to launch
    • With instance pools : Up to 30 seconds to launch
  • If the job is canceled during cluster startup:
    • The cluster startup continues. After the cluster is running, the job is terminated, and the cluster remains. 
    • As a result, there is a delay in reporting the job cancellation in the Job Details page. The job should be reported as canceled not failed.
  • AWS Databricks integration works with Spark 2.4.x only.

    NOTE: The version of Spark for AWS Databricks must be applied to the platform configuration through the databricks.sparkVersion property. Details are provided later.

Supported versions of Databricks

  • AWS Databricks 7.3 (Recommended)

  • AWS Databricks 5.5 LTS

Job Limits

By default, the number of jobs permitted on an AWS workspace is set to 1000.

  • The number of jobs that can be created per workspace in an hour is limited to 1000.
  • The number of jobs a workspace can create in an hour is limited to 5000 when using the run-submit API. This limit also affects jobs created by the REST API and notebook workflows. For more information, see "Configure Databricks job management" below.
  • The number of actively concurrent job runs in a workspace is limited to 150.

These limits apply to any jobs that use workspace data on the cluster.

Managing Limits

To enable retrieval and auditing of job information after a job has been completed, the Designer Cloud Powered by Trifacta platform does not delete jobs from the cluster. As a result, jobs can accumulate over time to exceed the number of jobs permitted on the cluster. If you reach these limits, you may receive a Quota for number of jobs has been reached limit. For more information, see https://docs.databricks.com/user-guide/jobs.html.

Optionally, you can allow the Designer Cloud Powered by Trifacta platform to manage your jobs to avoid these limitations. For more information, see "Configure Databricks job management" below.

Enable

To enable AWS Databricks, perform the following configuration changes:

Steps:

  1. You apply this change through the Workspace Settings Page. For more information, see Platform Configuration Methods.
  2. Locate the following parameter, which enables Trifacta Photon for smaller job execution. Set it to Enabled:

    Photon execution
  3. You do not need to save to enable the above configuration change.
  4. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.
  5. Locate the following parameters. Set them to the values listed below, which enables AWS Databricks (small to extra-large jobs) running environments:

    "webapp.runInDatabricks": true,
    "webapp.runWithSparkSubmit": false,
    "webapp.runInDataflow": false,
  6. Do not save your changes until you have completed the following configuration section.

Configure

Configure cluster mode 

When a user submits a job, the  Designer Cloud Enterprise Edition provides all the cluster specifications in the Databricks API and it creates cluster only for per-user or per-job, that means once the job is complete, the cluster is terminated.  Cluster creation may take less than 30 seconds if instance pools are used. If the instance pools are not used, it may take 10-15 minutes.

For more information on job clusters, see https://docs.databricks.com/clusters/configure.html.

The job clusters automatically terminate after the job is completed. A new cluster is automatically created when the user next requests access to AWS Databricks access. 

Cluster ModeDescription
USER

When a user submits a job, Designer Cloud Enterprise Edition creates a new cluster and persists the cluster ID in  Designer Cloud Enterprise Edition metadata for the user if the cluster does not exist or invalid. If the user already has an existing interactive valid cluster, then the existing cluster is reused when submitting the job.

Reset to JOB mode to run jobs in AWS Databricks.

JOB

When a user submits a job, Designer Cloud Enterprise Edition provides all the cluster specifications in the Databricks API. Databricks creates a cluster only for this job and terminates it as soon as the job completes.

Default cluster mode to run jobs in AWS Databricks.


Configure Instance Profiles in AWS Databricks

Designer Cloud Powered by Trifacta platform  EC2 instances can be configured with permissions to access AWS resources like S3 by attaching an IAM instance profile. Similarly, instance profiles can be attached to EC2 instances for use with AWS Databricks clusters.

NOTE: You must register the instance profiles in the Databricks workspace, or your Databricks clusters reject the instance profile ARNs and display an error. For more information, see https://docs.databricks.com/administration-guide/cloud-configurations/aws/instance-profiles.html#step-5-add-the-instance-profile-to-databricks.

To configure the instance profile for AWS Databricks, you must provide an IAM instance profile ARN in databricks.awsAttributes.instanceProfileArn parameter.

NOTE: For AWS Databricks, you can configure the instance profile value in databricks.awsAttributes.instanceProfileArn, only when the aws.credentialProvider is set to instance or temporary.

aws.credentialProviderAWS Databricks permissions
instance

Designer Cloud Powered by Trifacta platform or Databricks jobs gets all permissions directly from the instance profile.

temporary

Designer Cloud Powered by Trifacta platform or Databricks jobs use temporary credentials that are issued based on system or user IAM roles.

NOTE: The instance profile must have policies that allow  Designer Cloud Powered by Trifacta platform or Databricks to assume those roles.


defaultn/a


NOTE: If the aws.credentialProvider is set to temporary or instance while using AWS Databricks:

  • databricks.awsAttributes.instanceProfileArn must be set to a valid value for Databricks jobs to run successfully.
  • aws.ec2InstanceRoleForAssumeRole flag is ignored for Databricks jobs.

For more information, see Configure for AWS Authentication.

Configure instance pooling

Instance pooling reduces cluster node spin-up time by maintaining a set of idle and ready instances. The Designer Cloud Powered by Trifacta platform can be configured to leverage instance pooling on the AWS Databricks cluster for both worker and driver nodes.

NOTE: When instance pooling is enabled, the following parameters are not used:

databricks.driverNodeType

databricks.workerNodeType

For more information, see https://docs.azuredatabricks.net/clusters/instance-pools/index.html.

Instance pooling for worker nodes

Pre-requisites:

  • All cluster nodes used by the Designer Cloud Powered by Trifacta platform are taken from the pool. If the pool has an insufficient number of nodes, cluster creation fails.
  • Each user must have access to the pool and must have at least the ATTACH_TO permission.
  • Each user must have a personal access token from the same AWS Databricks workspace. See Configure personal access token below.

To enable:

  1. Acquire your pool identifier or pool name from AWS Databricks.

    NOTE: You can use either the Databricks pool identifier or pool name. If both poolId and poolName are specified, poolId is used first. If that fails to find a matching identifier, then the poolName value is checked.

    Tip: If you specify a poolName value only, then you can run your Databricks jobs against the available clusters across multiple Alteryx workspaces. This mechanism allows for better resource allocation and broader execution options.

  2. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.
  3. Set either of the following parameters: 

    1. Set the following parameter to the AWS Databricks pool identifier:

      "databricks.poolId": "<my_pool_id>",
    2. Or, you can set the following parameter to the AWS Databricks pool name:

      "databricks.poolName": "<my_pool_name>",
  4. Save your changes and restart the platform.

Instance pooling for driver nodes

The Designer Cloud Powered by Trifacta platform can be configured to use Databricks instance pooling for driver pools.

To enable:

  1. Acquire your driver pool identifier or driver pool name from Databricks.

    NOTE: You can use either the Databricks driver pool identifier or driver pool name. If both driverPoolId and driverPoolName are specified, driverPoolId is used first. If that fails to find a matching identifier, then the driverPoolName value is checked.

    Tip: If you specify a driverPoolName value only, then you can run your Databricks jobs against the available clusters across multiple Alteryx workspaces. This mechanism allows for better resource allocation and broader execution options.

  2. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.
  3. Set either of the following parameters: 

    1. Set the following parameter to the Databricks driver pool identifier:

      "databricks.driverPoolId": "<my_pool_id>",
    2. Or, you can set the following parameter to the Databricks driver pool name:

      "databricks.driverPoolName": "<my_pool_name>",
  4. Save your changes and restart the platform.

Configure Platform

Review and modify the following configuration settings, as required:

NOTE: Restart the platform after you modify the configuration settings for the system to take affect.

 Following is the list of parameters that have to be set to integrate the AWS Databricks with Designer Cloud Powered by Trifacta platform:

Required Parameters

ParameterDescriptionValue

databricks.serviceUrl

URL to the AWS Databricks Service where Spark jobs will be run-
metadata.cloud

Must be set to aws and should not changed to any other value while using AWS Databricks

Default: aws

Following is the list of parameters that can be reviewed or modified based on your requirements:

Optional Parameters

ParameterDescriptionValue

databricks.awsAttributes.firstOnDemandInstances

Number of initial cluster nodes to be placed on on-demand instances. The remainder is placed on availability instances

Default: 1

databricks.awsAttributes.availability

Availability type used for all subsequent nodes past the firstOnDemandInstances.

Default: SPOT_WITH_FALLBACK

databricks.awsAttributes.availabilityZone

Identifier for the availability zone/datacenter in which the cluster resides. The provided availability zone must be in the same region as the Databricks deployment.


databricks.awsAttributes.spotBidPricePercent

The max price for AWS spot instances, as a percentage of the corresponding instance type's on-demand price. When spot instances are requested for this cluster, only spot instances whose max price percentage matches this field will be considered.

Default: 100

databricks.awsAttributes.ebsVolume

The type of EBS volumes that will be launched with this cluster.

Default: None

databricks.awsAttributes.instanceProfileArn

EC2 instance profile ARN for the cluster nodes. This is only used when AWS credential provider is set to temporary/instance. The instance profile must have previously been added to the Databricks environment by an account administrator.

For more information, see Configure for AWS Authentication.
databricks.clusterMode

Determines the cluster mode for running a Databricks job.

Default: JOB
feature.parameterization.matchLimitOnSampling.databricksSparkMaximum number of parameterized source files that are permitted for matching in a single dataset with parameters.Default: 0
databricks.workerNodeTypeType of node to use for the AWS Databricks Workers/Executors. There are 1 or more Worker nodes per cluster.

Default: i3.xlarge



databricks.sparkVersionAWS Databricks runtime version which also references the appropriate version of Spark.

Depending on your version of AWS Databricks, please set this property according to the following:

  • AWS Databricks 7.3: 7.3.x-scala2.12

    NOTE: Except for the above version, AWS Databricks 7.x is not supported.

  • AWS Databricks 5.5 LTR: 5.5.x-scala2.11

Please do not use other values.

databricks.minWorkersInitial number of Worker nodes in the cluster, and also the minimum number of Worker nodes that the cluster can scale down to during auto-scale-down.

Minimum value: 1

Increasing this value can increase compute costs.

databricks.maxWorkersMaximum number of Worker nodes the cluster can create during auto scaling.

Minimum value: Not less than databricks.minWorkers.

Increasing this value can increase compute costs.

databricks.poolId

If you have enabled instance pooling in AWS Databricks, you can specify the pool identifier here.

NOTE: If both poolId and poolName are specified, poolId is used first. If that fails to find a matching identifier, then the poolName value is checked.

databricks.poolNameIf you have enabled instance pooling in AWS Databricks, you can specify the pool name here.

See previous.

Tip: If you specify a poolName value only, then you can use the instance pools with the same poolName available across multiple Databricks workspaces when you create a new cluster.

databricks.driverNodeType

Type of node to use for the AWS Databricks Driver. There is only one Driver node per cluster.

Default: i3.xlarge

For more information, see the sizing guide for Databricks.

NOTE: This property is unused when instance pooling is enabled. For more information, see Configure instance pooling below.

databricks.driverPoolIdIf you have enabled instance pooling in AWS Databricks, you can specify the driver node pool identifier here. For more information, see Configure instance pooling below.

NOTE: If both driverPoolId and driverPoolName are specified, driverPoolId is used first. If that fails to find a matching identifier, then the driverPoolName value is checked.

databricks.driverPoolNameIf you have enabled instance pooling in AWS Databricks, you can specify the driver node pool name here. For more information, see Configure instance pooling below.

See previous.

Tip: If you specify a driverPoolName value only, then you can use the instance pools with the same driverPoolName available across multiple Databricks workspaces when you create a new cluster.

databricks.logsDestinationDBFS location that cluster logs will be sent to every 5 minutesLeave this value as /trifacta/logs.
databricks.enableAutoterminationSet to true to enable auto-termination of a user cluster after N minutes of idle time, where N is the value of the autoterminationMinutes property.Unless otherwise required, leave this value as true.
databricks.clusterStatePollerDelayInSecondsNumber of seconds to wait between polls for AWS Databricks cluster status when a cluster is starting up
databricks.clusterStartupWaitTimeInMinutesMaximum time in minutes to wait for a Cluster to get to Running state before aborting and failing an AWS Databricks job. Default: 60
databricks.clusterLogSyncWaitTimeInMinutes

Maximum time in minutes to wait for a Cluster to complete syncing its logs to DBFS before giving up on pulling the cluster logs to the Alteryx node.

Set this to 0 to disable cluster log pulls.
databricks.clusterLogSyncPollerDelayInSecondsNumber of seconds to wait between polls for a Databricks cluster to sync its logs to DBFS after job completion. Default: 20
databricks.autoterminationMinutesIdle time in minutes before a user cluster will auto-terminate.Do not set this value to less than the cluster startup wait time value.
databricks.maxAPICallRetriesMaximum number of retries to perform in case of 429 error code responseDefault: 5. For more information, see Configure Maximum Retries for REST API section below.
databricks.enableLocalDiskEncryption

Enables encryption of data like shuffle data that is temporarily stored on cluster's local disk.

-
databricks.patCacheTTLInMinutesLifespan in minutes for the Databricks personal access token in-memory cacheDefault: 10
spark.useVendorSparkLibraries

When true, the platform bypasses shipping its installed Spark libraries to the cluster with each job's execution.

NOTE: This setting is ignored. The vendor Spark libraries are always used for AWS Databricks.

Configure Databricks Job Management

AWS Databricks enforces a hard limit of 1000 created jobs per workspace, and by default cluster jobs are not deleted. To support jobs more than 1000 jobs per cluster, you can enable job management for AWS Databricks.

NOTE: This feature covers the deletion of the job definition on the cluster, which counts toward the enforced limits. The Designer Cloud Powered by Trifacta platform never deletes the outputs of a job or the job definition stored in the platform. When cluster job definitions are removed, the jobs remain listed in the Jobs page, and job metadata is still available. There is no record of the job on the AWS Databricks cluster. Jobs continue to run, but users on the cluster may not be aware of them.


Tip: Regardless of your job management option, when you hit the limit for the number of job definitions that can be created on the Databricks workspace, the platform by default falls back to using the runs/submit API, if the Databricks Job Runs Submit Fallback setting has been enabled.


Steps:

  1. You apply this change through the Workspace Settings Page. For more information, see Platform Configuration Methods.
  2. Locate the following property and set it to one of the values listed below:

    Databricks Job Management
    Property ValueDescription
    Never Delete(default) Job definitions are never deleted from the AWS Databricks cluster.
    Always DeleteThe AWS Databricks job definition is deleted during the clean-up phase, which occurs after a job completes.
    Delete Successful OnlyWhen a job completes successfully, the AWS Databricks job definition is deleted during the clean-up phase. Failed or canceled jobs are not deleted, which allows you to debug as needed.
    Skip Job Creation

    For jobs that are to be executed only one time, the Designer Cloud Powered by Trifacta platform can be configured to use a different mechanism for submitting the job. When this option is enabled, the Designer Cloud Powered by Trifacta platform submits jobs using the run-submit API, instead of the run-now API. The run-submit API does not create an AWS Databricks job definition. Therefore the submitted job does not count toward the enforced job limit.

    DefaultInherits the default system-wide setting.
  3. When this feature is enabled, the platform falls back to use the runs/submit API as a fallback when the job limit for the Databricks workspace has been reached:

    Databricks Job Runs Submit Fallback
  4. Save your changes and restart the platform.

Enable shared cluster for Databricks Tables

Optionally, you can provide to the Designer Cloud Powered by Trifacta platform the name of a shared Databricks cluster to be used to access Databricks Tables. 

Pre-requisites:

NOTE: Any shared cluster must be maintained by the customer.

  • Shared clusters are not provisioned per-user and cannot be used to run Spark jobs.
  • If you have enabled table access control on your high-concurrency cluster, you must configure access to data objects for users. For more information, see https://docs.databricks.com/security/access-control/table-acls/object-privileges.html.
  • If only a limited number of users are using the shared cluster, you must configure the attach permission on the shared cluster in the Databricks workspace. 

Configure cluster

Depending on the credential provider type, the following properties must be specified in the Spark configuration for the shared cluster.

Default credential provider:

"fs.s3a.access.key"
"fs.s3a.secret.key"
"spark.hadoop.fs.s3a.access.key"
"spark.hadoop.fs.s3a.secret.key"

For more information, see https://docs.databricks.com/data/data-sources/aws/amazon-s3.html.

Instance credential provider:

"fs.s3a.credentialsType"
"fs.s3a.stsAssumeRole.arn"
"fs.s3a.canned.acl"
"fs.s3a.acl.default"
"spark.hadoop.fs.s3a.credentialsType"
"spark.hadoop.fs.s3a.stsAssumeRole.arn"
"spark.hadoop.fs.s3a.canned.acl"
"spark.hadoop.fs.s3a.acl.default"

For more information, see https://docs.databricks.com/administration-guide/cloud-configurations/aws/instance-profiles.html.

Temporary credential provider:

"fs.s3a.canned.acl"
"fs.s3a.acl.default"
"spark.hadoop.fs.s3a.canned.acl"
"spark.hadoop.fs.s3a.acl.default"

For more information, see  https://docs.databricks.com/administration-guide/cloud-configurations/aws/assume-role.html.

Enable

Steps:

  1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.
  2. Locate the following parameter, and add the name of the Databricks cluster to use to browse Databricks Table:

    "feature.databricks.connection.clusterName": "<your_cluster_name>",
  3. Save your changes and restart the cluster.

When a cluster name is provided:

  • If the cluster is available, all users of the Designer Cloud Powered by Trifacta platform attempt to connect to Databricks Tables through the listed cluster.
  • If the cluster has been terminated, a message indicates that the cluster must be restarted. After it has been restarted, you can try again.
  • If the cluster name is invalid, the Designer Cloud Powered by Trifacta platform fails any read/write operations to Databricks Tables.

NOTE: While using a single cluster shared across users to access Databricks Tables, each user must have a valid Databricks Personal Access Token to the shared cluster.

If a cluster name is not provided:

  • In USER mode, the default behavior is to create one interactive cluster for each user to browse Databricks Tables.
  • In JOB mode, the interactive cluster is created only when a user needs to browse Databricks Tables.

Configure user-level override for Databricks Tables access cluster

Individual users can specify the name of the cluster that accesses Databricks Tables through the Databricks settings. See "Specify Databricks Tables cluster name" below.

Configure for Secrets Manager

The AWS Secrets Manager is a secure vault for storing access credentials to AWS resources. It is mandatory to use with AWS Databricks. For more information, see Configure for AWS Secrets Manager.

Configure for Users

Configure AWS Databricks workspace overrides

A single AWS Databricks account can have access to multiple Databricks workspaces. You can create more than one workspace by using Account API if you are account is on the E2 version of the platform or on a selected custom plan that allows multiple workspaces per account.

For more information, see https://docs.databricks.com/administration-guide/account-api/new-workspace.html

Each workspace has a unique deployment name associated with it that defines the workspace URL. For example: https://<deployment-name>.cloud.databricks.com.  

NOTE:

  • The existing property databricks.serviceUrl is used to configure the URL to the Databricks Service to run Spark jobs.
  • The databricks.serviceUrl defines the default Databricks workspace for all user in the Designer Cloud Enterprise Edition workspace.
  • Individual user can override this setting in the User Preferences in the Databricks Personal Access Token page.

For more information, see Databricks Settings Page.

For more information, see Configure Platform section above.

Configure Databricks job throttling

By default, Databricks workspaces apply limits on the number of jobs that can be submitted before the cluster begins to fail jobs. These limits are the following:

  • Maximum number of concurrent jobs per cluster

  • Max number of concurrent jobs per workspace

  • Max number of concurrent clusters per workspace

Depending on how your clusters are configured, these limits can vary. For example, if the maximum number of concurrent jobs per cluster is set to 20, then the 21st concurrent job submitted to the cluster fails. 

To prevent unnecessary job failure, the Designer Cloud Powered by Trifacta platform submits the throttling of jobs to Databricks. When job throttling is enabled and the 21 concurrent job is submitted, the Designer Cloud Powered by Trifacta platform holds that job internally the first of any of the following events happens:

  • An active job on the cluster completes, and space is available for submitting a new job. The job is then submitted.
  • The user chooses to cancel the job.
  • One of the timeout limits described below is reached.

NOTE: The Designer Cloud Powered by Trifacta platform supports throttling of jobs based on the maximum number of concurrent jobs per cluster. Throttling against the other limits listed above is not supported at this time.

Steps:

Please complete the following steps to enable job throttling.

  1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.
  2. In the Designer Cloud application, select User menu > Admin console > Admin settings.
  3. Locate the following settings and set their values accordingly:

    SettingDescription

    databricks.userClusterThrottling.enabled

    When set to true, job throttling per Databricks cluster is enabled. Please specify the following settings.

    databricks.userClusterthrottling.maxTokensAllottedPerUserCluster

    Set this value to the maximum number of concurrent jobs that can run on one user cluster. Default value is 20.

    databricks.userClusterthrottling.tokenExpiryInMinutes

    The time in minutes after which tokens reserved by a job are revoked, irrespective of the job status. If a job is in progress and this limit is reached, then the Databricks token is expired, and the token is revoked under the assumption that it is stale. Default value is 120 (2 hours).

    Tip: Set this value to 0 to prevent token expiration. However, this setting is not recommended, as jobs can remain in the queue indefinitely.

    jobMonitoring.queuedJobTimeoutMinutesThe maximum time in minutes in which a job is permitted to remain in the queue for a slot on Databricks cluster. If this limit is reached, the job is marked as failed.
    batch-job-runner.cleanup.enabled

    When set to true, the Batch Job Runner service is permitted to clean up throttling tokens and job-level personal access tokens.

    Tip: Unless you have reason to do otherwise, you should leave this setting to true.

  4. Save your changes and restart the platform.

Configure personal access token

Each user must insert a Databricks Personal Access Token to access Databricks resources. For more information, see Databricks Settings Page.

Specify Databricks Tables cluster name

Individual users can specify the name of the cluster to which they are permissioned to access Databricks Tables. This cluster can also be shared among users. For more information, see Databricks Settings Page

Configure maximum retries for REST API 

There is a limit of 30 requests per second per workspace on the Databricks REST APIs. If this limit is reached, then a HTTP status code 429 error is returned, indicating that rate limiting is being applied by the server.  By default, the Designer Cloud Powered by Trifacta platform re-attempts to submit a request 5 times and then fails the job if the request is not accepted. 

If you want to change the number of retries, change the value for the databricks.maxAPICallRetries  flag.  

ValueDescription
5

(default) When a request is submitted through the AWS Databricks REST APIs, up to 5 retries can be performed in the case of failures.

  • The waiting period increases exponentially for every retry. For example, for the 1st retry, the wait time is 10 seconds, 20 seconds for the next retry, 40 seconds for the third retry and so on.
  • You can set the values accordingly based on number of minutes /seconds you want to try.
0

When an API call fails, the request fails. As the number of concurrent jobs increases, more jobs may fail.

NOTE: This setting is not recommended.


5+Increasing this setting above the default value may result in more requests eventually getting processed. However, increasing the value may consume additional system resources in a high concurrency environment and jobs might take longer to run due to exponentially increasing waiting time.

Use

Run Job From Application

When the above configuration has been completed, you can select the running environment through the application.

NOTE: When a Databricks job fails, the failure is reported immediately in the Designer Cloud application. In the background, the job logs are collected from Databricks and may not be immediately available.

See Run Job Page.

Run Job via API

You can use API calls to execute jobs.

Make sure that the request body contains the following:

    "execution": "databricksSpark",

For more information, see  $strBaseAPIURL#operation/runJobGroup

Troubleshooting

Spark job on AWS Databricks fails with "Invalid spark version" error

When running a job using Spark on AWS Databricks, the job may fail with the above invalid version error. In this case, the Databricks version of Spark has been deprecated.

Solution:

Since an AWS Databricks cluster is created for each user, the solution is to identify the cluster version to use, configure the platform to use it, and then restart the platform.

  1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods..
  2. Acquire the value for databricks.sparkVersion.
  3. In AWS Databricks, compare your value to the list of supported AWS Databricks version. If your version is unsupported, identify a new version to use. 

    NOTE: Ensure to note the version of Spark supported for the version of AWS Databricks that you have chosen.

  4. In the Designer Cloud Powered by Trifacta platform configuration, Set databricks.sparkVersion to the new version to use.

    NOTE: The value for spark.version does not apply to Databricks.

  5. Restart the Designer Cloud Powered by Trifacta platform.
  6. The platform is restarted. A new AWS Databricks cluster is created for each user using the specified values, when the user runs a job.

  • No labels

This page has no comments.