Configure for Azure Databricks

This section describes how to configure the Designer Cloud Powered by Trifacta platform to integrate with Databricks hosted in Azure.

Azure Databricks is an Apache Spark implementation that has been optimized for use on the Azure platform. For more information, see https://databricks.com/product/azure.

Note

You cannot integrate with existing Azure Databricks clusters.

Additional Databricks features supported by the platform:

Table access control: https://docs.databricks.com/security/access-control/table-acls/object-privileges.html

Prerequisites

The Designer Cloud Powered by Trifacta platform must be deployed in Microsoft Azure.
Note
If you are using Azure Databricks as a datasource, please verify that openJDKv1.8.0_302 or earlier is installed on the Trifacta node. Java 8 is required. If necessary, downgrade the Java version and restart the platform. There is a known issue with TLS v1.3.

Simba driver

Beginning in Release 7.1, the integration with Azure Databricks switched from using a Hive-based driver to a Simba driver for the integration with Spark.

Note

If you have upgraded from a release before Release 7.1, you should review the Connect String Options in your Azure Databricks connections, such as Databricks Tables. These options may not work with the Simba driver.

No installation is required to use this new driver.

For more information on the Simba driver, see https://databricks.com/spark/odbc-driver-download.

Limitations

Note

If you are using Azure AD to integrate with an Azure Databricks cluster, the Azure AD secret value stored in azure.secret must begin with an alphanumeric character. This is a known issue.

The Designer Cloud Powered by Trifacta platform must be installed on Microsoft Azure.
Import from nested folders is not supported for running jobs from Azure Databricks.
When a job is started and no cluster is available, a cluster is initiated, which can take up to four minutes. If the job is canceled during cluster startup:
- The job is terminated, and the cluster remains.
- The job is reported in the application as Failed, instead of Canceled.
Azure Databricks integration works with Spark 2.4.x, Spark 3.0.1, Spark 3.2.0, Spark 3.2.1, and Spark 3.3.0.
Note
The version of Spark for Azure Databricks must be applied to the platform configuration through the databricks.sparkVersion property. Details are provided later.
Azure Databricks integration does not work with Hive.

Supported versions of Azure Databricks

Supported versions:

Azure Databricks 10.x
Azure Databricks 9.1 LTS (Recommended)
Azure Databricks 7.3 LTS

Note

When an Azure Databricks version is deprecated, it is no longer available for selection when creating a new cluster. As a result, if a new cluster needs to be created for whatever reason, it must be created using a version supported by Azure Databricks, which may require you to change the value of the Spark version settings in the Designer Cloud Powered by Trifacta platform. See "Troubleshooting" for more information.

For more information on supported versions, seehttps://docs.databricks.com/release-notes/runtime/releases.html#supported-databricks-runtime-releases-and-support-schedule.

Job limits

By default, the number of jobs permitted on an Azure Databricks cluster is set to 1000.

The number of jobs that can be created per workspace in an hour is limited to 1000 .
- If Databricks Job Management is enabled in the platform, then this limit is raised to 5000 by using the run-submit API. For more information, see "Configure Databricks job management" below.
- For more information, see https://docs.databricks.com/dev-tools/api/latest/jobs.html#run-now.
These limits apply to any jobs run for workspace data on the cluster.
The number of actively concurrent job runs in a workspace is limited to 150.

Managing limits:

To enable retrieval and auditing of job information after a job has been completed, theDesigner Cloud Powered by Trifacta platformdoes not delete jobs from the cluster. As a result, jobs can accumulate over time to exceeded the number of jobs permitted on the cluster. If you reach these limits, you may receive aQuota for number of jobs has been reachedlimit. For more information, see https://docs.databricks.com/user-guide/jobs.html.

Optionally, you can allow the Designer Cloud Powered by Trifacta platform to manage your jobs to avoid these limitations. For more information, see "Configure Databricks job management" below.

Create Cluster

Note

Integration with pre-existing Azure Databricks clusters is not supported.

When a user first requests access to Azure Databricks, a new Azure Databricks cluster is created for the user. Access can include a request to run a job or to browse Databricks Tables. Cluster creation may take a few minutes.

A new cluster is also created when a user launches a job after:

The Azure Databricks configuration properties or Spark properties are changed in platform configuration.
A JAR file is updated on the Trifacta node

A user's cluster automatically terminates after a configurable time period. A new cluster is automatically created when the user next requests access to Azure Databricks access. See "Configure Platform" below.

Enable

To enable Azure Databricks, please perform the following configuration changes.

Steps:

You apply this change through the Workspace Settings Page. For more information, see Platform Configuration Methods.
Locate the following parameter, which enables Trifacta Photon for smaller job execution. Set it to Enabled:
```
Photon execution
```
You do not need to save to enable the above configuration change.
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.
Locate the following parameters. Set them to the values listed below, which enables Azure Databricks (small to extra-large jobs) running environments:
```
"webapp.runInDatabricks": true,
"webapp.runWithSparkSubmit": false,
"webapp.runinEMR": false,
"webapp.runInDataflow": false,
```
Locate the following parameter and set it to azure:
```
"metadata.cloud": "azure",
```
Note
Do not set this parameter to any value other than azure while using Azure Databricks.
Do not save your changes until you have completed the following configuration section.

Configure

Configure cluster mode

When a user submits a job, the Designer Cloud Powered by Trifacta Enterprise Edition provides all the cluster specifications in the Databricks API and it creates cluster only for per-user or per-job. that means once the job is complete, the cluster is terminated. Cluster creation may take less than 30 seconds if instance pools are used. If the instance pools are not used, it may take 10-15 minutes.

For more information on job clusters, see https://docs.databricks.com/clusters/configure.html .

The job clusters automatically terminate after the job is completed. A new cluster is automatically created when the user next requests access to Azure Databricks access.

Cluster Mode	Description
USER	When a user submits a job, Designer Cloud Powered by Trifacta Enterprise Edition creates a new cluster and persists the cluster ID in Designer Cloud Powered by Trifacta Enterprise Edition metadata for the user if the cluster does not exist or invalid. If the user already has an existing interactive valid cluster, then the existing cluster is reused when submitting the job. Default cluster mode to run jobs inAzureDatabricks.
JOB	When a user submits a job, Designer Cloud Powered by Trifacta Enterprise Edition provides all the cluster specifications in the Databricks API. Databricks creates a cluster only for this job and terminates it as soon as the job completes.

Cluster Mode

Description

USER

When a user submits a job, Designer Cloud Powered by Trifacta Enterprise Edition creates a new cluster and persists the cluster ID in Designer Cloud Powered by Trifacta Enterprise Edition metadata for the user if the cluster does not exist or invalid. If the user already has an existing interactive valid cluster, then the existing cluster is reused when submitting the job.

Default cluster mode to run jobs inAzureDatabricks.

JOB

When a user submits a job, Designer Cloud Powered by Trifacta Enterprise Edition provides all the cluster specifications in the Databricks API. Databricks creates a cluster only for this job and terminates it as soon as the job completes.

Configure use of cluster policies

Optionally, you can configure the Designer Cloud Powered by Trifacta platform to use the Databricks cluster policies that have been specified by your Databricks administrator for creating and using clusters. These policies are effectively templates for creation and use of Databricks clusters and govern aspects of clusters such as the type and count of nodes the resources that can be accessed via the cluster, and other settings. For more information on Databricks cluster policies, see https://docs.databricks.com/administration-guide/clusters/policies.html.

Prerequisites

Note

Your Databricks administrator must create and deploy the Databricks cluster policies from which Alteryx users can select for their personal use.

Notes:

When this feature is enabled, each user may select the appropriate Databricks cluster policy to use for jobs. If none is selected by a user, jobs are launched without a cluster policy for the user using the Databricks properties set in platform configuration.
Note
Except for Spark version and cluster policy identifier in job-level overrides, other Databricks cluster configuration in the Designer Cloud Powered by Trifacta platform is ignored when this feature is in use. Other job-level overrides are also ignored.
If a cluster policy is modified and existing clusters are using it, then subsequent job executions using that policy attempt to use the same cluster. This can cause issues in performance and even job failures.
Tip
Avoid editing cluster policies that are in use, as these changed policies may be applied to clusters generated under the old policies. Instead, you should create a new policy and assign it for use.
If the cluster policy references a Databricks instance pool that does not exist, the job fails.

Steps:

You apply this change through the Workspace Settings Page. For more information, see Platform Configuration Methods.
Locate the following parameter and set it to Enabled:
```
Databricks Cluster Policies
```
Save your changes and restart the platform.

Note

Each user must select a cluster policy to use. For more information, see Databricks Settings Page.

Job overrides:

A user's cluster policy can overridden when a job is executed via API. Set the request attribute for the clusterPolicyId.

Note

If a Databricks cluster policy is used, all job-level overrides except for clusterPolicyId are ignored.

For more information, see API Task - Run Job.

Policy template for Azure - without instance pools:

The following example cluster policy can provide a basis for creating your own Azure cluster policies when instance pools are not in use:

{
  "autoscale.max_workers": {
    "type": "fixed",
    "value": 3,
    "hidden": true
  },
  "autoscale.min_workers": {
    "type": "fixed",
    "value": 1,
    "hidden": true
  },
  "autotermination_minutes": {
    "type": "fixed",
    "value": 10,
    "hidden": true
  },
  "driver_node_type_id": {
    "type": "fixed",
    "value": "Standard_D3_v2",
    "hidden": true
  },
  "enable_local_disk_encryption": {
    "type": "fixed",
    "value": false
  },
  "node_type_id": {
    "type": "fixed",
    "value": "Standard_D3_v2",
    "hidden": true
  }
}

Policy template for Azure - with instance pools:

The following example cluster policy can provide a basis for creating your own Azure cluster policies when instance pools are in use:

{
  "autoscale.max_workers": {
    "type": "fixed",
    "value": 3,
    "hidden": true
  },
  "autoscale.min_workers": {
    "type": "fixed",
    "value": 1,
    "hidden": true
  },
  "enable_local_disk_encryption": {
    "type": "fixed",
    "value": false
  },
  "instance_pool_id": {
    "type": "fixed",
    "value": "SOME_POOL",
    "hidden": true
  },
  "driver_instance_pool_id": {
    "type": "fixed",
    "value": "SOME_POOL",
    "hidden": true
  },
  "autotermination_minutes": {
    "type": "fixed",
    "value": 10,
    "hidden": true
  },
}

Configure Platform

Please review and modify the following configuration settings.

Note

When you have finished modifying these settings, save them and restart the platform to apply.

Parameter	Description	Value
databricks.clusterMode	Determines the cluster mode for running a Databricks job.	Default: USER
feature.parameterization.matchLimitOnSampling.databricksSpark	Maximum number of parameterized source files that are permitted for matching in a single dataset with parameters.
databricks.workerNodeType	Type of node to use for the Azure Databricks Workers/Executors. There are 1 or more Worker nodes per cluster.	Default: `Standard_D3_v2` Note This property is unused when instance pooling is enabled. For more information, seeConfigure instance pooling below. For more information, see the sizing guide for Azure Databricks.
databricks.sparkVersion	Azure Databricks cluster version which also includes the version of Spark.	Depending on your version of Azure Databricks, please set this property according to the following: Azure Databricks 10.x: `10.0.x-scala2.12` Azure Databricks 9.1 LTS: `9.1.x-scala2.12` Azure Databricks 7.3 LTS:`7.3.x-scala2.12` Please do not use other values.
databricks.serviceUrl	URL to the Azure Databricks Service where Spark jobs will be run (Example: https://westus2.azuredatabricks.net)
databricks.minWorkers	Initial number of Worker nodes in the cluster, and also the minimum number of Worker nodes that the cluster can scale down to during auto-scale-down	Minimum value: `1` Increasing this value can increase compute costs.
databricks.maxWorkers	Maximum number of Worker nodes the cluster can create during auto scaling	Minimum value: Not less than `databricks.minWorkers`. Increasing this value can increase compute costs.
databricks.poolId	If you have enabled instance pooling in Azure Databricks, you can specify the worker node pool identifier here. For more information, see Configure instance pooling below.	Note If both poolId and poolName are specified, poolId is used first. If that fails to find a matching identifier, then the poolName value is checked.
databricks.poolName	If you have enabled instance pooling in Azure Databricks, you can specify the worker node pool name here. For more information, see Configure instance pooling below.	See previous. Tip If you specify a poolName value only, then you can use the instance pools with the same poolName available across multiple Databricks workspaces when you create a new cluster.
databricks.driverNodeType	Type of node to use for the Azure Databricks Driver. There is only 1 Driver node per cluster.	Default: `Standard_D3_v2` For more information, see the sizing guide for Databricks. Note This property is unused when instance pooling is enabled. For more information, seeConfigure instance pooling below.
databricks.driverPoolId	If you have enabledinstance poolingin Azure Databricks, you can specify the driver node pool identifier here. For more information, see Configure instance pooling below.	Note If both driverPoolId and driverPoolName are specified, driverPoolId is used first. If that fails to find a matching identifier, then the driverPoolName value is checked.
databricks.driverPoolName	If you have enabled instance pooling in Azure Databricks, you can specify the driver node pool name here. For more information, see Configure instance pooling below.	See previous. Tip If you specify a driverPoolName value only, then you can use the instance pools with the same driverPoolName available across multiple Databricks workspaces when you create a new cluster.
databricks.logsDestination	DBFS location that cluster logs will be sent to every 5 minutes	Leave this value as `/trifacta/logs`.
databricks.enableAutotermination	Set to true to enable auto-termination of a user cluster after N minutes of idle time, where N is the value of the autoterminationMinutes property.	Unless otherwise required, leave this value as `true`.
databricks.clusterStatePollerDelayInSeconds	Number of seconds to wait between polls for Azure Databricks cluster status when a cluster is starting up
databricks.clusterStartupWaitTimeInMinutes	Maximum time in minutes to wait for a Cluster to get to Running state before aborting and failing an Azure Databricks job
databricks.clusterLogSyncWaitTimeInMinutes	Maximum time in minutes to wait for a Cluster to complete syncing its logs to DBFS before giving up on pulling the cluster logs to the Trifacta node.	Set this to `0` to disable cluster log pulls.
databricks.clusterLogSyncPollerDelayInSeconds	Number of seconds to wait between polls for a Databricks cluster to sync its logs to DBFS after job completion
databricks.autoterminationMinutes	Idle time in minutes before a user cluster will auto-terminate.	Do not set this value to less than the cluster startup wait time value.
databricks.enableLocalDiskEncryption	Enables encryption of data like shuffle data that is temporarily stored on cluster's local disk.	-
databricks.patCacheTTLInMinutes	Lifespan in minutes for the Databricks personal access token in-memory cache	Default: 10
databricks.maxAPICallRetries	Maximum number of retries to perform in case of 429 error code response	Default: 5. For more information, seeConfigure Maximum Retriesfor REST API section below.
spark.useVendorSparkLibraries	When `true`, the platform bypasses shipping its installed Spark libraries to the cluster with each job's execution.	Note This setting is ignored. The vendor Spark libraries are always used for Azure Databricks.

Configure instance pooling

Instance pooling reduces cluster node spin-up time by maintaining a set of idle and ready instances. The Designer Cloud Powered by Trifacta platform can be configured to leverage instance pooling on the Azure Databricks cluster for both worker and driver nodes.

Note

When instance pooling is enabled, the following parameters are not used:

databricks.driverNodeType

databricks.workerNodeType

For more information, see https://docs.azuredatabricks.net/clusters/instance-pools/index.html.

Instance pooling for worker nodes

Prerequisites:

All cluster nodes used by the Designer Cloud Powered by Trifacta platform are taken from the pool. If the pool has an insufficient number of nodes, cluster creation fails.
Each user must have access to the pool and must have at least the ATTACH_TO permission.
Each user must have a personal access token from the same Azure Databricks workspace. See Configure personal access token below.

To enable:

Acquire your pool identifier or pool name from Azure Databricks.
Note
You can use either the Databricks pool identifier or pool name. If both poolId and poolName are specified, poolId is used first. If that fails to find a matching identifier, then the poolName value is checked.
Tip
If you specify a poolName value only, then you can run your Databricks jobs against the available clusters across multipleAlteryx workspaces . This mechanism allows for better resource allocation and broader execution options.
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.
Set either of the following parameters:
1. Set the following parameter to the Azure Databricks pool identifier:
```
"databricks.poolId": "<my_pool_id>",
```
2. Or, you can set the following parameter to the Azure Databricks pool name:
```
"databricks.poolName": "<my_pool_name>",
```
Save your changes and restart the platform.

Instance pooling for driver nodes

The Designer Cloud Powered by Trifacta platform can be configured to use Databricks instance pooling for driver pools.

To enable:

Acquire your driver pool identifier or driver pool name from Databricks.
Note
You can use either the Databricks driver pool identifier or driver pool name. If both driverPoolId and driverPoolName are specified, driverPoolId is used first. If that fails to find a matching identifier, then the driverPoolName value is checked.
Tip
If you specify a driverPoolName value only, then you can run your Databricks jobs against the available clusters across multiple Alteryx workspaces. This mechanism allows for better resource allocation and broader execution options.
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.
Set either of the following parameters:
1. Set the following parameter to the Databricks driver pool identifier:
```
"databricks.driverPoolId": "<my_pool_id>",
```
2. Or, you can set the following parameter to the Databricks driver pool name:
```
"databricks.driverPoolName": "<my_pool_name>",
```
Save your changes and restart the platform.

Configure Databricks job management

Azure Databricks enforces a hard limit of 1000 created jobs per workspace, and by default cluster jobs are not deleted. To support jobs more than 1000 jobs per cluster, you can enable job management for Azure Databricks.

Note

This feature covers the deletion of the job definition on the cluster, which counts toward the enforced limits. The Designer Cloud Powered by Trifacta platform never deletes the outputs of a job or the job definition stored in the platform. When cluster job definitions are removed, the jobs remain listed in the Job History page, and job metadata is still available. There is no record of the job on the Azure Databricks cluster. Jobs continue to run, but users on the cluster may not be aware of them.

Tip

Regardless of your job management option, when you hit the limit for the number of job definitions that can be created on the Databricks workspace, the platform by default falls back to using the runs/submit API, if the Databricks Job Runs Submit Fallback setting has been enabled.

Steps:

You apply this change through the Workspace Settings Page. For more information, see Platform Configuration Methods.

Locate the following property and set it to one of the values listed below:

Databricks Job Management

Property Value	Description
Never Delete	(default) Job definitions are never deleted from the Azure Databricks cluster.
Always Delete	The Azure Databricks job definition is deleted during the clean-up phase, which occurs after a job completes.
Delete Successful Only	When a job completes successfully, the Azure Databricks job definition is deleted during the clean-up phase. Failed or canceled jobs are not deleted, which allows you to debug as needed.
Skip Job Creation	For jobs that are to be executed only one time, the Designer Cloud Powered by Trifacta platform can be configured to use a different mechanism for submitting the job. When this option is enabled, the Designer Cloud Powered by Trifacta platform submits jobs using the run-submit API, instead of the run-now API. The run-submit API does not create an Azure Databricks job definition. Therefore the submitted job does not count toward the enforced job limit.
Default	Inherits the default system-wide setting.

When this feature is enabled, the platform falls back to use the runs/submit API as a fallback when the job limit for the Databricks workspace has been reached:
```
Databricks Job Runs Submit Fallback
```
Save your changes and restart the platform.

Configure Databricks job throttling

By default, Databricks workspaces apply limits on the number of jobs that can be submitted before the cluster begins to fail jobs. These limits are the following:

Maximum number of concurrent jobs per cluster
Max number of concurrent jobs per workspace
Max number of concurrent clusters per workspace

Depending on how your clusters are configured, these limits can vary. For example, if the maximum number of concurrent jobs per cluster is set to 20, then the 21st concurrent job submitted to the cluster fails.

To prevent unnecessary job failure, the Designer Cloud Powered by Trifacta platform submits the throttling of jobs to Databricks. When job throttling is enabled and the 21 concurrent job is submitted, the Designer Cloud Powered by Trifacta platform holds that job internally the first of any of the following events happens:

An active job on the cluster completes, and space is available for submitting a new job. The job is then submitted.
The user chooses to cancel the job.
One of the timeout limits described below is reached.

Note

The Designer Cloud Powered by Trifacta platform supports throttling of jobs based on the maximum number of concurrent jobs per cluster. Throttling against the other limits listed above is not supported at this time.

Steps:

Please complete the following steps to enable job throttling.

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.
In the Trifacta Application, select User menu > Admin console > Admin settings.

Locate the following settings and set their values accordingly:

Setting	Description
databricks.userClusterThrottling.enabled	When set to `true`, job throttling per Databricks cluster is enabled. Please specify the following settings.
databricks.userClusterthrottling.maxTokensAllottedPerUserCluster	Set this value to the maximum number of concurrent jobs that can run on one user cluster. Default value is `20`.
databricks.userClusterthrottling.tokenExpiryInMinutes	The time in minutes after which tokens reserved by a job are revoked, irrespective of the job status. If a job is in progress and this limit is reached, then the Databricks token is expired, and the token is revoked under the assumption that it is stale. Default value is `120` (2 hours). Tip Set this value to `0` to prevent token expiration. However, this setting is not recommended, as jobs can remain in the queue indefinitely.
jobMonitoring.queuedJobTimeoutMinutes	The maximum time in minutes in which a job is permitted to remain in the queue for a slot on Databricks cluster. If this limit is reached, the job is marked as failed.
batch-job-runner.cleanup.enabled	When set to `true`, the Batch Job Runner service is permitted to clean up throttling tokens and job-level personal access tokens. Tip Unless you have reason to do otherwise, you should leave this setting to `true`.

Save your changes and restart the platform.

Configure for Databricks Secrets Management

Optionally, you can leverage Databricks Secrets Management to store sensitive Databricks configuration properties. When this feature is enabled and a set of properties are specified, those properties and their values are stored in masked form. For more information on Databricks Secrets Management, see https://docs.databricks.com/security/secrets/index.html.

Steps:

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

Locate the following properties and set accordingly:

Setting	Description
databricks.secretNamespace	If multiple instances of Designer Cloud Powered by Trifacta Enterprise Edition are using the same Databricks cluster, you can specify the Databricks namespace to which these properties apply.
databricks.secrets	An array containing strings representing the properties that you wish to store in Databricks Secrets Management. For example, the default value stores a recommended set of Spark and Databricks properties: ["spark.hadoop.dfs.adls.oauth2.client.id","spark.hadoop.dfs.adls.oauth2.credential", "dfs.adls.oauth2.client.id","dfs.adls.oauth2.credential", "fs.azure.account.oauth2.client.id","fs.azure.account.oauth2.client.secret"], You can add or remove properties from this array list as needed.

Setting

Description

databricks.secretNamespace

If multiple instances of Designer Cloud Powered by Trifacta Enterprise Edition are using the same Databricks cluster, you can specify the Databricks namespace to which these properties apply.

databricks.secrets

An array containing strings representing the properties that you wish to store in Databricks Secrets Management. For example, the default value stores a recommended set of Spark and Databricks properties:

["spark.hadoop.dfs.adls.oauth2.client.id","spark.hadoop.dfs.adls.oauth2.credential",
"dfs.adls.oauth2.client.id","dfs.adls.oauth2.credential",
"fs.azure.account.oauth2.client.id","fs.azure.account.oauth2.client.secret"],

You can add or remove properties from this array list as needed.

Save your changes and restart the platform.

Configure personal access token

Each user must insert a Databricks Personal Access Token to access Databricks resources. For more information, see Databricks Settings Page.

Combine transform and profiling for Spark jobs

When profiling is enabled for a Spark job, the transform and profiling tasks are combined by default. As needed, you can separate these two tasks. Publishing behaviors vary depending on the approach. For more information, see Configure for Spark.

Additional Configuration

Enable SSO for Azure Databricks

To enable SSO authentication with Azure Databricks, you enable SSO integration with Azure AD. For more information, see Configure SSO for Azure AD.

Enable Azure Managed Identity access

For enhanced security, you can configure the Designer Cloud Powered by Trifacta platform to use an Azure Managed Identity. When this feature is enabled, the platform queries the Key Vault for the secret holding the applicationId and secret to the service principal that provides access to the Azure services.

Note

This feature is supported for Azure Databricks only.

Note

Your Azure Key Vault must already be configured, and the applicationId and secret must be available in the Key Vault. See Configure for Azure.

To enable, the following parameters for the Designer Cloud Powered by Trifacta platform must be specified.

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

Parameter	Description
azure.managedIdentities.enabled	Set to `true` to enable use of Azure managed identities.
azure.managedIdentities.keyVaultApplicationidSecretName	Specify the name of the Azure Key Vault secret that holds the service principal Application Id.
azure.managedIdentities.keyVaultApplicationSecretSecretName	Specify the name of the Key Vault secret that holds the service principal secret.

Save your changes.

Enable shared cluster for Databricks Tables

Optionally, you can provide to the Designer Cloud Powered by Trifacta platform the name of a shared Databricks cluster to be used to access Databricks Tables.

Prerequisites:

Note

Any shared cluster must be maintained by the customer.

Shared clusters are not provisioned per-user and cannot be used to run Spark jobs.
If you have enabled table access control on your high-concurrency cluster, you must configure access to data objects for users. For more information, see https://docs.databricks.com/security/access-control/table-acls/object-privileges.html.
If only a limited number of users are using the shared cluster, you must configure the attach permission on the shared cluster in the Databricks workspace.

Configure cluster

Depending on the credential provider type, the following properties must be specified in the Spark configuration for the shared cluster.

ADLS Gen2 Credentials:

"fs.azure.account.auth.type"
"fs.azure.account.oauth.provider.type"
"fs.azure.account.oauth2.client.id"
"fs.azure.account.oauth2.client.secret"
"fs.azure.account.oauth2.client.endpoint"
"spark.hadoop.fs.azure.account.auth.type"
"spark.hadoop.fs.azure.account.oauth.provider.type"
"spark.hadoop.fs.azure.account.oauth2.client.id"
"spark.hadoop.fs.azure.account.oauth2.client.secret"
"spark.hadoop.fs.azure.account.oauth2.client.endpoint"

For more information, see https://docs.databricks.com/data/data-sources/azure/azure-datalake-gen2.html.

Enable

Steps:

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.
Locate the following parameter, and add the name of the Databricks cluster to use to browse Databricks Table:
```
"feature.databricks.connection.clusterName": "<your_cluster_name>",
```
Save your changes and restart the cluster.

When a cluster name is provided:

If the cluster is available, all users of the Designer Cloud Powered by Trifacta platform attempt to connect to Databricks Tables through the listed cluster.
If the cluster has been terminated, a message indicates that the cluster must be restarted. After it has been restarted, you can try again.
If the cluster name is invalid, the Designer Cloud Powered by Trifacta platform fails any read/write operations to Databricks Tables.

Note

While using a single cluster shared across users to access Databricks Tables, each user must have a valid Databricks Personal Access Token to the shared cluster.

If a cluster name is not provided:

In USER mode, the default behavior is to create one interactive cluster for each user to browse Databricks Tables.
In JOB mode, the interactive cluster is created only when a user needs to browse Databricks Tables.

Configure user-level override for Databricks Tables access cluster

Individual users can specify the name of the cluster that accesses Databricks Tables through the Databricks settings. See "Specify Databricks Tables cluster name" below.

Specify Databricks Tables cluster name

Individual users can specify the name of the cluster to which they are permissioned to access Databricks Tables. This cluster can also be shared among users. For more information, see Databricks Settings Page.

Pass additional Spark properties

As needed, you can pass additional properties to the Spark running environment through the spark.props configuration area.

Note

These properties are passed to Spark for all jobs.

Steps:

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.
Search for the following property: spark.props.
Insert new Spark properties. For example, you can specify the spark.props.spark.executor.memory property, which changes the memory allocated to the Spark executor on each node by using the following in the spark.props area:
```
"spark": {
  ...
  "props": {
    "spark.executor.memory": "6GB"
  }
  ...
}
```
Save your changes and restart the platform.

For more information on modifying these settings, see Configure for Spark.

Configure Maximum Retries for REST API

There is a limit of 30 requests per second per workspace on the Databricks REST APIs. If this limit is reached, then a HTTP status code 429 error is returned, indicating that rate limiting is being applied by the server. By default, the Designer Cloud Powered by Trifacta platform re-attempts to submit a request 5 times and then fails the job if the request is not accepted.

If you want to change the number of retries, change the value for the databricks.maxAPICallRetries flag.

Value	Description
5	(default) When a request is submitted through the Azure Databricks REST APIs, up to `5`retries can be performed in the case of failures. The waiting period increases exponentially for every retry. For example, for the 1^st retry, the wait time is10 seconds, 20 seconds for the next retry, 40 seconds for the third retry and so on. You can set the values accordingly based on number of minutes /seconds you want to try.
0	When an API call fails, the request fails. As the number of concurrent jobs increases, more jobs may fail. Note This setting is not recommended.
5+	Increasing this setting above the default value may result in more requests eventually getting processed. However, increasing the value may consume additional system resources in a high concurrency environment and jobs might take longer to run due to exponentially increasing waiting time.

Use

Run job from application

When the above configuration has been completed, you can select the running environment through the application.

Note

When a Databricks job fails, the failure is reported immediately in the Trifacta Application. In the background, the job logs are collected from Databricks and may not be immediately available.

See Run Job Page.

Run job via API

You can use API calls to execute jobs.

Please make sure that the request body contains the following:

    "execution": "databricksSpark",

For more information, see https://api.trifacta.com/ee/9.7/index.html#operation/runJobGroup

Troubleshooting

Spark job on Azure Databricks fails with "Invalid spark version" error

When running a job using Spark on Azure Databricks, the job may fail with the above invalid version error. In this case, the Databricks version of Spark has been deprecated.

Solution:

Since an Azure Databricks cluster is created for each user, the solution is to identify the cluster version to use, configure the platform to use it, and then restart the platform.

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.
Acquire the value for databricks.sparkVersion.
In Azure Databricks, compare your value to the list of supported Azure Databricks version. If your version is unsupported, identify a new version to use.
Note
Please make note of the version of Spark supported for the version of Azure Databricks that you have chosen.
In the Designer Cloud Powered by Trifacta platform configuration, set databricks.sparkVersion to the new version to use.
Note
The value for spark.version does not apply to Databricks.
Restart the Designer Cloud Powered by Trifacta platform.
The platform is restarted. A new Azure Databricks cluster is created for each user using the specified values, when the user runs a job.

Spark job fails with "spark scheduler cannot be cast" error

When you run a job on Databricks, the job may fail with the following error:

java.lang.ClassCastException: org.apache.spark.scheduler.ResultTask cannot be cast to org.apache.spark.scheduler.Task
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

The job.log file may contain something similar to the following:

2022-07-19T15:41:24.832Z - [sid=0cf0cff5-2729-4742-a7b9-4607ca287a98] - [rid=83eb9826-fc3b-4359-8e8f-7fbf77300878] - [Async-Task-9] INFO com.trifacta.databricks.spark.JobHelper - Got error org.apache.spark.SparkException: Stage 0 failed. Error: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-10-243-149-238.eu-west-1.compute.internal, executor driver): java.lang.ClassCastException: org.apache.spark.scheduler.ResultTask cannot be cast to org.apache.spark.scheduler.Task
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616)
...

This error is due to a class mismatch between the Designer Cloud Powered by Trifacta platform and Databricks.

Solution:

The solution is to disable the precedence of using the Spark JARs provided from the Designer Cloud Powered by Trifacta platform over the Databricks Spark JARs. Please perform the following steps:

To apply this configuration change, login as an administrator to the Trifacta node. Then, edit trifacta-conf.json. For more information, see Platform Configuration Methods.

Locate the spark.props section and add the following configuration elements:

"spark": {
    ...
    "props": {
      "spark.driver.userClassPathFirst": false,
      "spark.executor.userClassPathFirst": false,
      ...
    }
},

Save your changes and restart the platform.

In this section:

Configure for Azure Databricks

Prerequisites

Simba driver

Limitations

Supported versions of Azure Databricks

Job limits

Create Cluster

Enable

Configure

Configure cluster mode

Configure use of cluster policies

Configure Platform

Configure instance pooling

Instance pooling for worker nodes

Instance pooling for driver nodes

Configure Databricks job management

Configure Databricks job throttling

Configure for Databricks Secrets Management

Configure personal access token

Combine transform and profiling for Spark jobs

Additional Configuration

Enable SSO for Azure Databricks

Enable Azure Managed Identity access

Enable shared cluster for Databricks Tables

Configure cluster

Enable

Configure user-level override for Databricks Tables access cluster

Specify Databricks Tables cluster name

Pass additional Spark properties

Configure Maximum Retries for REST API

Use

Run job from application

Run job via API

Troubleshooting

Spark job on Azure Databricks fails with "Invalid spark version" error

Spark job fails with "spark scheduler cannot be cast" error

Search results