Contents:
This section describes how to configure the Designer Cloud Powered by Trifacta® platform to integrate with Databricks hosted in Azure.
- Azure Databricks is an Apache Spark implementation that has been optimized for use on the Azure platform. For more information, see https://databricks.com/product/azure.
NOTE: You cannot integrate with existing Azure Databricks clusters.
Pre-requisites
The Designer Cloud Powered by Trifacta platform must be deployed in Microsoft Azure.
NOTE: If you are using Azure Databricks as a datasource, please verify that openJDKv1.8.0_212 or earlier is installed on the Alteryx node. Java 8 is required. If necessary, downgrade the Java version and restart the platform. There is a known issue with TLS v1.3.
Simba driver
Beginning in Release 7.1, the integration with Azure Databricks switched from using a Hive-based driver to a Simba driver for the integration with Spark.
NOTE: If you have upgraded from a release before Release 7.1, you should review the Connect String Options in your Azure Databricks connections, such as Databricks Tables. These options may not work with the Simba driver.
No installation is required to use this new driver.
For more information on the Simba driver, see https://databricks.com/spark/odbc-driver-download.
Limitations
NOTE: If you are using Azure AD to integrate with an Azure Databricks cluster, the Azure AD secret value stored in azure.secret
must begin with an alphanumeric character. This is a known issue.
The Designer Cloud Powered by Trifacta platform must be installed on Microsoft Azure.
Nested folders are not supported when running jobs from Azure Databricks.
- When a job is started and no cluster is available, a cluster is initiated, which can take up to four minutes. If the job is canceled during cluster startup:
- The job is terminated, and the cluster remains.
- The job is reported in the application as Failed, instead of Canceled.
Azure Databricks integration works with Spark 2.4.x only.
NOTE: The version of Spark for Azure Databricks must be applied to the platform configuration through the
databricks.sparkVersion
property. Details are provided later.- Azure Databricks integration does not work with Hive.
Supported versions of Azure Databricks
Supported versions:
- Azure Databricks 6.x
- Azure Databricks 5.5 LTS
NOTE: Some versions of Azure Databricks 6.x have already reached End of Life. Customers should upgrade to a version of Azure Databricks that is supported by the vendor. These EOL versions are likely to be deprecated from support by the Designer Cloud Powered by Trifacta platform in a subsequent release.
NOTE: When an Azure Databricks version is deprecated, it is no longer available for selection when creating a new cluster. As a result, if a new cluster needs to be created for whatever reason, it must be created using a version supported by Azure Databricks, which may require you to change the value of the Spark version settings in the Designer Cloud Powered by Trifacta platform. See "Troubleshooting" for more information.
For more information on supported versions, see https://docs.databricks.com/release-notes/runtime/releases.html#supported-databricks-runtime-releases-and-support-schedule.
Job limits
By default, the number of jobs permitted on an Azure Databricks cluster is set to 1000
.
- The number of jobs that can be created per workspace in an hour is limited to
1000
.- If Databricks Job Management is enabled in the platform, then this limit is raised to
5000
by using the run-submit API. For more information, see "Configure Databricks job management" below. - For more information, see https://docs.databricks.com/dev-tools/api/latest/jobs.html#run-now.
- If Databricks Job Management is enabled in the platform, then this limit is raised to
- These limits apply to any jobs run for workspace data on the cluster.
- The number of actively concurrent job runs in a workspace is limited to
150
.
Managing limits:
To enable retrieval and auditing of job information after a job has been completed, the Designer Cloud Powered by Trifacta platform does not delete jobs from the cluster. As a result, jobs can accumulate over time to exceeded the number of jobs permitted on the cluster. If you reach these limits, you may receive a Quota for number of jobs has been reached
limit. For more information, see https://docs.databricks.com/user-guide/jobs.html.
Optionally, you can allow the Designer Cloud Powered by Trifacta platform to manage your jobs to avoid these limitations. For more information, see "Configure Databricks job management" below.
Create Cluster
NOTE: Integration with pre-existing Azure Databricks clusters is not supported.
When a user first requests access to Azure Databricks, a new Azure Databricks cluster is created for the user. Access can include a request to run a job or to browse Databricks Tables. Cluster creation may take a few minutes.
A new cluster is also created when a user launches a job after:
- The Azure Databricks configuration properties or Spark properties are changed in platform configuration.
- A JAR file is updated on the Alteryx node
A user's cluster automatically terminates after a configurable time period. A new cluster is automatically created when the user next requests access to Azure Databricks access. See "Configure Platform" below.
Enable
To enable Azure Databricks, please perform the following configuration changes.
Steps:
- You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods. Locate the following parameters. Set them to the values listed below, which enable the Trifacta Photon (smaller jobs) and Azure Databricks (small to extra-large jobs) running environments:
"webapp.runInTrifactaServer": true, "webapp.runInDatabricks": true, "webapp.runWithSparkSubmit": false, "webapp.runinEMR": false, "webapp.runInDataflow": false,
- Do not save your changes until you have completed the following configuration section.
Configure
Configure Platform
Please review and modify the following configuration settings.
NOTE: When you have finished modifying these settings, save them and restart the platform to apply.
Parameter | Description | Value |
---|---|---|
feature.parameterization.matchLimitOnSampling.databricksSpark | Maximum number of parameterized source files that are permitted for matching in a single dataset with parameters. | |
databricks.workerNodeType | Type of node to use for the Azure Databricks Workers/Executors. There are 1 or more Worker nodes per cluster. | Default: NOTE: This property is unused when instance pooling is enabled. For more information, see Configure instance pooling below. For more information, see the sizing guide for Azure Databricks. |
databricks.sparkVersion | Azure Databricks cluster version which also includes the Spark Version. | Depending on your version of Azure Databricks, please set this property according to the following:
Please do not use other values. For more information, see Configure for Spark. |
databricks.serviceUrl | URL to the Azure Databricks Service where Spark jobs will be run (Example: https://westus2.azuredatabricks.net) | |
databricks.minWorkers | Initial number of Worker nodes in the cluster, and also the minimum number of Worker nodes that the cluster can scale down to during auto-scale-down | Minimum value: Increasing this value can increase compute costs. |
databricks.maxWorkers | Maximum number of Worker nodes the cluster can create during auto scaling | Minimum value: Not less than Increasing this value can increase compute costs. |
databricks.poolId | If you have enabled instance pooling in Azure Databricks, you can specify the pool identifier here. For more information, see Configure instance pooling below. | NOTE: If both poolId and poolName are specified, poolId is used first. If that fails to find a matching identifier, then the poolName value is checked. |
databricks.poolName | If you have enabled instance pooling in Azure Databricks, you can specify the pool name here. For more information, see Configure instance pooling below. | See previous. Tip: If you specify a poolName value only, then you can use the instance pools with the same poolName available across multiple Databricks workspaces when you create a new cluster. |
databricks.driverNodeType | Type of node to use for the Azure Databricks Driver. There is only 1 Driver node per cluster. | Default: For more information, see the sizing guide for Databricks. NOTE: This property is unused when instance pooling is enabled. For more information, see Configure instance pooling below. |
databricks.logsDestination | DBFS location that cluster logs will be sent to every 5 minutes | Leave this value as /trifacta/logs . |
databricks.enableAutotermination | Set to true to enable auto-termination of a user cluster after N minutes of idle time, where N is the value of the autoterminationMinutes property. | Unless otherwise required, leave this value as true . |
databricks.clusterStatePollerDelayInSeconds | Number of seconds to wait between polls for Azure Databricks cluster status when a cluster is starting up | |
databricks.clusterStartupWaitTimeInMinutes | Maximum time in minutes to wait for a Cluster to get to Running state before aborting and failing an Azure Databricks job | |
databricks.clusterLogSyncWaitTimeInMinutes | Maximum time in minutes to wait for a Cluster to complete syncing its logs to DBFS before giving up on pulling the cluster logs to the Alteryx node. | Set this to 0 to disable cluster log pulls. |
databricks.clusterLogSyncPollerDelayInSeconds | Number of seconds to wait between polls for a Databricks cluster to sync its logs to DBFS after job completion | |
databricks.autoterminationMinutes | Idle time in minutes before a user cluster will auto-terminate. | Do not set this value to less than the cluster startup wait time value. |
spark.useVendorSparkLibraries | When | NOTE: This setting is ignored. The vendor Spark libraries are always used for Azure Databricks. |
Configure instance pooling
Instance pooling reduces cluster node spin-up time by maintaining a set of idle and ready instances. The Designer Cloud Powered by Trifacta platform can be configured to leverage instance pooling on the Azure Databricks cluster.
Pre-requisites:
- All cluster nodes used by the Designer Cloud Powered by Trifacta platform are taken from the pool. If the pool has an insufficient number of nodes, cluster creation fails.
- Each user must have access to the pool and must have at least the
ATTACH_TO
permission. - Each user must have a personal access token from the same Azure Databricks workspace. See Configure personal access token below.
To enable:
Acquire your pool identifier or pool name from Azure Databricks.
NOTE: You can use either the Databricks pool identifier or pool name. If both poolId and poolName are specified, poolId is used first. If that fails to find a matching identifier, then the poolName value is checked.
Tip: If you specify a poolName value only, then you can run your Databricks jobs against the available clusters across multiple Alteryx workspaces. This mechanism allows for better resource allocation and broader execution options.
- You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods. Set either of the following parameters:
Set the following parameter to the Azure Databricks pool identifier:
"databricks.poolId": "<my_pool_id>",
Or, you can set the following parameter to the Azure Databricks pool name:
"databricks.poolName": "<my_pool_name>",
- Save your changes and restart the platform.
NOTE: When instance pooling is enabled, the following parameters are not used:
databricks.driverNodeType
databricks.workerNodeType
For more information, see https://docs.azuredatabricks.net/clusters/instance-pools/index.html.
Configure Databricks job management
Azure Databricks enforces a hard limit of 1000 created jobs per workspace, and by default cluster jobs are not deleted. To support jobs more than 1000 jobs per cluster, you can enable job management for Azure Databricks.
NOTE: This feature covers the deletion of the job definition on the cluster, which counts toward the enforced limits. The Designer Cloud Powered by Trifacta platform never deletes the outputs of a job or the job definition stored in the platform. When cluster job definitions are removed, the jobs remain listed in the Jobs page, and job metadata is still available. There is no record of the job on the Azure Databricks cluster. Jobs continue to run, but users on the cluster may not be aware of them.
Tip: Regardless of your job management option, when you hit the limit for the number of job definitions that can be created on the cluster, the platform automatically falls back to using the run-submit API, which does not create a job definition on the Azure Databricks cluster. You can still access your job results and outputs through the Designer Cloud Powered by Trifacta platform, but there is no record of the job on the Databricks cluster.
Steps:
- You apply this change through the Workspace Settings Page. For more information, see Platform Configuration Methods.
Locate the following property and set it to one of the values listed below:
Databricks Job Management
Property Value Description Never Delete (default) Job definitions are never deleted from the Azure Databricks cluster. Always Delete The Azure Databricks job definition is deleted during the clean-up phase, which occurs after a job completes. Delete Successful Only When a job completes successfully, the Azure Databricks job definition is deleted during the clean-up phase. Failed or canceled jobs are not deleted, which allows you to debug as needed. Skip Job Creation For jobs that are to be executed only one time, the Designer Cloud Powered by Trifacta platform can be configured to use a different mechanism for submitting the job. When this option is enabled, the Designer Cloud Powered by Trifacta platform submits jobs using the run-submit API, instead of the run-now API. The run-submit API does not create an Azure Databricks job definition. Therefore the submitted job does not count toward the enforced job limit.
Default Inherits the default system-wide setting. - Save your changes and restart the platform.
Configure personal access token
Each user must insert a Databricks Personal Access Token to access Databricks resources. For more information, see Databricks Personal Access Token Page.
Combine transform and profiling for Spark jobs
When profiling is enabled for a Spark job, the transform and profiling tasks are combined by default. As needed, you can separate these two tasks. Publishing behaviors vary depending on the approach. For more information, see Configure for Spark.
Additional Configuration
Enable SSO for Azure Databricks
To enable SSO authentication with Azure Databricks, you enable SSO integration with Azure AD. For more information, see Configure SSO for Azure AD.
Enable Azure Managed Identity access
For enhanced security, you can configure the Designer Cloud Powered by Trifacta platform to use an Azure Managed Identity. When this feature is enabled, the platform queries the Key Vault for the secret holding the applicationId and secret to the service principal that provides access to the Azure services.
NOTE: This feature is supported for Azure Databricks only.
NOTE: Your Azure Key Vault must already be configured, and the applicationId and secret must be available in the Key Vault. See Configure for Azure.
To enable, the following parameters for the Designer Cloud Powered by Trifacta platform must be specified.
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json
.
For more information, see Platform Configuration Methods.
Parameter | Description |
---|---|
azure.managedIdentities.enabled | Set to true to enable use of Azure managed identities. |
azure.managedIdentities.keyVaultApplicationidSecretName | Specify the name of the Azure Key Vault secret that holds the service principal Application Id. |
azure.managedIdentities.keyVaultApplicationSecretSecretName | Specify the name of the Key Vault secret that holds the service principal secret. |
Save your changes.
Pass additional Spark properties
As needed, you can pass additional properties to the Spark running environment through the spark.props
configuration area.
NOTE: These properties are passed to Spark for all jobs.
Steps:
- You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods. - Search for the following property:
spark.props
. Insert new Spark properties. For example, you can specify the
spark.props.spark.executor.memory
property, which changes the memory allocated to the Spark executor on each node by using the following in thespark.props
area:"spark": { ... "props": { "spark.executor.memory": "6GB" } ... }
- Save your changes and restart the platform.
For more information on modifying these settings, see Configure for Spark.
Use
Run job from application
When the above configuration has been completed, you can select the running environment through the application. See Run Job Page.
Run job via API
You can use API calls to execute jobs.
Please make sure that the request body contains the following:
"execution": "databricksSpark",
For more information, see https://api.trifacta.com/ee/7.6/index.html#operation/runJobGroup
Troubleshooting
Spark job on Azure Databricks fails with "Invalid spark version" error
When running a job using Spark on Azure Databricks, the job may fail with the above invalid version error. In this case, the Databricks version of Spark has been deprecated.
Solution:
Since an Azure Databricks cluster is created for each user, the solution is to identify the cluster version to use, configure the platform to use it, and then restart the platform.
- You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods. - Acquire the value for
databricks.sparkVersion
. In Azure Databricks, compare your value to the list of supported Azure Databricks version. If your version is unsupported, identify a new version to use.
NOTE: Please make note of the version of Spark supported for the version of Azure Databricks that you have chosen.
- In the Designer Cloud Powered by Trifacta platform configuration:
- Set
databricks.sparkVersion
to the new version to use. - Set
spark.version
to the appropriate version of Spark to use.
- Set
- Restart the Designer Cloud Powered by Trifacta platform.
- The platform is restarted. A new Azure Databricks cluster is created for each user using the specified values, when the user runs a job.
Spark job fails with "spark scheduler cannot be cast" error
When you run a job on Databricks, the job may fail with the following error:
java.lang.ClassCastException: org.apache.spark.scheduler.ResultTask cannot be cast to org.apache.spark.scheduler.Task at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
The job.log
file may contain something similar to the following:
2022-07-19T15:41:24.832Z - [sid=0cf0cff5-2729-4742-a7b9-4607ca287a98] - [rid=83eb9826-fc3b-4359-8e8f-7fbf77300878] - [Async-Task-9] INFO com.trifacta.databricks.spark.JobHelper - Got error org.apache.spark.SparkException: Stage 0 failed. Error: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-10-243-149-238.eu-west-1.compute.internal, executor driver): java.lang.ClassCastException: org.apache.spark.scheduler.ResultTask cannot be cast to org.apache.spark.scheduler.Task at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616) ...
This error is due to a class mismatch between the Designer Cloud Powered by Trifacta platform and Databricks.
Solution:
The solution is to disable the precedence of using the Spark JARs provided from the Designer Cloud Powered by Trifacta platform over the Databricks Spark JARs. Please perform the following steps:
- To apply this configuration change, login as an administrator to the Alteryx node. Then, edit
trifacta-conf.json
. For more information, see Platform Configuration Methods. Locate the
spark.props
section and add the following configuration elements:"spark": { ... "props": { "spark.driver.userClassPathFirst": false, "spark.executor.userClassPathFirst": false, ... } },
- Save your changes and restart the platform.
This page has no comments.