Page tree

Release 5.1


Contents:

   

Contents:


This section describes how to configure the Designer Cloud Powered by Trifacta® platform to integrate with Databricks hosted in Azure. 

NOTE: For each user, a separate cluster is created. It may take a few minutes to spin up a new cluster.

Pre-requisites

  • The Designer Cloud Powered by Trifacta platform must be deployed in Microsoft Azure.  

Limitations

NOTE: If you are using Azure AD to integrate with an Azure Databricks cluster, the Azure AD secret value stored in azure.secret must begin with an alphanumeric character. This is a known issue.

  • Nested folders are not supported when running jobs from Azure Databricks.
  • When a job is started and no cluster is available, a cluster is initiated, which can take up to four minutes. If the job is canceled during cluster startup:
    • The job is terminated, and the cluster remains. 
    • The job is reported in the application as Failed, instead of Canceled.
  • For ADLS integration, Azure Databricks integration does not support ADL user-mode authentication. Only system-mode authentication is supported.
  • Azure Databricks integration works with Spark 2.4.0 only. 
  • Azure Databricks integration does not work with Hive.
  • Use of partitioned tables in Azure Databricks is not supported.

Enable

To enable Azure Databricks, please perform the following configuration changes. 

Steps:

  1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.
  2. Locate the following parameters. Set them to the values listed below, which enable the Photon (smaller jobs) and Azure Databricks (small to extra-large jobs):

    "webapp.runInTrifactaServer": true,
    "webapp.runInDatabricks": true,
    "webapp.runInHadoop": false,
    "webapp.runinEMR": false,
    "webapp.runInDataflow": false,
    "photon.enabled": true,
  3. Do not save your changes until you have completed the following configuration section.

Configure

Configure Platform

Please review and modify the following configuration settings.

NOTE: When you have finished modifying these settings, save them and restart the platform to apply.

ParameterDescriptionValue
feature.parameterization.maxNumberOfFilesForExecution.databricksSparkMaximum number of parameterized source files that are permitted to be executed as part of an Azure Databricks job. 
feature.parameterization.matchLimitOnSampling.databricksSparkMaximum number of parameterized source files that are permitted for matching in a single dataset with parameters. 
databricks.workerNodeTypeType of node to use for the Azure Databricks Workers/Executors. There are 1 or more Worker nodes per cluster.

Default: Standard_D3_v2

For more information, see the sizing guide for Azure Databricks.

databricks.sparkVersionAzure Databricks cluster version which also includes the Spark Version.Please do not change.
databricks.serviceUrlURL to the Azure Databricks Service where Spark jobs will be run (Example: https://westus2.azuredatabricks.net) 
databricks.minWorkersInitial number of Worker nodes in the cluster, and also the minimum number of Worker nodes that the cluster can scale down to during auto-scale-down

Minimum value: 1

Increasing this value can increase compute costs.

databricks.maxWorkersMaximum number of Worker nodes the cluster can create during auto scaling

Minimum value: Not less than databricks.minWorkers.

Increasing this value can increase compute costs.

databricks.logsDestinationDBFS location that cluster logs will be sent to every 5 minutesLeave this value as /trifacta/logs.
databricks.enableAutoterminationSet to true to enable auto-termination of a user cluster after N minutes of idle time, where N is the value of the autoterminationMinutes property.Unless otherwise required, leave this value as true.
databricks.driverNodeTypeType of node to use for the Azure Databricks Driver. There is only 1 Driver node per cluster.

Default: Standard_D3_v2

For more information, see the sizing guide for Databricks.

databricks.clusterStatePollerDelayInSecondsNumber of seconds to wait between polls for Azure Databricks cluster status when a cluster is starting up 
databricks.clusterStartupWaitTimeInMinutesMaximum time in minutes to wait for a Cluster to get to Running state before aborting and failing an Azure Databricks job 
databricks.clusterLogSyncWaitTimeInMinutes

Maximum time in minutes to wait for a Cluster to complete syncing its logs to DBFS before giving up on pulling the cluster logs to the Alteryx node.

Set this to 0 to disable cluster log pulls.
databricks.clusterLogSyncPollerDelayInSecondsNumber of seconds to wait between polls for a Databricks cluster to sync its logs to DBFS after job completion 
databricks.autoterminationMinutesIdle time in minutes before a user cluster will auto-terminate.Do not set this value to less than the cluster startup wait time value.
spark.useVendorSparkLibrariesWhen true, the platform bypasses shipping its installed Spark libraries to the cluster with each job's execution.

Default is false.

Do not modify unless you are experiencing failures in Azure Databricks job execution. For more information, see Troubleshooting below.

Configure Personal Access Token

After the above configuration has been performed, each user must insert their personal access token into their User Settings page. This configuration enables the user to authenticate using the Azure Databricks REST APIs, which enables the execution of jobs. 

NOTE: Each user must apply a personal access token to their User Profile. Users that do not provide a personal authentication token cannot run jobs on Azure Databricks, including transformation, sampling, and profiling jobs.

Steps:

  1. Acquire your Azure Databricks personal access token. For more information, see https://docs.azuredatabricks.net/api/latest/authentication.html#requirements.
  2. Login to the application. From the menu bar, select Settings menu > Settings
  3. Click Databricks
  4. In the Databricks Personal Access Token field, paste your token.  

    Figure: Databricks user configuration

  5. Click Save.

Azure Databricks personal access tokens are saved in the Azure key vault. 

Use

Run job from application

When the above configuration has been completed, you can select the running environment through the application. See Run Job Page.

Run job via CLI

You can run jobs on Azure Databricks via CLI. When executing, the job_type parameter must be set to databricksSpark. See CLI for Jobs.

Run job via API

You can use API calls to execute jobs. When executing, the request body might look like the following. Please note the value for execution:

{  "wrangledDataset": {
    "id": 1
  },
  "overrides": {
    "execution": "databricksSpark",
    "profiler": false,
    "writesettings": [
      {
        "path": "mypath/cdr_txt.csv",
        "action": "create",
        "format": "csv",
        "compression": "none",
        "header": false,
        "asSingleFile": false
      }
    ]
  },
}

See API JobGroups Create v3.

Troubleshooting

Spark job on Azure Databricks fails with "Invalid spark version 4.1.x-scala2.11" error

When running a job using Spark on Azure Databricks, the job may fail with the above invalid version error. In this case, the Databricks version of Spark has been deprecated.

Solution:

Since an Azure Databricks cluster is created for each user, the solution is to identify the cluster version to use, configure the platform to use it, and then restart the platform.

  1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.
  2. Acquire the value for databricks.sparkVersion.
  3. In Azure Databricks, compare your value to the list of supported Azure Databricks version. If your version is unsupported, identify a new version to use. 

    NOTE: Please make note of the version of Spark supported for the version of Azure Databricks that you have chosen.

  4. In the Designer Cloud Powered by Trifacta platform configuration:
    1. Set databricks.sparkVersion to the new version to use.
    2. Set spark.version to the appropriate version of Spark to use.
  5. Restart the Designer Cloud Powered by Trifacta platform.
  6. The platform is restarted. A new Azure Databricks cluster is created for each user using the specified values, when the user runs a job.

This page has no comments.