Skip to main content

AWS Databricks Admin Setup

After you've completed the initial Databricks workspace configuration, follow this setup guide to provision AWS Databricks workspaces for Alteryx Analytics Cloud (AAC) users.

Important

You must first configure your base storage environment to S3 and disable ADS before setting up Databricks workspaces. Go to AWS as Private Data Storage to learn more.

Workspace Details

Enter a unique Workspace Name under Workspace Details. The Service URL automatically populates.

Cluster for Spark Jobs

AAC uses this cluster configuration to schedule import from or publish to Databricks via Spark Jobs. AAC creates a new cluster based on these details:

  1. Select a Cluster Policy. This defines the limits on the attributes available during cluster creation. For an unrestricted policy (default), leave this option blank. Refer to the later Cluster Policy Requirements section for details on these options.

  2. Select between Private Data Storage Credentials or Instance Profile for the S3 Auth Mode. This determines the mode with which you communicate with AWS S3. Refer to the later S3 Access Configuration section for details on these options.

  3. If you selected Instance Profile, enter an Instance Profile ARN.

  4. Select a Driver Node Type.

  5. Select a Worker Node Type. These are the AWS EC2 instance types to use for launching cluster nodes. You can also select a pool if you have one available rather than standalone instances. To reduce workflow job run latency, use pools with a reasonable number of warm, idle instances.

  6. Enter the Minimum Workers and Maximum Workers for the Databricks job cluster. Every cluster starts with the minimum number of workers provisioned. More workers dynamically add to the cluster if required based on workload, up to the maximum.

Cluster Policy Requirements

With Databricks, you can create policies with a specific set of restrictions on cluster configurations. Select one of these policies from the Cluster Policy dropdown. You can choose between Unrestricted Policy or Other Policy.

Unrestricted Policy (Recommended): The unrestricted policy grants you the freedom to define any cluster configuration you desire without limitations. This is the default policy.

Other Policy: If you select a policy from the dropdown, ensure that the selected cluster policy permits the chosen cluster configuration. Additionally, AAC provides some default configurations while creating a job cluster. Make sure that the chosen policy allows the following default configurations:

{
	"spark_version":"12.2.x-scala2.12",
	"runtime_engine": "photon",
	"cluster_type": "JOB",
	"cluster_log_conf.path": "/trifacta/logs",
	"autoterminationMinutes": 60,
	"enable_local_disk_encryption": false,
	"aws_attributes.availability":"SPOT_WITH_FALLBACK",
	"aws_attributes.ebs_volume_count": 0,
	"aws_attributes.ebs_volume_size": 0,
	"aws_attributes.ebs_volume_type": "NONE",
	"aws_attributes.first_on_demand": 1,
	"aws_attributes.spot_bid_price_percent":100
}

Note

The default configuration provided might change with future releases. Therefore, it is recommended to not define any default configuration in the cluster policy.

During Databricks workspace creation, AAC performs basic cluster policy validation, but the actual validation takes place during job execution. If the configuration doesn’t match the cluster policy, the Databricks job will fail with a validation error indicating a configuration mismatch.

S3 Access Configuration

To support workflow runs with any sources/destinations that aren’t Databricks tables, Databricks clusters require access to the configured S3 bucket in AAC as the default storage bucket. There are two ways to provide this access:

Instance Profile

When you select this mode, it is mandatory for the admin to also select an Instance Profile ARN. The Instance Profile ARN MUST have read and write access to the S3 bucket that AAC uses as the default storage bucket.

To configure an instance profile with the required permissions, go to the Databricks tutorial.

This is a secure and recommended option for authorizing S3 access to Databricks: No sensitive S3 credentials exchange between AAC and Databricks.

Private Data Storage Credentials

When you select this mode, AAC attempts to use the same credentials that you provided:

  • If you configured AAC with an AWS key-secret…

    • You don’t need additional configuration. The key and secret pass to the Databricks cluster. The job uses the key and secret to access the S3 bucket.

Caution

This is NOT a recommended method for authorizing S3 access to Databricks.

  • If you configured AAC with an AWS cross-account IAM role…

    • When using a cross-account IAM role, it’s mandatory for the admin to also select an Instance Profile ARN. AAC uses the identity of the instance profile to securely impersonate the configured IAM role. To configure an instance profile with the required permissions, go to the Databricks tutorial. The instance profile doesn’t need S3 access permissions. However, it does require permission to assume any cross-account IAM role associated with AAC. Use these permissions and trust relationships:

Note

Replace <accountid> with your AWS account ID.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": [
          "arn:aws:iam::<account-id>:role/<ROLE_1>",
          "arn:aws:iam::<account-id>:role/<ROLE_2>",
          "arn:aws:iam::<account-id>:role/<ROLE_3>"
        ]
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

The cross-account role also needs a new trust relationship to be assumed by the instance profile above. This is in addition to the trust relationship it already requires with AAC. Use these permissions and trust relationships:

Note

Replace <accountid> with your AWS account ID.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "<aws_account_id>"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringLike": {
          "sts:ExternalId": [
            "<external_id>"
          ]
        }
      }
    },
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<account-id>:role/<INSTANCE_PROFILE_ROLE>"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringLike": {
          "sts:ExternalId": [
            "<external_id>"
          ]
        }
      }
    }
  ]
}

This is a secure and recommended option for authorizing S3 access to Databricks: No sensitive S3 credentials exchange between AAC and Databricks.

Cluster for Photon Jobs

This is a long-running cluster required to browse, preview, and import Databricks tables as datasets in AAC. The cluster must meet these requirements to show up as an option:

  • Run in shared-access mode.

  • Use Databricks runtime version 12.2 LTS.

Once you’ve determined your Photon cluster, select Save.

You've now configured your Databricks workspace for use in AAC.

To Edit or Delete your Databricks workspace, select the 3-dot menu next to your workspace.

Use Databricks for Workflow Execution

After you’ve configured at least 1 Databricks workspace, you can enable the Databricks runtime for workflows in Admin Console > Settings > Job execution > Spark Engine. This replaces the scalable runtime used for executing workflows from EMR Spark to Databricks.

Once you’ve switched the engine, Databricks becomes available as a workflow job run option for users who’ve registered a personal access token against at least 1 Databricks workspace.

When you run a full workflow, AAC launches a dedicated job cluster using the Databricks configuration defined by the admin (for example the driver/worker node type and auto-scaling configuration). Every workflow job run gets a dedicated cluster. Workflow job run clusters only last for the duration of the run and then automatically terminate afterward. AAC never shares these clusters between users or different workflow runs.