Skip to main content

Private Data Processing

Private data processing involves running Alteryx Analytics Cloud (AAC) on a data processing cluster inside of your AWS account and VPC. This combination of your infrastructure, together with AWS resources and software managed by Alteryx, is commonly referred to as a private data plane.

This page focuses on the private data processing cluster itself. We first describe what’s going on inside the cluster, then walk through the setup.

Shared Responsibility Model

In the private data handling scenario, Alteryx Analytics Cloud requires clear boundaries of ownership. The shared responsibility matrix represents these boundaries of ownership.

Customer

Alteryx, Inc.

AWS Account-Wide Resources

  • Account Details

  • IAM Credentials

  • IAM Policy

  • Specification

VPC

  • Infrastructure

  • Subnets

  • Routing

  • Endpoints

  • Specification

Cloud Resources

  • S3

  • EKS

  • IAM Roles

  • IAM Policies

  • Secrets Manager

  • EMR Serverless

  • EC2

Software

  • On-demand Jobs

  • Long-running Services

Account-Wide Resources

At the highest level, Alteryx requires a set of permissions to run a private data plane. However, you will own the AWS account, the IAM credentials, and the IAM policy.

Alteryx provides a Cloud Formation template that defines the necessary permissions you can use to assist in completing this step.

Virtual Private Cloud

At the next level down, Alteryx defines a specification for the VPC. This includes the definition of a number of subnets, CIDR blocks, route tables, and endpoints.

You must implement the VPC according to this spec. Alteryx provides Cloud Formation templates to assist in subnet and route table creation which you can use to assist in completing this step.

Cloud Resources

Once you’ve completed setup of the AWS account and VPC, sign in to the Alteryx Analytics Cloud to trigger the provisioning process that creates your private data processing cluster. The list of resources varies depending on which services you enable in the private data plane, but includes temporary storage, a Kubernetes cluster, compute nodes, secret management, and elastic spark processing.

The Alteryx Analytics Cloud will create and maintain these resources for you using automated provisioning pipelines in Terraform.

Software

After provisioning the needed resources, Alteryxdeploys and maintains the software necessary to process your data within the private cluster. This includes a few long-running services and on-demand jobs.

AWS services

Private Data Handling utilizes a number of AWS services inside the customer VPC to handle data processing tasks. These are the services used:

Service

Usage

S3

Base storage layer.

EC2

Compute resources required to run Alteryx Analytics Cloudservices.

EKS

Runs the EC2 instances for platform services and jobs in the data plane.

Secrets Manager

Storage of infrastructure secrets.

IAM Roles

Provide permissions needed by the Alteryx Analytics Cloudto manage the necessary AWS resources.

IAM Policies

Permissions underlying the IAM roles.

VPC and Subnets

Define networking paths between different services.

Supported Regions

Private data handling is currently available in a variety of regions. In order to provide private data handling in a region, the region must:

  • Support EMR serverless (available AWS regions).

  • Have 3+ availability zones.

  • Support the specific EKS node types needed.

  • Provide EKS 1.24.

In addition to these regional requirements, Alteryx also needs to replicate the container images to local repositories to improve design-time and runtime performance.

These are the regions where private data handling is available:

Cloud Provider

Region Code

Region Name

AWS

ap-east-1

Asia Pacific (Hong Kong)

AWS

ap-northeast-1

Asia Pacific (Tokyo)

AWS

ap-northeast-2

Asia Pacific (Seoul)

AWS

ap-south-1

Asia Pacific (Mumbai)

AWS

ap-southeast-1

Asia Pacific (Singapore)

AWS

ap-southeast-2

Asia Pacific (Sydney)

AWS

ca-central-1

Canada (Central)

AWS

eu-north-1

Europe (Stockholm)

AWS

eu-west-1

Europe (Ireland)

AWS

eu-west-2

Europe (London)

AWS

eu-west-3

Europe (Paris)

AWS

eu-central-1

Europe (Frankfurt)

AWS

sa-east-1

South America (São Paulo)

AWS

us-east-1

US East (N. Virginia)

AWS

us-east-2

US East (Ohio)

AWS

us-west-2

US West (Oregon)

Software

Alteryx Analytics Cloud runs a number of jobs and services inside the private data plane.

Kubernetes On-demand Jobs

For Kubernetes on-demand jobs, Alteryx Analytics Cloud retrieves a container image (from cache or from a central store) and deploys it within an ephemeral pod that lasts for the duration of the job. All executables are in Java or Python.

  • conversion-jobs: Convert datasets from one format to another as needed within a workflow.

  • connectivity-jobs: Connect to external data systems at runtime.

  • photon-jobs: Photon is an in-memory prep and blend engine at runtime for smaller dataset sizes.

  • file-writer-jobs: Write processed data to the output destination specified within the workflow.

  • automl-jobs: In-memory jobs for Machine Learning used at runtime.

Kubernetes Long-running Services

  • data-service: Connects to external data systems at design-time via the JDBC API. Alteryx developed this service. Snyk scans the image for vulnerabilities.

  • teleport-agent: Sets up a secure way for Alteryx SRE to connect to the cluster for troubleshooting. Alteryx Analytics Cloud pulls the helm chart from the https://charts.releases.teleport.dev repository. Alteryx doesn't scan this third-party image.

  • datadog-agent: Collects logs and metrics from the cluster. Alteryx Analytics Cloud pulls the helm chart from the https://helm.datadoghq.com repository. Alteryx doesn't scan this third-party image.

  • keda: Auto-scaling of long-running services based on custom metrics with kafka support. Alteryx doesn't scan this third-party image.

  • external-secrets: Import/export between AWS Secret Manager secrets to and from Kubernetes Secrets Store. Alteryx doesn't scan this third-party image.

  • cluster-autoscaler: Scale EKS nodes based on pod demand. Alteryx doesn't scan this third-party image.

  • metrics-server: Allow EKS to use the metrics API. Alteryx doesn't scan this third-party image.

  • kubernetes-reflector: Replication of the dockerConfigJson secret across all namespaces. Alteryx doesn't scan this third-party image.

VM Long-running Services

  • Cloud execution for desktop (Optional): Cloud-execution-host container service that listens on a message bus for YZXP files uploaded from Designer Desktop to be processed in the data plane. Alteryx developed this service. Synk scans the image for vulnerabilities.

Provisioning Pipeline

Provisioning a private data plane consists of 2 primary steps:

  1. Creating Cloud Resources

  2. Deploying Software

Creating Cloud Resources

Private data planes use Infrastructure as Code (IaC). Alteryx Analytics Cloud uses Terraform Cloud to manage this. Terraform is an IaC tool that lets you define and manage infrastructure resources through human-readable configuration files. Terraform Cloud is a SaaS product provided by Hashicorp. To create and manage private data handling resources, Alteryx Analytics Cloud uses a set of terraform files, Terraform Cloud APIs, and private Terraform Cloud agents running on Alteryx infrastructure.

Provisioned AWS Resources

When you enable and provision private data handling through the Cloud PortalAlteryx Analytics Cloud creates these resources in the AWS infrastructure:

AWS Services

Purpose of Use

Size/Type

Desired Size

[Min–Max]

S3

Storage to store logs and staged temporary files.

< 50 TB

 

EKS Cluster

Alteryx in-memory processing engine.

Photon NodeGroup

m6i.4xlarge

[1–30]

Convert datasets from one format to another.

Convert NodeGroup

m6i.4xlarge

[1–30]

Connect to data sources.

data-system

NodeGroup

m6i.4xlarge

[1–30]

Publish job outputs to their destination.

file-system NodeGroup

m6i.xlarge

[1–30]

Execute ML jobs.

AutoML NodeGroup

m6i.4xlarge

[1–30]

Additional common tooling (for example, datadog and teleport) deployed.

t3a.medium

[3–8]

VPC

Dedicated VPC to deploy EKS, EC2, and EMR.

IAM Roles and Policies

Least privileged permissions to run the software.

Secret Manager

Store infrastructure encryption keys.

≈3-6 secrets

Deploying Software

You can find container images for on-demand jobs in a private container image repository that the private data plane has access to.

Alteryx deploys and maintains long-running services in your EKS cluster using Argo CD. Argo CD is a declarative, GitOps continuous delivery tool for kubernetes.

Setup Steps

Data plane provisioning triggers from the Admin Console inside Alteryx Analytics Cloud. You need Admin privileges within a workspace in order to see it.

  1. From the Alteryx Analytics Cloud homepage, select the profile icon. Select Admin Console from the menu.

  2. From the left navigation panel, select Private Data Handling.

Caution

Modifying or removing any of the AAC-provisioned public cloud resources once Private Data Handling has been provisioned will lead to an inconsistent state. This inconsistency will trigger errors during the job execution or deprovisioning of the Private Data Handling setup.

Step 1: Trigger Private Data Handling Provisioning

Make sure that Private Data Storage shows “Successfully Configured” before proceeding. If the status is “Not Configured,” go to Private Data Storage first, then return to this step.

docs-check-pds.png

Under the Private Data Processing section, there are 5 fields to fill out. These values come from completing the steps in Setup AWS Account and VPC.

docs-pdh-fields.png

Select Create to trigger the deployment of the cluster and resources in the AWS account. This runs a set of validation checks to verify the correct configuration of the AWS account. If there are incorrectly configured permissions, or the creation or tagging of the VPC resources is incorrect, you’ll receive an error message with a description that should point you in the right direction.

Once the initial validation checks complete, provisioning will commence. A message box on the screen periodically refreshes with status updates.

Note

The provisioning process takes approximately 35–40 minutes to complete.

Step 2: Append Custom Role’s Trust Relationship

Note

This step is only necessary if you used a cross-account role for permissions when you configured private data storage. If you used an access key for that step, you can skip this one.

Note

You must wait for the successful completion of Step 1 before you proceed with this step.

If your private data storage uses a cross-account role, then in order for your new private data plane to be able to read/write from your private data storage, you need to update that role to append a trust relationship with your new kubernetes cluster role as follows:

{
    "Sid": "",
    "Effect": "Allow",
    "Principal": {
         "AWS": "arn:aws:iam::<accountid>:role/aac-<xxxx-xxxxxxxxxxxx>-cluster-role"
    },
    "Action": "sts:AssumeRole"
}

Note

Replace AWS Principal with the ARN of the IAM role created by the private data handling provisioning process.

<accountid>: AWS account number where private data plane handling has been provisioned.

<xxxx-xxxxxxxxxxxx>: Last 2 segments of Private Data Processing Environment ID. You can locate this ID in the Admin UI after the private data plane has been successfully provisioned.

Example Scenario:

Account ID: 123456789012

Private Data Processing Environment ID: b2a65fbd-95dc-490a-b69b-a1dc92df224e

Role ARN: arn:aws:iam::123456789012:role/aac-b69b-a1dc92df224e-cluster-role

For more information, go to https://docs.aws.amazon.com/directoryservice/latest/admin-guide/edit_trust.html.

Step 3: EMR Serverless (Optional)

Configure EMR serverless if you are using Spark/EMR processing.

Enable EMR

  1. Log in to Alteryx Analytics Cloud.

  2. From the Profile menu, select Admin Console.

  3. From the leftmost navigation panel, select Private Data Handling.

  4. Select Enable EMR and then select Update.

Update Custom Role Created for S3 Connection

Append the custom policy and custom role from Step 2 with these permissions and trust relationships for EMR serverless:

Append Custom Policy Document

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "EMRServerlessAccess",
            "Effect": "Allow",
            "Action": [
                "emr-serverless:CreateApplication",
                "emr-serverless:UpdateApplication",
                "emr-serverless:DeleteApplication",
                "emr-serverless:ListApplications",
                "emr-serverless:GetApplication",
                "emr-serverless:StartApplication",
                "emr-serverless:StopApplication",
                "emr-serverless:StartJobRun",
                "emr-serverless:CancelJobRun",
                "emr-serverless:ListJobRuns",
                "emr-serverless:GetJobRun"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowNetworkInterfaceCreationViaEMRServerless",
            "Effect": "Allow",
            "Action": "ec2:CreateNetworkInterface",
            "Resource": [
                "arn:aws:ec2:*:*:network-interface/*",
                "arn:aws:ec2:*:*:security-group/*",
                "arn:aws:ec2:*:*:subnet/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:CalledViaLast": "ops.emr-serverless.amazonaws.com"
                }
            }
        },
        {
            "Sid":"AllowEMRServerlessServiceLinkedRoleCreation",
            "Effect":"Allow",
            "Action":"iam:CreateServiceLinkedRole",
            "Resource":"arn:aws:iam::<accountid>:role/aws-service-role/ops.emr-serverless.amazonaws.com/AWSServiceRoleForAmazonEMRServerless"
        },
        {
            "Sid": "AllowPassingRuntimeRole",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::<accountid>:role/aac-<xxxx-xxxxxxxxxxxx>-emr-serverless-spark-execution",
            "Condition": {
                "StringLike": {
                    "iam:PassedToService": "emr-serverless.amazonaws.com"
                }
            }
        },
        {
            "Sid": "S3ResourceBucketAccess",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::aac-<xxxx-xxxxxxxxxxxx>-emr-logs",
                "arn:aws:s3:::aac-<xxxx-xxxxxxxxxxxx>-emr-logs/*"
            ]
        }
    ]
}

Append Custom Role's Trust Relationship

{
    "Sid": "",
    "Effect": "Allow",
    "Principal": {
        "AWS": "arn:aws:iam::<accountid>:role/aac-<xxxx-xxxxxxxxxxxx>-emr-serverless-spark-execution"
    },
    "Action": "sts:AssumeRole"
},
{
    "Sid": "",
    "Effect": "Allow",
    "Principal": {
        "Service": "emr-serverless.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
}

Note

When you delete Private Data Handling, AWS replaces the trust relationship of aac-<xxxx-xxxxxxxxxxxx>-cluster-role ARN with an access key. You must also delete the trust relationship from the UI.

Note

Replace AWS Principal with the ARN of the IAM role created by the private data handling provisioning process.

<accountid>: AWS account number where private data plane handling has been provisioned.

<xxxx-xxxxxxxxxxxx>: Last 2 segments of Private Data Processing Environment ID. You can locate this ID in the Admin UI after the private data plane has been successfully provisioned.

Example Scenario:

Account ID: 123456789012

Private Data Processing Environment ID: b2a65fbd-95dc-490a-b69b-a1dc92df224e

Role ARN: arn:aws:iam::123456789012:role/aac-b69b-a1dc92df224e-emr-serverless-spark-execution

S3 ARN: arn:aws:s3:::aac-aac-b69b-a1dc92df224e-emr-logs

Step 4: Cloud Execution for Desktop (Optional)

Select the Cloud Execution for Desktop option to run Designer Desktop workflows in the cloud. Go to Enable Cloud Execution for Desktop for more information on how to enable this feature.

Common Issues

When provisioning your private data plane, these are some common trouble spots that we see. These are organized based on when you’d run across them.

Setup ValidationSymptom: Error Message on Private Data Handling Page in Admin Console
  • When: Occurs when performing the initial validation before kicking off the private data plane provisioning pipeline.

  • Examples:

    • Error insufficient subnets tagged AACSubnet with value eks_node, 3 required

  • Causes: Resources (IAM accounts, policies, VPCs, subnets) not tagged correctly.

  • Fix: Respond to each error message and address each based on the error message and/or reread the AWS Account and VPC setup instructions to make sure you've correctly tagged all resources.

During ProvisioningSymptom: Instances Failed to Join the Kubernetes Cluster

On the Private Data Handling page in Alteryx Analytics Cloud, you’ll see a note that provisioning failed, but without a good error message for what went wrong.

docs-provision-error-example.png

In the AWS admin console, you’ll see a NodeCreationFailure issue type with a description of Instances failed to join the kubernetes cluster. This typically indicates that there’s no route that allows the EKS subnets egress out to the internet. The new EKS nodes need access to an AWS EC2 Service Endpoint to attach to the EKS Cluster.

docs-aws-console-error.png
  • When: Occurs several minutes after the provisioning process has started.

  • Potential causes:

    • Firewall rules preventing egress from the cluster.

    • Failed DNS resolution within the cluster.

    • No internet gateway between the VPC and the internet.

    • Overlapping subnets.

    • NAT gateway configured as private instead of public.

  • Fix: You need investigate and test the networking setup. AWS provides some suggested troubleshooting steps here.

After ProvisioningSymptom: Workflow Run Failures in the Wrangle Step

Your job will fail and you’ll see a red “x” on the wrangle step of the workflow (rather than a green check mark).

  • When: Occurs when running a workflow.

  • Cause: The workflow can't access the private data store. This is usually because private data storage is using a cross-account role and the kubernetes cluster can't assume that role to access the store.

  • Fix: Update the trust relationship on your private data storage bucket (Step 2 above on this page).