Skip to main content

Dataproc Engine Setup Guide

Connect your Alteryx Analytics Cloud (AAC) workspace to your Dataproc Serverless account to enable the Dataproc Engine. Dataproc is a distributed Spark engine that can run your Designer Cloud workflows if you’re workspace is set up with GCS as Private Data Storage. Follow these steps to enable the Dataproc engine in your AAC workspace…

Prerequisites

  • You must be a Workspace AdminWorkspace Admin in AACAAC.

  • Your AACAAC workspace must be set up with GCS as Private Data Storage.

  • A GCS service account to run Dataproc batches (jobs).

  • Have administrative access to the target GCP project.

  • Create a VPC network for all the regions you want to use.

  • Set the constraint constraints/compute.requireOsLogin to false in the project you want to use.

Dataproc Engine Setup Guide

Follow these steps to enable the Dataproc engine in your AACAAC workspace…

GCS Service Accounts

There are 2 types of service accounts that you need…

  1. Base storage service account for GCS. Note that you only need this account if you use workspace mode. AACAAC uses this account to access GCS during design time and creates Dataproc batches. The account must have permission to create and monitor Dataproc batches. These are the recommended roles…

    Nota

    If you use user mode, AACAAC doesn’t use the base storage service account. Instead, AACAAC uses your SSO identity to launch the Dataproc batch. However, you need the same roles as listed for the base storage service account.

    1. Dataproc Editor (roles/dataproc.editor) in the project you want to execute Dataproc.

    2. Service Account User (roles/iam.serviceAccountUser) in the Dataproc service account. For more information, go to the GCS roles documentation.

  2. Dataproc service account. AACAAC passes this service account as an argument when creating a Dataproc batch. It must have the Dataproc Worker role (roles/dataproc.worker) in the project it’s executing in.

GCP Project Configuration

Set the constraint constraints/compute.requireOsLogin to false in the Google Cloud Platform (GCP) project you want to use. For more information, go to the GCS policy documentation.

VPC Network Configuration

You must have a VCP network set up to run Dataproc jobs. For more information on how to configure this network, go to the Dataproc Serverless documentation.

Complete Setup

The workspace admin can configure Dataproc for their workspace using the admin console.

dataproc_setup_form.png
  1. Go to workspace Admin section > Data Warehouses > Dataproc section.

  2. Fill configuration form

Tabela 1. Example default values for these configurations

Project ID

The Dataproc batch is executed within this Google project.

aac-multicloud-dev-4447

VPC Network Name

A VPC Network is used (in this case, a network with auto-subnets is used, so the subnet name does not need to be explicitly specified. If the network is configured with custom subnets, the subnet name must also be specified in the form).

dataproc-test-2

Region

Region where the Dataproc batch is executed.

us-central1

Service Account Name

Service Account used to run the Dataproc batch. This is specified as a parameter at launch time, and is not necessarily the same service account as the base storage.

dataproc-worker@aac-multicloud-dev-4447.iam.gserviceaccount.com



To complete the Dataproc engine setup, you need help from the Alteryx fulfillment team. To get started, please contact your customer success manager. You will need to provide this information from your Dataproc Serverless account…

  • VPC Network

  • Subnet name in the appropriate region.

  • Project ID

  • Region or Location

  • Service Accounts

  • All the job types run using the Alteryx main engine (batch-job-runner-v2).

    • Must contain transformation.dataprocSpark.