This section describes how you can configure Dataprep by Trifacta to operate within your enterprise's virtual private cloud (VPC).
TheTrifacta Applicationruns in your VPC in the Google Cloud Platform. No additional configuration is required.
Optionally, you can configure Dataflow jobs to be executed within your VPC. When enabled, data remains in your VPC during full execution of the job.
Previewing and sampling use the default network settings.
To enable in-VPC execution, the VPC network mode must be set to custom
, and additional VPC properties must be provided. In-VPC job execution can be configured per-user or per-output:
Per-user: For more information, see User Execution Settings Page.
Per-output: For more information, see Runtime Dataflow Execution Settings.
Per-output settings override any settings specified in your preferences.
This feature may not be available in all product editions. For more information on available features, see Compare Editions.
When jobs are migrated from execution from the platform VPC to your enterprise VPC, you may incur additional jobs to execute each job.
By default, Trifacta Photon and connectivity jobs execute in the Alteryx VPC. As needed, you can configure these jobs to run in your VPC.
Service accounts may be used for execution of these jobs where possible.
All job types supported for in-VPC execution are supported for manual and scheduled execution.
Job Type | Description |
Batch job processing | For execution of batch jobs within your VPC, you must perform the configuration, including specifying the appropriate service accounts to use. After configuration, these jobs are automatically executed within your VPC. |
Trifacta Photon | These jobs are transformation and quick scan sampling jobs that execute in memory. This type of job execution is suitable for small- to medium-sized jobs. |
Connectivity | If your data source or publishing target is a relational or API-based source, some or all of the job occurs through the connectivity framework. Tip If connectivity jobs have been enabled for execution in your environment, then BigQuery connectivity is enabled, including publishing and using BigQuery for running transformation jobs, using the appropriate service account. |
Connectivity - design time | In-VPC execution supports connection from the design time functions of the Trifacta Application to an in-VPC data service instance. This connection to the data service allows for testing connections, viewing table and schema information, and collecting initial samples from datasources hosted within your VPC. Note When this feature is enabled, SSH tunneling for connections does not work. |
Conversion | Ingestion jobs of datasources that need to be converted, such as binary formats like PDF, XLSX, and Google Sheets, can be executed within your VPC. Note Google Sheets conversion jobs use user credentials within the project, even if service accounts are enabled. |
For these job types, there are two types of configuration:
Configuration Type | Description |
Basic | Uses the GKE default namespace and default node pool. See below. |
Advanced | User-configured GKE namespace and user-specified node pool. See Dataprep In-VPC Execution - Advanced. |
Details on these configuration methods are provided below.
The following limitations apply to this release. These limitations may change in the future:
A running job is permitted to execute for no more than 1 hour.
For this release, only regions in the U.S. and Europe are supported.
Before you begin, please verify that your VPC environment has the following:
The project owner must perform configuration in Dataprep by Trifacta as part of this setup.
A GKE cluster is available for transformation jobs to use.
Your GKE cluster must have a public endpoint.
Use VPC-native clusters. Routes-based clusters are not supported.
If using a GKE cluster with private nodes, a Cloud NAT (network address table) must be available in your VPC to access the Alteryx image registry.
Workload identity must be enabled on the GKE cluster. Additional configuration for Dataprep by Trifacta is described later.
The use of service accounts (Compute Engine or Companion Service Accounts) is required to run jobs in your VPC.
Use of individual user credentials is not supported for Workload Identity.
Access to the following tools:
command line interface (CLI)kubectl
Acquire from Alteryx:
IP address for authorized control plane access.
In-VPC execution must be enabled by an administrator. In the Dataprep Settings page, you can enable the following settings.
Setting | Description |
In-VPC execution | Enables general in-VPC execution, which includes execution of the following job types:
In-VPC Conversion job execution | Enables execution of conversion jobs within your VPC. Note This setting is available when In-VPC Execution has been enabled. |
In-VPC Data-Service communication | Enables design-time connectivity jobs to be executed within your VPC. Note This setting is available when In-VPC Execution has been enabled. |
The Scheduling feature must also be enabled for the project.
For more information, see Dataprep Project Settings Page.
Please complete the following steps for the Basic configuration.
This Service Account is assigned to the nodes in the GKE node pool and is configured to have minimal privileges.
Following are variables listed in the configuration steps. They can be modified based on your requirements and supported values:
Variable | Description |
| Default service account name |
myproject | Name of your Google project |
myregion | Your Google Cloud region |
Please execute the following commands from the gcloud
Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \
. The backslash should not be included if the line is used as input.
gcloud iam service-accounts create trifacta-service-account \
--display-name="Service Account for running Trifacta Remote jobs"
gcloud projects add-iam-policy-binding myproject \
--member "" \
--role roles/logging.logWriter
gcloud projects add-iam-policy-binding myproject \
--member "" \
--role roles/monitoring.metricWriter
gcloud projects add-iam-policy-binding myproject \
--member "" \
--role roles/monitoring.viewer
gcloud projects add-iam-policy-binding myproject \
--member "" \
--role roles/stackdriver.resourceMetadata.writer
gcloud projects add-iam-policy-binding myproject \
--member "" \
--role roles/artifactregistry.reader
Verification steps:
gcloud projects get-iam-policy myproject --flatten="bindings[].members" --format="table(bindings.role)" --filter=""
The output should look like the following:
ROLE roles/artifactregistry.reader roles/logging.logWriter roles/monitoring.metricWriter roles/monitoring.viewer roles/stackdriver.resourceMetadata.writer
The following configuration is required for Internet access to acquire assets from Dataprep by Trifacta, if the GKE cluster has private nodes.
Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \
. The backslash should not be included if the line is used as input.
gcloud compute routers create myproject-myregion \
--network myproject-network \
gcloud compute routers nats create myproject-myregion \
--router=myproject-myregion \
--auto-allocate-nat-external-ips \
--nat-all-subnet-ip-ranges \
Verification Steps:
You can verify that the router NAT was created in the Google Cloud Platform Console:
This configuration creates the GKE cluster for use in executing jobs. This cluster must be created in the VPC/sub-network that has access to your datasources, such as your databases and Cloud Storage.
In the following, please replace w.x.y.z
with the IP address provided to you by Alteryx for authorized control plane access.
The Pod address range limits the maximum size of the cluster. See
For more information about the available zones in
, please see you don’t have Quotas on accounts, you must reconfigure the node size to fit inside your quota, or your cluster may not start.
Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \
. The backslash should not be included if the line is used as input.
gcloud container clusters create "trifacta-cluster" \
--project "myproject" \
--region "myregion" \
--no-enable-basic-auth \
--cluster-version "1.20.8-gke.900" \
--release-channel "None" \
--machine-type "n1-standard-16" \
--image-type "COS_CONTAINERD" \
--disk-type "pd-standard" \
--disk-size "100" \
--metadata disable-legacy-endpoints=true \
--service-account "" \
--max-pods-per-node "110" \
--num-nodes "1" \
--monitoring=SYSTEM \
--enable-ip-alias \
--network "projects/myproject/global/networks/myproject-network" \
--subnetwork "projects/myproject/regions/myregion/subnetworks/myproject-subnet-myregion" \
--no-enable-intra-node-visibility \
--default-max-pods-per-node "110" \
--enable-autoscaling \
--min-nodes "0" \
--max-nodes "3" \
--enable-master-authorized-networks \
--master-authorized-networks w.x.y.z/32 \
--addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver \
--no-enable-autoupgrade \
--enable-autorepair \
--max-surge-upgrade 1 \
--max-unavailable-upgrade 0 \
--workload-pool "" \
--enable-private-nodes \
--enable-shielded-nodes \
--shielded-secure-boot \
--node-locations "myregion-a","myregion-b","myregion-c" \
--master-ipv4-cidr= \
Verification Steps:
You can verify that the cluster was created through the Google Cloud Platform Console:
Use the following command to set up configuration to connect to the new cluster:
gcloud container clusters get-credentials trifacta-cluster --region myregion --project myproject
The following commands whitelist the Cloud shell for use on the cluster:
After you have acquired access, you can whitelist the following account:
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ServiceAccount automountServiceAccountToken: false metadata: namespace: default name: trifacta-job-runner EOF
You can whitelist the following role using the appropriate definition below:
Use the following if you are enabling design-time connectivity to a remote data service instance:
cat <<EOF | kubectl apply -n data-system-job-namespace -f - apiVersion: kind: Role metadata: name: trifacta-job-runner-role rules: - apiGroups: [""] resources: ["secrets"] verbs: ["list", "create", "delete"] - apiGroups: [""] resources: ["pods"] verbs: ["list", "update"] - apiGroups: [""] resources: ["pods/log", "pods/portforward"] verbs: ["get", "list", "create"] - apiGroups: ["batch"] resources: ["jobs"] verbs: ["get", "create", "delete", "watch"] - apiGroups: [""] resources: ["serviceaccounts"] verbs: ["list", "get"] - apiGroups: [""] resources: ["configmaps"] verbs: ["patch", list", "get", "create"] - apiGroups: [""] resources: ["services"] verbs: ["create", "list", "get"] - apiGroups: ["apps"] resources: ["deployments"] verbs: ["patch", "create", "list", "get"] EOF
Specify the following role bindings and cluster roles:
cat <<EOF | kubectl apply -f - apiVersion: kind: RoleBinding metadata: name: trifacta-job-runner-rb subjects: - kind: ServiceAccount name: trifacta-job-runner namespace: default roleRef: kind: Role name: trifacta-job-runner-role apiGroup: EOF
cat <<EOF | kubectl apply -f - apiVersion: kind: ClusterRole metadata: name: node-list-role rules: - apiGroups: [""] resources: ["nodes"] verbs: ["list"] EOF
cat <<EOF | kubectl apply -f - kind: ClusterRoleBinding apiVersion: metadata: name: node-list-rb subjects: - kind: ServiceAccount name: trifacta-job-runner namespace: default roleRef: kind: ClusterRole name: node-list-role apiGroup: EOF
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ServiceAccount automountServiceAccountToken: false metadata: name: trifacta-pod-sa EOF
For basic configuration, Trifacta Photon uses the default
node pool. No additional configuration is required.
For basic configuration, Trifacta Photon uses the default
namespace. No additional configuration is required.
Variable | Description |
trifacta-job-runner | Service Account used by Dataprep by Trifacta externally to launch jobs into the GKE cluster. |
trifacta-pod-sa | Service Account assigned to the job pod running in the GKE cluster. |
Please execute the following commands:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
automountServiceAccountToken: false
namespace: default
name: trifacta-job-runner
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
name: trifacta-job-runner-secret
annotations: trifacta-job-runner
cat <<EOF | kubectl apply -f -
kind: Role
name: trifacta-job-runner-role
- apiGroups: [""]
resources: ["secrets"]
verbs: ["create", "delete"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["list"]
- apiGroups: [""]
resources: ["pods/log"]
verbs: ["get"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["get", "create", "delete", "watch"]
- apiGroups: [""]
resources: ["serviceaccounts"]
verbs: ["list", "get"]
cat <<EOF | kubectl apply -f -
kind: RoleBinding
name: trifacta-job-runner-rb
- kind: ServiceAccount
name: trifacta-job-runner
namespace: default
kind: Role
name: trifacta-job-runner-role
cat <<EOF | kubectl apply -f -
kind: ClusterRole
name: node-list-role
- apiGroups: [""]
resources: ["nodes"]
verbs: ["list"]
cat <<EOF | kubectl apply -f -
kind: ClusterRoleBinding
name: node-list-rb
- kind: ServiceAccount
name: trifacta-job-runner
namespace: default
kind: ClusterRole
name: node-list-role
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
automountServiceAccountToken: false
name: trifacta-pod-sa
The following commands create the encryption keys for credentials:
Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \
. The backslash should not be included if the line is used as input.
openssl genrsa -out private_key.pem 2048
openssl pkcs8 -topk8 -inform PEM -outform DER -in private_key.pem -out private_key.der -nocrypt
openssl rsa -in private_key.pem -pubout -outform DER -out public_key.der
base64 -i public_key.der > public_key.der.base64
base64 -i private_key.der > private_key.der.base64
kubectl create secret generic trifacta-credential-encryption -n default \
After you have completed the above configuration, you must configure the Trifacta Application based on the commands that you have executed.
Login to the Trifacta Application as a project owner.
Select Admin console > VPC runtime settings.
Please complete the following configuration. For more information on these settings, see VPC Runtime Settings Page.
Setting | Command or Value |
Master URL | Command: gcloud container clusters describe trifacta-cluster --zone=myregion --format="value(endpoint)" Returns: This command returns a URL that looks similar to the following: |
OAuth token | Command: kubectl get secret/trifacta-job-runner-secret -o json | jq -r '.data.token' | base64 --decode |
Cluster CA certificate | Command: gcloud container clusters describe trifacta-cluster --zone=myregion --format="value(masterAuth.clusterCaCertificate)" |
Service account name | Value: |
Public key | Insert the contents of: To acquire this value: cat public_key.der.base64 Note To process Google Sheets data in your VPC, this value is required. Otherwise, it is optional. |
Private key secret name | Value: Note To process Google Sheets data in your VPC, this value is required. The private key must be accessible within your VPC. Otherwise, this value is optional. |
Setting | Command or Value |
Kubernetes namespace | default |
CPU, memory - request, limits | Adjust defaults, if necessary. |
Node selector, tolerations | Node selector = ""
Node tolerations = "" |
To test your configuration, click Test. A success message should be displayed.
To save your configuration, click Save.
Setting | Command or Value |
Kubernetes namespace | default |
CPU, memory - request, limits | Adjust defaults, if necessary. |
Node selector, tolerations | Node selector = ""
Node tolerations = "" |
To test your configuration, click Test. A success message should be displayed.
To save your configuration, click Save.
Google access tokens are valid for 1 hour. Some jobs can be long running. To protect against timeouts during these jobs and to support security practices recommended by Google, Dataprep by Trifacta supports the use of Workload Identity, which is Google's recommended approach for accessing Google APIs.
For more information on Workload Identity, see
For more information on enabling Workload Identity in your project, see
Workload Identity is required for running jobs on a GKE cluster, which is required for In-VPC job execution.
Workload Identity requires the use of Compute Engine or Companion service accounts. Use of individual user credentials is not supported. For more information, see Google Service Account Management.
For each Companion Service Account assigned to a user in Dataprep by Trifacta:
A new Kubernetes ServiceAccount must be created on the GKE cluster.
This step must be completed by your Google Cloud Platform administrator.
Using Workload Identity, the Companion Service Account must be bound to the newly created Kubernetes ServiceAccount.
// Create a new Kubernetes ServiceAccount on the GKE cluster with an annotation to bind it to the Companion ServiceAccount.
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
automountServiceAccountToken: false
name: trifacta-pod-sa-allaccess
// Allow the Kubernetes ServiceAccount to impersonate the Google IAM ServiceAccount by adding an IAM policy binding between the two service accounts. This binding allows the Kubernetes ServiceAccount to act as the IAM ServiceAccount.
gcloud iam service-accounts add-iam-policy-binding \
--project <project_name>
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:<project_name>[default/trifacta-pod-sa-allaccess]" \
Wait a couple of minutes for the binding to take effect.
For relational connectivity, additional configuration is required. Search for data-system
in Dataprep In-VPC Execution - Advanced.
You can use the following command to watch the Kubernetes clusters for job execution and to check active pods:
kubectl get pods -n default -w
To get details on a specific pod:
kubectl describe <podId>
Then, run a job in Trifacta Photon through the Trifacta Application. If the job runs successfully, then the configuration has been properly applied. See Run Job Page.