Page tree


Contents:

Our documentation site is moving!

For up-to-date documentation of Dataprep, please visit us at https://help.alteryx.com/Dataprep/.

   

Contents:


This section describes how you can configure Dataprep by Trifacta to operate within your enterprise's virtual private cloud (VPC).

The Dataprep by Trifacta application runs in your VPC in the Google Cloud Platform . No additional configuration is required.

Dataflow

Optionally, you can configure Dataflow jobs to be executed within your VPC. When enabled, data remains in your VPC during full execution of the job. 

NOTE: Previewing and sampling use the default network settings.

To enable in-VPC execution, the VPC network mode must be set to custom, and additional VPC properties must be provided. In-VPC job execution can be configured per-user or per-output:

Running Jobs

Feature Availability: This feature may not be available in all product editions. For more information on available features, see Compare Editions.

NOTE: When jobs are migrated from execution from the platform VPC to your enterprise VPC, you may incur additional jobs to execute each job.

Job Types

By default, Trifacta Photon and connectivity jobs execute in the Dataprep by Trifacta VPC. As needed, you can configure these jobs to run in your VPC.

Tip: Service accounts may be used for execution of these jobs where possible.

Tip: All job types supported for in-VPC execution are supported for manual and scheduled execution.


Job TypeDescription
Batch job processing

For execution of batch jobs within your VPC, you must perform the configuration, including specifying the appropriate service accounts to use. After configuration, these jobs are automatically executed within your VPC.

Trifacta Photon

These jobs are transformation and quick scan sampling jobs that execute in memory. This type of job execution is suitable for small- to medium-sized jobs.

Connectivity

If your data source or publishing target is a relational or API-based source, some or all of the job occurs through the connectivity framework.

Tip: If connectivity jobs have been enabled for execution in your environment, then BigQuery connectivity is enabled, including publishing and using BigQuery for running transformation jobs, using the appropriate service account. 

Connectivity - design time

In-VPC execution supports connection from the design time functions of the Dataprep by Trifacta application to an in-VPC data service instance. This connection to the data service allows for testing connections, viewing table and schema information, and collecting initial samples from datasources hosted within your VPC. 

NOTE: When this feature is enabled, SSH tunneling for connections does not work.

Conversion

Ingestion jobs of datasources that need to be converted, such as binary formats like PDF, XLSX, and Google Sheets, can be executed within your VPC.

NOTE: Google Sheets conversion jobs use user credentials within the project, even if service accounts are enabled.

For these job types, there are two types of configuration:

Configuration TypeDescription
Basic

Uses the GKE default namespace and default node pool. See below.

Advanced

User-configured GKE namespace and user-specified node pool. See Dataprep In-VPC Execution - Advanced.

Details on these configuration methods are provided below. 

Limitations

The following limitations apply to this release. These limitations may change in the future:

  • A running job is permitted to execute for no more than 1 hour. 
  • For this release, only regions in the U.S. and Europe are supported.

Prerequisites

Before you begin, please verify that your VPC environment has the following:

  • The project owner must perform configuration in  Dataprep by Trifacta as part of this setup.
  • A GKE cluster is available for transformation jobs to use.

    • Your GKE cluster must have a public endpoint.
    • Use VPC-native clusters. Routes-based clusters are not supported.
    • If using a GKE cluster with private nodes, a Cloud NAT (network address table) must be available in your VPC to access the Dataprep by Trifacta image registry.
  • Workload identity must be enabled on the GKE cluster. Additional configuration for  Dataprep by Trifacta is described later.
  • The use of service accounts (Compute Engine or Companion Service Accounts) is required to run jobs in your VPC. 
    • Use of individual user credentials is not supported for Workload Identity.
  • Access to the following tools:
    • gloud command line interface (CLI)
    • kubectl
    • openssl
    • base64

Acquire from Alteryx Inc:

  • IP address for authorized control plane access.

Enable

In-VPC execution must be enabled by an administrator. In the Dataprep Settings page, you can enable the following settings.

SettingDescription
In-VPC execution

Enables general in-VPC execution, which includes execution of the following job types:

  • Trifacta Photon jobs
  • Batch processing jobs
  • Connectivity jobs
In-VPC Conversion job execution

Enables execution of conversion jobs within your VPC.

NOTE: This setting is available when In-VPC Execution has been enabled.

In-VPC Data-Service communication

Enables design-time connectivity jobs to be executed within your VPC.

NOTE: This setting is available when In-VPC Execution has been enabled.

NOTE: The Scheduling feature must also be enabled for the project. 


For more information, see  Dataprep Project Settings Page.

Basic configuration

Please complete the following steps for the Basic configuration.

Google Cloud IAM Service Account

This Service Account is assigned to the nodes in the GKE node pool and is configured to have minimal privileges.

Following are variables listed in the configuration steps. They can be modified based on your requirements and supported values:

VariableDescription

trifacta-service-account

Default service account name
myprojectName of your Google project
myregionYour Google Cloud region

Please execute the following commands from the gcloud CLI: 

NOTE: Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \. The backslash should not be included if the line is used as input.

gcloud iam service-accounts create trifacta-service-account \
--display-name="Service Account for running Trifacta Remote jobs"

gcloud projects add-iam-policy-binding myproject \
--member "serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com" \
--role roles/logging.logWriter


gcloud projects add-iam-policy-binding myproject \
--member "serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com" \
--role roles/monitoring.metricWriter


gcloud projects add-iam-policy-binding myproject \
--member "serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com" \
--role roles/monitoring.viewer


gcloud projects add-iam-policy-binding myproject \
--member "serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com" \
--role roles/stackdriver.resourceMetadata.writer

gcloud projects add-iam-policy-binding myproject \
--member "serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com" \
--role roles/artifactregistry.reader

Verification steps:

Command:

gcloud projects get-iam-policy myproject --flatten="bindings[].members" --format="table(bindings.role)" --filter="bindings.members:serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com"

The output should look like the following:

ROLE
roles/artifactregistry.reader
roles/logging.logWriter
roles/monitoring.metricWriter
roles/monitoring.viewer
roles/stackdriver.resourceMetadata.writer

Router and NAT

The following configuration is required for Internet access to acquire assets from  Dataprep by Trifacta, if the GKE cluster has private nodes.

NOTE: Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \. The backslash should not be included if the line is used as input.

gcloud compute routers create myproject-myregion \
--network myproject-network \
--region=myregion

gcloud compute routers nats create myproject-myregion \
--router=myproject-myregion \
--auto-allocate-nat-external-ips \
--nat-all-subnet-ip-ranges \
--enable-logging

Verification Steps:

You can verify that the router NAT was created in the Google Cloud Platform Console: https://console.cloud.google.com/net-services/nat/list.

GKE cluster 

This configuration creates the GKE cluster for use in executing jobs. This cluster must be created in the VPC/sub-network that has access to your datasources, such as your databases and  Cloud Storage

In the following, please replace w.x.y.z with the IP address provided to you by Alteryx Inc for authorized control plane access.

NOTE: Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \. The backslash should not be included if the line is used as input.

gcloud container clusters create "trifacta-cluster" \
--project "myproject" \
--region "myregion" \
--no-enable-basic-auth \
--cluster-version "1.20.8-gke.900" \
--release-channel "None" \
--machine-type "n1-standard-16" \
--image-type "COS_CONTAINERD" \
--disk-type "pd-standard" \
--disk-size "100" \
--metadata disable-legacy-endpoints=true \
--service-account "trifacta-service-account@myproject.iam.gserviceaccount.com" \
--max-pods-per-node "110" \
--num-nodes "1" \
--logging=SYSTEM,WORKLOAD \
--monitoring=SYSTEM \
--enable-ip-alias \
--network "projects/myproject/global/networks/myproject-network" \
--subnetwork "projects/myproject/regions/myregion/subnetworks/myproject-subnet-myregion" \
--no-enable-intra-node-visibility \
--default-max-pods-per-node "110" \
--enable-autoscaling \
--min-nodes "0" \
--max-nodes "3" \
--enable-master-authorized-networks \
--master-authorized-networks w.x.y.z/32 \
--addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver \
--no-enable-autoupgrade \
--enable-autorepair \
--max-surge-upgrade 1 \
--max-unavailable-upgrade 0 \
--workload-pool "myproject.svc.id.goog" \
--enable-private-nodes \
--enable-shielded-nodes \
--shielded-secure-boot \
--node-locations "myregion-a","myregion-b","myregion-c" \
--master-ipv4-cidr=10.1.0.0/28 \
--enable-binauthz 

Verification Steps:

You can verify that the cluster was created through the Google Cloud Platform Console: https://console.cloud.google.com/kubernetes/list/overview.

Switch to new cluster

Use the following command to set up configuration to connect to the new cluster:

gcloud container clusters get-credentials trifacta-cluster --region myregion --project myproject

The following commands whitelist the Cloud shell for use on the cluster:

  1. Get the IP for the shell instance:

    dig +short myip.opendns.com @resolver1.opendns.com
  2. Modify the authorized networks to include the IP. You must include the Dataprep by Trifacta IP each time, since the IP addresses are not static.

    gcloud container clusters update mycluster \
     --enable-master-authorized-networks \
     --master-authorized-networks 34.68.114.64/28,192.77.238.35/32,34.75.7.151/32
  3. After you have acquired access, you can whitelist the following account:

    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: ServiceAccount
    automountServiceAccountToken: false
    metadata:
      namespace: default
      name: trifacta-job-runner
    EOF
  4. You can whitelist the following role using the appropriate definition below:

    1. Use the following for basic in-VPC connectivity:

      cat <<EOF | kubectl apply -f -
      apiVersion: rbac.authorization.k8s.io/v1
      kind: Role
      metadata:
        name: trifacta-job-runner-role
      rules:
      - apiGroups: [""]
        resources: ["secrets"]
        verbs: ["create", "delete"]
      - apiGroups: [""]
        resources: ["pods"]
        verbs: ["list"]
      - apiGroups: [""]
        resources: ["pods/log"]
        verbs: ["get"]
      - apiGroups: ["batch"]
        resources: ["jobs"]
        verbs: ["get", "create", "delete", "watch"]
      - apiGroups: [""]
        resources: ["serviceaccounts"]
        verbs: ["list", "get"]
      EOF
    2. Use the following if you are enabling design-time connectivity to a remote data service instance:

      cat <<EOF | kubectl apply -f -
      apiVersion: rbac.authorization.k8s.io/v1
      kind: Role
      metadata:
        name: trifacta-job-runner-role
      rules:
      - apiGroups: [""]
        resources: ["secrets"]
        verbs: ["create", "delete"]
      - apiGroups: ["apps"]
        resources: ["deployments"]
        verbs: ["get", "create", "delete"]
      - apiGroups: [""]
        resources: ["pods", "configmaps", "services"]
        verbs: ["list", "get", "create", "delete"]
      - apiGroups: [""]
        resources: ["pods/log", "pods/portforward"]
        verbs: ["get", "list", "create"]
      - apiGroups: ["batch"]
        resources: ["jobs"]
        verbs: ["get", "create", "delete", "watch"]
      - apiGroups: [""]
        resources: ["serviceaccounts"]
        verbs: ["list", "get"]
  5. Specify the following role bindings and cluster roles:

    cat <<EOF | kubectl apply -f -
    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
      name: trifacta-job-runner-rb
    subjects:
    - kind: ServiceAccount
      name: trifacta-job-runner
      namespace: default
    roleRef:
      kind: Role
      name: trifacta-job-runner-role
      apiGroup: rbac.authorization.k8s.io
    EOF
    cat <<EOF | kubectl apply -f -
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: node-list-role
    rules:
    - apiGroups: [""]
      resources: ["nodes"]
      verbs: ["list"]
    EOF
    cat <<EOF | kubectl apply -f -
    kind: ClusterRoleBinding
    apiVersion: rbac.authorization.k8s.io/v1
    metadata:
      name: node-list-rb
    subjects:
    - kind: ServiceAccount
      name: trifacta-job-runner
      namespace: default
    roleRef:
      kind: ClusterRole
      name: node-list-role
      apiGroup: rbac.authorization.k8s.io
    EOF
    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: ServiceAccount
    automountServiceAccountToken: false
    metadata:
      name: trifacta-pod-sa
    EOF

Node pool - diff

For basic configuration, Trifacta Photon uses the default node pool. No additional configuration is required.

Kubernetes namespace - diff

For basic configuration, Trifacta Photon uses the default namespace. No additional configuration is required.

Kubernetes Service Accounts - diff

VariableDescription
trifacta-job-runner

Service Account used by Dataprep by Trifacta externally to launch jobs into the GKE cluster.

trifacta-pod-saService Account assigned to the job pod running in the GKE cluster.

Please execute the following commands:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
automountServiceAccountToken: false
metadata:
  namespace: default
  name: trifacta-job-runner
EOF
cat <<EOF | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: trifacta-job-runner-role
rules:
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["create", "delete"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["list"]
- apiGroups: [""]
  resources: ["pods/log"]
  verbs: ["get"]
- apiGroups: ["batch"]
  resources: ["jobs"]
  verbs: ["get", "create", "delete", "watch"]
- apiGroups: [""]
  resources: ["serviceaccounts"]
  verbs: ["list", "get"]
EOF
cat <<EOF | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: trifacta-job-runner-rb
subjects:
- kind: ServiceAccount
  name: trifacta-job-runner
  namespace: default
roleRef:
  kind: Role
  name: trifacta-job-runner-role
  apiGroup: rbac.authorization.k8s.io
EOF
cat <<EOF | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: node-list-role
rules:
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["list"]
EOF
cat <<EOF | kubectl apply -f -
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: node-list-rb
subjects:
- kind: ServiceAccount
  name: trifacta-job-runner
  namespace: default
roleRef:
  kind: ClusterRole
  name: node-list-role
  apiGroup: rbac.authorization.k8s.io
EOF
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
automountServiceAccountToken: false
metadata:
  name: trifacta-pod-sa
EOF

Credential encryption keys

The following commands create the encryption keys for credentials:

NOTE: Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \. The backslash should not be included if the line is used as input.

openssl genrsa -out private_key.pem 2048


openssl pkcs8 -topk8 -inform PEM -outform DER -in private_key.pem -out private_key.der -nocrypt


openssl rsa -in private_key.pem -pubout -outform DER -out public_key.der

base64 -i public_key.der > public_key.der.base64

base64 -i private_key.der > private_key.der.base64


kubectl create secret generic trifacta-credential-encryption -n default \
--from-file=privateKey=private_key.der.base64

Dataprep by Trifacta application configuration

After you have completed the above configuration, you must configure the Dataprep by Trifacta application based on the commands that you have executed. 

Steps:

  1. Login to the Dataprep by Trifacta application as a project owner.
  2. Select Admin console > VPC runtime settings.

Please complete the following configuration. For more information on these settings, see VPC Runtime Settings Page.

Kubernetes cluster tab

SettingCommand or Value
Master URL

Command:

gcloud container clusters describe trifacta-cluster --zone=myregion --format="value(endpoint)"

Returns:

This command returns a URL that looks similar to the following:

https://34.0.0.0
OAuth token

Command:

kubectl get secret `kubectl get sa trifacta-job-runner -o json | jq -r '.secrets[0].name'` -o json | jq -r '.data.token' | base64 -decode
Cluster CA certificate

Command:

gcloud container clusters describe trifacta-cluster --zone=myregion --format="value(masterAuth.clusterCaCertificate)"
Service account name

Value: trifacta-pod-sa

Public key

Insert the contents of: public_key.der.base64.

To acquire this value:

cat public_key.der.base64

NOTE: To process Google Sheets data in your VPC, this value is required. Otherwise, it is optional. 

Private key secret name

Value: trifacta-credential-encryption 

NOTE: To process Google Sheets data in your VPC, this value is required. The private key must be accessible within your VPC. Otherwise, this value is optional. 

Photon tab

SettingCommand or Value
Kubernetes namespace

Value: default

To acquire the namespace value:

kubectl get namespace
CPU, memory - request, limits

Adjust as needed.

NOTE: CPU and memory requests and limits should be lower than the CPU and memory that can be allocated on the GKE node.

Node selector, tolerations

- diff

Values:

Node selector = ""
Node tolerations = ""

Connectivity tab

SettingCommand or Value
Kubernetes namespacedefault
CPU, memory - request, limitsAdjust defaults, if necessary.
Node selector, tolerations
Node selector = ""
    Node tolerations = ""
  • To test your configuration, click Test. A success message should be displayed. 
  • To save your configuration, click Save.

Conversion tab

SettingCommand or Value
Kubernetes namespacedefault
CPU, memory - request, limitsAdjust defaults, if necessary.
Node selector, tolerations
Node selector = ""
    Node tolerations = ""
  • To test your configuration, click Test. A success message should be displayed. 
  • To save your configuration, click Save.

Configure Workload Identity

Feature Availability: This feature may not be available in all product editions. For more information on available features, see Compare Editions.

Google access tokens are valid for 1 hour. Some jobs can be long running. To protect against timeouts during these jobs and to support security practices recommended by Google,  Dataprep by Trifacta supports the use of Workload Identity, which is Google's recommended approach for accessing Google APIs.

NOTE: Workload Identity is required for running jobs on a GKE cluster, which is required for In-VPC job execution.


NOTE: Workload Identity requires the use of Compute Engine or Companion service accounts. Use of individual user credentials is not supported. For more information, see Google Service Account Management.

This section describes how to bind a Companion Service Account to a Kubernetes ServiceAccount on the GKE cluster using Workload Identity. These steps need to be modified if you are binding a Compute Engine service account.

For each Companion Service Account assigned to a user in Dataprep by Trifacta:

  1. A new Kubernetes ServiceAccount must be created on the GKE cluster.

    NOTE: This step must be completed by your Google Cloud Platform administrator.


  2. Using Workload Identity, the Companion Service Account must be bound to the newly created Kubernetes ServiceAccount.

The following assumes that a Companion Service Account named allAccess@myproject.iam.gserviceaccount.com  already exists:

// Create a new Kubernetes ServiceAccount on the GKE cluster with an annotation to bind it to the allAccess@myproject.iam.gserviceaccount.com Companion ServiceAccount.

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
automountServiceAccountToken: false
metadata:

  annotations:
    iam.gke.io/gcp-service-account: allAccess@myproject.iam.gserviceaccount.com

  name: trifacta-pod-sa-allaccess
EOF    


// Allow the Kubernetes ServiceAccount to impersonate the Google IAM ServiceAccount by adding an IAM policy binding between the two service accounts. This binding allows the Kubernetes ServiceAccount to act as the IAM ServiceAccount.
gcloud iam service-accounts add-iam-policy-binding \
  --project <project_name>
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:<project_name>.svc.id.goog[default/trifacta-pod-sa-allaccess]" \

allAccess@myproject.iam.gserviceaccount.com

Wait a couple of minutes for the binding to take effect.

NOTE: For relational connectivity, additional configuration is required. Search for data-system in Dataprep In-VPC Execution - Advanced.

Testing

You can use the following command to watch the Kubernetes clusters for job execution and to check active pods:

kubectl get pods -n default -w

To get details on a specific pod:

kubectl describe <podId>

Then, run a job in  Trifacta Photon through the Dataprep by Trifacta application . If the job runs successfully, then the configuration has been properly applied. See  Run Job Page.

See Also for Dataprep In-VPC Execution:

This page has no comments.