Page tree

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 16 Next »

Trifacta Dataprep



Contents:


   

Contents:


In  Dataprep by Trifacta®, your jobs to transform your data are executed on Dataflow, a managed service for executing data pipelines within the Trifacta platformDataprep by Trifacta has been designed to integrate with Dataflow and to take advantage of multiple features available in the service. This section describes how to execute a job on Dataflow, as well as its options.

Default Dataflow Jobs

Steps:

  1. To run a job, open the flow containing the recipe whose output you wish to generate.
  2. Locate the recipe. Click the recipe's output object icon. 
  3. On the right side of the screen, information about the output object is displayed. The output object defines:
    1. The type and location of the outputs, including filenames and method of updating.
    2. Profiling options
    3. Execution options 
  4. For now, you can ignore the options for the output object. Click Run Job.
  5. In the Run Job page, you can review the job as it is currently specified. Click Run Job.
  6. The job is queued with default settings for execution on Dataflow.

For more information, see Run Job Page.

Tracking progress

You can track progress of your job through the following areas:

  • Flow View: select the output object. On the right side of the screen, click the Jobs tab. Your job in progress is listed. See Flow View Page.
  • Job Details Page: Click the link in the Jobs tab. You can review progress and individual details related to your job. See Job Details Page.

Download results

When your job has finished successfully, a Completed message is displayed in the Job Details page. 

Steps:

  1. In the Job Details page, click the Output Destinations tab.
  2. The generated outputs are listed. For each output, you can select download or view choices from the context menu on the right side of the screen. 

Publish

If you have created the connections to do so, you can choose to publish your generated results to external systems. In the Output Destinations tab, click the Publish link.

NOTE: You must have a connection configured to publish to an external datastore. For more information, see Connections Page.

For more information, see Publishing Dialog.

Output Options

When specifying the job you wish to run, you can define the following types of output options.

Profiling

When you select the Profiling checkbox, a visual profile of your generated results is generated as part of your data transformation job. This visual profile can be useful to identify any remaining issues with your data after the transformation is complete. 

Tip: Use visual profiling when you are building your recipes. It can also be useful as a quick check of outputs for production flows.

When the job is completed and you enabled visual profiling, the visual profiling is available for review through the Profile tab in the Job Details page. 

Tip: You can download PDF and JSON versions of the visual profile for offline analysis.

See Job Details Page.

For more information, see Overview of Visual Profiling.

Publishing actions

For each output object, you can define one or more publishing actions. A publishing action specifies the following:


GCSBigQuery
Type of outputfile typetable
Location of outputpathdatabase
Namefilenametable name
Update methodcreate, append, replacecreate, append, truncate, drop
Other options
  • headers
  • quotes
  • delimiters
  • multi-part options
  • compression

Parameterized destination

You can parameterize the output filename or table name as needed.

  • Parameter values can be defined at the flow level through Flow View. For more information, see Manage Parameters Dialog.
  • These parameters values can be passed into the running environment and inserted into the output filename or table name. 
  • For more information, see Overview of Parameterization.

Execution Overrides

You can specify some settings on the following aspects of job execution on Dataflow:

  • Endpoint, region, and zone where the job is executed
  • Machine resources and billing account to use for the job
  • Network and subnetwork where the job is executed

These settings can be specified at the project level or at the individual output object level:

  • Execution settings: Within your preferences, you can define your execution options for jobs. By default, all of your jobs executed from flows within the project use these settings. For more information, see Execution Settings Page.
  • Output object settings: The execution settings in the Execution Settings page can be overridden at the output object level. When you define an individual output object for a recipe, the execution settings that you specify in the Run Job page apply whenever the outputs are generated for this flow. See Dataflow Execution Settings.

Some examples of how these settings can be used are provided below.

Run Job Options

Run job in a different region and zone

If needed, you can run your job in a different region and zone. 

  • The region determines the geographic region where your job is executed, including the execution details related to your Dataflow job.
  • The zone is a sub-section of the region.

You might want to change the default settings for the following reasons:

  • Security and compliance: You may need to constrain your Dataflow work to a specific region for your enterprise's security requirements.
  • Data locality: If you know your project data is stored in a specific region, you may wish to set the job to run in this region to minimize network latency and potential costs associated with cross-region execution.
  • Resilience: If there are outages in your default Trifacta platform region, you may need to switch regions.
  • For more information, see https://cloud.google.com/dataflow/docs/concepts/regional-endpoints.

Steps:

In the Dataflow Execution Settings:

  1. Region: Choose the new Regional Endpoint from the drop-down list. 
  2. Zone: By default, the zone within the selected region is auto-selected for you. As needed, you can select a specific zone.

    Tip: Unless you have a specific reason to do so, you should leave the Zone value at Auto-Select to allow the platform to choose it for you.

Run job in custom VPC network

Dataprep by Trifacta supports job execution in the following Virtual Private Cloud (VPC) network modes:

  • Auto: (default) Dataflow job is executed over publicly available IP addresses using the VPC Network and Subnetwork settings determined by Trifacta platform.

    NOTE: In Auto mode, do not set values in the Dataflow Execution Settings for Network, Subnetwork, or (if available) Worker IP address configuration. These settings are ignored in Auto mode.

  • Custom: Optionally, you can customize the VPC network settings that are applied to your job if you need to apply specific network settings, including a private network. Set the VPC Network Mode to Custom and apply additional settings from the following settings.

    NOTE: Custom network settings do not apply to data previewing or sampling, which use the default network settings.

For more information on Google Virtual Private Clouds (VPCs):

Public vs. internal IP addresses

If the VPC Network mode is set to custom, then choose one of the following:

  • Allow public IP addresses - Use Dataflow workers that are available through public IP addresses. No further configuration is required.
  • Use internal IP addresses only - Dataflow workers use private IP addresses for all communication. Additional configuration is below.

Run job in custom VPC using Network value (internal IP addresses)

You can specify the VPC network to use in the Network value. 

NOTE: This network must be in the region that you have specified for the job. Do not specify a Subnetwork value.

NOTE: The Network must have Google Private Access enabled.


Run job in custom VPC using Subnetwork value (internal IP addresses)

For a subnetwork that is associated with your project, you can specify the subnetwork using a short URI.

NOTE: If the Subnetwork value is specified, then the Network value should be set to default. Dataflow chooses the Network for you.


NOTE: The Subnetwork must have Google Private Access enabled.


Short URI form:

regions/<REGION>/subnetworks/<SUBNETWORK>

where:

  • <REGION> is the region to use. 

    NOTE: This value must match the Regional Endpoint value.

  • <SUBNETWORK> Is the subnetwork identifier.


Run job in shared VPC network (internal IP addresses)

Feature Availability: This feature is available in
Dataprep by Trifacta Premium only.

You can specify to run your job in a VPC network that is shared with multiple projects. Please configure the following settings for your Dataflow Execution Settings:

  • VPC Network Mode: Custom
  • Network: Do not modify. When a Subnetwork value is specified, Dataflow ignores the Network value. 
  • Subnetwork: Specify the subnetwork using a full URL. See below.

    NOTE: If the Subnetwork is located within a shared VPC network, you must specify the complete URL.

    NOTE: Additional subnet-level permissions may be required.

Full URL form: 

https://www.googleapis.com/compute/v1/projects/<HOST_PROJECT_ID>/regions/<REGION>/subnetworks/<SUBNETWORK>

where:

  • <HOST_PROJECT_ID> corresponds to the project identifier. This value must be between 6 and 30 characters. The value can contain only lowercase letters, digits, or hyphens. It must start with a letter. Trailing hyphens are prohibited.
  • <REGION> is the region to use. 

    NOTE: This value must match the Regional Endpoint value.

  • <SUBNETWORK> Is the subnetwork identifier.

Subnet permissions for managed user service accounts

To execute the job on a shared VPC, you must set up subnet-level permissions for the managed user service account:

  1. In your host project, you must add the Cloud Dataprep Service Account with the role of Network user
  2. If you are using a Shared VPC, you must enable access to the Shared VPC to the managed user service account. This account must be added as a member of the shared subnet permissions for your shared VPC. For more information, see https://cloud.google.com/vpc/docs/provisioning-shared-vpc.
  3. In the Dataflow Execution Settings:
    1. VPC Network Mode: Custom
    2. Network: Leave as default.
    3. Subnetwork: Specify the full URL, including the host project identifier, region, and subnetwork values.
    4. Service Account: Enter the name of the managed user service account.

For more information on subnet-level permissions, see https://cloud.google.com/vpc/docs/provisioning-shared-vpc#networkuseratsubnet

Run job API

You can also run jobs using the REST APIs.

Tip: You can pass in overrides to the dataflow execution settings as part of your API request.

For more information, see API Workflow - Run Job.

Configure Machine Resources

By default, Dataflow attempts to select the appropriate machine for your job, based on the size of the job and any specified account-level settings. As needed, you can override these settings at the project level or for specific jobs through the Dataflow Execution Settings.

Tip: Unless performance issues related to your resource selections apply to all jobs in the project, you should make changes to your resources for individual output objects. If those changes improve performance and you are comfortable with the higher costs associated with the change, you can consider applying them through the Execution Settings page.

Choose machine type

machine type is a set of virtualized hardware resources, including memory size, CPU, and persistent disk storage, which are assigned to a virtual machine (VM) responsible for executing your job. 

Notes:

  • Billing for your job depends on the machine type (resources) that have been assigned to the job. If you select a more powerful machine type, you should expect higher costs for each job execution.
  • Dataprep by Trifacta provides a subset of available machine types from which you can select to execute your jobs. By default,  Dataprep by Trifacta uses a machine type that you define in your Execution Settings page.
  • If you are experiencing long execution times and are willing to incur additional costs, you can select a more powerful machine type. 

To select a different machine type, choose your option from the Machine Type drop-down in the Dataflow Execution Settings. Higher numbers in the machine type name indicate more powerful machines. 

Machine scaling algorithms

Feature Availability: This feature is available in
Dataprep by Trifacta Premium only.

By default,  Dataprep by Trifacta Premium utilizes a scaling algorithm based on throughput to scale up or down the Google Compute Engine instances that are deployed to execute your job. 

NOTE: Auto-scaling can increase the costs of job execution. If you use auto-scaling, you should specify a reasonable maximum limit.

Optionally, you can disable this scaling. Set Autoscaling algorithm to None

Below, you can see the matrix of options.

Auto-scaling AlgorithmInitial number of workersMaximum number of workers
Throughput based

Must be an integer between 1 and 1000, inclusive.

NOTE: This number may be adjusted as part of job execution.

Must be an integer between 1 and 1000, inclusive. This value must be greater than the initial number of workers.
None

Must be an integer between 1 and 1000, inclusive.

NOTE: This number determines the fixed number of Google Compute Engine instances that are launched when your job begins.

Not used.

Change Billing Options

Use different service account

By default,  Dataprep by Trifacta uses the service account that is configured for use with your project. In the Dataflow Execution Settings, enter the name for the Service Account to use. 

To see the current service account for your project:

  1. Click the Trifacta platform icon at the bottom of the left nav bar. 
  2. In the Google Cloud Console, select IAM & Admin > Service Accounts.

For more information, see https://cloud.google.com/iam/docs/service-accounts?_ga=2.77818962.-730391614.1565820652

Apply job labels

As needed, you can add labels to your job. For billing purposes, labels can be applied so that expenses related to jobs are properly categorized within your Google account. 

  • Each label must have a unique key within your project. 
  • You can create up to 64 labels per project.

  • No labels

This page has no comments.