In , most jobs to transform your data are executed by default on , a managed service for executing data pipelines within the  has been designed to integrate with  and to take advantage of multiple features available in the service. This section describes how to execute a job on , as well as its options.

Project owners can choose to enable , an in-memory running environment hosted on the . This running environment yields faster performance on small- to medium-sized jobs. For more information, see Dataprep Project Settings Page.

Default  Jobs

Steps:

  1. To run a job, open the flow containing the recipe whose output you wish to generate.
  2. Locate the recipe. Click the recipe's output object icon. 
  3. On the right side of the screen, information about the output object is displayed. The output object defines:
    1. The type and location of the outputs, including filenames and method of updating.
    2. Profiling options
    3. Execution options 
  4. For now, you can ignore the options for the output object. Click Run Job.
  5. In the Run Job page, you can review the job as it is currently specified. 
  6. To run the job on , select Dataflow.
  7. Click Run Job.
  8. The job is queued with default settings for execution on .

Tracking progress

You can track progress of your job through the following areas:

Download results

When your job has finished successfully, a Completed message is displayed in the Job Details page. 

Steps:

  1. In the Job Details page, click the Output Destinations tab.
  2. The generated outputs are listed. For each output, you can select download or view choices from the context menu on the right side of the screen. 

Publish

If you have created the connections to do so, you can choose to publish your generated results to external systems. In the Output Destinations tab, click the Publish link.

NOTE: You must have a connection configured to publish to an external datastore available through the Connections page.

Output Options

When specifying the job you wish to run, you can define the following types of output options.

Profiling

When you select the Profiling checkbox, a visual profile of your generated results is generated as part of your data transformation job. This visual profile can be useful to identify any remaining issues with your data after the transformation is complete. 

Tip: Use visual profiling when you are building your recipes. It can also be useful as a quick check of outputs for production flows.

When the job is completed and you enabled visual profiling, the visual profiling is available for review through the Profile tab in the Job Details page. 

Tip: You can download PDF and JSON versions of the visual profile for offline analysis in the Job Details page.

For more information, see Overview of Visual Profiling.

Publishing actions

For each output object, you can define one or more publishing actions. A publishing action specifies the following:


GCSBigQuery
Type of outputfile typetable
Location of outputpathdatabase
Namefilenametable name
Update methodcreate, append, replacecreate, append, truncate, drop
Other options
  • headers
  • quotes
  • delimiters
  • multi-part options
  • compression

Parameterized destination

You can parameterize the output filename or table name as needed.

Execution Overrides

You can specify some settings on the following aspects of job execution on :

These settings can be specified at the project level or at the individual output object level:

Some examples of how these settings can be used are provided below.

Run Job Options

Run job in a different region and zone

If needed, you can run your job in a different region and zone. 

You might want to change the default settings for the following reasons:

Steps:

In the Dataflow Execution Settings:

  1. Region: Choose the new Regional Endpoint from the drop-down list. 
  2. Zone: By default, the zone within the selected region is auto-selected for you. As needed, you can select a specific zone.

    Tip: Unless you have a specific reason to do so, you should leave the Zone value at Auto-Select to allow the platform to choose it for you.

Run job in custom VPC network

 supports job execution in the following Virtual Private Cloud (VPC) network modes:

For more information on Google Virtual Private Clouds (VPCs):

Public vs. internal IP addresses

If the VPC Network mode is set to custom, then choose one of the following:

Run job in custom VPC using Network value (internal IP addresses)

You can specify the VPC network to use in the Network value. 

NOTE: This network must be in the region that you have specified for the job. Do not specify a Subnetwork value.

NOTE: The Network must have Google Private Access enabled.

Run job in custom VPC using Subnetwork value (internal IP addresses)

For a subnetwork that is associated with your project, you can specify the subnetwork using a short URI.

NOTE: If the Subnetwork value is specified, then the Network value should be set to default. chooses the Network for you.


NOTE: The Subnetwork must have Google Private Access enabled.


Short URI form:

regions/<REGION>/subnetworks/<SUBNETWORK>

where:

Run job in shared VPC network (internal IP addresses)

You can specify to run your job in a VPC network that is shared with multiple projects. Please configure the following settings for your Dataflow Execution Settings:

Full URL form: 

https://www.googleapis.com/compute/v1/projects/<HOST_PROJECT_ID>/regions/<REGION>/subnetworks/<SUBNETWORK>

where:

Subnet permissions for managed user service accounts

To execute the job on a shared VPC, you must set up subnet-level permissions for the managed user service account:

  1. In your host project, you must add the Cloud Dataprep Service Account with the role of Network user
  2. If you are using a Shared VPC, you must enable access to the Shared VPC to the managed user service account. This account must be added as a member of the shared subnet permissions for your shared VPC. For more information, see https://cloud.google.com/vpc/docs/provisioning-shared-vpc.
  3. In the Dataflow Execution Settings:
    1. VPC Network Mode: Custom
    2. Network: Leave as default.
    3. Subnetwork: Specify the full URL, including the host project identifier, region, and subnetwork values.
    4. Service Account: Enter the name of the managed user service account.

For more information on subnet-level permissions, see https://cloud.google.com/vpc/docs/provisioning-shared-vpc#networkuseratsubnet.

Run job API

You can also run jobs using the REST APIs.

Tip: You can pass in overrides to the dataflow execution settings as part of your API request.

For more information, see API Workflow - Run Job.

Configure Machine Resources

By default,  attempts to select the appropriate machine for your job, based on the size of the job and any specified account-level settings. As needed, you can override these settings at the project level or for specific jobs through the Dataflow Execution Settings.

Tip: Unless performance issues related to your resource selections apply to all jobs in the project, you should make changes to your resources for individual output objects. If those changes improve performance and you are comfortable with the higher costs associated with the change, you can consider applying them through the Execution Settings page.

Choose machine type

machine type is a set of virtualized hardware resources, including memory size, CPU, and persistent disk storage, which are assigned to a virtual machine (VM) responsible for executing your job. 

Notes:

To select a different machine type, choose your option from the Machine Type drop-down in the Dataflow Execution Settings. Higher numbers in the machine type name indicate more powerful machines. 

Machine scaling algorithms

By default,  utilizes a scaling algorithm based on throughput to scale up or down the Google Compute Engine instances that are deployed to execute your job. 

NOTE: Auto-scaling can increase the costs of job execution. If you use auto-scaling, you should specify a reasonable maximum limit.

Optionally, you can disable this scaling. Set Autoscaling algorithm to None

Below, you can see the matrix of options.

Auto-scaling AlgorithmInitial number of workersMaximum number of workers
Throughput based

Must be an integer between 1 and 1000, inclusive.

NOTE: This number may be adjusted as part of job execution.

Must be an integer between 1 and 1000, inclusive. This value must be greater than the initial number of workers.
None

Must be an integer between 1 and 1000, inclusive.

NOTE: This number determines the fixed number of Google Compute Engine instances that are launched when your job begins.

Not used.

Change Billing Options

Use different service account

By default,  uses the service account that is configured for use with your project. In the Dataflow Execution Settings, enter the name for the Service Account to use. 

NOTE: Under the Permissions tab, please verify that Include Google-provided role grants is selected.

To see the current service account for your project:

  1. Click the  icon at the bottom of the left nav bar. 
  2. In the Google Cloud Console, select IAM & Admin > Service Accounts.

For more information, see https://cloud.google.com/iam/docs/service-accounts?_ga=2.77818962.-730391614.1565820652

Apply job labels

As needed, you can add labels to your job. For billing purposes, labels can be applied so that expenses related to jobs are properly categorized within your Google account.