In |
Project owners can choose to enable , an in-memory running environment hosted on the
. This running environment yields faster performance on small- to medium-sized jobs. For more information, see Dataprep Project Settings Page.
Steps:
You can track progress of your job through the following areas:
When your job has finished successfully, a Completed message is displayed in the Job Details page.
Steps:
If you have created the connections to do so, you can choose to publish your generated results to external systems. In the Output Destinations tab, click the Publish link.
NOTE: You must have a connection configured to publish to an external datastore available through the Connections page. |
When specifying the job you wish to run, you can define the following types of output options.
When you select the Profiling checkbox, a visual profile of your generated results is generated as part of your data transformation job. This visual profile can be useful to identify any remaining issues with your data after the transformation is complete.
Tip: Use visual profiling when you are building your recipes. It can also be useful as a quick check of outputs for production flows. |
When the job is completed and you enabled visual profiling, the visual profiling is available for review through the Profile tab in the Job Details page.
Tip: You can download PDF and JSON versions of the visual profile for offline analysis in the Job Details page. |
For more information, see Overview of Visual Profiling.
For each output object, you can define one or more publishing actions. A publishing action specifies the following:
GCS | BigQuery | |
---|---|---|
Type of output | file type | table |
Location of output | path | database |
Name | filename | table name |
Update method | create, append, replace | create, append, truncate, drop |
Other options |
|
You can parameterize the output filename or table name as needed.
You can specify some settings on the following aspects of job execution on :
Network and subnetwork where the job is executed
These settings can be specified at the project level or at the individual output object level:
Some examples of how these settings can be used are provided below.
If needed, you can run your job in a different region and zone.
The zone is a sub-section of the region.
You might want to change the default settings for the following reasons:
Steps:
In the Dataflow Execution Settings:
Zone: By default, the zone within the selected region is auto-selected for you. As needed, you can select a specific zone.
Tip: Unless you have a specific reason to do so, you should leave the Zone value at |
supports job execution in the following Virtual Private Cloud (VPC) network modes:
Auto: (default) job is executed over publicly available IP addresses using the VPC Network and Subnetwork settings determined by
.
NOTE: In Auto mode, do not set values in the Dataflow Execution Settings for Network, Subnetwork, or (if available) Worker IP address configuration. These settings are ignored in Auto mode. |
Custom: Optionally, you can customize the VPC network settings that are applied to your job if you need to apply specific network settings, including a private network. Set the VPC Network Mode to Custom
and apply additional settings from the following settings.
NOTE: Custom network settings do not apply to data previewing or sampling, which use the default network settings. |
For more information on Google Virtual Private Clouds (VPCs):
If the VPC Network mode is set to custom
, then choose one of the following:
Allow public IP addresses
- Use Use internal IP addresses only
- You can specify the VPC network to use in the Network value.
NOTE: This network must be in the region that you have specified for the job. Do not specify a Subnetwork value. |
NOTE: The Network must have Google Private Access enabled. |
For a subnetwork that is associated with your project, you can specify the subnetwork using a short URI.
NOTE: If the Subnetwork value is specified, then the Network value should be set to |
NOTE: The Subnetwork must have Google Private Access enabled. |
Short URI form:
regions/<REGION>/subnetworks/<SUBNETWORK> |
where:
<REGION>
is the region to use.
NOTE: This value must match the Regional Endpoint value. |
<SUBNETWORK>
Is the subnetwork identifier.
You can specify to run your job in a VPC network that is shared with multiple projects. Please configure the following settings for your Dataflow Execution Settings:
Custom
Subnetwork: Specify the subnetwork using a full URL. See below.
NOTE: If the Subnetwork is located within a shared VPC network, you must specify the complete URL. |
NOTE: Additional subnet-level permissions may be required. |
Full URL form:
https://www.googleapis.com/compute/v1/projects/<HOST_PROJECT_ID>/regions/<REGION>/subnetworks/<SUBNETWORK> |
where:
<HOST_PROJECT_ID>
corresponds to the project identifier. This value must be between 6 and 30 characters. The value can contain only lowercase letters, digits, or hyphens. It must start with a letter. Trailing hyphens are prohibited.<REGION>
is the region to use.
NOTE: This value must match the Regional Endpoint value. |
<SUBNETWORK>
Is the subnetwork identifier.
To execute the job on a shared VPC, you must set up subnet-level permissions for the managed user service account:
Cloud Dataprep Service Account
with the role of Network user
. Custom
default
.For more information on subnet-level permissions, see https://cloud.google.com/vpc/docs/provisioning-shared-vpc#networkuseratsubnet.
You can also run jobs using the REST APIs.
Tip: You can pass in overrides to the dataflow execution settings as part of your API request. |
For more information, see API Workflow - Run Job.
By default, attempts to select the appropriate machine for your job, based on the size of the job and any specified account-level settings. As needed, you can override these settings at the project level or for specific jobs through the Dataflow Execution Settings.
Tip: Unless performance issues related to your resource selections apply to all jobs in the project, you should make changes to your resources for individual output objects. If those changes improve performance and you are comfortable with the higher costs associated with the change, you can consider applying them through the Execution Settings page. |
A machine type is a set of virtualized hardware resources, including memory size, CPU, and persistent disk storage, which are assigned to a virtual machine (VM) responsible for executing your job.
Notes:
If you are experiencing long execution times and are willing to incur additional costs, you can select a more powerful machine type.
To select a different machine type, choose your option from the Machine Type drop-down in the Dataflow Execution Settings. Higher numbers in the machine type name indicate more powerful machines.
By default, utilizes a scaling algorithm based on throughput to scale up or down the Google Compute Engine instances that are deployed to execute your job.
NOTE: Auto-scaling can increase the costs of job execution. If you use auto-scaling, you should specify a reasonable maximum limit. |
Optionally, you can disable this scaling. Set Autoscaling algorithm to None
.
Below, you can see the matrix of options.
Auto-scaling Algorithm | Initial number of workers | Maximum number of workers | |
---|---|---|---|
Throughput based | Must be an integer between
| Must be an integer between 1 and 1000 , inclusive. This value must be greater than the initial number of workers. | |
None | Must be an integer between
| Not used. |
By default, uses the service account that is configured for use with your project. In the Dataflow Execution Settings, enter the name for the Service Account to use.
NOTE: Under the Permissions tab, please verify that Include Google-provided role grants is selected. |
To see the current service account for your project:
For more information, see https://cloud.google.com/iam/docs/service-accounts?_ga=2.77818962.-730391614.1565820652.
As needed, you can add labels to your job. For billing purposes, labels can be applied so that expenses related to jobs are properly categorized within your Google account.