Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r080

D toc

In 

D s product
rtrue
, your jobs to transform your data are executed on 
D s dataflow
, a managed service for executing data pipelines within the 
D s platform
D s product
 has been designed to integrate with 
D s dataflow
 and to take advantage of multiple features available in the service. This section describes how to execute a job on 
D s dataflow
, as well as its options.

Default 
D s dataflow
 Jobs

Steps:

  1. To run a job, open the flow containing the recipe whose output you wish to generate.
  2. Locate the recipe. Click the recipe's output object icon. 
  3. On the right side of the screen, information about the output object is displayed. The output object defines:
    1. The type and location of the outputs, including filenames and method of updating.
    2. Profiling options
    3. Execution options 
  4. For now, you can ignore the options for the output object. Click Run Job.
  5. In the Run Job page, you can review the job as it is currently specified. Click Run Job.
  6. The job is queued with default settings for execution on 
    D s dataflow
    .

For more information, see Run Job Page.

Tracking progress

You can track progress of your job through the following areas:

  • Flow View: select the output object. On the right side of the screen, click the Jobs tab. Your job in progress is listed. See Flow View Page.
  • Job Details Page: Click the link in the Jobs tab. You can review progress and individual details related to your job. See Job Details Page.

Download results

When your job has finished successfully, a Completed message is displayed in the Job Details page. 

Steps:

  1. In the Job Details page, click the Output Destinations tab.
  2. The generated outputs are listed. For each output, you can select download or view choices from the context menu on the right side of the screen. 

Publish

If you have created the connections to do so, you can choose to publish your generated results to external systems. In the Output Destinations tab, click the Publish link.

Info

NOTE: You must have a connection configured to publish to an external datastore. For more information, see Connections Page.

For more information, see Publishing Dialog.

Output Options

When specifying the job you wish to run, you can define the following types of output options.

Profiling

When you select the Profiling checkbox, a visual profile of your generated results is generated as part of your data transformation job. This visual profile can be useful to identify any remaining issues with your data after the transformation is complete. 

Tip

Tip: Use visual profiling when you are building your recipes. It can also be useful as a quick check of outputs for production flows.

When the job is completed and you enabled visual profiling, the visual profiling is available for review through the Profile tab in the Job Details page. 

Tip

Tip: You can download PDF and JSON versions of the visual profile for offline analysis.

See Job Details Page.

For more information, see Overview of Visual Profiling.

Publishing actions

For each output object, you can define one or more publishing actions. A publishing action specifies the following:


GCSBigQuery
Type of outputfile typetable
Location of outputpathdatabase
Namefilenametable name
Update methodcreate, append, replacecreate, append, truncate, drop
Other options
  • headers
  • quotes
  • delimiters
  • multi-part options
  • compression

Parameterized destination

You can parameterize the output filename or table name as needed.

  • Parameter values can be defined at the flow level through Flow View. For more information, see Manage Parameters Dialog.
  • These parameters values can be passed into the running environment and inserted into the output filename or table name. 
  • For more information, see Overview of Parameterization.

Execution Overrides

You can specify some settings on the following aspects of job execution on 

D s dataflow
:

  • Endpoint, region, and zone where the job is executed
  • Machine resources and billing account to use for the job
  • Network and subnetwork where the job is executed

These settings can be specified at the project level or at the individual output object level:

  • Project settings: At the project level, you can define the execution options for jobs. By default, all jobs executed from flows within the project use these settings. For more information, see Project Settings Page.
  • Output object settings: The execution settings in the Project Settings page can be overridden at the output object level. When you define an individual output object for a recipe, the execution settings that you specify in the Run Job page apply whenever the outputs are generated for this flow. See Dataflow Execution Settings.

Some examples of how these settings can be used are provided below.

Run Job Options

Run job in a different region and zone

If needed, you can run your job in a different region and zone. 

  • The region determines the geographic region where your job is executed, including the execution details related to your 
    D s dataflow
     job.
  • The zone is a sub-section of the region.

You might want to change the default settings for the following reasons:

  • Security and compliance: You may need to constrain your 
    D s dataflow
     work to a specific region for your enterprise's security requirements.
  • Data locality: If you know your project data is stored in a specific region, you may wish to set the job to run in this region to minimize network latency and potential costs associated with cross-region execution.
  • Resilience: If there are outages in your default 
    D s platform
     region, you may need to switch regions.
  • For more information, see https://cloud.google.com/dataflow/docs/concepts/regional-endpoints.

Steps:

In the Dataflow Execution Settings:

  1. Region: Choose the new Regional Endpoint from the drop-down list. 
  2. Zone: By default, the zone within the selected region is auto-selected for you. As needed, you can select a specific zone.

    Tip

    Tip: Unless you have a specific reason to do so, you should leave the Zone value at Auto-Select to allow the platform to choose it for you.

Run job in custom VPC network

D s product
 supports job execution in the following Virtual Private Cloud (VPC) network modes:

  • Auto: (default) 

    D s dataflow
     job is executed over publicly available IP addresses using the VPC Network and Subnetwork settings determined by 
    D s platform
    .

    Info

    NOTE: In Auto mode, do not set values in the Dataflow Execution Settings for Network, Subnetwork, or (if available) Worker IP address configuration. These settings are ignored in Auto mode.

  • Custom: Optionally, you can customize the VPC network settings that are applied to your job if you need to apply specific network settings, including a private network. Set the VPC Network Mode to Custom and apply additional settings from the following settings.

    Info

    NOTE: Custom network settings do not apply to data previewing or sampling, which use the default network settings.

For more information on Google Virtual Private Clouds (VPCs):

Public vs. internal IP addresses

If the VPC Network mode is set to custom, then choose one of the following:

  • Allow public IP addresses - Use
    D s dataflow
    workers that are available through public IP addresses. No further configuration is required.
  • Use internal IP addresses only -
    D s dataflow
    workers use private IP addresses for all communication. Additional configuration is below.

Run job in custom VPC using Network value (internal IP addresses)

You can specify the VPC network to use in the Network value. 

Info

NOTE: This network must be in the region that you have specified for the job. Do not specify a Subnetwork value.

Info

NOTE: The Network must have Google Private Access enabled.


Run job in custom VPC using Subnetwork value (internal IP addresses)

For a subnetwork that is associated with your project, you can specify the subnetwork using a short URI.

Info

NOTE: If the Subnetwork value is specified, then the Network value should be set to default.

D s dataflow
chooses the Network for you.


Info

NOTE: The Subnetwork must have Google Private Access enabled.


Short URI form:

Code Block
regions/<REGION>/subnetworks/<SUBNETWORK>

where:

  • <REGION> is the region to use. 

    Info

    NOTE: This value must match the Regional Endpoint value.

  • <SUBNETWORK> Is the subnetwork identifier.


Run job in shared VPC network (internal IP addresses)

D s ed
editionsgdppr

You can specify to run your job in a VPC network that is shared with multiple projects. Please configure the following settings for your Dataflow Execution Settings:

  • VPC Network Mode: Custom
  • Network: Do not modify. When a Subnetwork value is specified,
    D s dataflow
    ignores the Network value. 
  • Subnetwork: Specify the subnetwork using a full URL. See below.

    Info

    NOTE: If the Subnetwork is located within a shared VPC network, you must specify the complete URL.

    Info

    NOTE: Additional subnet-level permissions may be required.

Full URL form: 

Code Block
https://www.googleapis.com/compute/v1/projects/<HOST_PROJECT_ID>/regions/<REGION>/subnetworks/<SUBNETWORK>

where:

  • <HOST_PROJECT_ID> corresponds to the project identifier. This value must be between 6 and 30 characters. The value can contain only lowercase letters, digits, or hyphens. It must start with a letter. Trailing hyphens are prohibited.
  • <REGION> is the region to use. 

    Info

    NOTE: This value must match the Regional Endpoint value.

  • <SUBNETWORK> Is the subnetwork identifier.

Subnet permissions for managed user service accounts

To execute the job on a shared VPC, you must set up subnet-level permissions for the managed user service account:

  1. In your host project, you must add the Cloud Dataprep Service Account with the role of Network user
  2. If you are using a Shared VPC, you must enable access to the Shared VPC to the managed user service account. This account must be added as a member of the shared subnet permissions for your shared VPC. For more information, see https://cloud.google.com/vpc/docs/provisioning-shared-vpc.
  3. In the Dataflow Execution Settings:
    1. VPC Network Mode: Custom
    2. Network: Leave as default.
    3. Subnetwork: Specify the full URL, including the host project identifier, region, and subnetwork values.
    4. Service Account: Enter the name of the managed user service account.

For more information on subnet-level permissions, see https://cloud.google.com/vpc/docs/provisioning-shared-vpc#networkuseratsubnet

Run job API

You can also run jobs using the REST APIs.

Tip

Tip: You can pass in overrides to the dataflow execution settings as part of your API request.

For more information, see API Workflow - Run Job.

Configure Machine Resources

By default, 

D s dataflow
 attempts to select the appropriate machine for your job, based on the size of the job and any specified account-level settings. As needed, you can override these settings at the project level or for specific jobs through the Dataflow Execution Settings.

Tip

Tip: Unless performance issues related to your resource selections apply to all jobs in the project, you should make changes to your resources for individual output objects. If those changes improve performance and you are comfortable with the higher costs associated with the change, you can consider applying them through the Project Settings page for all jobs in the project.

Choose machine type

machine type is a set of virtualized hardware resources, including memory size, CPU, and persistent disk storage, which are assigned to a virtual machine (VM) responsible for executing your job. 

Notes:

  • Billing for your job depends on the machine type (resources) that have been assigned to the job. If you select a more powerful machine type, you should expect higher costs for each job execution.
  • D s product
     provides a subset of available machine types from which you can select to execute your jobs. By default, 
    D s product
     uses a machine type that you define in the Project Settings page.
  • If you are experiencing long execution times and are willing to incur additional costs, you can select a more powerful machine type. 

To select a different machine type, choose your option from the Machine Type drop-down in the Dataflow Execution Settings. Higher numbers in the machine type name indicate more powerful machines. 

Machine scaling algorithms

D s ed
editionsgdppr

By default, 

D s product
gdppr
gdppr
 utilizes a scaling algorithm based on throughput to scale up or down the Google Compute Engine instances that are deployed to execute your job. 

Info

NOTE: Auto-scaling can increase the costs of job execution. If you use auto-scaling, you should specify a reasonable maximum limit.

Optionally, you can disable this scaling. Set Autoscaling algorithm to None

Below, you can see the matrix of options.

Auto-scaling AlgorithmInitial number of workersMaximum number of workers
Throughput based

Must be an integer between 1 and 1000, inclusive.

Info

NOTE: This number may be adjusted as part of job execution.

Must be an integer between 1 and 1000, inclusive. This value must be greater than the initial number of workers.
None

Must be an integer between 1 and 1000, inclusive.

Info

NOTE: This number determines the fixed number of Google Compute Engine instances that are launched when your job begins.

Not used.

Change Billing Options

Use different service account

By default, 

D s product
 uses the service account that is configured for use with your project. In the Dataflow Execution Settings, enter the name for the Service Account to use. 

To see the current service account for your project:

  1. Click the 
    D s platform
     icon at the bottom of the left nav bar. 
  2. In the Google Cloud Console, select IAM & Admin > Service Accounts.

For more information, see https://cloud.google.com/iam/docs/service-accounts?_ga=2.77818962.-730391614.1565820652

Apply job labels

As needed, you can add labels to your job. For billing purposes, labels can be applied so that expenses related to jobs are properly categorized within your Google account. 

  • Each label must have a unique key within your project. 
  • You can create up to 64 labels per project.