Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r088

D toc

Excerpt

In 

D s product
productgdp
rtrue
,

...

most jobs to transform your data are executed by default on 

D s dataflow
, a managed service for executing data pipelines within the 
D s platform
D s product
 has been designed to integrate with 
D s dataflow
 and to take advantage of multiple features available in the service. This section describes how to execute a job on 
D s dataflow
, as well as its options.

Project owners can choose to enable 

D s photon
, an in-memory running environment hosted on the 
D s node
. This running environment yields faster performance on small- to medium-sized jobs. For more information, see Dataprep Project Settings Page.

Default 
D s dataflow
 Jobs

...

  1. To run a job, open the flow containing the recipe whose output you wish to generate.
  2. Locate the recipe. Click the recipe's output object icon. 
  3. On the right side of the screen, information about the output object is displayed. The output object defines:
    1. The type and location of the outputs, including filenames and method of updating.
    2. Profiling options
    3. Execution options 
  4. For now, you can ignore the options for the output object. Click Run Job.
  5. In the Run Job page, you can review the job as it is currently specified. 
  6. To run the job on 
    D s dataflow
    , select Dataflow.
  7. Click Run Job.
  8. The job is queued with default settings for execution on 
    D s dataflow
    .

...

  1. In the Job Details page, click the Output Destinations tab.
  2. The generated outputs are listed. For each output, you can select download or view choices from the context menu on the right side of the screen. 

Publish

If your deployment has been configured you have created the connections to do so, you can choose to publish your generated results to external systems. In the Output Destinations tab, click the Publish link.

...

These settings can be specified at the project level or at the individual output object level:

  • Project Execution settings: At the project level Within your preferences, you can define the your execution options for jobs. By default, all of your jobs executed from flows within the project use these settings. For more information, see Execution Settings Page.
  • Output object settings: The execution settings in the Project Execution Settings page can be overridden at the output object level. When you define an individual output object for a recipe, the Dataflow Execution Settings execution settings that you specify in the Run Job page apply whenever the outputs are generated for this flow. See Dataflow Execution Settings.

...

Run job in shared VPC network (internal IP addresses)

D s ed
editionsgdpent,gdppro,gdppr

You can specify to run your job in a VPC network that is shared with multiple projects. Please configure the following settings for your Dataflow Execution Settings:

  • VPC Network Mode: Custom
  • Network: Do  Do not modify. When a Subnetwork value is specified,
    D s dataflow
    ignores the Network value. 
  • Subnetwork: Specify the subnetwork using a full URL. See below.

    Info

    NOTE: If the Subnetwork is located within a shared VPC network, you must specify the complete URL.

    Info

    NOTE: Additional subnet-level permissions may be required.

...

For more information on subnet-level permissions, see https://cloud.google.com/vpc/docs/provisioning-shared-vpc#networkuseratsubnet. 

Run job API

You can also run jobs using the REST APIs.

...

Tip

Tip: Unless performance issues related to your resource selections apply to all jobs in the project, you should make changes to your resources for individual output objects. If those changes improve performance and you are comfortable with the higher costs associated with the change, you can consider applying them through the Project Execution Settings page for all jobs in the project.

Choose machine type

machine type is a set of virtualized hardware resources, including memory size, CPU, and persistent disk storage, which are assigned to a virtual machine (VM) responsible for executing your job. 

...

  • Billing for your job depends on the machine type (resources) that have been assigned to the job. If you select a more powerful machine type, you should expect higher costs for each job execution.
  • D s product
     provides a subset of available machine types from which you can select to execute your jobs. By default, 
    D s product
     uses a machine type that you define in the Project Settings pageyour Execution Settings page.
  • If you are experiencing long execution times and are willing to incur additional costs, you can select a more powerful machine type. 

...

Machine scaling algorithms

D s ed
editionsgdpent,gdppro,gdppr

By default, 

D s product
gdppr
gdppr
 utilizes a scaling algorithm based on throughput to scale up or down the Google Compute Engine instances that are deployed to execute your job. 

Info

NOTE: Auto-scaling can increase the costs of job execution. If you use auto-scaling, you should specify a reasonable maximum limit.

...

  • Each label must have a unique key within your project. 
  • You can create up to 64 labels per project.