Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r089

D toc

Excerpt

In 

d-s-product
productgdp
rtrue
,

...

most jobs to transform your data are executed by default on 

D s dataflow
, a managed service for executing data pipelines within the 
D s-gcp-platform
D s product
 has been designed to integrate with 
D s dataflow
 and to take advantage of multiple features available in the service. This section describes how to execute a job on 
D s dataflow
, as well as its options.

Project owners can choose to enable 

D s photon
, an in-memory running environment hosted on the 
D s node
. This running environment yields faster performance on small- to medium-sized jobs. For more information, see Dataprep Project Settings Page.

Default 
D s dataflow
 Jobs

...

  1. To run a job, open the flow containing the recipe whose output you wish to generate.
  2. Locate the recipe. Click the recipe's output object icon. 
  3. On the right side of the screen, information about the output object is displayed. The output object defines:
    1. The type and location of the outputs, including filenames and method of updating.
    2. Profiling options
    3. Execution options 
  4. For now, you can ignore the options for the output object. Click Run Job.
  5. In the Run Job page, you can review the job as it is currently specified. 
  6. To run the job on 
    D s dataflow
    , select Dataflow.
  7. Click Run Job.
  8. The job is queued with default settings for execution on 
    D s dataflow
    .

...

Run job in shared VPC network (internal IP addresses)

D s ed
editionsgdpent,gdppro,gdppr

You can specify to run your job in a VPC network that is shared with multiple projects. Please configure the following settings for your Dataflow Execution Settings:

  • VPC Network Mode: Custom
  • Network: Do not modify. When a Subnetwork value is specified,
    D s dataflow
    ignores the Network value. 
  • Subnetwork: Specify the subnetwork using a full URL. See below.

    Info

    NOTE: If the Subnetwork is located within a shared VPC network, you must specify the complete URL.

    Info

    NOTE: Additional subnet-level permissions may be required.

...

For more information on subnet-level permissions, see https://cloud.google.com/vpc/docs/provisioning-shared-vpc#networkuseratsubnet. 

Run job API

You can also run jobs using the REST APIs.

...

Machine scaling algorithms

D s ed
editionsgdpent,gdppro,gdppr

By default, 

D s product
gdppr
gdppr
 utilizes a scaling algorithm based on throughput to scale up or down the Google Compute Engine instances that are deployed to execute your job. 

Info

NOTE: Auto-scaling can increase the costs of job execution. If you use auto-scaling, you should specify a reasonable maximum limit.

...

By default, 

D s product
 uses the service account that is configured for use with your project. In the Dataflow Execution Settings, enter the name for the Service Account to use. 

Info

NOTE: Under the Permissions tab, please verify that Include Google-provided role grants is selected.

To see the current service account for your project:

...