most jobs to transform your data are executed by default on , a managed service for executing data pipelines within the . has been designed to integrate with and to take advantage of multiple features available in the service. This section describes how to execute a job on , as well as its options.
Project owners can choose to enable , an in-memory running environment hosted on the . This running environment yields faster performance on small- to medium-sized jobs. For more information, see Dataprep Project Settings Page.
- To run a job, open the flow containing the recipe whose output you wish to generate.
- Locate the recipe. Click the recipe's output object icon.
- On the right side of the screen, information about the output object is displayed. The output object defines:
- The type and location of the outputs, including filenames and method of updating.
- Profiling options
- Execution options
- For now, you can ignore the options for the output object. Click Run Job.
- In the Run Job page, you can review the job as it is currently specified.
- To run the job on , select Dataflow.
- Click Run Job.
- The job is queued with default settings for execution on .
Run job in shared VPC network (internal IP addresses)
|D s ed|
You can specify to run your job in a VPC network that is shared with multiple projects. Please configure the following settings for your Dataflow Execution Settings:
- VPC Network Mode:
- Network: Do not modify. When a Subnetwork value is specified, ignores the Network value.
Subnetwork: Specify the subnetwork using a full URL. See below.
NOTE: If the Subnetwork is located within a shared VPC network, you must specify the complete URL.
NOTE: Additional subnet-level permissions may be required.
For more information on subnet-level permissions, see https://cloud.google.com/vpc/docs/provisioning-shared-vpc#networkuseratsubnet.
Run job API
You can also run jobs using the REST APIs.
Machine scaling algorithms
|D s ed|
By default, utilizes a scaling algorithm based on throughput to scale up or down the Google Compute Engine instances that are deployed to execute your job.
NOTE: Auto-scaling can increase the costs of job execution. If you use auto-scaling, you should specify a reasonable maximum limit.
uses the service account that is configured for use with your project. In the Dataflow Execution Settings, enter the name for the Service Account to use.
NOTE: Under the Permissions tab, please verify that Include Google-provided role grants is selected.
To see the current service account for your project: