When you specify a Dataflow job, you may pass to the running environment a set of property values to apply to the execution of the job. Overrides are defined in the Run Job page and are applied to the configured job.
- You can specify overrides for ad-hoc jobs through the Run Job page.
- You can specify overrides when you configure a scheduled job execution.
These property values override any settings applied to the project.
- Properties whose values are not specified in the dataflow execution overrides use the values that you set in the Project Settings page.
- See Project Settings Page.
Figure: Dataflow Execution Properties
Default execution settings:
By default, Dataprep by Trifacta runs your job in the us-central1
region on an n1-standard-1
machine. As needed, you can change the geo location and the machine where your job is executed.
Tip: You can change the default values for the following in your project settings. See Project Settings Page .
Making changes to these settings can affect performance times for executing your job.
Tip: For more information on how the following settings affect your jobs, see Run Job on Cloud Dataflow.
Setting | Description |
---|---|
Regional Endpoint | A regional endpoint handles execution details for your Dataflow job, its location determines where the Dataflow job is executed. |
Zone | A sub-section of region, a zone contains specific resources for a given region. Select |
Machine Type | Choose the type of machine on which to run your job. The default is Note: not all machine types supported directly through Dataprep by Trifacta. |
For more information on these regional endpoints, see https://cloud.google.com/dataflow/docs/concepts/regional-endpoints.
For more information on machine types, https://cloud.google.com/compute/docs/machine-types.
Advanced settings:
Setting | Description |
---|---|
VPC Network mode | If the network mode is set to As needed, you can override the default settings configured for your project for this job. Set this value to NOTE: Avoid applying overrides unless necessary. These network settings apply to job execution. Preview and sampling use the
For more information: |
Network | To use a different VPC network, enter the name of the VPC network to use as an override for this job. Click Save to apply the override. |
Subnetwork | To specify a different subnetwork, enter the URL of the subnetwork. The URL should be in the following format: regions/<REGION>/subnetworks/<SUBNETWORK> where:
If you have access to another project within your organization, you can execute your Dataflow job through it by specifying a full URL in the following form: https://www.googleapis.com/compute/v1/projects/<HOST_PROJECT_ID>/regions/<REGION>/subnetworks/<SUBNETWORK> where:
Click Save to apply the override. |
For more information on these settings, see Project Settings Page.
Setting | Description |
---|---|
Worker IP address configuration | If the VPC Network mode is set to
|
Autoscaling Algorithms | The type of algorithm to use to scale the number of Google Compute Engine instances to accommodate the size of your job. Possible values:
|
Initial number of workers | Number of Google Compute Engine instances with which to launch the job. This number may be adjusted as part of job execution. This number must be an integer between 1 and 1000 , inclusive. |
Maximum number of workers | Maximum number of Google Compute Engine instances to use during execution. This value must be greater than the initial number of workers and must be an integer between |
Service account | Email address of the service account under which to run the job. NOTE: When using a named service account to access data and run jobs in other projects, the user running the job must be granted the |
Labels | Create or assign labels to apply to the billing for the Dataprep by Trifacta jobs run in your project. You may reference up to 64 labels. NOTE: Each label must have a unique key name. For more information, see https://cloud.google.com/resource-manager/docs/creating-managing-labels. |
Notes on behavior:
- Values specified here are applied to the current job or to all jobs executed using the output object.
- Properties not specified here are not submitted, and the default values for Dataflow are used.
This page has no comments.