Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r085

D toc

Excerpt

In the Run Job page, you can specify transformation and profiling jobs for the currently loaded recipe. Available options include output formats and output destinations.




Info

NOTE: When you run a job in

D s product
, the job is queued and executed on
D s dataflow
.
D s product
observes the job in progress and reports progress as needed back into the application.
D s product
does not control the execution of the job.


Tip

Tip: Jobs can be scheduled for periodic execution through Flow View page. For more information, see Add Schedule Dialog.


Tip

Tip: Columns that have been hidden in the Transformer page still appear in the generated output. Before you run a job, you should verify that all currently hidden columns are ok to include in the output.





D caption
typefigure
Run Job Page

Running Environment

Select the environment where you wish to execute the job. Some of the following environments may not be available to you. These options appear only if there are multiple accessible running environments.

Info

NOTE: Running a job executes the transformations on the entire dataset and saves the transformed data to the specified location. Depending on the size of the dataset and available processing resources, this process can take a while.


Photon: Executes the job in Photon, an embedded running environment hosted on the same server as the 

D s product
rtrue

D s ed
nottrue
editionsgdple

Info

NOTE: Jobs that are executed on

D s photon
may be limited to run for a maximum of 10 minutes, after which they fail with a timeout error. If your job fails due to this limit, please switch to running the job on
D s dataflow
.

Spark: Executes the job using the Spark running environment.

Dataflow: Executes job on

D s dataflow
within the
D s platform
. This environment is best suited for larger jobs.

Dataflow + BigQuery: For flows whose data sources and outputs are sourced in BigQuery, you can configure your flows to run jobs for them in BigQuery.


Options

Profile Results: Optionally, you can disable profiling of your output, which can improve the speed of overall job execution. When the profiling job finishes, details are available through the Job Details page, including links to download results.

Info

NOTE: Percentages for valid, missing, or mismatched column values may not add up to 100% due to rounding.

See Job Details Page.

Ignore recipe errors: Optionally, you can choose to ignore errors in your recipes and proceed with the job execution. 

Info

NOTE: When this option is selected, the job may be completed with warning errors. For notification purposes, these jobs with errors are treated as successful jobs, although you may be notified that the job completed with warnings.

Details are available in the Job Details page. For more information, see Job Details Page.

Publishing Actions


You can add, remove, or edit the outputs generated from this job. By default, a CSV output for your home directory on the selected datastore is included in the list of destinations, which can be removed if needed. You must include at least one output destination. 

Columns:

  • Actions: Lists the action and the format for the output.
  • Location: The directory and filename or table information where the output is to be written.
  • Settings: Identifies the output format and any compression, if applicable, for the publication.

Actions:

  • To change format, location, and settings of an output, click the Edit icon.
  • To delete an output, click the X icon.

Add publishing action

From the available datastores in the left column, select the target for your publication. 

D caption
typefigure
Add Publishing Action
Info

NOTE: Do not create separate publishing actions that apply to the same file or database table.

New/Edit: You can create new or modify existing connections. By default, the displayed connections support publishing. See Create Connection Window.Steps:

  1. Select the publishing target. Click an icon in the left column.
    1. BigQuery: You can published your results to the current project or to a different one to which you have access.

      Info

      NOTE: You must have read and write access to any BigQuery database to which you are publishing.

      To publish to a different project, click the BigQuery link at the front of the breadcrumb trail. Then, enter the identifier for the project where you wish to publish your job results.

      Info

      Tip: Your projects and their identifiers are available for review through the

      D s product
      menu bar. See UI Reference.

      Click Go. Navigate to the database where you wish to write your BigQuery results.

      For more information, see BigQuery Connections.

  2. Locate a publishing destination: Do one of the following.

    1. Explore: 

      Info

      NOTE: The publishing location must already exist before you can publish to it. The publishing user must have write permissions to the location.

      1. To sort the listings in the current directory, click the carets next to any column name.
      2. For larger directories, browse using the paging controls.
      3. Use the breadcrumb trail to explore the target datastore. Navigate folders as needed.
    2. Search: Use the search bar to search for specific locations in the current folder only.
    3. Manual entry: Click the Edit icon to manually edit or paste in a destination.
  3. Create Folder: Depending on the storage destination, you can click it to create a new folder for the job inside the currently selected one. Do not include spaces in your folder name.

  4. Create a new file: Enter the filename under which to save the dataset.

    1. Select the Data Storage Format.
    2. Supported output formats:
      1. CSV
      2. JSON
      3. Avro
    3. You can also write as BigQuery Table, if connected to BigQuery.

  5. BigQuery: When publishing to BigQuery, you must specify the table to which to publish and related actions. See below.

  6. As needed, you can parameterize the outputs that you are creating. Click Parameterize destination in the right panel. See Parameterize destination settings below.

  7. To save the publishing destination, click Add.

To update a publishing action, hover over its entry. Then, click Edit.

To delete a publishing action, select Delete from its context menu.

Variables

If any variable parameters have been specified for the datasets or outputs of the flow, you can apply overrides to their default values. Click the listed default value and insert a new value. A variable can have an empty value.

D s job overrides

Info

NOTE: Unless this output is a scheduled destination, variable overrides apply only to this job. Subsequent jobs use the default variable values, unless specified again. No data validation is performed on entries for override values.

Tip

Tip: At the flow level, you can specify overrides at the flow level. Override values are applied to parameters of all types that are a case-sensitive match. However, values that are specified at runtime override flow-level overrides. For more information, see Manage Parameters Dialog.

For more information on variables, see Overview of Parameterization.

Output settings

Depending on the type of output that you are generating, you must specify additional settings to define location, format, and other settings. 

D children
alltrue

Run Job

To execute the job as configured, click Run. The job is queued for execution.

D s dataflow
 imposes a limit on the size of the job as represented by the JSON passed in. 

Tip

Tip: If this limit is exceeded, the job may fail with a job graph too large error. The workaround is to split the job into smaller jobs, such as splitting the recipe into multiple recipes. This is a known limitation of

D s dataflow
.

After a job has been queued, you can track its progress toward completion. See Job Details Page.

Automation

Run jobs via API

You can use the available REST APIs to execute jobs for known datasets. For more information, see API Reference.