In the Run Job page, you can specify transformation and profiling jobs for the currently loaded dataset. Available options include output formats and output destinations.You can also configure the environment where the job is to be executed.
Tip: Columns that have been hidden in the Transformer page still appear in the generated output. Before you run a job, you should verify that all currently hidden columns are ok to include in the output.
Run Job Page
Select the environment where you wish to execute the job. Some of the following environments may not be available to you. These options appear only if there are multiple accessible running environments.
NOTE: Running a job executes the transformations on the entire dataset and saves the transformed data to the specified location. Depending on the size of the dataset and available processing resources, this process can take a while.
Tip: The application attempts to identify the best running environment for you. You should choose the default option, which factors in the available environments and the size of your dataset to identify the most efficient processing environment.
Photon: Executes the job on the running environment hosted on the same server as the .
Spark: Executes the job using the Spark running environment.
Databricks: Executes the job on the Azure Databricks cluster with which the platform is integrated.
NOTE: Use of Azure Databricks is not supported on Marketplace installs.
For more information, see Configure for Azure Databricks.
Profile Results: Optionally, you can disable profiling of your output, which can improve the speed of overall job execution. When the profiling job finishes, details are available through the Job Details page, including links to download results.
NOTE: Percentages for valid, missing, or mismatched column values may not add up to 100% due to rounding.This issue applies to the Photon running environment.
See Job Details Page.
You can add, remove, or edit the outputs generated from this job. By default, a CSV output for your home directory on the selected datastore is included in the list of destinations, which can be removed if needed. You must include at least one output destination.
From the available datastores in the left column, select the target for your publication.
Add Publishing Action
NOTE: Do not create separate publishing actions that apply to the same file or database table.
If Hive publishing is enabled, you must select or specify a database table to which to publish.
Depending on the running environment, results are generated in Avro or Parquet format. See below for details on specifying the action and the target table.
If you are publishing a wide dataset to Hive, you should generate results using Parquet.
For more information on how data is written to Hive, see Hive Data Type Conversions.
Locate a publishing destination: Do one of the following.
NOTE: The publishing location must already exist before you can publish to it. The publishing user must have write permissions to the location.
NOTE: If your HDFS environment is encrypted, the default output home directory for your user and the output directory where you choose to generate results must be in the same encryption zone. Otherwise, writing the job results fails with a
Choose an existing file or folder: When the location is found, select the file to overwrite or the folder into which to write the results.
NOTE: You must have write permissions to the folder or file that you select.
Create a new file: See below.
As needed, you can parameterize the outputs that you are creating. Click Parameterize destination in the right panel. See Parameterize destination settings below.
To save the publishing destination, click Add.
To delete a publishing action, select Delete from its context menu.
If any variable parameters have been specified for the datasets or outputs of the flow, you can apply overrides to their default values. Click the listed default value and insert a new value. A variable can have an empty value.
NOTE: Unless this output is a scheduled destination, variable overrides apply only to this job. Subsequent jobs use the default variable values, unless specified again. No data validation is performed on entries for override values.
For more information on variables, see Overview of Parameterization.
When you generate file-based results, you can configure the filename, storage format, compression, number of files, and the updating actions in the right-hand panel.
Output File Settings
Configure the following settings.
To change it, navigate to the proper directory.
Avro:This format is used to support data serialization within a Hadoop environment.
CSV and JSON: These formats are supported for all types of imported datasets and all running environments.
TDE: Choose TDE (Tableau Data Extract) to generate results that can be imported into Tableau.
If you have created a Tableau Server connection, you can write results to Tableau Server or publish them after they have been generated in TDE format.
NOTE: If you encounter errors generating results in TDE format, additional configuration may be required. See Supported File Formats.
Publishing action: Select one of the following:
NOTE: If multiple jobs are attempting to publish to the same filename, a numeric suffix (
myOutput_3.csv, and so on).
Append to this file every run: For each job run with the selected publishing destination, the same file is appended, which means that the file grows until it is purged or trimmed.
NOTE: When publishing single files to S3 or WASB, the
NOTE: When appending data into a Hive table, the columns displayed in the Transformer page must match the order and data type of the columns in the Hive table.
NOTE: This option is not available for outputs in TDE format.
NOTE: Compression of published files is not supported for an
Include headers as first row on creation: For CSV outputs, you can choose to include the column headers as the first row in the output. For other formats, these headers are included automatically.
NOTE: Headers cannot be applied to compressed outputs.
Include quotes: For CSV outputs, you can choose to include double quote marks around all values, including headers.
Delimiter: For CSV outputs, you can enter the delimiter that is used to separate fields in the output. The default value is the global delimiter, which you can override on a per-job basis in this field.
Tip: If needed for your job, you can entire Unicode characters in the following format:
NOTE: The Spark running environment does not support use of multi-character delimiters for CSV outputs. You can switch your job to a different running environment or use single-character delimiters. For more information on this issue, see https://issues.apache.org/jira/browse/SPARK-24540.
Single File: Output is written to a single file. Default setting for smaller, file-based jobsor for jobs executed on the
Multiple Files: Output is written to multiple files. Default setting for larger file-based jobsor for jobs executed on in a remote, cluster-based running environment
Compression: For text-based outputs, compression can be applied to significantly reduce the size of the output. Select a preferred compression format for each format you want to compress.
NOTE: If you encounter errors generating results using Snappy, additional configuration may be required. See Supported File Formats.
Some relational connections can be configured to support writing directly to the database. Please configure the following settings to specify the output table.
NOTE: You cannot write to multiple relational outputs from the same job.
Output database: To change the database to which you are publishing, click the database icon in the sidebar. Select a different database.
Append to this table every run: Each run adds any new results to the end of the table.
If you are creating a publishing action for aRedshift database table, you must provide the following information.
NOTE: Some may be exported to Redshift using different data types. For more information, see Redshift Data Type Conversions.
Output database: To change the database to which you are publishing, click the Redshift icon in the sidebar. Select a different database.
When publishing to Hive, please complete the following steps to configure the table and settings to apply to the publish action.
NOTE: Some may be exported to Hive using different data types. For more information on how types are exported to Hive, see Hive Data Type Conversions.
Output database: To change the database to which you are publishing, click the Hive icon in the sidebar. Select a different database.
NOTE: You cannot publish to a Hive database that is empty. The database must contain at least one table.
Publish actions: Select one of the following.
NOTE: If you are writing to unmanaged tables in Hive, create and drop & load actions are not supported.
Append to this table every run: Each run adds any new results to the end of the table.
Tip: Optionally, users can be permitted to publish to Hive staging schemas to which they do not have full create and drop permissions. This feature must be enabled. For more information, see Configure for Hive.
When enabled, the name of the staging DB must be inserted into your user profile. See User Profile Page.
When publishing to Tableau Server, please complete the following steps to configure the datasource and settings to apply to the publish action.
Output project: To change the project to which you are publishing, click the Tableau icon in the sidebar. Select a different project.
Append to this datasource every run: Each run adds any new results to the end of the datasource.
Tip: If you generate a TDE file as part of your output, you can choose to download and later publish it to Tableau Server. For more information, see Publishing Dialog.
For file- or table-based publishing actions, you can parameterize elements of the output path. Whenever you execute a job, you can pass in parameter values through the Run Job page.
NOTE: Output parameters are independent of dataset parameters. However, two variables of different types with the same name should resolve to the same value.
Supported parameter types:
For more information, see Overview of Parameterization.
Define destination parameter
Name: Enter a display name for the variable.
NOTE: Variable names do not have to be unique. Two variables with the same name should resolve to the same value.
To execute the job as configured, click Run Job. The job is queued for execution.After a job has been queued, you can track its progress toward completion. See Jobs Page.
You can use the available REST APIs to execute jobs for known datasets. For more information, see API JobGroups Create v4.
For more information on the entire API workflow, see API Workflow - Develop a Flow.