Contents:
The Command Line Interface for Jobs enables programmatic control over a variety of operations on the platform. You can use the CLI to execute any of the following types of commands:
Run a job
NOTE: In this release, you cannot run jobs using datasets imported from Redshift or SQL DW connections via the CLI. This known issue will be fixed in a future release.
- Check job status
Publish a completed job to other datastores asynchronously
NOTE: In this release, you cannot publish results to Redshift or SQL DW connections via the CLI. This known issue will be fixed in a future release.
- Get information on publications of a job
- Load data into a new or existing table in Redshift
- (Hive only) Clear an existing table and load with new data
Requirements
- The CLI must have access to a running instance of the Designer Cloud Powered by Trifacta® platform. You can specify the host and port of this instance.
- If you are running jobs for a dataset with parameters, the downloaded assets reference only the first matching file of the dataset. To run the job across all files in the dataset with parameters, you must build the matching logic within your CLI script. For more information on datasets with parameters, see Overview of Parameterization.
Command Reference
Execute the following command from the top-level Alteryx directory. The Python script references script.cli
and datasources.tsv
as parameters.
For repeat executions of the same script.cli
file, you can parameterize the values in the datasources.tsv
.
The CLI tools are stored in the following directory:
/opt/trifacta/bin/
For executing jobs, specify parameters for the following:
./trifacta_cli.py (parameters)
Parameters
Common
These parameters are common to job or connection actions.
Parameter | Description | Applicable CLI Commands |
---|---|---|
command_type | The type of CLI command to execute. Accepted values:
See Examples below. For more information on the following commands, see CLI for Connections.
| All |
user_name | (Required) Alteryx username of the user to execute the job. Please specify the full username. NOTE: In the response, this value is listed as | All |
password | Alteryx password for the username NOTE: If you have enabled Kerberos-based access to the Designer Cloud Powered by Trifacta platform, you do not need to provide a password. To enable, additional configuration is required. See Set up for a Kerberos-enabled Hadoop cluster. NOTE: You can store username and password information in an external file so that they don't need to be included in each command. For more information, see CLI Config File. | All |
cli_output_path | Defines the client-side path where the JSON output is stored for all commands. Default value is NOTE: The user issuing the command must also have execute permissions on all parent folders in the specified | All |
disable_ssl_certification | (Optional) When communicating over HTTPS, this setting can be used to override the default behavior of validating the server certificate before executing the command. NOTE: If you have stored a self-signed certificate on the Alteryx node, please set the | All commands |
conn_ssl | (Optional) Connect to the datastore over SSL. NOTE: You must modify the NOTE: SSL connections are not supported for Hive, Redshift, or SQL Server. | All commands |
Params for managing jobs
These parameters apply to managing jobs only.
Parameter | Description | Applicable CLI Commands |
---|---|---|
host | (Required) The server and port number of the Alteryx instance. NOTE: In some environments, the | All |
conn_name | Internal name of the connection. This name is referenced in your CLI scripts. It should be a single value without spaces. NOTE: This value must be unique among your connection names. | load_data , publish , truncate_and_load |
conn_id | The internal identifier for the connection. When a connection is created, it is assigned an internal numeric identifier. This ID or the Tip: This value is available when you hover over a connection in the application. See Flows Page. | publish , load_data , truncate_and_load |
job_type | The execution environment in which to run the job:
NOTE: If the
NOTE: When this job type is applied, your CLI scripts automatically transition to running jobs in Spark.
| run_job |
job_id | The internal identifier for the job. This value can be retrieved from the output of a completed run_job command. | get_job_status , publish ,get_publications ,load_data
|
profiler | When on , profiling of your job is enabled. Default is off . | run_job |
data | Full UNIX path to the source TSV file. This file contains a URL pointing to the actual Hive or HDFS source: one TSV file for each job run. Executing user must have access to this file. | run_job |
script | Full UNIX path from the Alteryx root directory to the CLI script file. Executing user must have access. | run_job |
publish_action | (Optional) Defines the action taken on second and subsequent publish operations:
| run_job |
header | (Optional), The output for a CSV job with NOTE: If you use the | run_job |
single_file | (Optional) When true , CSV or JSON outputs are written to a single file. Default is false . | run_job |
output_path | (Required) Defines the fully qualified URI to where the job results are written, as in the following examples: hdfs://host:port/path/filename.csv s3://bucketName/path/filename.csv NOTE: The This parameter specifies the base filename. If you are publishing files, the Protocol is set in | run_job |
output_format | Accepted values: NOTE: For For | run_job |
database | Name of Redshift or Hive database to which you are publishing or loading. | publish ,load_data |
table | The table of the database to which you are publishing or loading. | publish ,load_data |
publish_format | The format of the output file from which to publish to Hive or Redshift tables. Accepted values: | publish , get_publications |
publish_opt_file | Path to file containing definitions for multiple file or table targets to which to write the job's results. For more information, see CLI Publishing Options File. | run_job |
skip_publish_validation | By default, the CLI automatically checks for schema validation when generating results to a pre-existing source. If this flag is set, schema validation is skipped on results output. | run_job |
For documentation on the CLI parameters, run:
./trifacta_cli.py --help
Additional documentation may be available for individual commands using the following:
./trifacta_cli.py <commmand> --help
Examples
A key function of the CLI is to execute jobs. You can also check job status through the command line interface and then take subsequent publication actions using other commands.
Run job
This command requires a dataset and a CLI script. The CLI script is used to programmatically run a recipe produced in the Transformer page.
For example, if you receive raw data each day, you can parameterize the execution of the same recipe against daily downloads written to HDFS.
Each run of the CLI script creates a new job. A finished CLI job appears on the Jobs page.
Steps:
- A recipe is specific to a dataset. In the Transformer page, open the Recipe Panel.
- Click Download.
- Select CLI Script.
- Download to your desktop. The ZIP contains the following:
script.cli
Contains the necessary code and configuration to access HDFS and the script in the Alteryx database.datasources.tsv
Contains pointers to the source storage location of your datasource(s).- If you are running jobs for a dataset with parameters, the downloaded assets reference only the first matching file of the dataset. To run the job across all files in the dataset with parameters, you must build the matching logic within your CLI script. For more information on datasets with parameters, see Overview of Parameterization.
- For an example of how to add parameters in a local script, see CLI Example - Parameterize Job Runs.
publishopts.json
Template file for defining one or more publishing targets for running jobs. See CLI Publishing Options File.
- These files must be transferred to the Alteryx Server where you can reference them from the Alteryx root directory.
Notes on connections and jobs
In the downloaded ZIP, the datasources.tsv
file may contain a reference to the connection used to import the dataset. However, if you are running the CLI in an Designer Cloud Powered by Trifacta platform instance that is different from its source, this connectionId may be different in the new environment. From the new environment, please do the following:
- Use the
list_connections
operation to acquire the list of connections available in the new environment. See CLI for Connections. Acquire the Id value for the connection corresponding to the one used in
datasources.tsv
.NOTE: The user who is executing the CLI script must be able to access the connection in the new environment.
- Edit
datasources.tsv
. Replace the connection Id value in the file with the value retrieved through the CLI. - When the job is executed, it should properly connect to the source through the connection in the new environment.
Command - Basic Job Run
NOTE: This method of specifying a single-file publishing action has been superseded by a newer method, which relies on an external file for specifying publishing targets. In a future release, this method may be deprecated. For more information, see CLI Publishing Options File.
Example (All one command):
./trifacta_cli.py run_job --user_name <trifacta_user> --password <trifacta_password> --job_type spark --output_format json --data redshift-test/datasources.tsv --script redshift-test/script.cli --cli_output_path ./job_info.out --profiler on --output_path hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/cleaned_table_1.json
Output
Job #42 has been successfully launched: You may monitor the progress of your job here: http://localhost:3005/jobs
JSON Response
JSON response written to job_info.out
:
{ "status": "success", "job_outputs": { "other_outputs": [ "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/.profiler/profilerValidValueHistograms.json", "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/.profiler/profilerSamples.json", "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/.profiler/profilerTypeCheckHistograms.json", "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/.profiler/profilerInput.json" ] "job_result_files": [ "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/cleaned_table_1.json", ] }, "job_id": 42, "cli_script": "/trifacta/queryResults/foo@trifacta.com/redshift-test/script.cli", "job_type": "spark", "profiler": "on", "source_data": "/trifacta/queryResults/foo@trifacta.com/redshift-test/datasources.tsv", "host": "localhost:3005", "output_path": "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/cleaned_table_1.json", "user": "foo@trifacta.com", "output_file_formats": [ "json" ], }
Command - File Publishing Options
You can specify publication options as part of your run_job
command. In the following, a single CSV file with headers is written to a new file with each job execution.
Example (All one command):
./trifacta_cli.py run_job --user_name <trifacta_user> --password <trifacta_password> --job_type spark --output_format csv --data redshift-test/datasources.tsv --script redshift-test/script.cli --publish_action create --header true --single_file true --cli_output_path ./job_info.out --profiler on --output_path hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/43/cleaned_table_1.csv
Output
Job #43 has been successfully launched: You may monitor the progress of your job here: http://localhost:3005/jobs
JSON Response
JSON response written to job_info.out
:
{ "status": "success", "job_outputs": { "other_outputs": [ "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/43/.profiler/profilerValidValueHistograms.json", "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/43/.profiler/profilerSamples.json", "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/43/.profiler/profilerTypeCheckHistograms.json", "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/43/.profiler/profilerInput.json" ] "job_result_files": [ "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/43/cleaned_table_1.csv", ] }, "job_id": 43, "cli_script": "/trifacta/queryResults/foo@trifacta.com/redshift-test/script.cli", "output_file_formats": [ "csv", ], "job_type": "spark", "host": "localhost:3005", "job_output_path": "/trifacta/queryResults/foo@trifacta.com/MyDataset/43/", "user": "foo@trifacta.com", "source_data": "/trifacta/queryResults/foo@trifacta.com/redshift-test/datasources.tsv", "profiler": "on" }
Command - publishing to multiple targets
As part of the CLI job, you can define multiple file or table targets to which to write the job results. For more information, see CLI Publishing Options File.
Get job status
After you queue a job through the CLI, you can review the status of the job through the application or through the CLI.
Tip: You can acquire the job ID through the application as needed. For example, at some point in the future, you might decide to publish to Hive the results from a job you executed two weeks ago. It might be easiest to retrieve this job identifier from the Dataset Details page. See Dataset Details Page.
Command
Example (All one command):
./trifacta_cli.py get_job_status --user_name <trifacta_user> --password <trifacta_password> --job_id 42 --cli_output_path ./job_info.out
Output
Job status: Complete
JSON Response
JSON response written to job_info.out
:
{ "status": "success", "job_outputs": { "other_outputs": [ "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/.profiler/profilerValidValueHistograms.json", "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/.profiler/profilerSamples.json", "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/.profiler/profilerTypeCheckHistograms.json", "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/.profiler/profilerInput.json" ] "job_result_files": [ "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/cleaned_table_1.json", "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/cleaned_table_1.csv", "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/cleaned_table_1.avro", ] }, "job_id": 42, "cli_script": "/trifacta/queryResults/foo@trifacta.com/redshift-test/script.cli", "output_file_formats": [ "csv", "json", "avro", ], "job_type": "spark", "host": "localhost:3005", "job_output_path": "/trifacta/queryResults/foo@trifacta.com/MyDataset/42/", "user": "foo@trifacta.com", "source_data": "/trifacta/queryResults/foo@trifacta.com/redshift-test/datasources.tsv", "profiler": "on" }
Publish
You can publish job results to Hive or Redshift:
- For Hive, you can publish Avro or Parquet results from HDFS or S3 to Hive.
For Redshift, you can publish CSV, JSON, or Avro results from S3 to Redshift.
NOTE: To publish to Redshift, results must be written first to S3.
NOTE: Even if you are publishing to the default schema, you must preface the
table
value with the name of the schema to use:MySchema.MyTable
.
Publish commands can be executed as soon as the job identifier has been created. After the publish command is submitted, the publish job is queued for execution after any related transform job has been completed.
You execute one publish command for each output you wish to write to a supported database table.
Command
Example (All one command):
./trifacta_cli.py publish --user_name <trifacta_user> --password <trifacta_password> --job_id 42 --database dev --table table_job_42 --conn_name 1 --publish_format avro --cli_output_path ./publish_info.out
Output
Create new table every run has been successfully launched: You may monitor the progress of your publish job here: http://localhost:3005/jobs Upon success, you may view the results of your publish job here: http://localhost:3005/jobs/42
JSON Response
JSON response written to publish_info.out
:
{"status": "Job Started", "job_id": 42}
Get publications
You can retrieve a JSON list of all publications that have been executed for a specific job.
Command
Example (All one command):
./trifacta_cli.py get_publications --user_name <trifacta_user> --password <trifacta_password> --job_id 42 --cli_output_path ./publications.out --publish_format avro
Output
Job with id 42 has 2 avro publication(s) associated with it. The list of publications is available in "./publications.out".
JSON Response
JSON response written to publications.out
:
{ "publications": [ { "publication_target": "redshift", "job_id": "42", "database": "dev", "publication_id": 69, "app_host": "trifacta.example.com:3005", "user": "foo@trifacta.com", "table": "table_job_42", "publish_format": "avro", "connect_str": "jdbc:redshift://dev.example.com:5439/dev" }, { "publication_target": "hive", "job_id": "42", "database": "default", "publication_id": 70, "app_host": "trifacta.example.com:3005", "user": "foo@trifacta.com", "table": "table_job_42", "publish_format": "avro", "connect_str": "jdbc:hive2://hadoop:10000/default" } ], }
Load data into table
You can load data into pre-existing Redshift tables.
- Data is appended after any existing rows.
If the table does not exist, the job fails.
NOTE: When appending data into a Redshift table, the columns displayed in the Transformer page must match the order and data type of the columns in the target table.
Command
Example (All one command):
./trifacta_cli.py load_data --user_name <trifacta_user> --password <trifacta_password> --job_id 42 --database dev --table table_42 --conn_name aSQLServerConnection --publish_format avro --cli_output_path ./load_info.out
Output
Append to this table every run has been successfully launched: You may monitor the progress of your publish job here: http://localhost:3005/jobs Upon success, you may view the results of your Load data/Append job here: http://localhost:3005/jobs/42
JSON Response
JSON response written to load_info.out
:
{"status": "Job Started", "job_id": 42}
Truncate and load
For existing tables, you can clear them and load them with results from a job. If the table does not exist, a new one is created and populated.
Command
Example (All one command):
./trifacta_cli.py truncate_and_load --user_name <trifacta_user> --password <trifacta_password> --job_id 10 --database dev --table table_43 --conn_name aSQLServerConnection --publish_format avro --cli_output_path ./load_and_trunc_info.out
Output
Truncate and Load has been successfully launched:You may monitor the progress of your publish job here: http://localhost:3005/jobs Upon success, you may view the results of your Truncate and Load job here: http://localhost:3005/jobs/10
JSON Response
JSON response written to load_and_trunc_info.out
:
{"status": "Job Started", "job_id": 10}
This page has no comments.