The Command Line Interface for Jobs enables programmatic control over a variety of operations on the platform. You can use the CLI to execute any of the following types of commands:
Run a job
NOTE: In this release, you cannot run jobs using datasets imported from Redshift or SQL DW connections via the CLI. This known issue will be fixed in a future release. |
Publish a completed job to other datastores asynchronously
NOTE: In this release, you cannot publish results to Redshift or SQL DW connections via the CLI. This known issue will be fixed in a future release. |
Execute the following command from the top-level . The Python script references
script.cli
and datasources.tsv
as parameters.
For repeat executions of the same script.cli
file, you can parameterize the values in the datasources.tsv
.
The CLI tools are stored in the following directory:
/opt/trifacta/bin/ |
For executing jobs, specify parameters for the following:
./trifacta_cli.py (parameters) |
These parameters are common to job or connection actions.
Parameter | Description | Applicable CLI Commands | ||
---|---|---|---|---|
command_type | The type of CLI command to execute. Accepted values:
See Examples below. For more information on the following commands, see CLI for Connections.
| All | ||
user_name | (Required)
| All | ||
password |
| All | ||
cli_output_path | Defines the client-side path where the JSON output is stored for all commands. Default value is
| All | ||
disable_ssl_certification | (Optional) When communicating over HTTPS, this setting can be used to override the default behavior of validating the server certificate before executing the command.
| All commands | ||
conn_ssl | (Optional) Connect to the datastore over SSL.
| All commands |
These parameters apply to managing jobs only.
Parameter | Description | Applicable CLI Commands | ||
---|---|---|---|---|
host | (Required) The server and port number of the
| All | ||
conn_name | Internal name of the connection. This name is referenced in your CLI scripts. It should be a single value without spaces.
| load_data , publish , truncate_and_load | ||
conn_id | The internal identifier for the connection. When a connection is created, it is assigned an internal numeric identifier. This ID or the
| publish , load_data , truncate_and_load | ||
job_type | The execution environment in which to run the job:
| run_job | ||
job_id | The internal identifier for the job. This value can be retrieved from the output of a completed run_job command. | get_job_status , publish ,get_publications ,load_data
| ||
profiler | When on , profiling of your job is enabled. Default is off . | run_job | ||
data | Full UNIX path to the source TSV file. This file contains a URL pointing to the actual Hive or HDFS source: one TSV file for each job run. Executing user must have access to this file. | run_job | ||
script | Full UNIX path from the | run_job | ||
publish_action | (Optional) Defines the action taken on second and subsequent publish operations:
| run_job | ||
header | (Optional), The output for a CSV job with
| run_job | ||
single_file | (Optional) When true , CSV or JSON outputs are written to a single file. Default is false . | run_job | ||
output_path | (Required) Defines the fully qualified URI to where the job results are written, as in the following examples:
This parameter specifies the base filename. If you are publishing files, the Protocol is set in | run_job | ||
output_format | Accepted values:
For | run_job | ||
database | Name of Redshift or Hive database to which you are publishing or loading. | publish ,load_data | ||
table | The table of the database to which you are publishing or loading. | publish ,load_data | ||
publish_format | The format of the output file from which to publish to Hive or Redshift tables. Accepted values: | publish , get_publications | ||
publish_opt_file | Path to file containing definitions for multiple file or table targets to which to write the job's results. For more information, see CLI Publishing Options File. | run_job | ||
skip_publish_validation | By default, the CLI automatically checks for schema validation when generating results to a pre-existing source. If this flag is set, schema validation is skipped on results output. | run_job |
For documentation on the CLI parameters, run:
./trifacta_cli.py --help |
Additional documentation may be available for individual commands using the following:
./trifacta_cli.py <commmand> --help |
A key function of the CLI is to execute jobs. You can also check job status through the command line interface and then take subsequent publication actions using other commands.
This command requires a dataset and a CLI script. The CLI script is used to programmatically run a recipe produced in the Transformer page.
For example, if you receive raw data each day, you can parameterize the execution of the same recipe against daily downloads written to HDFS.
Each run of the CLI script creates a new job. A finished CLI job appears on the Jobs page.
Steps:
script.cli
Contains the necessary code and configuration to access HDFS and the script in the datasources.tsv
Contains pointers to the source storage location of your datasource(s). publishopts.json
Template file for defining one or more publishing targets for running jobs. See CLI Publishing Options File.Notes on connections and jobs
In the downloaded ZIP, the datasources.tsv
file may contain a reference to the connection used to import the dataset. However, if you are running the CLI in an instance that is different from its source, this connectionId may be different in the new environment. From the new environment, please do the following:
list_connections
operation to acquire the list of connections available in the new environment. See CLI for Connections.Acquire the Id value for the connection corresponding to the one used in datasources.tsv
.
NOTE: The user who is executing the CLI script must be able to access the connection in the new environment. |
datasources.tsv
. Replace the connection Id value in the file with the value retrieved through the CLI. NOTE: This method of specifying a single-file publishing action has been superseded by a newer method, which relies on an external file for specifying publishing targets. In a future release, this method may be deprecated. For more information, see CLI Publishing Options File. |
Example (All one command):
./trifacta_cli.py run_job --user_name <trifacta_user> --password <trifacta_password> --job_type spark --output_format json --data redshift-test/datasources.tsv --script redshift-test/script.cli --cli_output_path ./job_info.out --profiler on --output_path hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/cleaned_table_1.json |
Job #42 has been successfully launched: You may monitor the progress of your job here: http://localhost:3005/jobs |
JSON response written to job_info.out
:
{ "status": "success", "job_outputs": { "other_outputs": [ "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/.profiler/profilerValidValueHistograms.json", "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/.profiler/profilerSamples.json", "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/.profiler/profilerTypeCheckHistograms.json", "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/.profiler/profilerInput.json" ] "job_result_files": [ "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/cleaned_table_1.json", ] }, "job_id": 42, "cli_script": "/trifacta/queryResults/foo@trifacta.com/redshift-test/script.cli", "job_type": "spark", "profiler": "on", "source_data": "/trifacta/queryResults/foo@trifacta.com/redshift-test/datasources.tsv", "host": "localhost:3005", "output_path": "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/cleaned_table_1.json", "user": "foo@trifacta.com", "output_file_formats": [ "json" ], } |
You can specify publication options as part of your run_job
command. In the following, a single CSV file with headers is written to a new file with each job execution.
Example (All one command):
./trifacta_cli.py run_job --user_name <trifacta_user> --password <trifacta_password> --job_type spark --output_format csv --data redshift-test/datasources.tsv --script redshift-test/script.cli --publish_action create --header true --single_file true --cli_output_path ./job_info.out --profiler on --output_path hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/43/cleaned_table_1.csv |
Job #43 has been successfully launched: You may monitor the progress of your job here: http://localhost:3005/jobs |
JSON response written to job_info.out
:
{ "status": "success", "job_outputs": { "other_outputs": [ "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/43/.profiler/profilerValidValueHistograms.json", "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/43/.profiler/profilerSamples.json", "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/43/.profiler/profilerTypeCheckHistograms.json", "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/43/.profiler/profilerInput.json" ] "job_result_files": [ "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/43/cleaned_table_1.csv", ] }, "job_id": 43, "cli_script": "/trifacta/queryResults/foo@trifacta.com/redshift-test/script.cli", "output_file_formats": [ "csv", ], "job_type": "spark", "host": "localhost:3005", "job_output_path": "/trifacta/queryResults/foo@trifacta.com/MyDataset/43/", "user": "foo@trifacta.com", "source_data": "/trifacta/queryResults/foo@trifacta.com/redshift-test/datasources.tsv", "profiler": "on" } |
As part of the CLI job, you can define multiple file or table targets to which to write the job results. For more information, see CLI Publishing Options File.
After you queue a job through the CLI, you can review the status of the job through the application or through the CLI.
Tip: You can acquire the job ID through the application as needed. For example, at some point in the future, you might decide to publish to Hive the results from a job you executed two weeks ago. It might be easiest to retrieve this job identifier from the Dataset Details page. See Dataset Details Page. |
Example (All one command):
./trifacta_cli.py get_job_status --user_name <trifacta_user> --password <trifacta_password> --job_id 42 --cli_output_path ./job_info.out |
Job status: Complete |
JSON response written to job_info.out
:
{ "status": "success", "job_outputs": { "other_outputs": [ "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/.profiler/profilerValidValueHistograms.json", "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/.profiler/profilerSamples.json", "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/.profiler/profilerTypeCheckHistograms.json", "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/.profiler/profilerInput.json" ] "job_result_files": [ "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/cleaned_table_1.json", "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/cleaned_table_1.csv", "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/cleaned_table_1.avro", ] }, "job_id": 42, "cli_script": "/trifacta/queryResults/foo@trifacta.com/redshift-test/script.cli", "output_file_formats": [ "csv", "json", "avro", ], "job_type": "spark", "host": "localhost:3005", "job_output_path": "/trifacta/queryResults/foo@trifacta.com/MyDataset/42/", "user": "foo@trifacta.com", "source_data": "/trifacta/queryResults/foo@trifacta.com/redshift-test/datasources.tsv", "profiler": "on" } |
You can publish job results to Hive or Redshift:
For Redshift, you can publish CSV, JSON, or Avro results from S3 to Redshift.
NOTE: To publish to Redshift, results must be written first to S3. |
NOTE: Even if you are publishing to the default schema, you must preface the |
Publish commands can be executed as soon as the job identifier has been created. After the publish command is submitted, the publish job is queued for execution after any related transform job has been completed.
You execute one publish command for each output you wish to write to a supported database table.
Example (All one command):
./trifacta_cli.py publish --user_name <trifacta_user> --password <trifacta_password> --job_id 42 --database dev --table table_job_42 --conn_name 1 --publish_format avro --cli_output_path ./publish_info.out |
Create new table every run has been successfully launched: You may monitor the progress of your publish job here: http://localhost:3005/jobs Upon success, you may view the results of your publish job here: http://localhost:3005/jobs/42 |
JSON response written to publish_info.out
:
{"status": "Job Started", "job_id": 42} |
You can retrieve a JSON list of all publications that have been executed for a specific job.
Example (All one command):
./trifacta_cli.py get_publications --user_name <trifacta_user> --password <trifacta_password> --job_id 42 --cli_output_path ./publications.out --publish_format avro |
Job with id 42 has 2 avro publication(s) associated with it. The list of publications is available in "./publications.out". |
JSON response written to publications.out
:
{ "publications": [ { "publication_target": "redshift", "job_id": "42", "database": "dev", "publication_id": 69, "app_host": "trifacta.example.com:3005", "user": "foo@trifacta.com", "table": "table_job_42", "publish_format": "avro", "connect_str": "jdbc:redshift://dev.example.com:5439/dev" }, { "publication_target": "hive", "job_id": "42", "database": "default", "publication_id": 70, "app_host": "trifacta.example.com:3005", "user": "foo@trifacta.com", "table": "table_job_42", "publish_format": "avro", "connect_str": "jdbc:hive2://hadoop:10000/default" } ], } |
You can load data into pre-existing Redshift tables.
If the table does not exist, the job fails.
NOTE: When appending data into a Redshift table, the columns displayed in the Transformer page must match the order and data type of the columns in the target table. |
Example (All one command):
./trifacta_cli.py load_data --user_name <trifacta_user> --password <trifacta_password> --job_id 42 --database dev --table table_42 --conn_name aSQLServerConnection --publish_format avro --cli_output_path ./load_info.out |
Append to this table every run has been successfully launched: You may monitor the progress of your publish job here: http://localhost:3005/jobs Upon success, you may view the results of your Load data/Append job here: http://localhost:3005/jobs/42 |
JSON response written to load_info.out
:
{"status": "Job Started", "job_id": 42} |
For existing tables, you can clear them and load them with results from a job. If the table does not exist, a new one is created and populated.
Example (All one command):
./trifacta_cli.py truncate_and_load --user_name <trifacta_user> --password <trifacta_password> --job_id 10 --database dev --table table_43 --conn_name aSQLServerConnection --publish_format avro --cli_output_path ./load_and_trunc_info.out |
Truncate and Load has been successfully launched:You may monitor the progress of your publish job here: http://localhost:3005/jobs Upon success, you may view the results of your Truncate and Load job here: http://localhost:3005/jobs/10 |
JSON response written to load_and_trunc_info.out
:
{"status": "Job Started", "job_id": 10} |