Page tree

 

Contents:


NOTE: This feature requires developer-level skills to enable and use.

The Command Line Interface for Jobs enables programmatic control over a variety of operations on the platform. You can use the CLI to execute any of the following types of commands:

  • Run a job

    NOTE: In this release, you cannot run jobs using datasets imported from Redshift or SQL DW connections via the CLI. This known issue will be fixed in a future release.

  • Check job status
  • Publish a completed job to other datastores asynchronously

    NOTE: In this release, you cannot publish results to Redshift or SQL DW connections via the CLI. This known issue will be fixed in a future release.

  • Get information on publications of a job
  • Load data into a new or existing table in Redshift
  • (Hive only) Clear an existing table and load with new data

Requirements

  • The CLI must have access to a running instance of the Trifacta® platform. You can specify the host and port of this instance. 
  • If you are running jobs for a dataset with parameters, the downloaded assets reference only the first matching file of the dataset. To run the job across all files in the dataset with parameters, you must build the matching logic within your CLI script. For more information on datasets with parameters, see Overview of Parameterization.

Command Reference

Execute the following command from the top-level Trifacta directory. The Python script references script.cliand datasources.tsv as parameters.

For repeat executions of the same script.cli file, you can parameterize the values in the datasources.tsv.

The CLI tools are stored in the following directory:

/opt/trifacta/bin/

For executing jobs, specify parameters for the following:

./trifacta_cli.py (parameters)

Parameters

Common

These parameters are common to job or connection actions.

ParameterDescriptionApplicable CLI Commands
command_type
The type of CLI command to execute. Accepted values:

  • run_job - Execute a specified job on the specified running environment.
  • get_job_status - Get job status information.
  • get_publications - Acquire publication information for a specified job.
  • publish - Publish a completed job to the specified database table. If the table exists, the data is appended to the existing table.

  • load_data  - Load data into the database table. If the table exists, the data is appended to the existing table. If the table does not exist, the job fails.
  • truncate_and_load - Overwrite data in specified table. If the table does not exist, a new one is created and populated.

See Examples below.

For more information on the following commands, see CLI for Connections.

  • create_connection - Create a new connection object.
  • edit_connection - Edit an existing connection object.
  • list_connections - List all connection objects for the specified user.
  • delete_connection - Delete a connection object.
All
user_name

(Required) Trifacta username of the user to execute the job. Please specify the full username.

NOTE: In the response, this value is listed as  user .

All
password

Trifacta password for the username
If no password is specified, you are prompted to enter one.

NOTE: If you have enabled Kerberos-based access to the Trifacta platform, you do not need to provide a password. To enable, additional configuration is required. See Set up for a Kerberos-enabled Hadoop cluster.

NOTE: You can store username and password information in an external file so that they don't need to be included in each command. For more information, see CLI Config File.

All
cli_output_path

Defines the client-side path where the JSON output is stored for all commands. Default value iscli_results.out

NOTE: The user issuing the command must also have execute permissions on all parent folders in the specified cli_output_path.

All
disable_ssl_certification

(Optional) When communicating over HTTPS, this setting can be used to override the default behavior of validating the server certificate before executing the command.

NOTE: If you have stored a self-signed certificate on the Trifacta node, please set the REQUESTS_CA_BUNDLE environment variable to point to the directory that contains the trusted server's certificate(s). The CLI will verify against these certs. In this case, the disable_ssl_certificate parameter is not needed.

All commands
conn_ssl

(Optional) Connect to the datastore over SSL.

NOTE: You must modify the host parameter value to include the appropriate port number for the SSL connection.

NOTE: SSL connections are not supported for Hive, Redshift, or SQL Server.


All commands 

Params for managing jobs

These parameters apply to managing jobs only. 

ParameterDescriptionApplicable CLI Commands
host

(Required) The server and port number of the Trifacta instance.
Replace this value with the host and port of the running Trifacta instance. If it is not provided, localhost:3005 is assumed.

NOTE: In some environments, the http:// or https:// protocol identifier may be required as part of the host value.

All
conn_name

Internal name of the connection. This name is referenced in your CLI scripts. It should be a single value without spaces.

NOTE: This value must be unique among your connection names.

load_data, publish, truncate_and_load
conn_id

The internal identifier for the connection. When a connection is created, it is assigned an internal numeric identifier. This ID or the connection_name can be used to reference the connection in future commands.

Tip: This value is available when you hover over a connection in the application. See Flows Page.

publish , load_data, truncate_and_load
job_type

The execution environment in which to run the job:

photon = Run on Photon running environment on Trifacta Server .

NOTE: If the job_type parameter is not specified, CLI jobs are run on the Photon running environment.

hadoop = Run in the default running environment for your Hadoop cluster.

NOTE: When this job type is applied, your CLI scripts automatically transition to running jobs in Spark.

spark = Run on the Spark running environment in Hadoop.

run_job
job_idThe internal identifier for the job. This value can be retrieved from the output of a completed run_job command.get_job_statuspublish,get_publications,load_data
profilerWhen on, profiling of your job is enabled. Default is off.run_job
dataFull UNIX path to the source TSV file. This file contains a URL pointing to the actual Hive or HDFS source: one TSV file for each job run. Executing user must have access to this file.run_job
script
Full UNIX path from the Trifacta root directory to the CLI script file. Executing user must have access.
run_job
publish_action

(Optional) Defines the action taken on second and subsequent publish operations:

  • create - (default) A new file is created with each publication. Filename is numeric identifier of the job ID.
  • append - Each publication appends to the existing output file. Filename is consistent across publications.

    NOTE: Compression of published files is not supported through the command line interface.

    NOTE: When publishing single files to S3, the append operation is not supported.

  • replace - Subsequent publications replace the same file with each execution.
run_job
header

(Optional), The output for a CSV job with append or create publishing action includes the column headers as the first row. Default is false.

NOTE: If you use the header option, you must also include the single_file option, or this setting is ignored.

run_job
single_file(Optional) When true, CSV or JSON outputs are written to a single file. Default is false.run_job
output_path

(Required) Defines the fully qualified URI to where the job results are written, as in the following examples:

hdfs://host:port/path/filename.csv
s3://bucketName/path/filename.csv

NOTE: The output_path must include the protocol identifier or host and port number (if applicable).

This parameter specifies the base filename. If you are publishing files, the publish_action parameter value may change the exact filename that is written.

Protocol is set in webapp.storageProtocolin trifacta-conf.json.

run_job
output_format

Accepted values:csv,jsonpqt (Parquet), and avro (Avro).

NOTE: For  pqt  format,  job_type=spark is required.

For job_type=photon, you may generate csv, json, and avro results.

run_job
database
Name of Redshift or Hive database to which you are publishing or loading.
publish,load_data
table

The table of the database to which you are publishing or loading.

publish,load_data
publish_format

The format of the output file from which to publish to Hive or Redshift tables. Accepted values: csv, jsonpqt (Parquet), or avro (Avro).

publishget_publications
publish_opt_filePath to file containing definitions for multiple file or table targets to which to write the job's results. For more information, see CLI Publishing Options File.run_job
skip_publish_validation

By default, the CLI automatically checks for schema validation when generating results to a pre-existing source.

If this flag is set, schema validation is skipped on results output.

run_job

 

For documentation on the CLI parameters, run:

./trifacta_cli.py --help

Additional documentation may be available for individual commands using the following:

./trifacta_cli.py <commmand> --help

Examples

A key function of the CLI is to execute jobs. You can also check job status through the command line interface and then take subsequent publication actions using other commands.

Run job

This command requires a dataset and a CLI script. The CLI script is used to programmatically run a recipe produced in the Transformer page.

For example, if you receive raw data each day, you can parameterize the execution of the same recipe against daily downloads written to HDFS.

Each run of the CLI script creates a new job. A finished CLI job appears on the Jobs page.

Steps:

  1. A recipe is specific to a dataset. In the Transformer page, open the Recipe Panel.
  2. Click Download.
  3. Select CLI Script.
  4. Download to your desktop. The ZIP contains the following:
    • script.cli Contains the necessary code and configuration to access HDFS and the script in the Trifacta database.
    • datasources.tsv Contains pointers to the source storage location of your datasource(s). 
      • If you are running jobs for a dataset with parameters, the downloaded assets reference only the first matching file of the dataset. To run the job across all files in the dataset with parameters, you must build the matching logic within your CLI script. For more information on datasets with parameters, see Overview of Parameterization.
      • For an example of how to add parameters in a local script, see CLI Example - Parameterize Job Runs.
    • publishopts.json Template file for defining one or more publishing targets for running jobs. See CLI Publishing Options File.
  5. These files must be transferred to the Trifacta Server where you can reference them from the Trifacta root directory.

Notes on connections and jobs

In the downloaded ZIP, the datasources.tsv file may contain a reference to the connection used to import the dataset. However, if you are running the CLI in an Trifacta platform instance that is different from its source, this connectionId may be different in the new environment. From the new environment, please do the following:

  1. Use the list_connections operation to acquire the list of connections available in the new environment. See CLI for Connections.
  2. Acquire the Id value for the connection corresponding to the one used in datasources.tsv

    NOTE: The user who is executing the CLI script must be able to access the connection in the new environment.

  3. Edit datasources.tsv. Replace the connection Id value in the file with the value retrieved through the CLI. 
  4. When the job is executed, it should properly connect to the source through the connection in the new environment.

Command - Basic Job Run

NOTE: This method of specifying a single-file publishing action has been superseded by a newer method, which relies on an external file for specifying publishing targets. In a future release, this method may be deprecated. For more information, see CLI Publishing Options File.

Example (All one command):

./trifacta_cli.py run_job --user_name <trifacta_user> --password <trifacta_password> --job_type spark 
--output_format json --data redshift-test/datasources.tsv --script redshift-test/script.cli 
--cli_output_path ./job_info.out --profiler on --output_path hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/cleaned_table_1.json

Output

Job #42 has been successfully launched:

You may monitor the progress of your job here: http://localhost:3005/jobs

JSON Response

JSON response written to job_info.out:

{
  "status": "success",
  "job_outputs": {
    "other_outputs": [
      "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/.profiler/profilerValidValueHistograms.json",
      "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/.profiler/profilerSamples.json",
      "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/.profiler/profilerTypeCheckHistograms.json",
      "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/.profiler/profilerInput.json"
    ]
    "job_result_files": [
      "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/cleaned_table_1.json",
    ]
  },
  "job_id": 42,
  "cli_script": "/trifacta/queryResults/foo@trifacta.com/redshift-test/script.cli",
  "job_type": "spark",
  "profiler": "on",
  "source_data": "/trifacta/queryResults/foo@trifacta.com/redshift-test/datasources.tsv",
  "host": "localhost:3005",
  "output_path": "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/cleaned_table_1.json",
  "user": "foo@trifacta.com",
  "output_file_formats": [
    "json"
  ],
}

Command - File Publishing Options

You can specify publication options as part of your run_job command. In the following, a single CSV file with headers is written to a new file with each job execution.

Example (All one command):

./trifacta_cli.py run_job --user_name <trifacta_user> --password <trifacta_password> --job_type spark 
--output_format csv --data redshift-test/datasources.tsv --script redshift-test/script.cli 
--publish_action create --header true --single_file true
--cli_output_path ./job_info.out --profiler on --output_path hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/43/cleaned_table_1.csv

Output

Job #43 has been successfully launched:

You may monitor the progress of your job here: http://localhost:3005/jobs

JSON Response

JSON response written to job_info.out:

{
  "status": "success",
  "job_outputs": {
    "other_outputs": [
      "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/43/.profiler/profilerValidValueHistograms.json",
      "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/43/.profiler/profilerSamples.json",
      "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/43/.profiler/profilerTypeCheckHistograms.json",
      "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/43/.profiler/profilerInput.json"
    ]
    "job_result_files": [
      "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/43/cleaned_table_1.csv",
    ]
  },
  "job_id": 43,
  "cli_script": "/trifacta/queryResults/foo@trifacta.com/redshift-test/script.cli",
  "output_file_formats": [
    "csv",
  ],
  "job_type": "spark",
  "host": "localhost:3005",
  "job_output_path": "/trifacta/queryResults/foo@trifacta.com/MyDataset/43/",
  "user": "foo@trifacta.com",
  "source_data": "/trifacta/queryResults/foo@trifacta.com/redshift-test/datasources.tsv",
  "profiler": "on"
}

Command - publishing to multiple targets

As part of the CLI job, you can define multiple file or table targets to which to write the job results. For more information, see CLI Publishing Options File.

Get job status

After you queue a job through the CLI, you can review the status of the job through the application or through the CLI. 

Tip: You can acquire the job ID through the application as needed. For example, at some point in the future, you might decide to publish to Hive the results from a job you executed two weeks ago. It might be easiest to retrieve this job identifier from the Dataset Details page. See Dataset Details Page.

Command

Example (All one command):

./trifacta_cli.py get_job_status --user_name <trifacta_user> --password <trifacta_password> --job_id 42 
--cli_output_path ./job_info.out

Output

Job status: Complete

JSON Response

JSON response written to job_info.out:

{
  "status": "success",
  "job_outputs": {
    "other_outputs": [
      "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/.profiler/profilerValidValueHistograms.json",
      "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/.profiler/profilerSamples.json",
      "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/.profiler/profilerTypeCheckHistograms.json",
      "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/.profiler/profilerInput.json"
    ]
    "job_result_files": [
      "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/cleaned_table_1.json",
      "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/cleaned_table_1.csv",
      "hdfs://localhost:8020/trifacta/queryResults/foo@trifacta.com/MyDataset/42/cleaned_table_1.avro",
    ]
  },
  "job_id": 42,
  "cli_script": "/trifacta/queryResults/foo@trifacta.com/redshift-test/script.cli",
  "output_file_formats": [
    "csv",
    "json",
    "avro",
  ],
  "job_type": "spark",
  "host": "localhost:3005",
  "job_output_path": "/trifacta/queryResults/foo@trifacta.com/MyDataset/42/",
  "user": "foo@trifacta.com",
  "source_data": "/trifacta/queryResults/foo@trifacta.com/redshift-test/datasources.tsv",
  "profiler": "on"
}

Publish

You can publish job results to Hive or Redshift:

  • For Hive, you can publish Avro or Parquet results from HDFS or S3 to Hive.
  • For Redshift, you can publish CSV, JSON, or Avro results from S3 to Redshift.

    NOTE: To publish to Redshift, results must be written first to S3.

    NOTE: Even if you are publishing to the default schema, you must preface the table value with the name of the schema to use: MySchema.MyTable.

Publish commands can be executed as soon as the job identifier has been created. After the publish command is submitted, the publish job is queued for execution after any related transform job has been completed.

You execute one publish command for each output you wish to write to a supported database table.

Command

Example (All one command):

./trifacta_cli.py publish --user_name <trifacta_user> --password <trifacta_password> --job_id 42 
--database dev --table table_job_42 --conn_name 1 --publish_format avro 
--cli_output_path ./publish_info.out

 

Output

Create new table every run has been successfully launched:
You may monitor the progress of your publish job here: http://localhost:3005/jobs
Upon success, you may view the results of your publish job here: http://localhost:3005/jobs/42

JSON Response

JSON response written to publish_info.out:

{"status": "Job Started", "job_id": 42}

Get publications

You can retrieve a JSON list of all publications that have been executed for a specific job. 

Command

Example (All one command):

./trifacta_cli.py get_publications --user_name <trifacta_user> --password <trifacta_password> --job_id 42
--cli_output_path ./publications.out --publish_format avro

Output

Job with id 42 has 2 avro publication(s) associated with it. The list of publications is available in "./publications.out".

JSON Response

JSON response written to publications.out:

{
  "publications": [
    {
      "publication_target": "redshift",
      "job_id": "42",
      "database": "dev",
      "publication_id": 69,
      "app_host": "trifacta.example.com:3005",
      "user": "foo@trifacta.com",
      "table": "table_job_42",
      "publish_format": "avro",
      "connect_str": "jdbc:redshift://dev.example.com:5439/dev"
    },
    {
      "publication_target": "hive",
      "job_id": "42",
      "database": "default",
      "publication_id": 70,
      "app_host": "trifacta.example.com:3005",
      "user": "foo@trifacta.com",
      "table": "table_job_42",
      "publish_format": "avro",
      "connect_str": "jdbc:hive2://hadoop:10000/default"
     }
  ],
}

Load data into table

You can load data into pre-existing Redshift tables.

  • Data is appended after any existing rows.
  • If the table does not exist, the job fails.

NOTE: When appending data into a Redshift table, the columns displayed in the Transformer page must match the order and data type of the columns in the target table.

Command

Example (All one command):

./trifacta_cli.py load_data --user_name <trifacta_user> --password <trifacta_password> --job_id 42 
--database dev --table table_42 --conn_name aSQLServerConnection --publish_format avro 
--cli_output_path ./load_info.out

Output

Append to this table every run has been successfully launched:
You may monitor the progress of your publish job here: http://localhost:3005/jobs
Upon success, you may view the results of your Load data/Append job here: http://localhost:3005/jobs/42

JSON Response

JSON response written to load_info.out:

{"status": "Job Started", "job_id": 42}

Truncate and load

For existing tables, you can clear them and load them with results from a job. If the table does not exist, a new one is created and populated.

Command

Example (All one command):

./trifacta_cli.py truncate_and_load --user_name <trifacta_user> --password <trifacta_password> --job_id 10 
--database dev --table table_43 --conn_name aSQLServerConnection --publish_format avro 
--cli_output_path ./load_and_trunc_info.out

Output

Truncate and Load has been successfully launched:You may monitor the progress of your publish job here: http://localhost:3005/jobs
Upon success, you may view the results of your Truncate and Load job here: http://localhost:3005/jobs/10

JSON Response

JSON response written to load_and_trunc_info.out:

{"status": "Job Started", "job_id": 10}

This page has no comments.