API Task - Run Job on Dataset with Parameters

Warning

API access is migrating to Enterprise only. Beginning in Release 9.5, all new or renewed subscriptions have access to public API endpoints on the Enterprise product edition only. Existing customers on non-Enterprise editions will retain access their available endpoints (Legacy) until their subscription expires. To use API endpoints after renewal, you must upgrade to the Enterprise product edition or use a reduced set of endpoints (Current). For more information on differences between product editions in the new model, please visit Pricing and Packaging.

Overview

This example task describes how to run jobs on datasets with parameters through Dataprep by Trifacta.

A dataset with parameters is a dataset in which some part of the path to the data objects has been parameterized. Since one or more of the parts of the path can vary, you can build a dataset with parameters to capture data that spans multiple files. For example, datasets with parameters can be used to parameterize serialized data by region or data or other variable. For more information on datasets with parameters, see Overview of Parameterization.

Basic Task

The basic method by which you build and run a job for a dataset with parameters is very similar to the non-parameterized dataset method with a few notable exceptions. The steps in this task follow the same steps for the standard task. Where the steps overlap links have been provided to the non-parameterized task. For more information, see API Task - Develop a Flow.

Example Datasets

This example covers three different datasets, each of which features a different type of dataset with parameters.

Example Number	Parameter Type	Description
1	Datetime parameter	In this example, a directory is used to store daily orders transactions. This dataset must be defined with a Datetime parameter to capture the preceding 7 days of data. Jobs can be configured to process all of this data as it appears in the directory.
2	Variable	This dataset segments data into four timezones across the US. These timezones are defined using the following text values in the path: `pacific`, `mountain`, `central`, and `eastern`. In this case, you can create a parameter called `region`, which can be overridden at runtime to be set to one of these four values during job execution.
3	Pattern parameter	This example is a directory containing point-of-sale transactions captured into individual files for each region. Since each region is defined by a numeric value (`01`, `02`, `03`), the dataset can be defined using a pattern parameter.
4	Environment parameter	An environment parameter is defined by an admin and is available for every user of the project or workspace. In particular, environment parameters are useful for defining source bucket names, which may vary between environments in the same organization.

Step - Create Containing Flow

You must create the flow to host your dataset with parameters.

In the response, you must capture and retain the flow Identifier. For more information, see API Task - Develop a Flow.

Step - Create Datasets with Parameters

Note

When you import a dataset with parameters, only the first matching dataset is used for the initial file. If you want to see data from other matching files, you must collect a new sample within the Transformer page.

Example 1 - Dataset with Datetime parameter

Suppose your files are stored in the following paths:

MyFiles/1/Datetime/2018-04-06-orders.csv
MyFiles/1/Datetime/2018-04-05-orders.csv
MyFiles/1/Datetime/2018-04-04-orders.csv
MyFiles/1/Datetime/2018-04-03-orders.csv
MyFiles/1/Datetime/2018-04-02-orders.csv
MyFiles/1/Datetime/2018-04-01-orders.csv
MyFiles/1/Datetime/2018-03-31-orders.csv

When you navigate to the directory through the application, you mouse over one of these files and select Parameterize.

In the window, select the date value (e.g. YYYY-MM-DD) and then click the Datetime icon.

Datetime Parameter:

Format: YYYY-MM-DD
Date Range: Date is last 7 days.
Click Save.

The Datetime parameter should match with all files in the directory. Import this dataset and wrangle it.

After you wrangle the dataset, return to its flow view and select the recipe. You should be able to extract the flowId and recipeId values from the URL.

For purposes of this example, here are some key values:

flowId: 35
recipeId: 127

Example 2 - Dataset with Variable

Suppose your files are stored in the following paths:

MyFiles/1/variable/census-eastern.csv
MyFiles/1/variable/census-central.csv
MyFiles/1/variable/census-mountain.csv
MyFiles/1/variable/census-pacific.csv

When you navigate to the directory through the application, you mouse over one of these files and select Parameterize.

In the window, select the region value, which could be one of the following depending on the file: eastern, central, mountain, or pacific. Click the Variable icon.

Variable Parameter:

Name: region
Default Value:Set this default to pacific.
Click Save.

In this case, the variable only matches one value in the directory. However, when you apply runtime overrides to the region variable, you can set it to any value.

Import this dataset and wrangle it.

After you wrangle the dataset, return to its flow view and select the recipe. You should be able to extract the flowId and recipeId values from the URL.

For purposes of this example, here are some key values:

flowId: 33
recipeId: 123

Example 3 - Dataset with pattern parameter

Suppose your files are stored in the following paths:

MyFiles/1/pattern/POS-r01.csv
MyFiles/1/pattern/POS-r02.csv
MyFiles/1/pattern/POS-r03.csv

When you navigate to the directory through the application, you mouse over one of these files and select Parameterize.

In the window, select the two numeric digits (e.g. 02). Click the Pattern icon.

Pattern Parameter:

Type: Wrangle
Matching regular expression: {digit}{2}
Click Save.

In this case, the Wrangle should match any sequence of two digits in a row. In the above example, this expression matches: 01, 02, and 03, all of the files in the directory.

Import this dataset and wrangle it.

After you wrangle the dataset, return to its flow view and select the recipe. You should be able to extract the flowId and recipeId values from the URL.

For purposes of this example, here are some key values:

flowId: 32
recipeId: 121

Note

You have created flows for each type of dataset with parameters.

Example 4 - Dataset with parameterized bucket name

You can parameterize part or all of the bucket name in your source or target paths.

Suppose you have multiple workspaces that use different S3 buckets for sources of data. For example, your environments might look like the following:

Environment	S3 Bucket Name
Dev	myco-dev
Prod	myco-prod

For your datasources, you can parameterize the name of the bucket, so that if you migrate your flow between these environments, the references to datasources are updated based on the parameterized value for the bucket in the new environment.

Create environment parameter

Parameterized buckets are a good use for environment parameters. An environment parameter is a parameter that is available for use by every user in the project or workspace. In this case, the bucket name can be referenced for all datasets in the project or workspace, so turning that value into a parameter makes managing your datasources much more efficient.

You can use the following example to create environment parameter called env.bucketName, with a value of myco-dev. This environment parameter would be created in your Dev environment:

Note

The overrideKey value, which is the name of the environment parameter, must begin with env..

Endpoint	http://www.example.com:3005/v4/environmentParameters
Authentication	Required
Method	POST
Request Body	{ "overrideKey": "env.bucketName", "value": { "variable": { "value": "myco-dev" } } }
Response	{ "id": 1, "overrideKey": "env.bucketName", "value": { "variable": { "value": "myco-dev" } }, "createdAt": "2021-06-24T14:15:22Z", "updatedAt": "2021-06-24T14:15:22Z", "deleted_at": "2021-06-24T14:15:22Z", "usageInfo": { "runParameters": 1 } }

For more information on creating environment parameters, see Dataprep by Trifacta: API Reference docs

Create dataset with parameterized bucket name

The following example creates an imported dataset with two parameters:

Parameter Name	Parameter Type	Environment Parameter?	Description
myPath	path	No	The parameterized part of the path. The static value is `/`. The default value is `/dummy`. In this case, for the job run, the value is overridden with `/dummy2`.
env.bucketName	bucket	Yes	The parameterized part of the bucket path. The static value is `myco-`. In this case, for the job run, the value `dev` is inserted after the fifth character in the variable.

Parameter Name

Parameter Type

Environment Parameter?

Description

myPath

path

The parameterized part of the path.

The static value is /.

The default value is /dummy.

In this case, for the job run, the value is overridden with /dummy2.

env.bucketName

bucket

Yes

The parameterized part of the bucket path.

The static value is myco-.

In this case, for the job run, the value dev is inserted after the fifth character in the variable.

Endpoint	http://www.example.com:3005/v4/importedDatasets
Authentication	Required
Method	POST
Request Body	{ "name": "Dummy Dataset", "uri": "/path", "description": "My S3 parameterized dataset", "type": "S3", "isDynamic": true, "runParameters": [ { "type": "path", "overrideKey": "myPath", "insertionIndices": [ { "index": 1, "order": 0 } ], "value": { "variable": { "value": "dummy2" } } }, { "type": "bucket", "overrideKey": "env.bucketParam", "insertionIndices": [ { "index": 5, "order": 0 } ], "value": { "variable": { "value": "dev" } } } ], "dynamicBucket": "myco-", "dynamicPath": "/" }
Response	{ "visible": true, "numFlows": 0, "path": "/dummy", "bucket": "", "type": "s3", "isDynamic": true, "runParameters": [ { "type": "path", "overrideKey": "myPath", "insertionIndices": [ { "index": 1, "order": 0 } ], "value": { "variable": { "value": "dummy2" } }, "isEnvironmentParameter": false }, { "type": "bucket", "overrideKey": "env.bucketParam", "insertionIndices": [ { "index": 5, "order": 0 } ], "value": { "variable": { "value": "dev" } }, "isEnvironmentParameter": true } ], "dynamicBucket": "myco-", "dynamicPath": "/" }

For more information, see Dataprep by Trifacta: API Reference docs

Step - Wrangle Data

After you have created your dataset with parameter, you can wrangle it through the application. For more information, see Transformer Page.

Step - Run Job

Below, you can review the API calls to run a job for each type of dataset with parameters, including relevant information about overrides.

Example 1 - Dataset with Datetime parameter

Note

You cannot apply overrides to these types of datasets with parameters. The following request contains overrides for write settings but no overrides for parameters.

Endpoint	http://www.example.com:3005/v4/jobGroups
Authentication	Required
Method	POST
Request Body	{ "wrangledDataset": { "id": 127 }, "overrides": { "execution": "photon", "profiler": true, "writesettings": [ { "path": "MyFiles/queryResults/joe@example.com/2018-04-03-orders.csv", "action": "create", "format": "csv", "compression": "none", "header": false, "asSingleFile": false } ] }, "runParameters": {} }

In the above example, the job has been launched for recipe 127 to execute on the Trifacta Photon running environment with profiling enabled.
1. Output format is CSV to the designated path. For more information on these properties, see Dataprep by Trifacta: API Reference docs
2. Output is written as a new file with no overwriting of previous files.

A response code of 201 - Created is returned. The response body should look like the following:

{
    "sessionId": "79276c31-c58c-4e79-ae5e-fed1a25ebca1",
    "reason": "JobStarted",
    "jobGraph": {
        "vertices": [
            21,
            22
        ],
        "edges": [
            {
                "source": 21,
                "target": 22
            }
        ]
    },
    "id": 29,
    "jobs": {
        "data": [
            {
                "id": 21
            },
            {
                "id": 22
            }
        ]
    }
}

Retain the jobgroupId=29 value for monitoring.

Example 2 - Dataset with Variable

In the following example, the region variable has been overwritten with the value central to execute the job on orders-central.csv:

Endpoint	http://www.example.com:3005/v4/jobGroups
Authentication	Required
Method	POST
Request Body	{ "wrangledDataset": { "id": 123 }, "overrides": { "execution": "photon", "profiler": true, "writesettings": [ { "path": "MyFiles/queryResults/joe@example.com/region-eastern.csv", "action": "create", "format": "csv", "compression": "none", "header": false, "asSingleFile": false } ] }, "runParameters": { "overrides": { "data": [{ "key": "region", "value": "central" } ]} } }

In the above example, the job has been launched for recipe 123 to execute on the Trifacta Photon running environment with profiling enabled.
1. Output format is CSV to the designated path. For more information on these properties, see Dataprep by Trifacta: API Reference docs
2. Output is written as a new file with no overwriting of previous files.

A response code of 201 - Created is returned. The response body should look like the following:

{
    "sessionId": "79276c31-c58c-4e79-ae5e-fed1a25ebca1",
    "reason": "JobStarted",
    "jobGraph": {
        "vertices": [
            21,
            22
        ],
        "edges": [
            {
                "source": 21,
                "target": 22
            }
        ]
    },
    "id": 27,
    "jobs": {
        "data": [
            {
                "id": 21
            },
            {
                "id": 22
            }
        ]
    }
}

Retain the jobgroupId=27 value for monitoring.

Example 3 - Dataset with pattern parameter

Note

You cannot apply overrides to these types of datasets with parameters. The following request contains overrides for write settings but no overrides for parameters.

Endpoint	http://www.example.com:3005/v4/jobGroups
Authentication	Required
Method	POST
Request Body	{ "wrangledDataset": { "id": 121 }, "overrides": { "execution": "photon", "profiler": false, "writesettings": [ { "path": "hdfs://hadoop:50070/trifacta/queryResults/admin@example.com/POS-r02.txt", "action": "create", "format": "csv", "compression": "none", "header": false, "asSingleFile": false } ] }, "runParameters": {} }

In the above example, the job has been launched for recipe 121 to execute on the Trifacta Photon running environment with profiling enabled.
1. Output format is CSV to the designated path. For more information on these properties, see Dataprep by Trifacta: API Reference docs
2. Output is written as a new file with no overwriting of previous files.

A response code of 201 - Created is returned. The response body should look like the following:

{
    "sessionId": "79276c31-c58c-4e79-ae5e-fed1a25ebca1",
    "reason": "JobStarted",
    "jobGraph": {
        "vertices": [
            21,
            22
        ],
        "edges": [
            {
                "source": 21,
                "target": 22
            }
        ]
    },
    "id": 28,
    "jobs": {
        "data": [
            {
                "id": 21
            },
            {
                "id": 22
            }
        ]
    }
}

Retain the jobgroupId=28 value for monitoring.

Example 4 - Dataset with parameterized bucket name

The following example contains a parameterized bucket reference, with a specified override value. Administrators and project owners can specify the default value for environment parameters, and users can specify overrides for these values at job execution time.

Endpoint	http://www.example.com:3005/v4/jobGroups
Authentication	Required
Method	POST
Request Body	{ "wrangledDataset": { "id": 121 }, "runParameters": { "overrides": { "data": [ { "key": "env.bucketName", "value": "myco-dev2" } ] } } }

In the above example, the job has been launched for recipe 121 to execute with the env.bucketName override value (myco-dev2) for the environment parameter.

For more information on these properties, see Dataprep by Trifacta: API Reference docs

Step - Monitoring Your Job

After the job has been created and you have captured the jobGroup Id, you can use it to monitor the status of your job. For more information, see Dataprep by Trifacta: API Reference docs

Step - Re-run Job

If you need to re-run the job as specified, you can use the wrangledDataset identifier to re-run the most recent job.

Tip

When you re-run a job, you can change any variable values as part of the request.

Example request:

Endpoint	http://www.example.com:3005/v4/jobGroups
Authentication	Required
Method	POST
Request Body	{ "wrangledDataset": { "id": 123 }, "runParameters": { "overrides": { "data": [{ "key": "region", "value": "central" } ]} } }

For more information, see API Task - Develop a Flow.

In this section:

API Task - Run Job on Dataset with Parameters

Overview

Example Datasets

Step - Create Containing Flow

Step - Create Datasets with Parameters

Example 1 - Dataset with Datetime parameter

Example 2 - Dataset with Variable

Example 3 - Dataset with pattern parameter

Example 4 - Dataset with parameterized bucket name

Create environment parameter

Create dataset with parameterized bucket name

Step - Wrangle Data

Step - Run Job

Example 1 - Dataset with Datetime parameter

Example 2 - Dataset with Variable

Example 3 - Dataset with pattern parameter

Example 4 - Dataset with parameterized bucket name

Step - Monitoring Your Job

Step - Re-run Job

Search results