Skip to main content

API Task - Run Job on Dataset with Parameters

Warning

API access is migrating to Enterprise only. Beginning in Release 9.5, all new or renewed subscriptions have access to public API endpoints on the Enterprise product edition only. Existing customers on non-Enterprise editions will retain access their available endpoints (Legacy) until their subscription expires. To use API endpoints after renewal, you must upgrade to the Enterprise product edition or use a reduced set of endpoints (Current). For more information on differences between product editions in the new model, please visit Pricing and Packaging.

Overview

This example task describes how to run jobs on datasets with parameters through Dataprep by Trifacta.

A dataset with parameters is a dataset in which some part of the path to the data objects has been parameterized. Since one or more of the parts of the path can vary, you can build a dataset with parameters to capture data that spans multiple files. For example, datasets with parameters can be used to parameterize serialized data by region or data or other variable. For more information on datasets with parameters, see Overview of Parameterization.

Basic Task

The basic method by which you build and run a job for a dataset with parameters is very similar to the non-parameterized dataset method with a few notable exceptions. The steps in this task follow the same steps for the standard task. Where the steps overlap links have been provided to the non-parameterized task. For more information, see API Task - Develop a Flow.

Example Datasets

This example covers three different datasets, each of which features a different type of dataset with parameters.

Example Number

Parameter Type

Description

1

Datetime parameter

In this example, a directory is used to store daily orders transactions. This dataset must be defined with a Datetime parameter to capture the preceding 7 days of data. Jobs can be configured to process all of this data as it appears in the directory.

2

Variable

This dataset segments data into four timezones across the US. These timezones are defined using the following text values in the path: pacific, mountain, central, and eastern. In this case, you can create a parameter called region, which can be overridden at runtime to be set to one of these four values during job execution.

3

Pattern parameter

This example is a directory containing point-of-sale transactions captured into individual files for each region. Since each region is defined by a numeric value (01, 02, 03), the dataset can be defined using a pattern parameter.

4

Environment parameter

An environment parameter is defined by an admin and is available for every user of the project or workspace. In particular, environment parameters are useful for defining source bucket names, which may vary between environments in the same organization.

Step - Create Containing Flow

You must create the flow to host your dataset with parameters.

In the response, you must capture and retain the flow Identifier. For more information, see API Task - Develop a Flow.

Step - Create Datasets with Parameters

Note

When you import a dataset with parameters, only the first matching dataset is used for the initial file. If you want to see data from other matching files, you must collect a new sample within the Transformer page.

Example 1 - Dataset with Datetime parameter

Suppose your files are stored in the following paths:

MyFiles/1/Datetime/2018-04-06-orders.csv
MyFiles/1/Datetime/2018-04-05-orders.csv
MyFiles/1/Datetime/2018-04-04-orders.csv
MyFiles/1/Datetime/2018-04-03-orders.csv
MyFiles/1/Datetime/2018-04-02-orders.csv
MyFiles/1/Datetime/2018-04-01-orders.csv
MyFiles/1/Datetime/2018-03-31-orders.csv

When you navigate to the directory through the application, you mouse over one of these files and select Parameterize.

In the window, select the date value (e.g. YYYY-MM-DD) and then click the Datetime icon.

Datetime Parameter:

  • Format: YYYY-MM-DD

  • Date Range: Date is last 7 days.

  • Click Save.

The Datetime parameter should match with all files in the directory. Import this dataset and wrangle it.

After you wrangle the dataset, return to its flow view and select the recipe. You should be able to extract the flowId and recipeId values from the URL.

For purposes of this example, here are some key values:

  • flowId: 35

  • recipeId: 127

Example 2 - Dataset with Variable

Suppose your files are stored in the following paths:

MyFiles/1/variable/census-eastern.csv
MyFiles/1/variable/census-central.csv
MyFiles/1/variable/census-mountain.csv
MyFiles/1/variable/census-pacific.csv

When you navigate to the directory through the application, you mouse over one of these files and select Parameterize.

In the window, select the region value, which could be one of the following depending on the file: eastern, central, mountain, or pacific. Click the Variable icon.

Variable Parameter:

  • Name: region

  • Default Value:Set this default to pacific.

  • Click Save.

In this case, the variable only matches one value in the directory. However, when you apply runtime overrides to the region variable, you can set it to any value.

Import this dataset and wrangle it.

After you wrangle the dataset, return to its flow view and select the recipe. You should be able to extract the flowId and recipeId values from the URL.

For purposes of this example, here are some key values:

  • flowId: 33

  • recipeId: 123

Example 3 - Dataset with pattern parameter

Suppose your files are stored in the following paths:

MyFiles/1/pattern/POS-r01.csv
MyFiles/1/pattern/POS-r02.csv
MyFiles/1/pattern/POS-r03.csv

When you navigate to the directory through the application, you mouse over one of these files and select Parameterize.

In the window, select the two numeric digits (e.g. 02). Click the Pattern icon.

Pattern Parameter:

  • Type: Wrangle

  • Matching regular expression: {digit}{2}

  • Click Save.

In this case, the Wrangle should match any sequence of two digits in a row. In the above example, this expression matches: 01, 02, and 03, all of the files in the directory.

Import this dataset and wrangle it.

After you wrangle the dataset, return to its flow view and select the recipe. You should be able to extract the flowId and recipeId values from the URL.

For purposes of this example, here are some key values:

  • flowId: 32

  • recipeId: 121

Note

You have created flows for each type of dataset with parameters.

Example 4 - Dataset with parameterized bucket name

You can parameterize part or all of the bucket name in your source or target paths.

Suppose you have multiple workspaces that use different S3 buckets for sources of data. For example, your environments might look like the following:

Environment

S3 Bucket Name

Dev

myco-dev

Prod

myco-prod

For your datasources, you can parameterize the name of the bucket, so that if you migrate your flow between these environments, the references to datasources are updated based on the parameterized value for the bucket in the new environment.

Create environment parameter

Parameterized buckets are a good use for environment parameters. An environment parameter is a parameter that is available for use by every user in the project or workspace. In this case, the bucket name can be referenced for all datasets in the project or workspace, so turning that value into a parameter makes managing your datasources much more efficient.

You can use the following example to create environment parameter called env.bucketName, with a value of myco-dev. This environment parameter would be created in your Dev environment:

Note

The overrideKey value, which is the name of the environment parameter, must begin with env..

Endpoint

http://www.example.com:3005/v4/environmentParameters

Authentication

Required

Method

POST

Request Body

{
  "overrideKey": "env.bucketName",
  "value": {
    "variable": {
      "value": "myco-dev"
    }
  }
}

Response

{
  "id": 1,
  "overrideKey": "env.bucketName",
  "value": {
    "variable": {
      "value": "myco-dev"
    }
  },
  "createdAt": "2021-06-24T14:15:22Z",
  "updatedAt": "2021-06-24T14:15:22Z",
  "deleted_at": "2021-06-24T14:15:22Z",
  "usageInfo": {
    "runParameters": 1
  }
}

For more information on creating environment parameters, see Dataprep by Trifacta: API Reference docs

Create dataset with parameterized bucket name

The following example creates an imported dataset with two parameters:

Parameter Name

Parameter Type

Environment Parameter?

Description

myPath

path

No

The parameterized part of the path.

The static value is /.

The default value is /dummy.

In this case, for the job run, the value is overridden with /dummy2.

env.bucketName

bucket

Yes

The parameterized part of the bucket path.

The static value is myco-.

In this case, for the job run, the value dev is inserted after the fifth character in the variable.

Endpoint

http://www.example.com:3005/v4/importedDatasets

Authentication

Required

Method

POST

Request Body

{
  "name": "Dummy Dataset",
  "uri": "/path",
  "description": "My S3 parameterized dataset",     
  "type": "S3",     
  "isDynamic": true,
  "runParameters": [
    {
      "type": "path",
      "overrideKey": "myPath",
      "insertionIndices": [
        {
          "index": 1,
          "order": 0
        }
      ],
      "value": {
        "variable": {
          "value": "dummy2"
        }
      }
    },
    {
      "type": "bucket",
      "overrideKey": "env.bucketParam",
      "insertionIndices": [
        {
          "index": 5,
          "order": 0
        }
      ],
      "value": {
        "variable": {
          "value": "dev"
        }
      }
    }
  ],
  "dynamicBucket": "myco-",
  "dynamicPath": "/"
}

Response

{
  "visible": true,
  "numFlows": 0,
  "path": "/dummy",
  "bucket": "",
  "type": "s3",
  "isDynamic": true,
  "runParameters": [
    {
      "type": "path",
      "overrideKey": "myPath",
      "insertionIndices": [
        {
          "index": 1,
          "order": 0
        }
      ],
      "value": {
        "variable": {
          "value": "dummy2"
        }
      },
      "isEnvironmentParameter": false
    },
    {
      "type": "bucket",
      "overrideKey": "env.bucketParam",
      "insertionIndices": [
        {
          "index": 5,
          "order": 0
        }
      ],
      "value": {
        "variable": {
          "value": "dev"
        }
      },
      "isEnvironmentParameter": true
    }
  ],
  "dynamicBucket": "myco-",
  "dynamicPath": "/"
}

For more information, see Dataprep by Trifacta: API Reference docs

Step - Wrangle Data

After you have created your dataset with parameter, you can wrangle it through the application. For more information, see Transformer Page.

Step - Run Job

Below, you can review the API calls to run a job for each type of dataset with parameters, including relevant information about overrides.

Example 1 - Dataset with Datetime parameter

Note

You cannot apply overrides to these types of datasets with parameters. The following request contains overrides for write settings but no overrides for parameters.

  1. Endpoint

    http://www.example.com:3005/v4/jobGroups

    Authentication

    Required

    Method

    POST

    Request Body

    {
      "wrangledDataset": {
        "id": 127
      },
      "overrides": {
        "execution": "photon",
        "profiler": true,
        "writesettings": [
          {
            "path": "MyFiles/queryResults/joe@example.com/2018-04-03-orders.csv",
            "action": "create",
            "format": "csv",
            "compression": "none",
            "header": false,
            "asSingleFile": false
          }
        ]
      },
      "runParameters": {}
    }
  2. In the above example, the job has been launched for recipe 127 to execute on the Trifacta Photon running environment with profiling enabled.

    1. Output format is CSV to the designated path. For more information on these properties, see Dataprep by Trifacta: API Reference docs

    2. Output is written as a new file with no overwriting of previous files.

  3. A response code of 201 - Created is returned. The response body should look like the following:

    {
        "sessionId": "79276c31-c58c-4e79-ae5e-fed1a25ebca1",
        "reason": "JobStarted",
        "jobGraph": {
            "vertices": [
                21,
                22
            ],
            "edges": [
                {
                    "source": 21,
                    "target": 22
                }
            ]
        },
        "id": 29,
        "jobs": {
            "data": [
                {
                    "id": 21
                },
                {
                    "id": 22
                }
            ]
        }
    }
  4. Retain the jobgroupId=29 value for monitoring.

Example 2 - Dataset with Variable

In the following example, the region variable has been overwritten with the value central to execute the job on orders-central.csv:

  1. Endpoint

    http://www.example.com:3005/v4/jobGroups

    Authentication

    Required

    Method

    POST

    Request Body

    {
      "wrangledDataset": {
        "id": 123
      },
      "overrides": {
        "execution": "photon",
        "profiler": true,
        "writesettings": [
          {
            "path": "MyFiles/queryResults/joe@example.com/region-eastern.csv",
            "action": "create",
            "format": "csv",
            "compression": "none",
            "header": false,
            "asSingleFile": false
          }
        ]
      },
      "runParameters": {
        "overrides": {
          "data": [{
            "key": "region",
            "value": "central"
          }
        ]}
      }
    }
  2. In the above example, the job has been launched for recipe 123 to execute on the Trifacta Photon running environment with profiling enabled.

    1. Output format is CSV to the designated path. For more information on these properties, see Dataprep by Trifacta: API Reference docs

    2. Output is written as a new file with no overwriting of previous files.

  3. A response code of 201 - Created is returned. The response body should look like the following:

    {
        "sessionId": "79276c31-c58c-4e79-ae5e-fed1a25ebca1",
        "reason": "JobStarted",
        "jobGraph": {
            "vertices": [
                21,
                22
            ],
            "edges": [
                {
                    "source": 21,
                    "target": 22
                }
            ]
        },
        "id": 27,
        "jobs": {
            "data": [
                {
                    "id": 21
                },
                {
                    "id": 22
                }
            ]
        }
    }
  4. Retain the jobgroupId=27 value for monitoring.

Example 3 - Dataset with pattern parameter

Note

You cannot apply overrides to these types of datasets with parameters. The following request contains overrides for write settings but no overrides for parameters.

  1. Endpoint

    http://www.example.com:3005/v4/jobGroups

    Authentication

    Required

    Method

    POST

    Request Body

    {
      "wrangledDataset": {
        "id": 121
      },
      "overrides": {
        "execution": "photon",
        "profiler": false,
        "writesettings": [
          {
            "path": "hdfs://hadoop:50070/trifacta/queryResults/admin@example.com/POS-r02.txt",
            "action": "create",
            "format": "csv",
            "compression": "none",
            "header": false,
            "asSingleFile": false
          }
        ]
      },
      "runParameters": {}
    }
  2. In the above example, the job has been launched for recipe 121 to execute on the Trifacta Photon running environment with profiling enabled.

    1. Output format is CSV to the designated path. For more information on these properties, see Dataprep by Trifacta: API Reference docs

    2. Output is written as a new file with no overwriting of previous files.

  3. A response code of 201 - Created is returned. The response body should look like the following:

    {
        "sessionId": "79276c31-c58c-4e79-ae5e-fed1a25ebca1",
        "reason": "JobStarted",
        "jobGraph": {
            "vertices": [
                21,
                22
            ],
            "edges": [
                {
                    "source": 21,
                    "target": 22
                }
            ]
        },
        "id": 28,
        "jobs": {
            "data": [
                {
                    "id": 21
                },
                {
                    "id": 22
                }
            ]
        }
    }
  4. Retain the jobgroupId=28 value for monitoring.

Example 4 - Dataset with parameterized bucket name

The following example contains a parameterized bucket reference, with a specified override value. Administrators and project owners can specify the default value for environment parameters, and users can specify overrides for these values at job execution time.

Endpoint

http://www.example.com:3005/v4/jobGroups

Authentication

Required

Method

POST

Request Body

{
  "wrangledDataset": {
    "id": 121
  },
  "runParameters": {
    "overrides": {
      "data": [
        {
          "key": "env.bucketName", 
          "value": "myco-dev2"
        }
      ]
    }
  }
}

In the above example, the job has been launched for recipe 121 to execute with the env.bucketName override value (myco-dev2) for the environment parameter.

For more information on these properties, see Dataprep by Trifacta: API Reference docs

Step - Monitoring Your Job

After the job has been created and you have captured the jobGroup Id, you can use it to monitor the status of your job. For more information, see Dataprep by Trifacta: API Reference docs

Step - Re-run Job

If you need to re-run the job as specified, you can use the wrangledDataset identifier to re-run the most recent job.

Tip

When you re-run a job, you can change any variable values as part of the request.

Example request:

Endpoint

http://www.example.com:3005/v4/jobGroups

Authentication

Required

Method

POST

Request Body

{
  "wrangledDataset": {
    "id": 123
  },
  "runParameters": {
    "overrides": {
      "data": [{
        "key": "region",
        "value": "central"
      }
    ]}
  }
}

For more information, see API Task - Develop a Flow.