Skip to main content

API Task - Swap Datasets

Overview

After you have created a flow, imported a dataset, and created a recipe for that dataset, you may need to swap in a different dataset and run the recipe against that one. This task steps through that process via the APIs.

Note

If you are processing multiple parallel datasources in a single job, you should create a dataset with parameters and then run the job. For more information, see API Task - Run Job on Dataset with Parameters.

This task utilizes the following methods:

  1. Creating an imported dataset. After the new file has been added to the backend datastore, you can import into Designer Cloud Powered by Trifacta Enterprise Edition as an imported dataset.

  2. Swap dataset. Using the ID of the imported dataset you created, you can now assign the dataset to the recipe in your flow.

  3. Run a job. Run the job against the dataset.

  4. Monitor progress. Monitor the progress of the job until it is complete.

Example Datasets

In this example, you are wrangling data from orders placed in different regions on a quarterly basis. When a new file drops, you want to be able to swap out the current dataset that is assigned to the recipe and swap in the new one. Then, run the job.

Example Files:

The following files are stored in HDFS:

Path and Filename

Description

hdfs:///user/orders/MyCo-orders-west-Q1.txt

Orders from West region for Q1

hdfs:///user/orders/MyCo-orders-west-Q2.txt

Orders from West region for Q2

hdfs:///user/orders/MyCo-orders-north-Q1.txt

Orders from North region for Q1

hdfs:///user/orders/MyCo-orders-north-Q2.txt

Orders from North region for Q2

hdfs:///user/orders/MyCo-orders-east-Q1.txt

Orders from East region for Q1

hdfs:///user/orders/MyCo-orders-east-Q1.txt

Orders from East region for Q2

Assumptions

You have already created a flow, which contains the following imported dataset and recipe:

Note

When an imported dataset is created via API, it is always imported as an unstructured dataset. Any recipe that references this dataset should contain initial parsing steps required to structure the data.

Tip

Through the UI, you can import one of your datasets as unstructured. Create a recipe for this dataset and then edit it. In the Recipe panel, you should be able to see the structuring steps. Back in Flow View, you can chain your structural recipe off of this one. Dataset swapping should happen on the first recipe.

Object Type

Name

Id

flow

MyCo-Orders-Quarter

2

Imported Dataset

MyCo-orders-west-Q1.txt

8

Recipe (wrangledDataset)

n/a

9

Job

n/a

3

Base URL:

For purposes of this example, the base URL for the platform is the following:

http://www.example.com:3005

Step - Import Dataset

Note

You cannot add datasets to the flow through the flows endpoint. Moving pre-existing datasets into a flow is not supported in this release. Create or locate the flow first and then when you create the datasets, associate them with the flow at the time of creation.

Note

When an imported dataset is created via API, it is always imported as an unstructured dataset. Any recipe that references this dataset should contain initial parsing steps required to structure the data.

The following steps describe how to create an imported dataset and assign it to the flow that has already been created (flowId=2).

Steps:

  1. To create an imported dataset, you must acquire the following information about the source.

    1. path

    2. type

    3. name

    4. description

    5. bucket (if a file stored on S3)

  2. In this example, the file you are importing is MyCo-orders-west-Q2.txt. Since the files are similar in nature and are stored in the same directory, you can acquire this information by gathering the information from the imported dataset that is already part of the flow. Execute the following:

    Endpoint

    http://www.example.com:3005/v4/importedDatasets

    Authentication

    Required

    Method

    POST

    Request Body

    {
      "path": "hdfs:///user/orders/MyCo-orders-west-Q2.txt",
      "name": "MyCo-orders-west-Q2.txt",
      "description": "MyCo-orders-west-Q2"
    }
    
  3. The response should be a 201 - Created status code with something like the following:

    {
        "id": 12,
        "size": "281032",
        "path": "hdfs:///user/orders/MyCo-orders-west-Q2.txt",
        "dynamicPath": null,
        "workspaceId": 1,
        "isSchematized": false,
        "isDynamic": false,
        "disableTypeInference": false,
        "createdAt": "2018-10-29T23:15:01.831Z",
        "updatedAt": "2018-10-29T23:15:01.889Z",
        "parsingRecipe": {
            "id": 11
        },
        "runParameters": [],
        "name": "MyCo-orders-west-Q2.txt.txt",
        "description": "MyCo-orders-west-Q2.txt",
        "creator": {
            "id": 1
        },
        "updater": {
            "id": 1
        },
        "connection": null
    }
  4. You must retain the id value so you can reference it when you create the recipe.

  5. See https://api.trifacta.com/ee/9.7/index.html#operation/createImportedDataset

Note

You have imported a dataset that is unstructured and is not associated with any flow.

Step - Swap Dataset from Recipe

The next step is to swap the primary input dataset for the recipe to point at the newly imported dataset. This step automatically adds the imported dataset to the flow and drops the previous imported dataset from the flow.

  1. Use the following to swap the primary input dataset for the recipe:

    Endpoint

    http://www.example.com:3005/v4/wrangledDatasets/9/primaryInputDataset

    Authentication

    Required

    Method

    PUT

    Request Body

    {
      "importedDataset": {
        "id": 12
      }
    }
  2. The response should be a 200 - OK status code with something like the following:

    {
        "id": 9,
        "wrangled": true,
        "createdAt": "2019-03-03T17:58:53.979Z",
        "updatedAt": "2019-03-03T18:01:11.310Z",
        "recipe": {
            "id": 9,
            "name": "POS-r01",
            "description": null,
            "active": true,
            "nextPortId": 1,
            "createdAt": "2019-03-03T17:58:53.965Z",
            "updatedAt": "2019-03-03T18:01:11.308Z",
            "currentEdit": {
                "id": 8
            },
            "redoLeafEdit": {
                "id": 7
            },
            "creator": {
                "id": 1
            },
            "updater": {
                "id": 1
            }
        },
        "referenceInfo": null,
        "activeSample": {
            "id": 7
        },
        "creator": {
            "id": 1
        },
        "updater": {
            "id": 1
        },
        "referencedFlowNode": null,
        "flow": {
            "id": 2
        }
    }
  3. The new imported dataset is now the primary input for the recipe, and the old imported dataset has been removed from the flow.

https://api.trifacta.com/ee/9.7/index.html#operation/updateInputDataset

Step - Rerun Job

To execute a job on this recipe, you can simply re-run any job that was executed on the old imported dataset, since you reference the job by jobId and wrangledDataset (recipe) Id.

Endpoint

http://www.example.com:3005/v4/jobGroups

Authentication

Required

Method

POST

Request Body

{
  "wrangledDataset": {
    "id": 9
  }
}

The job is re-run as it was previously specified.

If you need to modify any job parameters, you must create a new job definition.

Step - Monitor Your Job

After the job has been queued, you can track it to completion. See API Task - Develop a Flow.

Step - Schedule Your Job

When you are satisfied with how your flow is working, you can set up periodic schedules using a third-party tool to execute the job on a regular basis.

The tool must hit the above endpoints to swap in the new dataset and run the job.