Skip to main content

API Task - Manage Outputs

Warning

API access is migrating to Enterprise only. Beginning in Release 9.5, all new or renewed subscriptions have access to public API endpoints on the Enterprise product edition only. Existing customers on non-Enterprise editions will retain access their available endpoints (Legacy) until their subscription expires. To use API endpoints after renewal, you must upgrade to the Enterprise product edition or use a reduced set of endpoints (Current). For more information on differences between product editions in the new model, please visit Pricing and Packaging.

Overview

Through the APIs, you can separately manage the outputs associated with an individual recipe. This task describes how to create output objects, which are associated with your recipe, and how to publish those outputs to different datastores in varying formats. You can continue to modify the output objects and their related write settings and publications independently of managing the wrangling process. Whenever you need new results, you can reference the wrangled dataset with which your outputs have been associated, and the job is executed and published in the appropriate manner to your targets. Terms...

Relevant terms:

Term

Description

outputObjects

An outputObject is a definition of one or more types of outputs and how they are generated. It must be associated with a recipe.

Note

An outputObject must be created for a recipe before you can run a job on it. One and only one outputObject can be associated with a recipe.

writeSettings

A writeSettings object defines file-based outputs within an outputObject. Settings include path, format, compression, and delimiters.

publications

A publications object is used to specify a table-based output and is associated with an outputObject. Settings include the connection to use, path, table type, and write action to apply.

Note

If you need to make changes for purposes of a specific job run, you can add overrides to the request for the job. These overrides apply only for the current job. For more information, seeDataprep by Trifacta: API Reference docs

Basic Task

Here's the basic task described in this section.

  1. Get the internal identifier for the recipe for which you are building outputs.

  2. Create the outputObject for the recipe.

  3. Create a writeSettings object and associate it with the outputObject.

  4. Run a test job, if desired.

  5. For any publication, get the internal identifier for the connection to use.

  6. Create a publication object and associate it with the outputObject.

  7. Run your job.

Variations

If you are generating exclusively file-based or relational outputs, you can vary this task in the following ways:

For file-based outputs:

  1. Get the internal identifier for the recipe for which you are building outputs.

  2. Create the outputObject for the recipe.

  3. Create a writeSettings object and associate it with the outputObject.

  4. Run your job.

For relational outputs:

  1. Get the internal identifier for the recipe for which you are building outputs.

  2. Create the outputObject for the recipe.

  3. For any publication, get the internal identifier for the connection to use.

  4. Create a publication object and associate it with the outputObject.

  5. Run your job.

Step - Get Recipe ID

To begin, you need the internal identifier for the recipe.

Note

In the APIs, a recipe is identified by its internal name, a wrangled dataset.

Request:

Endpoint

http://www.wrangle-dev.example.com:3005/v4/wrangledDatasets

Authentication

Required

Method

GET

Request Body

None.

Response:

Status Code

200 - OK

Response Body

{
    "data": [
        {
            "id": 11,
            "wrangled": true,
            "createdAt": "2018-11-12T23:06:36.473Z",
            "updatedAt": "2018-11-12T23:06:36.539Z",
            "recipe": {
                "id": 10
            },
            "name": "POS-r01",
            "description": null,
            "referenceInfo": null,
            "activeSample": {
                "id": 11
            },
            "creator": {
                "id": 1
            },
            "updater": {
                "id": 1
            },
            "flow": {
                "id": 4
            }
        },
        {
            "id": 1,
            "wrangled": true,
            "createdAt": "2018-11-12T23:19:57.650Z",
            "updatedAt": "2018-11-12T23:20:47.297Z",
            "recipe": {
                "id": 19
            },
            "name": "member_info",
            "description": null,
            "referenceInfo": null,
            "activeSample": {
                "id": 20
            },
            "creator": {
                "id": 1
            },
            "updater": {
                "id": 1
            },
            "flow": {
                "id": 6
            }
        }
    ]
}

cURL example:

curl -X GET \
  http://www.wrangle-dev.example.com:3005/v4/wrangledDatasets \
  -H 'authorization: Basic <auth_token>' \
  -H 'cache-control: no-cache'

Terms...

Relevant terms:

Term

Description

URL

URL and method to execute.

authorization

Authorization taken to pass to the platform. Basic authorization works.

Note

This token must be passed with each request to the platform.

cache-control

Cache control setting.

content-type

HTTP content type to send. These applications use application/json.

Tip

In the above, let's assume that the recipe identifier of interest is wrangledDataset=11. This means that the flow where it is hosted is flow.id=4. Retain this information for later.

For more information, see Dataprep by Trifacta: API Reference docs

Step - Create outputObject

Create the outputObject and associate it with the recipe identifier. In the following request, the wrangledDataset identifier that you retrieved in the previous call is applied as the flowNodeId value.

The following example includes an embedded writeSettings object, which generates a CSV file output. You can remove this embedded object if desired, but you must create a writeSettings object before you can generate an output.

Request:

Endpoint

http://www.wrangle-dev.example.com:3005/v4/outputObjects

Authentication

Required

Method

POST

Request Body

{
    "execution": "photon",
    "profiler": true,
    "isAdhoc": true,
    "writeSettings": {
        "data": [
            {
                "delim": ",",
                "path": "hdfs://hadoop:50070/trifacta/queryResults/admin@example.com/POS_01.avro",
                "action": "create",
                "format": "avro",
                "compression": "none",
                "header": false,
                "asSingleFile": false,
                "prefix": null,
                "suffix": "_increment",
                "includeMismatches": true,
                "hasQuotes": false
            }
        ]
    },
    "flowNode": {
        "id": 11
    }
}

Response:

Status Code

201 - Created

Response Body

{
    "id": 4,
    "execution": "photon",
    "profiler": true,
    "isAdhoc": true,
    "updatedAt": "2018-11-13T00:20:49.258Z",
    "createdAt": "2018-11-13T00:20:49.258Z",
    "creator": {
        "id": 1
    },
    "updater": {
        "id": 1
    },
    "flowNode": {
        "id": 11
    }
}

cURL example:

curl -X POST \
  http://www.wrangle-dev.example.com/v4/outputObjects \
  -H 'authorization: Basic <auth_token>' \
  -H 'cache-control: no-cache' \
  -H 'content-type: application/json' \
  -d '{
    "execution": "photon",
    "profiler": true,
    "isAdhoc": true,
    "writeSettings": {
        "data": [
            {
                "delim": ",",
                "path": "hdfs://hadoop:50070/trifacta/queryResults/admin@example.com/POS_01.avro",
                "action": "create",
                "format": "avro",
                "compression": "none",
                "header": false,
                "asSingleFile": false,
                "prefix": null,
                "suffix": "_increment",
                "includeMismatches": true,
                "hasQuotes": false
            }
        ]
    },
    "flowNode": {
        "id": 11
    }
}'

Terms...

Relevant terms:

Term

Description

URL

URL and method to execute.

authorization

Authorization taken to pass to the platform. Basic authorization works.

Note

This token must be passed with each request to the platform.

cache-control

Cache control setting.

content-type

HTTP content type to send. These applications use application/json.

Tip

You've created an outputObject (id=4) and an embedded writeSettings object and have associated them with the appropriate recipe flowNodeId=11. You can now run a job for this recipe generating the specified output.

For more information, see Dataprep by Trifacta: API Reference docs

Step - Run a Test Job

Now that outputs have been defined for the recipe, you can just execute a job on the specified recipe flowNodeId=11:

Request:

Endpoint

http://www.wrangle-dev.example.com:3005/v4/jobGroups

Authentication

Required

Method

POST

Request Body

{
  "wrangledDataset": {
    "id": 11
  }
}

Response:

Status Code

201 - Created

Response Body

{

    "sessionId": "79276c31-c58c-4e79-ae5e-fed1a25ebca1",
    "reason": "JobStarted",
    "jobGraph": {
        "vertices": [
            21,
            22
        ],
        "edges": [
            {
                "source": 21,
                "target": 22
            }
        ]
    },
    "id": 2,
    "jobs": {
        "data": [
            {
                "id": 21
            },
            {
                "id": 22
            }
        ]
    }
}

Note

To re-run the job against its currently specified outputs, writeSettings, and publications, you only need the recipe ID. If you need to make changes for purposes of a specific job run, you can add overrides to the request for the job. These overrides apply only for the current job. For more information, see Dataprep by Trifacta: API Reference docs

To track the status of the job:

  • You can monitor the progress through the application.

  • You can monitor progress through the status field by querying the specific job. For more information, see Dataprep by Trifacta: API Reference docs

Tip

You've run a job, generating one output in Avro format.

Step - Create writeSettings Object

Suppose you want to create another file-based output for this outputObject. You can create a second writeSettings object, which publishes the results of the job run on the recipe to the specified location.

The following example creates settings for generating a parquet-based output.

Request:

Endpoint

http://www.wrangle-dev.example.com:3005/v4/writeSettings/

Authentication

Required

Method

POST

Request Body

{
    "delim": ",",
    "path": "hdfs://hadoop:50070/trifacta/queryResults/admin@example.com/POS_r03.pqt",
    "action": "create",
    "format": "pqt",
    "compression": "none",
    "header": false,
    "asSingleFile": false,
    "prefix": null,
    "suffix": "_increment",
    "hasQuotes": false,
    "outputObjectId": 4
}

Response:

Status Code

201 - Created

Response Body

{
    "delim": ",",
    "id": 2,
    "path": "hdfs://hadoop:50070/trifacta/queryResults/admin@example.com/POS_r03.pqt",
    "action": "create",
    "format": "pqt",
    "compression": "none",
    "header": false,
    "asSingleFile": false,
    "prefix": null,
    "suffix": "_increment",
    "hasQuotes": false,
    "updatedAt": "2018-11-13T01:07:52.386Z",
    "createdAt": "2018-11-13T01:07:52.386Z",
    "creator": {
        "id": 1
    },
    "updater": {
        "id": 1
    },
    "outputObject": {
        "id": 4
    }
}

cURL example:

curl -X POST \
  http://www.wrangle-dev.example.com/v4/writeSettings \
  -H 'authorization: Basic <auth_token>' \
  -H 'cache-control: no-cache' \
  -H 'content-type: application/json' \
  -d '{    "delim": ",",
    "path": "hdfs://hadoop:50070/trifacta/queryResults/admin@example.com/POS_r03.pqt",
    "action": "create",
    "format": "pqt",
    "compression": "none",
    "header": false,
    "asSingleFile": false,
    "prefix": null,
    "suffix": "_increment",
    "hasQuotes": false,
    "outputObject": {
      "id": 4
    }
}

Terms...

Relevant terms:

Term

Description

URL

URL and method to execute.

authorization

Authorization taken to pass to the platform. Basic authorization works.

Note

This token must be passed with each request to the platform.

cache-control

Cache control setting.

content-type

HTTP content type to send. These applications use application/json.

Tip

You've added a new writeSettings object and associated it with your outputObject (id=4). When you run the job again, the Parquet output is also generated.

For more information, see Dataprep by Trifacta: API Reference docs

Step - Get Connection ID for Publication

To generate a publication, you must identify the connection through which you are publishing the results.

Below, the request returns a single connection to Hive (id=1).

Request:

Endpoint

http://www.wrangle-dev.example.com:3005/v4/connections

Authentication

Required

Method

GET

Request Body

None.

Response:

Status Code

200 - OK

Response Body

{
    "data": [
        {
            "id": 1,
            "host": "hadoop",
            "port": 10000,
            "vendor": "hive",
            "params": {
                "jdbc": "hive2",
                "connectStringOptions": "",
                "defaultDatabase": "default"
            },
            "ssl": false,
            "vendorName": "hive",
            "name": "hive",
            "description": null,
            "type": "jdbc",
            "isGlobal": true,
            "credentialType": "conf",
            "credentialsShared": true,
            "uuid": "28415970-e6c4-11e8-82be-9947a31ecdd5",
            "disableTypeInference": false,
            "createdAt": "2018-11-12T21:44:39.816Z",
            "updatedAt": "2018-11-12T21:44:39.842Z",
            "credentials": [],
            "creator":  {
                "id":  1
            },
            "updater":  {
                "id":  1
            },
            "workspace":  {
                "id":  1
            }
        }
    ],
    "count": 1
}

cURL example:

curl -X GET \
  http://www.wrangle-dev.example.com/v4/connections \
  -H 'authorization: Basic <auth_token>' \
  -H 'cache-control: no-cache' \
  -H 'content-type: application/json'

Terms...

Relevant terms:

Term

Description

URL

URL and method to execute.

authorization

Authorization taken to pass to the platform. Basic authorization works.

Note

This token must be passed with each request to the platform.

cache-control

Cache control setting.

content-type

HTTP content type to send. These applications use application/json.

For more information, see Dataprep by Trifacta: API Reference docs

Step - Create a Publication

Example - Hive:

You can create publications that publish table-based outputs through specified connections. In the following, a Hive table is written out to the default database through connectionId = 1. This publication is associated with the outputObject id=4.

Request:

Endpoint

http://www.wrangle-dev.example.com:3005/v4/publications

Authentication

Required

Method

POST

Request Body

{
    "path": [
        "default"
    ],
    "tableName": "myPublishedHiveTable",
    "targetType": "hive",
    "action": "create",
    "outputObject": {
        "id": 4
    },
    "connection": {
        "id": 1
    }
}

Response:

Status Code

201 - Created

Response Body

{
    "path": [
        "default"
    ],
    "id": 3,
    "tableName": "myPublishedHiveTable",
    "targetType": "hive",
    "action": "create",
    "updatedAt": "2018-11-13T01:25:39.698Z",
    "createdAt": "2018-11-13T01:25:39.698Z",
    "creator": {
        "id": 1
    },
    "updater": {
        "id": 1
    },
    "outputObject": {
        "id": 4
    },
    "connection": {
        "id": 1
    }
}

cURL example:

curl -X POST \
  http://example.com:3005/v4/publications \
  -H 'authorization: Basic <auth_token>' \
  -H 'cache-control: no-cache' \
  -H 'content-type: application/json' \
  -d '{
    "path": [
        "default"
    ],
    "tableName": "myPublishedHiveTable",
    "targetType": "hive",
    "action": "create",
    "outputObject": {
        "id": 4
    },
    "connection": {
        "id": 1
    }
}' 

Terms...

Relevant terms:

Term

Description

URL

URL and method to execute.

authorization

Authorization taken to pass to the platform. Basic authorization works.

Note

This token must be passed with each request to the platform.

cache-control

Cache control setting.

content-type

HTTP content type to send. These applications use application/json.

Example - BigQuery:

In the following, a BigQuery table is updated through connectionId = 2 using an upsert action. This action updates the values in the table based on matching column values between source and the corresponding columns in the target table. When these key values match, the columns that you specify in the request are updated with values from the source data. Additional options are listed under Notes below.

This publication is associated with the outputObject id=8.

Request:

Endpoint

http://www.wrangle-dev.example.com:3005/v4/publications

Authentication

Required

Method

POST

Request Body

{
    "tableName": "bq_table_merge",
    "path": [
        "tri-dev",
        "demo_test"
    ],
    "targetType": "bigquery",
    "connectionId": "bigquery",
    "action": "upsert",
    "outputObjectId": 8,
    "runParameters": [],
    "parameters": {

        "mergeJoinKeys": "[\"col1\"]",
        "colsToUpdate": "[\"col2\",\"col3\"]",
        "deleteUnmatchedRowsInTarget": "false",
        "externalTableLocation": null,
        "insertChecked": "true",
        "isDeltaTable": "false",
        "isExternalTable": "false",
        "matchedRowsAction": "update"
     }
}

Response:

Status Code

201 - Created

Response Body

{
    "path": [
        "default"
    ],
    "id": 3,
    "tableName": "bq_table_merge",
    "targetType": "bigquery",
    "action": "upsert",
    "updatedAt": "2018-11-13T01:25:39.698Z",
    "createdAt": "2018-11-13T01:25:39.698Z",
    "creator": {
        "id": 1
    },
    "updater": {
        "id": 1
    },
    "outputObject": {
        "id": 8
    },
    "connection": {
        "id": "bigquery"
    }
}

Notes:

Property

Description

action

Set this value to upsert to update the target table using the merge (upsert) action.

connection.id

This value must be set to bigquery.

mergeJoinKeys

The set of one or more columns whose values are used to determine if a row in the source data that you are publishing matches a row in the target table.

colsToUpdate

If the join keys above do match, then the colsToUpdate columns are updated with the values from the source data.

Note

These column values are updated if matchedRowsAction is set to update.

deleteUnmatchedRowsInTarget

When true, rows in the target table that do not match the mergeJoinKeys columns are deleted. Only matching rows remain.

externalTableLocation

This parameter can be used to include a path to the external table. Otherwise, this value is null.

insertChecked

When true and the join keys do not match, the non-matching rows are written to the table.

isDeltaTable

When true, the target table is written as a delta table, which means data is stored as changes from the previous version of the table.

matchedRowsAction

This parameter defines the action taken on the table when rows are matched for the join keys:

  • update - (default) update the columns to update with the values from the source data

  • delete - delete the matching rows in the target

For more information, see

Dataprep by Trifacta: API Reference docs

Tip

You're done.

You have done the following:

  1. Created an output object:

    1. Embedded a writeSettings object to define an Avro output.

    2. Associated the outputObject with a recipe.

  2. Added another writeSettings object to the outputObject.

  3. Added a table-based publication object to the outputObject.

You can now generate results for these three different outputs whenever you run a job (create a jobgroup) for the associated recipe.

Step - Apply Overrides

When you are publishing results to a relational source, you can optionally apply overrides to the job to redirect the output or change the action applied to the target table. For more information, see API Task - Run Job.

Step - Apply Dataflow Job Overrides

Note

This feature may not be available in all product editions. For more information on available features, see Compare Editions.

Note

Overrides applied to the output objects are merged with any overrides specified as part of the jobGroup at the time of execution. For more information, see API Task - Run Job.

If neither object has a specified override for a Dataflow property, the applicable project setting is used. See User Execution Settings Page.

You can optionally submit override values for a predefined set of Dataflow properties on the output object. These overrides are applied each time that the outputobject is used to generate a set of results.

Note

If you are using automatic VPC network mode, then network, subnetwork, and usePublicIPs do not apply.

Tip

You can apply job overrides to the job itself, instead of applying overrides to the outputobject. For more information, see API Task - Run Job.

Example - Apply labels to output object

In the following example, an existing outputObject (id=4) is modified to include override values for the labels of the job. Each property and its value as specified as a key-value pair in the request:

Request:

Endpoint

https://www.api.clouddataprep.com/v4/outputObjects/4

Authentication

Required

Method

PATCH

Request Body

{
  "execution": "dataflow",
  "profiler": true,
  "outputObjectDataflowOptions": {
    "region": "us-central1",
    "zone": "us-central1-a", 
    "machineType": "n1-standard-64",
    "network": "my-network-name",
    "subnetwork": "regions/us-central1/subnetworks/my-subnetwork",
    "autoscalingAlgorithm": "THROUGHPUT_BASED",
    "serviceAccount": "my-service-account-name@<project-id>.iam.gserviceaccount.com",
    "numWorkers": "1",
    "maxNumWorkers": "1000",
    "usePublicIps": "true",
    "labels": [
       {
         "key": "my-billing-label-key",
         "value": "my-billing-label-value"      
       }
     ]
   }
}

Response:

Status Code

200 - Ok

Response Body

{
    "id": 4,
    "updater": {
        "id": 1
    },
    "updatedAt": "2020-03-21T00:27:00.937Z",
    "createdAt": "2020-03-20T23:30:42.991Z"
}

cURL example:

curl -X PATCH \
  http://www.wrangle-dev.example.com/v4/outputObjects/4 \
  -H 'authorization: Bearer <auth_token>' \
  -H 'cache-control: no-cache' \
  -H 'content-type: application/json' \
  -d '{
  "execution": "dataflow",
  "profiler": true,
  "outputObjectDataflowOptions": {
    "region": "us-central1",
    "zone": "us-central1-a", 
    "machineType": "n1-standard-64",
    "network": "my-network-name",
    "subnetwork": "regions/us-central1/subnetworks/my-subnetwork",
    "autoscalingAlgorithm": "THROUGHPUT_BASED",
    "serviceAccount": "my-service-account-name@<project-id>.iam.gserviceaccount.com",
    "numWorkers": "1",
    "maxNumWorkers": "1000",
    "usePublicIps": "true",
    "labels": [
       {
         "key": "my-billing-label-key",
         "value": "my-billing-label-value"      
       }
     ]
   }
}'

Terms...

Relevant terms:

Term

Description

URL

URL and method to execute.

authorization

Authorization taken to pass to the platform. Basic authorization works.

Note

This token must be passed with each request to the platform.

cache-control

Cache control setting.

content-type

HTTP content type to send. These applications use application/json.

Notes on properties:

  • If a network value, subnetwork value, or both is specified, then the VPC mode is custom. This setting is available in the UI for convenience.

  • You can submit empty or null values for property values in the payload. These values are submitted.

  • If you are not using auto-scaling on your job:

    • "autoscalingAlgorithm": "NONE",

    • Use "numWorkers" instead to specify the number of compute nodes to use for the job.

      Note

      This feature may not be available in all product editions. For more information on available features, see Compare Editions.

  • If you are using auto-scaling on your job:

    • "autoscalingAlgorithm": "throughput_based",

    • Use the "maxNumWorkers" and "numWorkers" instead to specify the number of compute nodes to use for the job.

      Note

      This feature may not be available in all product editions. For more information on available features, see Compare Editions.

Notes on labels:

You can use labels to assign billing information for the job in your project.

Note

This feature may not be available in all product editions. For more information on available features, see Compare Editions.

You can apply up to 64 labels for a job. For more information on the available properties, see Runtime Dataflow Execution Settings.

Example - Override VPC settings

In the following example, an existing outputObject (id=4) is modified to override the VPC settings to use a non-local VPC:

Request:

Endpoint

https://www.api.clouddataprep.com/v4/outputObjects/4

Authentication

Required

Method

PATCH

Request Body

{
  "execution": "dataflow",
  "outputObjectDataflowOptions": {
    "region": "second-region",
    "zone": "us-central1-a", 
    "network": "my-other-network",
    "subnetwork": "regions/second-region/subnetworks/my-other-subnetwork"
  }
}

Response:

Status Code

200 - Ok

Response Body

{
    "id": 4,
    "updater": {
        "id": 1
    },
    "updatedAt": "2020-03-21T00:27:00.937Z",
    "createdAt": "2020-03-20T23:30:42.991Z"
}

cURL example:

curl -X PATCH \
  http://www.wrangle-dev.example.com/v4/outputObjects/4 \
  -H 'authorization: Bearer <auth_token>' \
  -H 'cache-control: no-cache' \
  -H 'content-type: application/json' \
  -d '{
  "execution": "dataflow",
  "outputObjectDataflowOptions:" {
    "region": "second-region",
    "zone": "us-central1-a", 
    "network": "my-other-network",
    "subnetwork": "regions/second-region/subnetworks/my-other-subnetwork"
  }
}'

Terms...

Relevant terms:

Term

Description

URL

URL and method to execute.

authorization

Authorization taken to pass to the platform. Basic authorization works.

Note

This token must be passed with each request to the platform.

cache-control

Cache control setting.

content-type

HTTP content type to send. These applications use application/json.

Notes on properties:

  • If a network value, subnetwork value, or both is specified, then the VPC mode is custom. This setting is available in the UI for convenience.

  • Subnetwork values must be specified as a short URL or a full URL.

    • To specify the VPC associated with a different project to which you have access, use the full URL pattern for the subnetwork value:

      https://www.googleapis.com/compute/v1/projects/<HOST_PROJECT_ID>/regions/<REGION>/subnetworks/<SUBNETWORK>

      <HOST_PROJECT_ID> corresponds to the project identifier. This value must be between 6 and 30 characters. The value can contain only lowercase letters, digits, or hyphens. It must start with a letter. Trailing hyphens are prohibited.

    • To specify a different VPC subnetwork, you can also use a short URL pattern for the subnetwork value:

      regions/<REGION>/subnetworks/<SUBNETWORK>

For more information on these properties, see Runtime Dataflow Execution Settings.