On April 28, 2021, Google is changing the required permissions for attaching IAM roles to service accounts. If you are using IAM roles for your Google service accounts, please see Changes to User Management.
For the latest updates on available API endpoints and documentation, see api.trifacta.com.
Contents:
Overview
Basic Workflow
Here's the basic workflow described in this section.
- Get the internal identifier for the recipe for which you are building outputs.
Create the outputObject for the recipe.
- Create a writeSettings object and associate it with the outputObject.
- Run a test job, if desired.
- For any publication, get the internal identifier for the connection to use.
- Create a publication object and associate it with the outputObject.
- Run your job.
Variations
If you are generating exclusively file-based or relational outputs, you can vary this workflow in the following ways:
For file-based outputs:
- Get the internal identifier for the recipe for which you are building outputs.
Create the outputObject for the recipe.
- Create a writeSettings object and associate it with the outputObject.
- Run your job.
For relational outputs:
- Get the internal identifier for the recipe for which you are building outputs.
Create the outputObject for the recipe.
- For any publication, get the internal identifier for the connection to use.
- Create a publication object and associate it with the outputObject.
- Run your job.
Step - Get Recipe ID
To begin, you need the internal identifier for the recipe.
NOTE: In the APIs, a recipe is identified by its internal name, a wrangled dataset.
Request:
Endpoint | http://www.wrangle-dev.example.com:3005/v4/wrangledDatasets |
---|---|
Authentication | Required |
Method | GET |
Request Body | None. |
Response:
Status Code | 200 - OK |
---|---|
Response Body | { "data": [ { "id": 11, "wrangled": true, "createdAt": "2018-11-12T23:06:36.473Z", "updatedAt": "2018-11-12T23:06:36.539Z", "recipe": { "id": 10 }, "name": "POS-r01", "description": null, "referenceInfo": null, "activeSample": { "id": 11 }, "creator": { "id": 1 }, "updater": { "id": 1 }, "flow": { "id": 4 } }, { "id": 1, "wrangled": true, "createdAt": "2018-11-12T23:19:57.650Z", "updatedAt": "2018-11-12T23:20:47.297Z", "recipe": { "id": 19 }, "name": "member_info", "description": null, "referenceInfo": null, "activeSample": { "id": 20 }, "creator": { "id": 1 }, "updater": { "id": 1 }, "flow": { "id": 6 } } ] } |
cURL example:
curl -X GET \ http://www.wrangle-dev.example.com:3005/v4/wrangledDatasets \ -H 'authorization: Basic <auth_token>' \ -H 'cache-control: no-cache'
Checkpoint: In the above, let's assume that the recipe identifier of interest is wrangledDataset=11
. This means that the flow where it is hosted is flow.id=4
. Retain this information for later.
For more information, see Cloud Dataprep by TRIFACTA INC. API Reference docs: Premium | Standard
Step - Create outputObject
Create the outputObject and associate it with the recipe identifier. In the following request, the wrangledDataset identifier that you retrieved in the previous call is applied as the flowNodeId
value.
The following example includes an embedded writeSettings
object, which generates a CSV file output. You can remove this embedded object if desired, but you must create a writeSettings
object before you can generate an output.
Request:
Endpoint | http://www.wrangle-dev.example.com:3005/v4/outputObjects |
---|---|
Authentication | Required |
Method | POST |
Request Body | { "execution": "photon", "profiler": true, "isAdhoc": true, "writeSettings": { "data": [ { "delim": ",", "path": "hdfs://hadoop:50070/trifacta/queryResults/admin@example.com/POS_01.avro", "action": "create", "format": "avro", "compression": "none", "header": false, "asSingleFile": false, "prefix": null, "suffix": "_increment", "hasQuotes": false } ] }, "flowNode": { "id": 11 } } |
Response:
Status Code | 201 - Created |
---|---|
Response Body | { "id": 4, "execution": "photon", "profiler": true, "isAdhoc": true, "updatedAt": "2018-11-13T00:20:49.258Z", "createdAt": "2018-11-13T00:20:49.258Z", "creator": { "id": 1 }, "updater": { "id": 1 }, "flowNode": { "id": 11 } } |
cURL example:
curl -X POST \ http://www.wrangle-dev.example.com/v4/outputObjects \ -H 'authorization: Basic <auth_token>' \ -H 'cache-control: no-cache' \ -H 'content-type: application/json' \ -d '{ "execution": "photon", "profiler": true, "isAdhoc": true, "writeSettings": { "data": [ { "delim": ",", "path": "hdfs://hadoop:50070/trifacta/queryResults/admin@example.com/POS_01.avro", "action": "create", "format": "avro", "compression": "none", "header": false, "asSingleFile": false, "prefix": null, "suffix": "_increment", "hasQuotes": false } ] }, "flowNode": { "id": 11 } }'
Checkpoint: You've created an outputObject (id=4
) and an embedded writeSettings object and have associated them with the appropriate recipe flowNodeId=11
. You can now run a job for this recipe generating the specified output.
For more information, see Cloud Dataprep by TRIFACTA INC. API Reference docs: Premium | Standard
Step - Run a Test Job
Now that outputs have been defined for the recipe, you can just execute a job on the specified recipe flowNodeId=11
:
Request:
Endpoint | http://www.wrangle-dev.example.com:3005/v4/jobGroups |
---|---|
Authentication | Required |
Method | POST |
Request Body | { "wrangledDataset": { "id": 11 } } |
Response:
Status Code | 201 - Created |
---|---|
Response Body | { "sessionId": "79276c31-c58c-4e79-ae5e-fed1a25ebca1", "reason": "JobStarted", "jobGraph": { "vertices": [ 21, 22 ], "edges": [ { "source": 21, "target": 22 } ] }, "id": 2, "jobs": { "data": [ { "id": 21 }, { "id": 22 } ] } } |
NOTE: To re-run the job against its currently specified outputs, writeSettings, and publications, you only need the recipe ID. If you need to make changes for purposes of a specific job run, you can add overrides to the request for the job. These overrides apply only for the current job. For more information, see Cloud Dataprep by TRIFACTA INC. API Reference docs: Premium | Standard
To track the status of the job:
- You can monitor the progress through the application.
You can monitor progress through the
status
field by querying the specific job. For more information, see Cloud Dataprep by TRIFACTA INC. API Reference docs: Premium | Standard
Checkpoint: You've run a job, generating one output in Avro format.
Step - Create writeSettings Object
Suppose you want to create another file-based output for this outputObject. You can create a second writeSettings object, which publishes the results of the job run on the recipe to the specified location.
The following example creates settings for generating a parquet-based output.
Request:
Endpoint | http://www.wrangle-dev.example.com:3005/v4/writeSettings/ |
---|---|
Authentication | Required |
Method | POST |
Request Body | { "delim": ",", "path": "hdfs://hadoop:50070/trifacta/queryResults/admin@example.com/POS_r03.pqt", "action": "create", "format": "pqt", "compression": "none", "header": false, "asSingleFile": false, "prefix": null, "suffix": "_increment", "hasQuotes": false, "outputObjectId": 4 } |
Response:
Status Code | 201 - Created |
---|---|
Response Body | { "delim": ",", "id": 2, "path": "hdfs://hadoop:50070/trifacta/queryResults/admin@example.com/POS_r03.pqt", "action": "create", "format": "pqt", "compression": "none", "header": false, "asSingleFile": false, "prefix": null, "suffix": "_increment", "hasQuotes": false, "updatedAt": "2018-11-13T01:07:52.386Z", "createdAt": "2018-11-13T01:07:52.386Z", "creator": { "id": 1 }, "updater": { "id": 1 }, "outputObject": { "id": 4 } } |
cURL example:
curl -X POST \ http://www.wrangle-dev.example.com/v4/writeSettings \ -H 'authorization: Basic <auth_token>' \ -H 'cache-control: no-cache' \ -H 'content-type: application/json' \ -d '{ "delim": ",", "path": "hdfs://hadoop:50070/trifacta/queryResults/admin@example.com/POS_r03.pqt", "action": "create", "format": "pqt", "compression": "none", "header": false, "asSingleFile": false, "prefix": null, "suffix": "_increment", "hasQuotes": false, "outputObject": { "id": 4 } }
Checkpoint: You've added a new writeSettings object and associated it with your outputObject (id=4
). When you run the job again, the Parquet output is also generated.
For more information, see
Cloud Dataprep by TRIFACTA INC. API Reference docs: Premium | Standard
Step - Get Connection ID for Publication
To generate a publication, you must identify the connection through which you are publishing the results.
Below, the request returns a single connection to Hive (id=1
).
Request:
Endpoint | http://www.wrangle-dev.example.com:3005/v4/connections |
---|---|
Authentication | Required |
Method | GET |
Request Body | None. |
Response:
Status Code | 200 - OK |
---|---|
Response Body | { "data": [ { "id": 1, "host": "hadoop", "port": 10000, "vendor": "hive", "params": { "jdbc": "hive2", "connectStringOptions": "", "defaultDatabase": "default" }, "ssl": false, "vendorName": "hive", "name": "hive", "description": null, "type": "jdbc", "isGlobal": true, "credentialType": "conf", "credentialsShared": true, "uuid": "28415970-e6c4-11e8-82be-9947a31ecdd5", "disableTypeInference": false, "createdAt": "2018-11-12T21:44:39.816Z", "updatedAt": "2018-11-12T21:44:39.842Z", "credentials": [], "creator": { "id": 1 }, "updater": { "id": 1 }, "workspace": { "id": 1 } } ], "count": 1 } |
cURL example:
curl -X GET \ http://www.wrangle-dev.example.com/v4/connections \ -H 'authorization: Basic <auth_token>' \ -H 'cache-control: no-cache' \ -H 'content-type: application/json'
For more information, see Cloud Dataprep by TRIFACTA INC. API Reference docs: Premium | Standard
Step - Create a Publication
You can create publications that publish table-based outputs through specified connections. In the following, a Hive table is written out to the default
database through connectionId = 1. This publication is associated with the outputObject id=4.
Request:
Endpoint | http://www.wrangle-dev.example.com:3005/v4/publications |
---|---|
Authentication | Required |
Method | POST |
Request Body | { "path": [ "default" ], "tableName": "myPublishedHiveTable", "targetType": "hive", "action": "create", "outputObject": { "id": 4 }, "connection": { "id": 1 } } |
Response:
Status Code | 201 - Created |
---|---|
Response Body | { "path": [ "default" ], "id": 3, "tableName": "myPublishedHiveTable", "targetType": "hive", "action": "create", "updatedAt": "2018-11-13T01:25:39.698Z", "createdAt": "2018-11-13T01:25:39.698Z", "creator": { "id": 1 }, "updater": { "id": 1 }, "outputObject": { "id": 4 }, "connection": { "id": 1 } } |
cURL example:
curl -X POST \ http://example.com:3005/v4/publications \ -H 'authorization: Basic <auth_token>' \ -H 'cache-control: no-cache' \ -H 'content-type: application/json' \ -d '{ "path": [ "default" ], "tableName": "myPublishedHiveTable", "targetType": "hive", "action": "create", "outputObject": { "id": 4 }, "connection": { "id": 1 } }'
For more information, see Cloud Dataprep by TRIFACTA INC. API Reference docs: Premium | Standard
Checkpoint: You're done.
You have done the following:
- Created an output object:
- Embedded a writeSettings object to define an Avro output.
- Associated the outputObject with a recipe.
- Added another writeSettings object to the outputObject.
- Added a table-based publication object to the outputObject.
You can now generate results for these three different outputs whenever you run a job (create a jobgroup) for the associated recipe.
Step - Apply Overrides
When you are publishing results to a relational source, you can optionally apply overrides to the job to redirect the output or change the action applied to the target table. For more information, see API Workflow - Run Job.
Step - Apply Cloud Dataflow Job Overrides
NOTE: Overrides applied to the output objects are merged with any overrides specified as part of the jobGroup at the time of execution. For more information, see API Workflow - Run Job.
If neither object has a specified override for a Cloud Dataflow property, the applicable project setting is used. See Project Settings Page.
You can optionally submit override values for a predefined set of Cloud Dataflow properties on the output object. These overrides are applied each time that the outputobject is used to generate a set of results.
NOTE: If you are using automatic VPC network mode, then network
, subnetwork
, and usePublicIPs
do not apply.
Tip: You can apply job overrides to the job itself, instead of applying overrides to the outputobject. For more information, see API Workflow - Run Job.
Example - Apply labels to output object
In the following example, an existing outputObject (id=4) is modified to include override values for the labels of the job. Each property and its value as specified as a key-value pair in the request:
Request:
Endpoint | https://www.api.clouddataprep.com/v4/outputObjects/4 |
---|---|
Authentication | Required |
Method | PATCH |
Request Body | { "execution": "dataflow", "profiler": true, "outputObjectDataflowOptions": { "region": "us-central1", "zone": "us-central1-a", "machineType": "n1-standard-64", "network": "my-network-name", "subnetwork": "regions/us-central1/subnetworks/my-subnetwork", "autoscalingAlgorithm": "THROUGHPUT_BASED", "serviceAccount": "my-service-account-name@<project-id>.iam.gserviceaccount.com", "numWorkers": "1", "maxNumWorkers": "1000", "usePublicIps": "true", "labels": [ { "key": "my-billing-label-key", "value": "my-billing-label-value" } ] } } |
Response:
Status Code | 200 - Ok |
---|---|
Response Body | { "id": 4, "updater": { "id": 1 }, "updatedAt": "2020-03-21T00:27:00.937Z", "createdAt": "2020-03-20T23:30:42.991Z" } |
cURL example:
curl -X PATCH \ http://www.wrangle-dev.example.com/v4/outputObjects/4 \ -H 'authorization: Bearer <auth_token>' \ -H 'cache-control: no-cache' \ -H 'content-type: application/json' \ -d '{ "execution": "dataflow", "profiler": true, "outputObjectDataflowOptions": { "region": "us-central1", "zone": "us-central1-a", "machineType": "n1-standard-64", "network": "my-network-name", "subnetwork": "regions/us-central1/subnetworks/my-subnetwork", "autoscalingAlgorithm": "THROUGHPUT_BASED", "serviceAccount": "my-service-account-name@<project-id>.iam.gserviceaccount.com", "numWorkers": "1", "maxNumWorkers": "1000", "usePublicIps": "true", "labels": [ { "key": "my-billing-label-key", "value": "my-billing-label-value" } ] } }'
Notes on properties:
If a network value, subnetwork value, or both is specified, then the VPC mode is custom. This setting is available in the UI for convenience.
You can submit empty or null values for property values in the payload. These values are submitted.
- If you are not using auto-scaling on your job:
"autoscalingAlgorithm": "NONE",
- Use "
numWorkers
" instead to specify the number of compute nodes to use for the job. - Feature Availability: This feature is available in Cloud Dataprep Premium by TRIFACTA INC.
- If you are using auto-scaling on your job:
"autoscalingAlgorithm": "throughput_based",
- Use the
"maxNumWorkers"
and "numWorkers
" instead to specify the number of compute nodes to use for the job.- Feature Availability: This feature is available in Cloud Dataprep Premium by TRIFACTA INC.
Notes on labels:
You can use labels to assign billing information for the job in your project.
Feature Availability: This feature is available in Cloud Dataprep Premium by TRIFACTA INC.
Key: This value must be unique among your job labels.
Value: Assign based on the accepted values for the label.
For more information, see https://cloud.google.com/resource-manager/docs/creating-managing-labels.
You can apply up to 64 labels for a job. For more information on the available properties, see Dataflow Execution Settings.
Example - Override VPC settings
In the following example, an existing outputObject (id=4) is modified to override the VPC settings to use a non-local VPC:
Request:
Endpoint | https://www.api.clouddataprep.com/v4/outputObjects/4 |
---|---|
Authentication | Required |
Method | PATCH |
Request Body | { "execution": "dataflow", "outputObjectDataflowOptions": { "region": "second-region", "zone": "us-central1-a", "network": "my-other-network", "subnetwork": "regions/second-region/subnetworks/my-other-subnetwork" } } |
Response:
Status Code | 200 - Ok |
---|---|
Response Body | { "id": 4, "updater": { "id": 1 }, "updatedAt": "2020-03-21T00:27:00.937Z", "createdAt": "2020-03-20T23:30:42.991Z" } |
cURL example:
curl -X PATCH \ http://www.wrangle-dev.example.com/v4/outputObjects/4 \ -H 'authorization: Bearer <auth_token>' \ -H 'cache-control: no-cache' \ -H 'content-type: application/json' \ -d '{ "execution": "dataflow", "outputObjectDataflowOptions:" { "region": "second-region", "zone": "us-central1-a", "network": "my-other-network", "subnetwork": "regions/second-region/subnetworks/my-other-subnetwork" } }'
Notes on properties:
If a network value, subnetwork value, or both is specified, then the VPC mode is custom. This setting is available in the UI for convenience.
Subnetwork values must be specified as a short URL or a full URL.
To specify the VPC associated with a different project to which you have access, use the full URL pattern for the subnetwork value:
https://www.googleapis.com/compute/v1/projects/<HOST_PROJECT_ID>/regions/<REGION>/subnetworks/<SUBNETWORK>
<HOST_PROJECT_ID>
corresponds to the project identifier. This value must be between 6 and 30 characters. The value can contain only lowercase letters, digits, or hyphens. It must start with a letter. Trailing hyphens are prohibited.To specify a different VPC subnetwork, you can also use a short URL pattern for the subnetwork value:
regions/<REGION>/subnetworks/<SUBNETWORK>
For more information on these properties, see Dataflow Execution Settings.
This page has no comments.