This example walks through the process of creating, identifying, and executing a job through automated methods. For this example, these tasks are accomplished using the following methods:
NOTE: This API workflow applies to a Development instance of the |
In this example, you are attempting to wrangle monthly point of sale (POS) data from three separate regions into a single dataset for the state. This monthly data must be enhanced with information about the products and stores in the state. So, the example has a combination of transactional and reference data, which must be brought together into a single dataset.
Tip: To facilitate re-execution of this job each month, the transactional data should be stored in a dedicated directory. This directory can be overwritten with next month's data using the same filenames. As long as the new files are structured in an identical manner to the original ones, the new month's data can be processed by re-running the API aspects of this workflow. |
Example Files:
The following files are stored on HDFS:
Path and Filename | Description |
---|---|
hdfs:///user/pos/POS-r01.txt | Point of sale transactions for Region 1. |
hdfs:///user/pos/POS-r02.txt | Point of sale transactions for Region 2. |
hdfs:///user/pos/POS-r03.txt | Point of sale transactions for Region 3. |
hdfs:///user/ref/REF_PROD.txt | Reference data on products for the state. |
hdfs:///user/ref/REF_CAL.txt | Reference data on stores in the state. |
NOTE: The reference and transactional data are stored in separate directories. In this case, you can assume that the user has read access through his |
Base URL:
For purposes of this example, the base URL for the is the following:
http://www.example.com:3005 |
To begin, you must locate a flow or create a flow through the APIs to contain the datasets that you are importing.
NOTE: You cannot add datasets to the flow through the
|
Locate:
NOTE: If you know the display name value for the flow and are confident that it is not shared with any other flows, you can use the APIs to retrieve the flowId. See
|
In the Flow Details page for that flow, locate the flow identifier in the URL:
Flow Details URL | http://www.example.com:3005/flows/10 |
---|---|
Flow Id | 10 |
Retain this identifier for later use.
Create:
Through the APIs, you can create a flow using the following call:
Endpoint | http://www.example.com:3005/v4/flows | |
---|---|---|
Authentication | Required | |
Method | POST | |
Request Body |
|
The response should be status code 201 - Created
with a response body like the following:
{ "id": 10, "updatedAt": "2017-02-17T17:08:57.848Z", "createdAt": "2017-02-17T17:08:57.848Z", "name": "Point of Sale - 2013", "description": "Point of Sale data for state", "creator": { "id": 1 }, "updater": { "id": 1 }, "workspace": { "id": 1 } } |
10
) for later use.For more information, see
operation/createFlow |
Checkpoint: You have identified or created the flow to contain your dataset or datasets. |
To create datasets from the above sources, you must:
The following steps describe how to complete these actions via API for a single file.
Steps:
To create an imported dataset, you must acquire the following information about the source. In the above example, the source is the POS-r01.txt
file.
Construct the following request:
Endpoint | http://www.example.com:3005/v4/importedDatasets | |
---|---|---|
Authentication | Required | |
Method | POST | |
Request Body |
|
You should receive a 201 - Created
response with a response body similar to the following:
{ "id": 8, "size": "281032", "uri": "hdfs:///user/pos/POS-r01.txt", "dynamicPath": null, "bucket": null, "isSchematized": true, "isDynamic": false, "disableTypeInference": false, "updatedAt": "2017-02-08T18:38:56.640Z", "createdAt": "2017-02-08T18:38:56.560Z", "parsingScriptId": { "id": 14 }, "runParameters": { "data": [] }, "name": "POS-r01.txt", "description": "POS-r01.txt", "creator": { "id": 1 }, "updater": { "id": 1 }, "connection": null } |
You must retain the id
value so you can reference it when you create the recipe.
See
operation/createImportedDataset |
Next, you create the recipe. Construct the following request:
Endpoint | http://www.example.com:3005/v4/wrangledDatasets | |
---|---|---|
Authentication | Required | |
Method | POST | |
Request Body |
|
You should receive a 201 - Created
response with a response body similar to the following:
{ "id": 23, "wrangled": true, "updatedAt": "2018-02-06T19:59:22.735Z", "createdAt": "2018-02-06T19:59:22.698Z", "name": "POS-r01", "active": true, "referenceInfo": null, "activeSample": { "id": 23 }, "creator": { "id": 1 }, "updater": { "id": 1 }, "recipe": { "id": 23 }, "flow": { "id": 10 } } |
From the recipe, you must retain the value for the id
. For more information, see
operation/createWrangledDataset |
Checkpoint: You have created a flow with multiple imported datasets and recipes. |
After you have created the flow with all of your source datasets, you can wrangle the base dataset to integrate all of the source into it.
Steps for Transactional data:
POS-r01
dataset. It's loaded in the Transformer page. union
transform. In the Search panel, enter union
in the textbox and press ENTER
.Select the other two transactional datasets: POS-r02
and POS-r03
.
NOTE: When you join or union one dataset into another, changes made in the joined dataset are automatically propagated to the dataset where it has been joined. |
Steps for reference data:
In the columns Store_Nbr
and Item_Nbr
are unique keys into the REF_CAL
and REF_PROD
datasets, respectively. Using the Join window, you can pull in the other fields from these reference datasets based on these unique keys.
POS-r01
dataset. join datasets
for the transform. The Join window opens. RED_PROD
dataset. Click Accept. Click Next.Item_Nbr
value that has a matching ITEM_NBR
value in the reference dataset, all of the other reference fields are pulled into the POS-r01
dataset. You can repeat the above general process to integrate the reference data for stores.
Checkpoint: You have created a flow with multiple datasets and have integrated all of the relevant data into a single dataset. |
Before you run a job, you must define output objects, which specify the following:
NOTE: You can continue with this workflow without creating outputObjects yet. In this workflow, overrides are applied during the job definition, so you don't have to create the outputObjects and writeSettings at this time. |
For more information on creating outputObjects, writeSettings, and publications, see API Workflow - Manage Outputs.
Through the APIs, you can specify and run a job. In the above example, you must run the job for the terminal dataset, which is POS-r01
in this case. This dataset contains references to all of the other datasets. When the job is run, the recipes for the other datasets are also applied to the terminal dataset, which ensures that the output reflects the proper integration of these other datasets into POS-r01
.
NOTE: In the following example, writeSettings have been specified as overrides in the job definition. These overrides are applied for this job run only. If you need to re-run the job with these settings, you must either 1) re-apply the overrides or 2) create the writeSettings objects.For more information, see API Workflow - Manage Outputs. |
Steps:
23
.
Endpoint | http://www.example.com:3005/v4/jobGroups |
---|---|
Authentication | Required |
Method | POST |
Request Body:
{ "wrangledDataset": { "id": 23 }, "overrides": { "execution": "photon", "profiler": true, "writesettings": [ { "path": "hdfs://hadoop:50070/trifacta/queryResults/admin@example.com/POS-r01.csv", "action": "create", "format": "csv", "compression": "none", "header": false, "asSingleFile": false } ] }, "ranfrom": null } |
23
to execute on the Output format is CSV to the designated path. For more information on these properties, see
operation/runJobGroup |
A response code of 201 - Created
is returned. The response body should look like the following:
{ "sessionId": "79276c31-c58c-4e79-ae5e-fed1a25ebca1", "reason": "JobStarted", "jobGraph": { "vertices": [ 21, 22 ], "edges": [ { "source": 21, "target": 22 } ] }, "id": 3, "jobs": { "data": [ { "id": 21 }, { "id": 22 } ] } } |
Retain the id
value, which is the job identifier, for monitoring.
You can monitor the status of your job through the following endpoint:
Endpoint | http://www.example.com:3005/v4/jobGroups/<id>/status |
---|---|
Authentication | Required |
Method | GET |
Request Body | None. |
When the job has successfully completed, the returned status message is the following:
"Complete" |
For more information, see
operation/runJobGroup |
In the future, you can re-run the job exactly as you specified it by executing the following call:
Tip: You can swap imported datasets before re-running the job. For example, if you have uploaded a new file, you can change the primary input dataset for the dataset and then use the following API call to re-run the job as specified. See
|
Endpoint | http://www.example.com:3005/v4/jobGroups | |
---|---|---|
Authentication | Required | |
Method | POST | |
Request Body |
|
The job is re-run as it was previously specified.