This example walks through the process of creating, identifying, and executing a job through automated methods. For this example, these tasks are accomplished using the following methods:
NOTE: This API workflow applies to a Development instance of the |
In this example, you are attempting to wrangle monthly point of sale (POS) data from three separate regions into a single dataset for the state. This monthly data must be enhanced with information about the products and stores in the state. So, the example has a combination of transactional and reference data, which must be brought together into a single dataset.
Tip: To facilitate re-execution of this job each month, the transactional data should be stored in a dedicated directory. This directory can be overwritten with next month's data using the same filenames. As long as the new files are structured in an identical manner to the original ones, the new month's data can be processed by re-running the API aspects of this workflow. |
Example Files:
The following files are stored on your HDFS deployment:
Path and Filename | Description |
---|---|
hdfs:///user/pos/POS-r01.txt | Point of sale transactions for Region 1. |
hdfs:///user/pos/POS-r02.txt | Point of sale transactions for Region 2. |
hdfs:///user/pos/POS-r03.txt | Point of sale transactions for Region 3. |
hdfs:///user/ref/REF_PROD.txt | Reference data on products for the state. |
hdfs:///user/ref/REF_CAL.txt | Reference data on stores in the state. |
NOTE: The reference and transactional data are stored in separate directories. In this case, you can assume that the user has read access through his |
Base URL:
For purposes of this example, the base URL for the is the following:
http://www.example.com:3005 |
To begin, you must locate a flow or create a flow through the APIs to contain the datasets that you are importing.
NOTE: You cannot add datasets to the flow through the |
Locate:
NOTE: If you know the display name value for the flow and are confident that it is not shared with any other flows, you can use the APIs to retrieve the flowId. See API Flows Get List v3. |
In the Flow Details page for that flow, locate the flow identifier in the URL:
Flow Details URL | http://www.example.com:3005/flows/10 |
---|---|
Flow Id | 10 |
Retain this identifier for later use.
Create:
Through the APIs, you can create a flow using the following call:
Endpoint | http://www.example.com:3005/v3/flows | |
---|---|---|
Authentication | Required | |
Method | POST | |
Request Body |
|
The response should be status code 201 - Created
with a response body like the following:
{ "id": 10, "name": "Point of Sale - 2013", "description": "Point of Sale data for state", "createdBy": 1, "updatedBy": 1, "updatedAt": "2017-02-17T17:08:57.848Z", "createdAt": "2017-02-17T17:08:57.848Z" } |
10
) for later use.Checkpoint: You have identified or created the flow to contain your dataset or datasets. |
To create datasets from the above sources, you must:
The following steps describe how to complete these actions via API for a single file.
Steps:
To create an imported dataset, you must acquire the following information about the source. In the above example, the source is the POS-r01.txt
file.
Construct the following request:
Endpoint | http://www.example.com:3005/v3/importedDatasets | |
---|---|---|
Authentication | Required | |
Method | POST | |
Request Body |
|
You should receive a 201 - Created
response with a response body similar to the following:
{ "id": 8, "size": "281032", "path": "/user/pos/POS-r01.txt", "isSharedWithAll": false, "type": "hdfs", "bucket": null, "isSchematized": false, "createdBy": 1, "updatedBy": 1, "updatedAt": "2017-02-08T18:38:56.640Z", "createdAt": "2017-02-08T18:38:56.560Z", "connectionId": null, "parsingScriptId": 14, "cpProject": null } |
You must retain the id
value so you can reference it when you create the recipe.
Next, you create the recipe. Construct the following request:
Endpoint | http://www.example.com:3005/v3/wrangledDataset | |
---|---|---|
Authentication | Required | |
Method | POST | |
Request Body |
|
You should receive a 201 - Created
response with a response body similar to the following:
{ "id": 23, "flowId": 10, "scriptId": 24, "wrangled": true, "createdBy": 1, "updatedBy": 1, "updatedAt": "2017-02-08T20:28:06.067Z", "createdAt": "2017-02-08T20:28:06.067Z", "flowNodeId": null, "deleted_at": null, "activesampleId": null, "name": "POS-r01", "active": true } |
id
. For more information, see API WrangledDatasets Create v3.Checkpoint: You have created a flow with multiple imported datasets and recipes. |
After you have created the flow with all of your source datasets, you can wrangle the base dataset to integrate all of the source into it.
Steps for Transactional data:
POS-r01
dataset. It's loaded in the Transformer page. union
transform. In the Search panel, enter union
in the textbox and press ENTER
.Select the other two transactional datasets: POS-r02
and POS-r03
.
NOTE: When you |
Steps for reference data:
In the columns Store_Nbr
and Item_Nbr
are unique keys into the REF_CAL
and REF_PROD
datasets, respectively. Using the Join panel, you can pull in the other fields from these reference datasets based on these unique keys.
POS-r01
dataset. join
for the transform. The Join panel opens. RED_PROD
dataset. Click Accept. Click Next.Item_Nbr
value that has a matching ITEM_NBR
value in the reference dataset, all of the other reference fields are pulled into the POS-r01
dataset. You can repeat the above general process to integrate the reference data for stores.
Checkpoint: You have created a flow with multiple datasets and have integrated all of the relevant data into a single dataset. |
Through the APIs, you can specify and run a job. In the above example, you must run the job for the terminal dataset, which is POS-r01
in this case. This dataset contains references to all of the other datasets. When the job is run, the recipes for the other datasets are also applied to the terminal dataset, which ensures that the output reflects the proper integration of these other datasets into POS-r01
.
Steps:
23
.
Endpoint | http://www.example.com:3005/v3/jobGroups |
---|---|
Authentication | Required |
Method | POST |
Request Body:
{ "wrangledDataset": { "id": 23 }, "overrides": { "execution": "photon", "profiler": true, "writesettings": [ { "path": "hdfs://hadoop:50070/trifacta/queryResults/admin@trifacta.local/cdr_txt.csv", "action": "create", "format": "csv", "compression": "none", "header": false, "asSingleFile": false } ] }, "ranfrom": null } |
23
to execute on the Photon running environment with profiling enabled. A response code of 201 - Created
is returned. The response body should look like the following:
{ "jobgroupId": 3, "jobIds": [ 5, 6 ], "reason": "JobStarted", "sessionId": "9c2c6220-ef2d-11e6-b644-6dbff703bdfc" } |
Retain the jobgroupId
value for monitoring.
You can monitor the status of your job through the following endpoint:
Endpoint | http://www.example.com:3005/v3/jobGroups/<id>/status |
---|---|
Authentication | Required |
Method | GET |
Request Body | None. |
When the job has successfully completed, the returned status message is the following:
"Complete" |
For more information, see API JobGroups Get Status v3.
In the future, you can re-run the job exactly as you specified it by executing the following call:
Tip: You can swap imported datasets before re-running the job. For example, if you have uploaded a new file, you can change the primary input dataset for the dataset and then use the following API call to re-run the job as specified. See API WrangledDatasets Put PrimaryInputDataset v3. |
Endpoint | http://www.example.com:3005/v3/jobGroups/<id> | |
---|---|---|
Authentication | Required | |
Method | POST | |
Request Body |
|
The job is re-run as it was previously specified.
If you need to modify any job parameters, you must create a new job definition.