After you have created a flow, imported a dataset, and created a recipe for that dataset, you may need to swap in a different dataset and run the recipe against that one. This workflow steps through that process via the APIs.
NOTE: If you are processing multiple parallel datasources in a single job, you should create a dataset with parameters and then run the job. For more information, see API Workflow - Run Job on Dataset with Parameters. |
This workflow utilizes the following methods:
Creating an imported dataset. After the new file has been added to the backend datastore, you can import into as an imported dataset.
In this example, you are wrangling data from orders placed in different regions on a quarterly basis. When a new file drops, you want to be able to swap out the current dataset that is assigned to the recipe and swap in the new one. Then, run the job.
Example Files:
The following files are stored in HDFS:
Path and Filename | Description |
---|---|
hdfs:///user/orders/MyCo-orders-west-Q1.txt | Orders from West region for Q1 |
hdfs:///user/orders/MyCo-orders-west-Q2.txt | Orders from West region for Q2 |
hdfs:///user/orders/MyCo-orders-north-Q1.txt | Orders from North region for Q1 |
hdfs:///user/orders/MyCo-orders-north-Q2.txt | Orders from North region for Q2 |
hdfs:///user/orders/MyCo-orders-east-Q1.txt | Orders from East region for Q1 |
hdfs:///user/orders/MyCo-orders-east-Q1.txt | Orders from East region for Q2 |
You have already created a flow, which contains the following imported dataset and recipe:
NOTE: When an imported dataset is created via API, it is always imported as an unstructured dataset. Any recipe that references this dataset should contain initial parsing steps required to structure the data. |
Tip: Through the UI, you can import one of your datasets as unstructured. Create a recipe for this dataset and then edit it. In the Recipe panel, you should be able to see the structuring steps. Back in Flow View, you can chain your structural recipe off of this one. Dataset swapping should happen on the first recipe. |
Object Type | Name | Id |
---|---|---|
flow | MyCo-Orders-Quarter | 2 |
Imported Dataset | MyCo-orders-west-Q1.txt | 8 |
Recipe (wrangledDataset) | n/a | 9 |
Job | n/a | 3 |
Base URL:
For purposes of this example, the base URL for the platform is the following:
http://www.example.com:3005 |
NOTE: You cannot add datasets to the flow through the
|
NOTE: When an imported dataset is created via API, it is always imported as an unstructured dataset. Any recipe that references this dataset should contain initial parsing steps required to structure the data. |
The following steps describe how to create an imported dataset and assign it to the flow that has already been created (flowId=2).
Steps:
To create an imported dataset, you must acquire the following information about the source.
In this example, the file you are importing is MyCo-orders-west-Q2.txt
. Since the files are similar in nature and are stored in the same directory, you can acquire this information by gathering the information from the imported dataset that is already part of the flow. Execute the following:
Endpoint | http://www.example.com:3005/v4/importedDatasets | |
---|---|---|
Authentication | Required | |
Method | POST | |
Request Body |
|
The response should be a 201 - Created
status code with something like the following:
{ "id": 12, "size": "281032", "path": "hdfs:///user/orders/MyCo-orders-west-Q2.txt", "dynamicPath": null, "workspaceId": 1, "isSchematized": false, "isDynamic": false, "disableTypeInference": false, "createdAt": "2018-10-29T23:15:01.831Z", "updatedAt": "2018-10-29T23:15:01.889Z", "parsingRecipe": { "id": 11 }, "runParameters": [], "name": "MyCo-orders-west-Q2.txt.txt", "description": "MyCo-orders-west-Q2.txt", "creator": { "id": 1 }, "updater": { "id": 1 }, "connection": null } |
You must retain the id
value so you can reference it when you create the recipe.
See
operation/createImportedDataset |
Checkpoint: You have imported a dataset that is unstructured and is not associated with any flow. |
The next step is to swap the primary input dataset for the recipe to point at the newly imported dataset. This step automatically adds the imported dataset to the flow and drops the previous imported dataset from the flow.
NOTE: When you swap datasets, existing samples are not automatically discarded. These samples are invalid. As a workaround, you can generate a new sample manually. For more information on generating samples through the application, see Samples Panel. |
Use the following to swap the primary input dataset for the recipe:
Endpoint | http://www.example.com:3005/v4/wrangledDatasets/9/primaryInputDataset | |
---|---|---|
Authentication | Required | |
Method | PUT | |
Request Body |
|
The response should be a 200 - OK
status code with something like the following:
{ "id": 9, "wrangled": true, "createdAt": "2019-03-03T17:58:53.979Z", "updatedAt": "2019-03-03T18:01:11.310Z", "recipe": { "id": 9, "name": "POS-r01", "description": null, "active": true, "nextPortId": 1, "createdAt": "2019-03-03T17:58:53.965Z", "updatedAt": "2019-03-03T18:01:11.308Z", "currentEdit": { "id": 8 }, "redoLeafEdit": { "id": 7 }, "creator": { "id": 1 }, "updater": { "id": 1 } }, "referenceInfo": null, "activeSample": { "id": 7 }, "creator": { "id": 1 }, "updater": { "id": 1 }, "referencedFlowNode": null, "flow": { "id": 2 } } |
operation/updateInputDataset |
For a file-based backend datastore, you can change the source of your imported dataset to use a different bucket and path for your imported dataset.
NOTE: This endpoint changes the source of the imported dataset. The wrangleddataset (recipe) continues to point to the imported dataset, which points to the new source. Since the source of the imported dataset object is altered, this change affects all objects that reference the imported dataset, even in other flows. |
Tip: This endpoint is useful if you have imported your flow into a different project that uses a different source bucket. |
Use the following to swap the source of your imported dataset for the recipe:
Endpoint | http://www.example.com:3005/v4/importedDatasets/9/ | |
---|---|---|
Authentication | Required | |
Method | PUT | |
Request Body |
|
The response should be a 200 - OK
status code with the imported dataset definition.
operation/updateImportedDataset |
To execute a job on this recipe, you can simply re-run any job that was executed on the old imported dataset, since you reference the job by jobId and wrangledDataset (recipe) Id.
Endpoint | http://www.example.com:3005/v4/jobGroups | |
---|---|---|
Authentication | Required | |
Method | POST | |
Request Body |
|
The job is re-run as it was previously specified.
If you need to modify any job parameters, you must create a new job definition.
After the job has been queued, you can track it to completion. See API Workflow - Develop a Flow.
When you are satisfied with how your flow is working, you can set up periodic schedules using a third-party tool to execute the job on a regular basis.
The tool must hit the above endpoints to swap in the new dataset and run the job.