Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r071

...

  1. Creating an imported dataset. After the new file has been added to the backend datastore, you can import into the into 

    d-s-platformproduct
    rtrue
     as an imported dataset.

  2. Swap dataset. Using the ID of the imported dataset you created, you can now assign the dataset to the recipe in your flow. 
  3. Run a job. Run  Run the job against the dataset. 
  4. Monitor progress. Monitor  Monitor the progress of the job until it is complete.

...

The following files are stored on your HDFS deployment:

Path and FilenameDescription
hdfs:///user/orders/MyCo-orders-west-Q1.txtOrders from West region for Q1
hdfs:///user/orders/MyCo-orders-west-Q2.txtOrders from West region for Q2
hdfs:///user/orders/MyCo-orders-north-Q1.txtOrders from North region for Q1
hdfs:///user/orders/MyCo-orders-north-Q2.txtOrders from North region for Q2
hdfs:///user/orders/MyCo-orders-east-Q1.txtOrders from East region for Q1
hdfs
hdfs:///user/orders/MyCo-orders-east-Q1.
txt
txtOrders from East region for Q2

Assumptions

You have already created a flow, which contains the following imported dataset and recipe:

...

Tip

Tip: Through the UI, you can import one of your datasets as unstructured. Create a recipe for this dataset and then edit it. In the Recipe panel, you should be able to see the structuring steps. Back in Flow View, you can chain your structural recipe off of this one. Dataset swapping should happen on the first recipe.

Object TypeNameId
flowMyCo-Orders-Quarter2
Imported DatasetMyCo-orders-west-Q1.txt8
Recipe (wrangledDataset)n/a9
Jobn/a3

 

Base URL:

For purposes of this example, the base URL for the 

D s item
itemplatform
 is the platform is the following:

Code Block
http://www.example.com:3005

...

  1. To create an imported dataset, you must acquire the following information about the source. 

    1. path
    2. type
    3. name
    4. description
    5. bucket (if a file stored on S3)
  2. In this example, the file you are importing is is MyCo-orders-west-Q2.txt. Since  Since the files are similar in nature and are stored in the same directory, you can acquire this information by gathering the information from the imported dataset that is already part of the flow. Execute the following:

    Endpointhttp://www.example.com:3005/v4/importedDatasets
    AuthenticationRequired
    MethodPOST
    Request Body
    Code Block
    {
      "path": "hdfs:///user/orders/MyCo-orders-west-Q2.txt",
      "name": "MyCo-orders-west-Q2.txt",
      "description": "MyCo-orders-west-Q2"
    }
    
  3. The response should be a 201 - Created status  status code with something like the following:

    Code Block
    {
        "id": 12,
        "size": "281032",
        "path": "hdfs:///user/orders/MyCo-orders-west-Q2.txt",
        "dynamicPath": null,
        "workspaceId": 1,
        "isSchematized": false,
        "isDynamic": false,
        "disableTypeInference": false,
        "createdAt": "2018-10-29T23:15:01.831Z",
        "updatedAt": "2018-10-29T23:15:01.889Z",
        "parsingRecipe": {
            "id": 11
        },
        "runParameters": [],
        "name": "MyCo-orders-west-Q2.txt.txt",
        "description": "MyCo-orders-west-Q2.txt",
        "creator": {
            "id": 1,
        },
        "updater": {
            "id": 1,
        },
        "connection": null,
    }
  4. You must retain the id value so you can reference it when you create the recipe.

  5. See 

    D s api refdoclink
    operation/createImportedDataset

...

  1. Use the following to swap the primary input dataset for the recipe:

    Endpointhttp://www.example.com:3005/v4/wrangledDatasets/9/primaryInputDataset
    AuthenticationRequired
    MethodPUT
    Request Body
    Code Block
    {
      "importedDataset": {
        "id": 12
      }
    }
  2. The response should be a 200 - OK status code with something like the following:

    Code Block
    {
        "id": 9,
        "wrangled": true,
        "createdAt": "2019-03-03T17:58:53.979Z",
        "updatedAt": "2019-03-03T18:01:11.310Z",
        "recipe": {
            "id": 9,
    x        "name": "POS-r01",
    x        "description": null,
            "active": true,
            "nextPortId": 1,
            "createdAt": "2019-03-03T17:58:53.965Z",
            "updatedAt": "2019-03-03T18:01:11.308Z",
            "currentEdit": {
                "id": 8
            },
            "redoLeafEdit": {
                "id": 7
            },
            "creator": {
                "id": 1
            },
            "updater": {
                "id": 1
            }
        },
        "referenceInfo": null,
        "activeSample": {
            "id": 7
        },
        "creator": {
            "id": 1
        },
        "updater": {
            "id": 1
        },
        "referencedFlowNode": null,
        "flow": {
            "id": 2
        }
    }
  3. The new imported dataset is now the primary input for the recipe, and the old imported dataset has been removed from the flow. 

...

To execute a job on this recipe, you can simply re-run any job that was executed on the old imported dataset, since you reference the job by jobId and wrangledDataset (recipe) Id. 

 

Endpointhttp://www.example.com:3005/v4/jobGroups
AuthenticationRequired
MethodPOST
Request Body
Code Block
{
  "wrangledDataset": {
    "id": 9
  }
}

The job is re-run as it was previously specified.

...