Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version next

D toc

Excerpt

Version 1: This section describes how you can import JSON files into 

D s product
rtrue
, convert them to tabular format, wrangle them, and then export them back in the same JSON format.

The basic method is described by way of example. In the example, the JSON file must be imported into 

D s product
rtrue
, a new column must be inserted into the JSON, and the resulting JSON must be exported in the same structure.

Enable

Warning

This legacy method of working with JSON is likely to be deprecated in a future release. Features developed in the future, such as schema management, are unlikely to support this method of JSON import.

If you have existing flows that were created using this legacy method, they should continue to work as expected. However, you should migrate your flows to use the newer version as soon as possible. See Migrate below.

Info

NOTE: Thelegacy version of JSON import is required if you are working with compressed JSON files or only Newline JSON files.

This method can be enabled through workspace settings. For more information, see Workspace Settings Page.

Migrate to v2

Any flow that you have created using the v1 version of the JSON importer should work without modification.

In the future, the v1 version will be deprecated. You can use the following method to migrate your flows to use the new version of the JSON importer.

Info

NOTE: The v1 version of JSON import is supported for imported datasets. If you use these datasets in other flows, they are likely to require modifications that you have done in recipes.

Basic process:

This migration process creates new versions of these imported datasets and fixes recipes accordingly.

  1. Through the Library for Data page, locate the imported datasets that are based on JSON files.
    1. You may be able to just search for json
  2. For each JSON imported dataset:
    1. Click the link.
    2. In the Dataset Details page, copy the value for the Location. Paste it into a text file.
    3. In the Dataset Details page, locate flow or flows where the dataset is in use.

      Tip

      Tip: If you copy the link address of the flow and paste it into a text file, you can paste that later into a browser and jump directly to the flow.

    4.  Repeat the above steps for each JSON-based imported dataset.

  3. You should now have a list of links to the source data and the flows where your JSON imported datasets are in use. 
  4. In the Library page, create a new version of each imported dataset:
    1. Click Import Data.
    2. Click the appropriate connection. 
    3. Paste the link to the Location where the source is stored.
    4. The data is ingested through the conversion service.

      Tip

      Tip: Click the icon for the dataset in the right panel. All rows in the Preview panel should be properly structured. Nested data may not be broken out into separate columns at this time.

    5. Rename the dataset as needed.

      Tip

      Tip: You should give each new version of the imported dataset a consistent prefix or suffix tag, such as -v2. Later, you can locate these new imported datasets easily through search in the Library for Data page.

    6. Click Continue.
  5. Repeat the above steps for each imported dataset that you are updating to v2.
  6. For each of these flows:
    1. Navigate to it. 
    2. Locate the v1 imported dataset in it. You might copy the name. 
    3. Click Add Datasets. Search for the v2 imported dataset. Add it to the flow.
  7. In Flow View:
    1. Click the recipe that is in use with the v1 version of the imported dataset. In the context menu in the right panel, select Make a copy > without inputs
    2. Select the copied recipe. 
    3. In the context menu in the right panel, select Change input. Select the v2 imported dataset.
    4. Your v2 imported dataset is now connected to a version of your recipe. 
    5. Select the recipe object. In the right panel, you should see a preview of the recipe steps. 

      Info

      NOTE: In the recipe, the steps where you modified the imported dataset into tabular format are likely to be be broken. This is ok.

  8. Click Edit recipe
  9. In the Transformer page:
    1. Disable recipe step 1.
    2. Review the state of the data grid to see if the data is organized in tabular form. 
    3. If not, repeat the above steps for the next step in your recipe.
    4. Continue until the data is in tabular form.
  10. After some additional tweaking, your recipe should contain no broken steps, and your data should appear in tabular form. 
  11. You may wish to run a job or download your sample data to compare it to outputs from your v1 imported dataset and steps. You may need to create an output object first.
  12. You can now integrate these changes in either of the following ways:
    1. Apply to existing recipe: Change the input on the existing to the v2 imported dataset. Apply any disabling of steps and other tweaks to the recipe's connected to the v1 imported dataset.

      Info

      NOTE: Before applying the above changes, you might want to download the v1 recipe through the Recipe panel.

    2. Use v2 recipe in the flow: You could simply switch over to using the new recipe. Caveats:
      1. You must recreate any outputs and schedules from the v1 recipe.  
      2. Internal identifiers for the new recipe and its outputs are different from the v1 recipe. These new identifiers may impact API-based automation.
      3. Other application objects that reference the v1 recipes, such as flow tasks in your plans, must be fixed to use the new recipe or output objects.
  13. Run a production job to verify that your flow is producing consistent data with the v2 imported dataset. 
  14. Repeat as needed for other flows.

JSON Input and Output

Input: 

  • It is easier to work with JSON in which each row of the file is a record. When a record spans multiple rows, additional steps are required in the application to render it into tabular format. The example uses multi-row JSON records.

Output:

Info

NOTE: JSON-formatted files that are generated by

D s product
are rendered in JSON Lines format, which is a single line per-record variant of JSON. For more information, see http://jsonlines.org.

  • D s product
     can generate a JSON file as an output for your job. Characteristics of generated JSON files:
    • Newline-delimited: The end of each record is the \n character. If your downstream system is expecting comma-delineated records except for the last one, additional work is required outside of the application.
    • Non-nested: Each record in the generated file is flat. 
      • For multi-level JSON hierarchies, you can nest columns together and leave the top level as a set of columns in the data grid. However, on output, the second and lower hierarchies appear as quoted string values in the output. Additional cleanup is required outside of the application. 

Example

This example dataset contains information on books. In this case:

  • The data is submitted as one attribute per row. A single JSON record spans many rows.
  • The total number of books is three.
  • The JSON data has two hierarchies.
Code Block
"book": {
  "id": "bk101",
  "author": "Guy, Joe",
  "title": "Json Guide",
  "genre": "Computer",
  "price": "44.95",
  "publish_date": "2002-04-26",
  "characteristics": {
    "cover_color": "black",
    "paper_stock": "20",
    "paper_source": "new"
  },
  "description": "An in-depth look at creating applications."
},
"book": {
  "id": "bk102",
  "author": "Nelson, Rogers",
  "title": "When Doves Cry",
  "genre": "Biography",
  "price": "24.95",
  "publish_date": "2016-04-21",
  "characteristics": {
    "cover_color": "white",
    "paper_stock": "15",
    "paper_source": "recycled"
  },
  "description": "Biography of a prince."
},
"book": {
  "id": "bk103",
  "author": "Fitzgerald, F. Scott",
  "title": "The Great Gatsby",
  "genre": "Fiction",
  "price": "9.95",
  "publish_date": "1925-04-10",
  "characteristics": {
    "cover_color": "blue",
    "paper_stock": "20",
    "paper_source": "new"
  },
  "description": "Classic American novel."
}

JSON Ingestion Process

  1. Import the JSON file.

    Info

    NOTE: During import, you should deselect the Detect Structure option. You are likely to need to rebuild the initial parsing steps to consume the file properly. Details are provided later.

  2. If needed, convert loose JSON to a single JSON record per row. 
  3. Unnest the data into columns.
    1. Each level in the JSON hierarchy must be un-nested in a separate step.
  4. When all of the JSON data is in tabular form, perform any 
    D s lang
     transformations. 
  5. If you need to retain the hierarchy, you must nest the lower levels of the JSON hierarchy back into their original form. Leave the top level un-nested.
    1. If it is ok to write out flat JSON records, you can export without nesting the data again.
  6. Run the job, generating a JSON output. 

Step - Import the file 

  1. Through the Import Data page, navigate and select your JSON file for import. 
  2. When the file has been loaded, click Edit settings for the dataset card in the right panel. In the Import Settings dialog, deselect the Detect Structure checkbox.
  3. Complete the rest of the import process. 
  4. In Flow View, add the JSON-based imported dataset to your flow and create a recipe for it.
  5. Select the recipe, and click Edit Recipe...

Step - Convert to one JSON record per row

Info

NOTE: This step is required only if a single JSON record in your imported dataset spans multiple rows. If you have single-row JSON records in the Transformer page, please skip to the next section.

  1. In the Transformer page, you should see your loosely formatted JSON in a single column. Each row contains a separate attribute, and a single record spans multiple rows.
  2. Open the Recipe panel on the right side. The initial parsing steps for the data are displayed. 
  3. In Recipe panel, delete all steps except the first one.
  4. The first one is a Break into rows transformation. This transformation can only appear in the first step of a recipe. 
  5. Select the step, and click the Pencil icon to edit it.
  6. In the Transform Builder, the Split on value is probably the \n character.

  7. The above signals to the application to break up the data into individual rows on the newline (\n) character. This transformation then breaks up your loose JSON on every single attribute. You must modify the Split on value so that it captures only the first attribute of each JSON record. For the above dataset, the Split on value must be the following, noting the space after the colon:

    Code Block
    "book": 
  8. Click Add to save the step again. 
  9. The above dataset should now have four rows, with the first one an empty row. This empty row is caused by the insertion of the \n in front of the first reference to the above string. In the column histogram, select the gray bar, which selects the empty row. In the Suggestions panel, locate the Delete rows suggest, and click Add. The row is removed.
  10. You now have individual rows for each JSON record.

Step - Convert JSON to Object type

The next step involves converting your JSON records to a column of Object type values. The Object data type is a means of rendering records into key-value pairs. However, its structure is a little different from JSON. For more information, see Object Data Type.

Steps:

The following steps convert your JSON to an Object data type. 

  1. Since JSON uses character indentation to convey structure, you should remove these indentations if they appear in your dataset. For our two-layered example, you can use the following transformation:

    D trans
    p03Value\"
    Typeref
    p01NameColumn
    p01Valuecolumn1
    p02NameFind
    p02Value/\n\s*"/
    p03NameReplace with
    p04Valuetrue
    p04NameMatch all occurrences
    SearchTermReplace text or patterns
    1. In the above, the key term is the Find pattern, which is a regular expression:

      Code Block
      /\n\s*"/
    2. The two forward slashes at the ends define the pattern as a regular expression.
    3. The content in the middle matches on the pattern of a newline character, an arbitrary number of spaces, and a double quote.
    4. This pattern is replaced with just the double-quote, removing the preceding part of the pattern from the dataset.
    5. For more information on matching patterns, see Text Matching.
  2. In standard JSON, a comma is used to demarcate the end of a line or a record, except for the last one in a set. 
    1. In the above example, the first two records have commas at the end of them. Here is a snippet of their ends:

      Code Block
      ... "description":"An in-depth look at creating applications."},
      ... "description":"Biography of a prince."},
      ... "description":"Classic American novel."}
    2. To convert these records to Object type, the commas at the end of the first two rows must be removed:
      D trans
      p03Value}
      Typeref
      p01NameColumn
      p01Valuecolumn1
      p02NameFind
      p02Value`\n\},\n{end}`
      p03NameReplace with
      p04Valuetrue
      p04NameMatch all occurrences
      SearchTermReplace text or patterns

      1. The above transformation is similar to the previous one. However, in this one, the Find pattern uses a 

        D s lang
        itempattern
         to indicate that the pattern should only be matched at the end of a record:

        Code Block
        {end}
      2. This token in the pattern prevents it from matching if there are other instances of the pattern nested within the record.
  3. Individual records should look similar to the following:

    D s property overflow

    Code Block
    {"id": "bk101","author": "Guy, Joe","title": "Json Guide","genre": "Computer", \
    "price": "44.95","publish_date": "2002-04-26",{"cover_color": "black", \
    "paper_stock": "20","paper_source": "new"}, \
    "description": "An in-depth look at creating applications."}
  4. These records are suitable for conversion to Object data type. 
  5. To change the data type for the column, click the icon to the left of the column header. Select Object.
  6. The column data type is changed to Object. The step to change data type is added to your recipe, too.
  7. If the column histogram now displays some mismatched records. 
    1. Review those records to determine what is malformed.
    2. Delete the recipe step that changes the data type to Object.
    3. Make fixes as necessary.
    4. Switch back to Object data type. Iterate as needed until all records are valid when the column is converted to Object type.

Step - Unnest JSON records

The next step is to convert your JSON records to tabular format. 

Info

NOTE: For JSON records that have multiple levels in the hierarchy, you should unnest the top level of the hierarchy first, followed by each successive level.

Tip

Tip: The easiest way to unnest is to select the column header for the column containing your Object data. Unnest should be one of the suggested options. If not, you can use the following process.


  1. In the Recipe panel, click New Step
  2. In the Search panel, enter unnest object elements
  3. Specify the following transformation. Substitute the Paths to elements values below with the top-level keys in your JSON records:

    D trans
    p03Valueauthor
    p06NamePath to elements5
    p09Valuetrue
    p01NameColumn
    p06Valueprice
    p03NamePath to elements2
    p07Valuepublish_date
    p04Valuetitle
    SearchTermUnnest object elements
    p07NamePath to elements6
    p09NameRemove elements from original
    Typeref
    p05NamePath to elements4
    p01Valuecolumn1
    p02NamePath to elements1
    p02Valueid
    p05Valuegenre
    p04NamePath to elements3
    p08Valuedescription
    p08NamePath to elements7

    1. In the above, each Paths to elements entry specifies a key in the JSON record. The key's associated value becomes the value in the new column, which is given the same name as the key. 
    2. So, this step breaks out the key-value pairs for the specified keys into separate columns in the dataset. 

      Tip

      Tip: You can choose to remove the original from the source or not. In deeper or wider JSON files, removing can help to identify what remains to be unnested.

  4. Repeat the above process for the next level in the hierarchy. In the example, this step means unnesting the characteristics node:

    D trans
    p03Valuecharacteristics.paper_stock
    Typeref
    p05NameRemove elements from original
    p01NameColumn
    p01Valuecolumn1
    p02NamePath to elements1
    p02Valuecharacteristics.cover_color
    p05Valuetrue
    p03NamePath to elements2
    p04Valuecharacteristics.paper_source
    p04NamePath to elements3
    SearchTermUnnest object elements
  5. You can now delete column1. From the column menu to the right of column1, select Delete.
  6. You have now converted your JSON to tabular format.

    Tip

    Tip: If the above set of steps needs to be applied to multiple files, you might consider stopping your work and returning to Flow View. Select this recipe and click Add New Recipe. If you add successive steps in another recipe, the first one can be used for doing initial processing of your JSON files, separate from any wrangling that you may do for individual files.

    Tip

    Tip: The unnesting process may have moved some columns into positions that are different from their order in the original JSON. Use the Move command from the column menu to reposition your columns.

Step - Wrangle your dataset

Your JSON data is ready for wrangling. 

In the following example, the discount column is created. If the publication date is before 01/01/2000, then the discount is 0.1 (10%):

D trans
p03Valuediscount
Typeref
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02ValueIF(publish_date < DATE(2000, 1, 1), 0.1, 0)
p03NameNew column name
SearchTermNew formula

Continue adding steps until you have transformed your data as needed and are ready to run a job on it.

Step - Nest the JSON records

Info

NOTE: If your desired JSON output does not include multiple hierarchies, you can skip this section. The generated JSON files are a single JSON record per row.

If a job is run using the recipe created so far on the example data, a newline-delimited JSON file that has no hierarchies in it can be generated by the application. However, the dataset is a two-level hierarchy, so the elements in the characteristics hierarchy are written out in the following manner:

Code Block
"characteristics.cover_color":"black","characteristics.paper_stock":20,"characteristics.paper_source":"new",
"characteristics.cover_color":"white","characteristics.paper_stock":15,"characteristics.paper_source":"recycled",
"characteristics.cover_color":"blue","characteristics.paper_stock":20,"characteristics.paper_source":"new",

You can take one of two approaches:

  1. Generate the JSON file with a flat hierarchy. Output looks like the above. Use an external tool to unnest the second and lower hierarchies appropriately.
  2. Re-nest the lower hierarchies until have you have a single flat record, containing some Object type columns that hold the underlying hierarchies. When the re-nested JSON records are exported, secondary hierarchies appear as escaped string values. More details later.

If you are re-nesting the lower hierarchies, you can use the following approach. 

Tip

Tip: The following steps reshape your data. You may wish to create a new recipe as an output of the previous recipe where you can add the following steps.


  1. When you re-nest, you want to nest from the lowest to top tier of the hierarchy. 
  2. In the example, the following columns should be nested together: characteristics.cover_colorcharacteristics.paper_stock, and characteristics.paper_source:

    D trans
    p03Valuecharacteristics.paper_source
    Typeref
    p05NameNew column name
    p01Namecolumn1
    p01Valuecharacteristics.cover_color
    p02Namecolumn2
    p02Valuecharacteristics.paper_stock
    p05Valuecharacteristics
    p03Namecolumn3
    p04ValueObject
    p04NameNest columns to
    SearchTermNest columns into Objects
  3. In the generated characteristics column, you can remove the characteristics. from the key value:

    D trans
    p03Value(empty)
    Typeref
    p01NameColumn
    p01Valuecharacteristics
    p02NameFind
    p02Value`characteristics.`
    p03NameReplace with
    SearchTermReplace text or patterns
  4. Now, delete the three source columns:

    D trans
    p03Valuecharacteristics.paper_source
    Typeref
    p01Namecolumn1
    p01Valuecharacteristics.cover_color
    p02Namecolumn2
    p02Valuecharacteristics.paper_stock
    p03Namecolumn3
    SearchTermDelete columns
  5. Repeat the above steps for the next level of the hierarchy in your dataset. 

    Info

    NOTE: Do not nest the columns at the top level of the hierarchy.

Step - Generate JSON output

When you are ready, you can run the job. Create or modify a publishing action to generate a JSON file for output. See Run Job Page.

When the job completes, you can click the JSON link in the Output Destinations tab of the Job Details page to download your JSON file. See Job Details Page.

Step - Final Cleanup

Outside the application, you may need to do the following:

  1. Since the JSON output is newline delimited, your downstream system may need you to add commas at the end of each record but the last one. 
  2. If you have re-nested JSON hierarchies into your flat records, the exported JSON for secondary hierarchies appears as quoted strings, like the following:

    Code Block
    "characteristics":"{\"cover_color\":\"black\",\"paper_stock\":\"20\",\"paper_source\":\"new\"}",
    "characteristics":"{\"cover_color\":\"white\",\"paper_stock\":\"15\",\"paper_source\":\"recycled\"}",
    "characteristics":"{\"cover_color\":\"blue\",\"paper_stock\":\"20\",\"paper_source\":\"new\"}",

    The quoted strings can be fixed by simple search and replace. 

 

D s also
inCQLtrue
label((label = "structuring_tasks") OR (label = "json"))