Page tree

Trifacta Dataprep



Contents:

   

Contents:


Version 2: This section describes how you can import JSON files into  Dataprep by Trifacta®, convert them to tabular format, wrangle them, and then export them back in the same JSON format.

The basic workflow is described by way of example. In the example workflow, the JSON file must be imported into  Dataprep by Trifacta®, a new column must be inserted into the JSON, and the resulting JSON must be exported in the same structure.

Enable

This method of working with JSON is enabled by default.

For more information on disabling, see Dataprep Project Settings Page.


You can choose to continue using the legacy method of working with JSON.

NOTE: The legacy version of JSON import is required if you are working with compressed JSON files or only Newline JSON files.

You should migrate your flows to using the new version.

NOTE: The legacy version of working with JSON is likely to be deprecated in a future release.

For more information on migrating to the new version, see Working with JSON v1.

Requirements

JSON input 

  • Recommended limit of 1 GB in source file size. Since conversion happens within the Trifacta node, this limit may vary depending on the memory of the Trifacta node.
  • Filename extensions must be .json or .JSON
  • Conversion of compressed JSON files is not supported. Compressed JSON files can be imported using the previous method. See Working with JSON v1.
  • For best results, all keys and values should be quoted and imported as strings. 

    NOTE: Escape characters that make JSON invalid can cause your JSON file to fail to import.

    • You can escape quote values to treat them as literals in your strings using the backslash character. For example: \"

    • When the values are imported into the Transformer page, the Trifacta application re-infers the data type for each column.
JSON structureDescriptionSupported?
Newline

The newline character (\n) denotes the end of a record. Each record can contain the keys (object or array) and values for the JSON object.

Tip: This version is supported through through both versions of JSON import, but it performs better in v1. If you are using the Newline form of JSON exclusively, you should use v1.


Supported

Top-level objectTop-level row contains keys for mapping JSON objectsSupported
Top-level arrayTop-level row contains array of objectsSupported

JSON output

NOTE: JSON-formatted files that are generated by Dataprep by Trifacta are rendered in JSON Lines format, which is a single line per-record variant of JSON. For more information, see http://jsonlines.org.

  • Dataprep by Trifacta can generate a JSON file as an output for your job. Characteristics of generated JSON files:
    • Newline-delimited: The end of each record is the \n character. If your downstream system is expecting comma-delineated records except for the last one, additional work is required outside of the application.
    • Non-nested: Each record in the generated file is flat. 
      • For multi-level JSON hierarchies, you can nest columns together and leave the top level as a set of columns in the data grid. However, on output, the second and lower hierarchies appear as quoted string values in the output. Additional cleanup is required outside of the application. 

Example

The following example contains records of images from a website:

{"metrics": [{"rank": "1043", "score" : "9679"}], "caption": "Such a good boy!", "id": "9kt8ex", "url": "https://www.example.com/w285fpp11.jpg", "filename": "w285fpp11.jpg"}
{"metrics": [{"rank": "1042", "score" : "9681"}], "caption": "This sweet puppy has transformed our life!", "id": "9x2774", "url": "https://www.example.com/fmll0cy11.jpg", "filename": "fmll0cy11.jpg"}
{"metrics": [{"rank": "1041", "score" : "9683"}], "caption": "We sure love our fur babies.", "id": "a8guou", "url": "https://www.example.com/mljnmq521.jpg", "filename": "mljnmq521.jpg"}

Notes:

  • Each row is a complete JSON record containing keys and values.

    Tip: Nested JSON, such as metrics above, can be inserted as part of a record. It can then be unnested within the application.

  • Each key's value must have a comma after it, except for the final key value in any row. 

    NOTE: The end of a JSON record is the right curly bracket (}). Commas are not added to the end of each line in this format.

JSON Workflow

  1. Import the JSON file.

  2. Any nested data must be unnested within columns. Each level in the JSON hierarchy must be un-nested in a separate step.
  3. When all of the JSON data is in tabular form, perform any  Wrangle  transformations. 
  4. If you need to rebuild the loose JSON hierarchy, you must nest the lower levels of the JSON hierarchy back into their original form.
    1. If it is ok to write out flat JSON records, you can export without nesting the data again.
  5. Run the job, generating a JSON output. 

Step - Import the file 

  1. Through the Import Data page, navigate and select your JSON file for import. 

    NOTE: File formats are detected based on the file extension. Please verify that your file extension is .json or .JSON, which ensures that it is passed through the conversion service.

    1. The file is passed through the conversion process, which reviews the JSON file and stores it on the base storage layer in a format that can be easily ingested as in row-per-record format. This process happens within the Import Data page. You can track progress on the right side of the screen.

  2. After the file has been converted, click the Preview icon on the right side of the screen. In the Preview, you can review the first few rows of the imported file.

    1. If some rows are missing from the preview, then you may have a syntax error in the first row after the last well-structured row. You should try to fix this in source and re-import.

    2. If all of the rows are problematic, your data is likely malformed.

  3. Complete the rest of the import process. For more information, see Import Data Page

  4. Add the JSON-based imported dataset to a flow and create a recipe for it. For more information, see Flow View Page.
    1. Select the recipe, and click Edit Recipe...

In the Transformer page, the example above should look like the following:

metricscaptionidurlfilename
[{"rank":"1043","score":"9679"}]Such a good boy!9kt8exhttps://www.example.com/w285fpp11.jpgw285fpp11.jpg
[{"rank":"1042","score":"9681"}]This sweet puppy has transformed our life!9x2774https://www.example.com/fmll0cy11.jpgfmll0cy11.jpg
[{"rank":"1041","score":"9683"}]We sure love our fur babies.a8guouhttps://www.example.com/mljnmq521.jpgmljnmq521.jpg


Step - Unnest JSON records

Your JSON records are in tabular format. If you have nested JSON objects within your JSON records, the next step is to unnest your JSON records. 

NOTE: For JSON records that have multiple levels in the hierarchy, you should unnest the top level of the hierarchy first, followed by each successive level.

Tip: The easiest way to unnest is to select the column header for the column containing your nested data. Unnest should be one of the suggested options, and the suggestion should include the specification for the paths to the key values. If not, you can use the following process.

 

  1. In the Recipe panel, click New Step
  2. In the Search panel, enter unnest values into new columns
  3. Specify the following transformation. Substitute the Paths to elements values below with the top-level keys in your JSON records:

    Transformation Name Unnest values into new columns
    Parameter: Column metrics
    Parameter: Path to elements1 [0]

    Tip: You can choose to remove the original from the source or not. In deeper or wider JSON files, removing can help to identify what remains to be unnested.

  4. In the above transformation, the bracketing array around the set of values has been broken down into raw JSON. This value may now be interpreted as a String data type. From the column drop-down, you can select Object data type.
  5. Click the column head again, or specify the following transformation to unnest the Object column:
    Transformation Name Unnest Objects into columns
    Parameter: Column 0
    Parameter: Path to elements1 rank
    Parameter: Path to elements2 score

    1. In the above, each Paths to elements entry specifies a key in the JSON record. The key's associated value becomes the value in the new column, which is given the same name as the key. 
    2. So, this step breaks out the key-value pairs for the specified keys into separate columns in the dataset. 

  6. Repeat the above process for the next level in the hierarchy. 
  7. You can now delete the source columns. In the example, these source columns are named metrics and 0

    Tip: SHIFT + click these columns and then select Delete columns from the right panel. Click Add.

  8. Repeat the above steps for each nested JSON object.

    Tip: If the above set of steps needs to be applied to multiple files, you might consider stopping your work and returning to Flow View. Select this recipe and click Add New Recipe. If you add successive steps in another recipe, the first one can be used for doing initial processing of your JSON files, separate from any wrangling that you may do for individual files.

    Tip: The unnesting process may have moved some columns into positions that are different from their order in the original JSON. Use the Move command from the column menu to reposition your columns.

Step - Wrangle your dataset

Your JSON data is ready for wrangling. Continue adding steps until you have transformed your data as needed and are ready to run a job on it.

Step - Nest the JSON records

NOTE: If your desired JSON output does not include multiple hierarchies, you can skip this section. The generated JSON files are a single JSON record per row.

If you ran a job on the example dataset, the output would look like the following:

{"rank":1043,"score":9679,"caption":"Such a good boy!","id":"9kt8ex","url":"https://www.example.com/w285fpp11.jpg","filename":"w285fpp11.jpg"}
{"rank":1042,"score":9681,"caption":"This sweet puppy has transformed our life!","id":"9x2774","url":"https://www.example.com/fmll0cy11.jpg","filename":"fmll0cy11.jpg"}
{"rank":1041,"score":9683,"caption":"We sure love our fur babies.","id":"a8guou","url":"https://www.example.com/mljnmq521.jpg","filename":"mljnmq521.jpg"}

Suppose you want to nest the url and filename columns into a nested array called, resources.

Re-nest the lower hierarchies until have you have a single flat record, containing some Object type columns that hold the underlying hierarchies. When the re-nested JSON records are exported, secondary hierarchies appear as escaped string values. More details later.

Tip: The following steps reshape your data. You may wish to create a new recipe as an output of the previous recipe where you can add the following steps.

Steps:

  1. SHIFT + click the url and filename columns. Then, select Nest columns in the right-hand panel. This transformation should look like the following:

    Transformation Name Nest columns into Objects
    Parameter: column1 url
    Parameter: column2 filename
    Parameter: Nest columns to Object
    Parameter: New column name column1
  2. column1 now contains an Object mapping of the two columns. You can now nest this column again into an Array:
    Transformation Name Nest columns into Objects
    Parameter: Columns column1
    Parameter: Nest columns to Array
    Parameter: New column name resources
  3. Delete column1.
  4. Continue nesting other columns in a similar fashion. Repeat the above steps for the next level of the hierarchy in your dataset.
  5. You must re-nested from the bottom of the target hierarchy to the top. 

    NOTE: Do not nest the columns at the top level of the hierarchy.

  6. When the column names contain all of the keys that you wish to generate in the top-level JSON output, you can run the job.

Step - Generate JSON output

When you are ready, you can run the job. Create or modify a publishing action to generate a JSON file for output. See Run Job Page.

When the job completes, you can click the JSON link in the Output Destinations tab of the Job Details page to download your JSON file. See Job Details Page.

Output file for the above example should look like the following:

{"rank":1043,"score":9679,"caption":"Such a good boy!","id":"9kt8ex","url":"https://www.example.com/w285fpp11.jpg","filename":"w285fpp11.jpg","resources":[{"url":"https://www.example.com/w285fpp11.jpg","filename":"w285fpp11.jpg"}]}
{"rank":1042,"score":9681,"caption":"This sweet puppy has transformed our life!","id":"9x2774","url":"https://www.example.com/fmll0cy11.jpg","filename":"fmll0cy11.jpg","resources":[{"url":"https://www.example.com/fmll0cy11.jpg","filename":"fmll0cy11.jpg"}]}
{"rank":1041,"score":9683,"caption":"We sure love our fur babies.","id":"a8guou","url":"https://www.example.com/mljnmq521.jpg","filename":"mljnmq521.jpg","resources":[{"url":"https://www.example.com/mljnmq521.jpg","filename":"mljnmq521.jpg"}]}

This page has no comments.