Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version next

D toc

Excerpt

When a dataset is initially loaded into the Transformer page, one or more steps may be automatically added to the new recipe in order to assist in parsing the data. The added steps are based on the type of data that is being loaded and the ability of the application to recognize the structure of the data.

File Encoding

When a text file is used as an imported dataset, 

D s product
rtrue
 assumes that the imported files are encoded in UTF-8, by default.

...

Info

NOTE: In some cases, imported files are not properly parsed due to issues with encryption types or encryption keys in the source datastore. For more information, please contact your datastore administrator.

As needed, you can change the encoding to use when parsing individual files. In the Import Data page, click Edit Settings in the right-hand panel.

...

  

Automatic Structure Detection

Info

NOTE: By default, these steps do not appear in the recipe panel due to automatic structure detection. If you are having issues with the initial structuring of your dataset, you may choose to re-import the dataset with Detect structure disabled. Then, you can review this section to identify how to manually structure your data. For more information on changing the import settings for a dataset, see Import Data Page.

This section provides information on how to apply initial parsing steps to unstructured imported datasets. These steps should be applied through the recipe panel.

...

  • In some cases, the application may be unable to create this header row. Instead, the columns are titled column1column2column3 and so on.
  • If the column names are split across multiple rows in your dataset, you may need to modify the header column naming transformation step. For more information, see Rename Columns 

Converted data

Some formats, such as binary data or JSON, are converted to a format that is natively understood by the product before the data is available for sampling and transformation.

Excel

...

Microsoft Excel files are internally converted to CSV files and then loaded into the Transformer page. CSV files are treated using the general parsing steps. See previous section.For more information, see Import Excel Data.

JSON

If 80% of the records in an imported dataset are valid JSON objects, then the data is parsed as JSON through a conversion process

Notes:

  • For JSON files, it is important to import them in unstructured format.
  • D s product
    rtrue
    requires that JSON files be submitted with one valid JSON object per line. 
    • Multi-line JSON import is not supported.
    • Consistently malformed JSON objects or objects that overlap linebreaks might cause import to fail.

For more information, see Working with JSON v2.

Database Tables

Properly formatted database tables with a provided schema should not require any initial parsing steps.

Known Issues

  • Some characters in imported datasets, such as NUL (ASCII character 0) characters, may cause problems with recognizing line breaks. If initial parsing is having trouble with line breaks, you may need to fix the issue in the source data prior to import, since the Splitrows transformation must be the first step in your recipe. 

...

The new initial parsing steps are now inserted into recipe flow before the recipe steps in development.

D s also
inCQLtrue
label((label = "structuring_tasks") OR (label = "structure") OR (label = "import"))