Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r097

...

  • relational datasets (tables and views)
  • schematized files (e.g. Parquet)

    InfoNOTE: 
  • Schema validation is not supported for CSV files.file-based datasets (e.g. CSV files)

To assist with these issues, the

D s webapp
can be configured to monitor schema changes on your dataset. Schema validation performs the following actions on your dataset:

...

File settings

During the creation of an imported dataset, you can configure the following settings for schema validation:

Steps:

  1. After a file has been selected in the Import Data page, click Edit settings.
  2. In the Edit settings dialog:


    SettingEffects on schema validation
    Detect structure

    When enabled, the structure of the first chunk from the imported dataset is used for determining the schema of the dataset. 

    Info

    NOTE: If the imported dataset is composed of multiple files, only the first file is used for schema validation purposes. If there are changes to the schema of the second or later files, they are undetected.

    When disabled, the structure of the file is ignored, and all data is imported as a single column. Schema validation is effectively disabled for the dataset.

    Infer headerThe first row of data is used as the column headers.
    No headersDefault column names are used in the stored schema: column1 , column2, and so on.

For more information, see File Import Settings.

Use

When a job is launched, the schema validation check is performed in parallel with the data ingestion step. Schema validation checks for:

...

  • Reduces the number of duplicate or invalid datasets created from the same source.
  • Reduces challenges of replacing datasets and retaking samples. 

Limitations

...

  • If a column's data type is modified and other changes, such as column name changes, are not detected, this change is not considered a schema drift error.
  • You cannot refresh the schemas of reference datasets or uploaded sources.
  • Schema refresh does not apply to any file formats that require conversion to native formats.

    Info

    NOTE: Schema management does not work forJSON-based imported datasets that were created under the v1 legacy method of JSON import. All JSON imported datasets created under the legacy method (v1) of JSON import must be recreated to behave like v2 datasets with respect to conversion and schema management. Features developed in the future may not retroactively be supported in the v1 legacy mode. For more information, see Working with JSON v2.

Info

NOTE: If you have imported a flow from an earlier version of the application, you may receive warnings of schema drift during job execution when there have been no changes to the underlying schema. This is a known issue. The workaround is to create a new version of the underlying imported dataset and use it in the imported flow.

Limitations for parameterized datasets

Parameterized files:

Info

NOTE: If you attempt to refresh the schema of a parameterized dataset based on a set of files, only the schema for the first file is checked for changes. If changes are detected, the other files are assumed to contain those changes as well. This can lead to changes being assumed or undetected in later files and potential data corruption in the flow.

Parameterized tables:

Info

NOTE: Refreshing the schema of a parameterized dataset using custom SQL is not supported.

Effects of refreshing schemas

...

For more information on how to refresh the schemas of your datasets, see:

...