- relational datasets (tables and views)
schematized files (e.g. Parquet)
- Schema validation is not supported for CSV files.file-based datasets (e.g. CSV files)
To assist with these issues, the
can be configured to monitor schema changes on your dataset. Schema validation performs the following actions on your dataset:
D s webapp
During the creation of an imported dataset, you can configure the following settings for schema validation:
- After a file has been selected in the Import Data page, click Edit settings.
In the Edit settings dialog:
Setting Effects on schema validation Detect structure
When enabled, the structure of the first chunk from the imported dataset is used for determining the schema of the dataset.
NOTE: If the imported dataset is composed of multiple files, only the first file is used for schema validation purposes. If there are changes to the schema of the second or later files, they are undetected.
When disabled, the structure of the file is ignored, and all data is imported as a single column. Schema validation is effectively disabled for the dataset.
Infer header The first row of data is used as the column headers. No headers Default column names are used in the stored schema:
column2, and so on.
For more information, see File Import Settings.
When a job is launched, the schema validation check is performed in parallel with the data ingestion step. Schema validation checks for:
- Reduces challenges of replacing datasets and retaking samples.
- If a column's data type is modified and other changes, such as column name changes, are not detected, this change is not considered a schema drift error.
- You cannot refresh the schemas of reference datasets or uploaded sources.
Schema refresh does not apply to any file formats that require conversion to native formats.
NOTE: Schema management does not work forJSON-based imported datasets that were created under the v1 legacy method of JSON import. All JSON imported datasets created under the legacy method (v1) of JSON import must be recreated to behave like v2 datasets with respect to conversion and schema management. Features developed in the future may not retroactively be supported in the v1 legacy mode. For more information, see Working with JSON v2.
NOTE: If you have imported a flow from an earlier version of the application, you may receive warnings of schema drift during job execution when there have been no changes to the underlying schema. This is a known issue. The workaround is to create a new version of the underlying imported dataset and use it in the imported flow.
Limitations for parameterized datasets
NOTE: If you attempt to refresh the schema of a parameterized dataset based on a set of files, only the schema for the first file is checked for changes. If changes are detected, the other files are assumed to contain those changes as well. This can lead to changes being assumed or undetected in later files and potential data corruption in the flow.
NOTE: Refreshing the schema of a parameterized dataset using custom SQL is not supported.
Effects of refreshing schemas
For more information on how to refresh the schemas of your datasets, see: