Page tree

Release 9.2


Contents:

   

Contents:


A schema refers to the sequence and data type of columns in a dataset. Schemas are applicable to relational tables and some file formats. This section provides an overview of how Designer Cloud powered by Trifacta® Enterprise Edition enables the capture and tracking of changes of input schemas as well as the methods available for transforming your data to match a target schema.

Overview of Schemas

A schema is a skeleton structure that represents the logical view of the dataset. The dataset can be a file, table, or a SQL query in a database. A schema defines how the data is structured and organized. Schema information includes:

  • Column names
  • Column ordering
  • Column data types

Schemas may apply to relational tables and schematized file formats such as Avro and Parquet. 

Input type conversions

Depending on the data source,  Designer Cloud powered by Trifacta® Enterprise Edition can read in native data types into Trifacta data types. For more information, see Type Conversions.

Schema Validation


Over time, schema sources may change in major and minor ways, often without warning. From within the Designer Cloud application , schema changes may appear as broken recipe steps and can cause data corruption downstream. To assist with these issues, the Designer Cloud application can be configured to monitor schema changes on your dataset. Schema validation performs the following actions on your dataset:

  • On read, the schema information from the dataset is captured and stored separately in the Trifacta database. This information identifies column names, data types, and ordering of the dataset.
  • When the dataset is read during job execution, the new schema information is read and compared to the stored version, which enables identification of changes to the dataset.

    Tip: This check occurs as the first step of the job execution process and is labeled as Schema validation.

  • You can configure the Designer Cloud application to halt job execution when schema validation issues have been encountered.

    Tip: Configuration settings can be overridden for individual jobs.

Limitations

  • Schema validation applies only to sources that have published schemas (relational datasources and schematized file types).

    NOTE: CSV files are not supported.

    NOTE: If you attempt to refresh the schema of a parameterized dataset based on a set of files, only the schema for the first file is checked for changes. If changes are detected, the other files are contain those changes as well. This can lead to changes being assumed or undetected in later files and potential data corruption in the flow.

Enable

Schema management service

If you are not enabling schema validation, the Schema Management service can be disabled.

Steps:

  1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.
  2. Locate the following parameter and set it to false:

    "schema-management-service.enabled": true,
  3. Save your changes and restart the platform.

Settings

At the project or workspace level, an administrator can set the default settings for outputs to validate schemas or not. 

Tip: Workspace-level defaults can be overridden at the job level, even if the workspace-level settings are disabled. For more information, see Run Job Page.

Use

When schema validation is enabled and a job is launched, the schema validation check is performed in parallel with the data ingestion step. The results of the schema validation check are reported in the Job Details page in the Schema validation stage.

NOTE: Jobs may be configured to fail if schema validation checks fail. If jobs are not configured to fail, jobs may complete with warnings and publish output data to the specified targets, when schema validation fails.

For more information, see Job Details Page.

When schema validation detects differences in the Job Details page, those findings can be explored in detail. See Schema Changes Dialog.

Job-level overrides

You can override the project or workspace level settings for schema validation for individual jobs. For more information, see Run Job Page.

Schema Refresh

Schema refresh enables on-demand updating of your imported dataset schemas to capture changes to columns. For example, when you are working with datasets in a flow view, you can refresh your imported datasets' schemas by checking the source schema for changes. Schema refresh automatically generates a new initial sample, which allows you to gather fresh data in the Transformer page.

Schema refresh applies to:

  • Relational schemas
  • Schematized files
  • Delimited files

    NOTE: Delimiter files include CSVs and TSVs and can include other files whose delimiters can be inferred by the Designer Cloud application during import. Delimited files do not contain data type information; data types are inferred by the Designer Cloud application for these file types.

    NOTE: File types that require conversion, such as Excel, PDF, and JSON, are not supported.


Key Benefits:

  • Reduces the number of duplicate or invalid datasets created from the same source.
  • Reduces challenges of replacing datasets and retaking samples. 

Limitations

NOTE: If you attempt to refresh the schema of a parameterized dataset based on a set of files, only the schema for the first file is checked for changes. If changes are detected, the other files are assumed to contain those changes as well. This can lead to changes being assumed or undetected in later files and potential data corruption in the flow.


  • You cannot refresh the schemas of reference datasets or uploaded sources.
  • Schema refresh does not apply to any file formats that require conversion to native formats. These file formats include PDF, Excel, and JSON among others.

  • If a column's data type is modified and other changes, such as column name changes, are not detected, this change is not considered a schema drift error.

Effects of refreshing schemas

When you choose to refresh a schema, the schema is refreshed without checking for changes, which forces the deletion of all samples and recollection of a new initial sample. All pre-existing samples must be recreated. In some environments, this sample collection incurs costs.

When you refresh the schema in the Designer Cloud application :  

  • The source schema is applied to the imported dataset in all cases. 
    • All the existing samples are invalidated.
    • A new initial sample is generated, which updates the previewed data. This may take some time. 
  • Addition or removal of columns may cause recipe steps to break, which can cause any transformation jobs to fail. You must fix these broken steps in the Recipe panel.

Refresh your schemas

For more information on how to refresh the schemas of your datasets, see:

Via API:

For more information, see  https://api.trifacta.com/ee/9.2/index.html#operation/asyncRefreshSchema

Output Schemas

Output type conversions

Depending on the output system,  Designer Cloud powered by Trifacta Enterprise Edition can deliver your results in columns and data types native to the target. For more information, see Type Conversions.

Target schemas

As needed, you can import a dataset the columns of which can serve as the target of your transformation efforts. When this target schema is imported, it is super-imposed on the columns of your dataset in the Transformer page, allowing you to quickly change the naming, order, and data typing of your columns to match the target schema. For more information, see Overview of RapidTarget.

See Also for Overview of Schema Management:

This page has no comments.