Page tree


Trifacta Wrangler Pro is no longer available. This space will be removed soon. Please visit this page instead: Overview of Data Quality

   

Contents:


Trifacta® provides multiple mechanisms to transform and standardize data to meet usage needs, including profile visualizations and type-based quality bars to identify potential anomalies and quality problems. Data quality checks can be applied during data import, transformation, or export in the form of visual profiling.

Broadly speaking, data quality identifies the degree to which data is usable and responsive to your use case. When you assess data quality, you are designing tests to assess its suitability for generic usage and for your specific uses. 

Data Quality Characteristics

Data quality covers the following characteristics:

  • Completeness: values are present where they are needed and expected
  • Accuracy: data is substantively free of errors
  • Consistency: a dataset can be matched across different data sources of the enterprise
  • Timeliness: data values are up-to-date
  • Uniqueness: aggregate data are free from any duplication via filters or other transformations of source data
  • Validity: data are structured based on an adequate and rigorous classification system
  • Availability / Accessibility: data are made available to the relevant stakeholders
  • Traceability: the history, processing and location of the data under consideration can be easily traced

Schema Validation

Type inference

When data is imported, the  Trifacta attempts to infer the data types in the source and to type columns in the dataset accordingly. Type inference uses the first 20-25 rows of the initial sample to assess the appropriate data type to apply to the column. For more information, see Type Conversions.

Some imported data, such as relational tables, may include schema information to identify the data type of each column. In some cases you can disable type inferencing on imported data:

  • Connections: As part of the definition of a connection, you can optionally choose to disable type inference. For more information, see Create Connection Window.

  • Per-dataset: When you import a dataset, you can modify the import settings for the selected source to disable type inference. See Import Data Page.

Assign targets

To assist in your transformation efforts, you can assign a target schema for each recipe. This target schema is super-imposed on the columns of your data. Using visual tools to review differences and select changes, you can rapidly convert the structure of your dataset in development to meet the expected target schema. For more information, see Overview of RapidTarget

Identify Anomalies

In the Transformer page, you can use the available visual tools to review the data quality characteristics of the columns in your data. These data visualizations and type-based quality bars can assist in identifying potential anomalies and quality problems.

Data quality bar

At the top of each column, you can see a data quality bar, which uses the following color coding to validate the column values against the selected column type.

Color barDescription
greenValues that are valid for the current data type of the column
redValues that are mismatched for the column data type
grayMissing or null values

Tip: Click any of the color bars to receive suggestions for transformations to add to your recipe. See Overview of Predictive Transformation.


Tip: You can change a column's data type in the column header. See Column Menus.

For more information, see Data Quality Bars.

Column histogram

In the column header, you can review the count and distribution of values in the column. A column's histogram can be useful for identifying anomalies or for selecting specific sets of values in the column for further exploration.  

Tip: Click and drag over any set of values to receive suggestions for transformations to add to your recipe. See Overview of Predictive Transformation.

See Column Histograms.

Column details

Through the Column Details panel, you can explore the quality and distribution of the values in the column. The contents of the panel vary depending on the data type. For example, if the column is typed for Datetime values, then the Column Details panel includes information on the distribution of values across the days of the week and days of the month. 

For all data types, you can review useful statistics on statistical quartiles, the uniqueness of values, mismatches, and outliers. 

Tip: The Column Details panel is very useful for acquiring statistical information on column values in a visual format. Click any data quality bar to be prompted for suggestions of transformation steps. See Overview of Predictive Transformation.

For more information, see Column Details Panel.

Data Quality Rules

In the Transformer page, you can design data quality rules to apply to the displayed sample of your data. These data quality rules can be used to identify anomalies, completeness, uniqueness, and validity. Additionally, rules can be defined to assess the quality of the data for its intended purpose in your data pipeline. In addition, you can use calculated metric type (derived metrics) as a source of data quality input types and create a metric-based data quality rule.

Examples:

  • Check that all product identifiers fit a specified pattern
  • Verify that there are no negative values for any count columns
  • Validate that primary key columns contain unique values
  • Specify metrics in the Column value for some rule types.

    • For example, instead of specifying a column name such as OrderTotal as the input for the data quality rule, you could specify for some rule types, AVERAGE(OrderTotal).

    • For more examples, see Add Data Quality Rule.

NOTE: Metric-based rules are supported only for some metric types. For more information on the rules that support metrics, see Data Quality Rules Reference.

NOTE: Data quality rules are not transformation steps.  They assess the current state of the sampled data in the Transformer page and can be used to assist in constructing transformation steps to improve data quality.

NOTE: As you apply transformation steps to the data, the state of your data quality rules is automatically updated to reflect the changes. If you delete columns or other elements referenced in the data quality rules, errors are generated in the Transformer page.

Limitations

  • Rules cannot be included in macros.
  • Rules cannot be parameterized.
  • Sets of rules are created for each recipe. Rules cannot be shared between recipes.

Data quality rule categories

Rules break down into the following categories:

CategoryDescription
Integrity ConstraintsRule types in this category assess the validity of a column's data and any implied relationships between the data (e.g., City + State implies Zip Code)
Pattern MatchingThese rule types test whether the data in your column matches patterns that you define.
Column Values

These rule types compare column values to limits or sets of acceptable values.

In addition to column references, you can specify metric-based values. For example, you can create a constraint that the sales quantity should be within a specific range.

Other Rules

You can also create data quality rules based on custom Wrangle formulas.

Data quality types

Within each of the above categories, you can explore and define a variety of types of data quality rules. These rule types provide a template for creating the rule, which accepts one or more input parameters that you specify. For more information on the set of available rule types, see Data Quality Rules Reference.

Creating rules

For each recipe, you can create individualized sets of rules from within the Transformer page. In the Data Quality Rules panel, you build your data-specific rules and can review the quality bars of each rule as you continue to build your recipe.

For more information on creating rules, see Add Data Quality Rule.

For more information, see Data Quality Rules Panel.

Reviewing suggestions

Through the Data Quality Rules panel, you can review a set of suggested data quality rules that are applicable to your dataset. These rules are generated based on heuristics applied to your sampled data. For more information, see Data Quality Rules Panel.

Rules evaluation

In the Transformer page:

  1. Rules are evaluated and displayed for the current location in the recipe. For example, if you change the location of the recipe cursor to a point earlier in the recipe, all of the defined rules are evaluated for the state of the dataset sample at that point in the recipe.
  2. The data quality rules defined in the Transformer page are applied to the displayed sample. If your sample is not the full dataset, you should consider taking additional samples to validate the rules across other parts of your dataset.

In job results:

After job execution, these rules are applied across the entire dataset and available when visual profiling is enabled. For more information, see "Data Quality in Job Details" below.

Standardization

You can use the Standardization tool to standardized clustered sets of column values to values that are common and consistent throughout your enterprise's data. For more information, see Overview of Standardization

Data Quality Functions

The following functions are available for assessing data quality.

Type functions

Item Description
NULL Function The NULL function generates null values.
IFNULL Function The IFNULL function writes out a specified value if the source value is a null. Otherwise, it writes the source value. Input can be a literal, a column reference, or a function.
IFMISSING Function The IFMISSING function writes out a specified value if the source value is a null or missing value. Otherwise, it writes the source value. Input can be a literal, a column reference, or a function.
IFMISMATCHED Function The IFMISMATCHED function writes out a specified value if the input expression does not match the specified data type or typing array. Otherwise, it writes the source value. Input can be a literal, a column reference, or a function.
IFVALID Function The IFVALID function writes out a specified value if the input expression matches the specified data type. Otherwise, it writes the source value. Input can be a literal, a column reference, or a function.
ISNULL Function The ISNULL function tests whether a column of values contains null values. For input column references, this function returns true or false.
ISMISSING Function The ISMISSING function tests whether a column of values is missing or null. For input column references, this function returns true or false.
ISMISMATCHED Function Tests whether a set of values is not valid for a specified data type.
VALID Function Tests whether a set of values is valid for a specified data type and is not a null value.
PARSEINT Function Evaluates a String input against the Integer datatype. If the input matches, the function outputs an Integer value. Input can be a literal, a column of values, or a function returning String values.
PARSEBOOL Function Evaluates a String input against the Boolean datatype. If the input matches, the function outputs a Boolean value. Input can be a literal, a column of values, or a function returning String values.
PARSEFLOAT Function Evaluates a String input against the Decimal datatype. If the input matches, the function outputs a Decimal value. Input can be a literal, a column of values, or a function returning String values.
PARSEARRAY Function Evaluates a String input against the Array datatype. If the input matches, the function outputs an Array value. Input can be a literal, a column of values, or a function returning String values.
PARSEOBJECT Function Evaluates a String input against the Object datatype. If the input matches, the function outputs an Object value. Input can be a literal, a column of values, or a function returning String values.
PARSESTRING Function Evaluates an input against the String datatype. If the input matches, the function outputs a String value. Input can be a literal, a column of values, or a function returning values. Values can be of any data type.

Count functions

The following functions measure counts of values within a column, optionally counted by group. 

ItemDescription
COUNT Function

Generates the count of rows in the dataset. Generated value is of Integer type. 

COUNTA Function

Generates the count of non-null rows in a specified column, optionally counted by group. Generated value is of Integer type. 

COUNTDISTINCT Function

Generates the count of distinct values in a specified column, optionally counted by group. Generated value is of Integer type. 

UNIQUE Function

Extracts the set of unique values from a column into an array stored in a new column. This function is typically part of an aggregation. 

Aggregation functions

ItemDescription
AVERAGE Function

Computes the average (mean) from all row values in a column or group. Input column can be of Integer or Decimal.

See also:

SUM Function

Computes the sum of all values found in all row values in a column. Input column can be of Integer or Decimal.

MIN Function

Computes the minimum value found in all row values in a column. Input column can be of Integer, Decimal or Datetime.

MAX Function

Computes the maximum value found in all row values in a column. Inputs can be Integer, Decimal, or Datetime.

MODE Function

Computes the mode (most frequent value) from all row values in a column, according to their grouping. Input column can be of Integer, Decimal, or Datetime type.

MINDATE Function

Computes the minimum value found in all row values in a Datetime column.

MAXDATE Function

Computes the maximum value found in all row values in a Datetime column.

MODEDATE Function

Computes the most frequent (mode) value found in all row values in a Datetime column.

Statistical functions - single column

Variations in these functions:

  • Some of these functions have variations that use the sample population method of computation.  
  • IF conditional functions can be used to compute statistical computations based on a condition.

General statistics

ItemDescription
VAR Function

Computes the variance among all values in a column. Input column can be of Integer or Decimal. If no numeric values are detected in the input column, the function returns 0

STDEV Function

Computes the standard deviation across all column values of Integer or Decimal type.

MEDIAN Function

Computes the median from all row values in a column or group. Input column can be of Integer or Decimal.

QUARTILE Function

Computes a specified quartile across all row values in a column or group. Input column can be of Integer or Decimal.

PERCENTILE Function

Computes a specified percentile across all row values in a column or group. Input column can be of Integer or Decimal.


ItemDescription
APPROXIMATEMEDIAN Function

APPROXIMATEQUARTILE Function

Computes an approximation for a specified quartile across all row values in a column or group. Input column can be of Integer or Decimal.

APPROXIMATEPERCENTILE Function

Computes an approximation for a specified percentile across all row values in a column or group. Input column can be of Integer or Decimal.

Statistical functions - multi-column

ItemDescription
COVAR Function

Computes the covariance between two columns using the population method. Source values can be of Integer or Decimal type.

CORREL Function

Computes the correlation coefficient between two columns. Source values can be of Integer or Decimal type.

Data Quality in Job Details

When you run a job and generate results, you can review the the quality of the data of the generated output. 

Visual profiling

In parallel with executing the job, you can generate a visual profile of the generated results. This visual profile provides graphical representations of the valid and mismatched values against each column's data type, as well as indications about missing values in the output. 

Tip: Visual profiles can be downloaded in PDF or JSON format for offline analysis.

Visual profiling is selected as part of the job definition process. See Run Job Page.

For more information, see Overview of Visual Profiling.

Rules tab

When visual profiling is enabled for your job, the Rules tab in the Job Details page contains the results of the data quality rules for the job's recipes applied across the entire dataset. 

Tip: Data quality rules are available for download in JSON and PDF format. For more information, see Job Details Page.

For more information, see  Job Details Page .

This page has no comments.