Skip to main content

Validate Your Data

The process of cleansing, enhancing, and transforming your data can introduce significant changes to it, some of which might not be intended. This page provides some tips and techniques for validating your dataset, from start to finish for your data wrangling efforts.

Data validation can be broken down into the following categories:

  • Consistency - Does your data fit into expected values for it? Do field values match the data type for the column? Are values within acceptable ranges? Are rows unique? Duplicated?

  • Completeness - Are all expected values included in your data? Are some fields missing values? Are there expected values that are not present in the dataset?

Before You Begin

Before you begin building your data pipeline, you should identify your standards for data quality.

Note

Depending on your source system, you might be able to generate data quality reports from within it. These reports can be used as the basis for validating your work in Designer Cloud Powered by Trifacta Enterprise Edition.

If your source system does not enable generation of these reports, you should consider profiling your dataset as soon as you load your data into Designer Cloud Powered by Trifacta Enterprise Edition.

Verify downstream requirements

Before you begin modifying your dataset, you should review the columns and ranges of values in those columns that are expected by the downstream consumer of your dataset. A quick review can provide guidance to identify the key areas of your dataset that require end-to-end validation.

Identify important fields

For datasets with many columns, it might be problematic to apply consistent validation across all columns. In these situations, you might need to decide the columns whose consistency, completeness, and accuracy are most important.

Profile your source data

Before you get started building your recipe on your dataset, it might be a good idea to create a visual profile of your source data. This process involves creating a minimal recipe on a dataset after you have loaded into the Transformer page. Then, you run a job to generate a profile of the data, which can be used as a baseline for validating the data and as an assistant in debugging the origin of any data problems you discover.

Visual profiling also generates statistics on the values in each column in the dataset. You can use this statistical information to assess overall data quality of the source data. This visual profile information is part of the record for the job, which remains in the system after execution.

Generate a new random sample

When a dataset is first loaded into the Transformer, the default sampling collects the first N rows of data, depending on the size and density of each row. However, your dataset might contain variations in the data that are not present in this first sample. New samples can be generated through the Samples panel.

Transformations vs. Data Quality Rules

You can perform data quality rules through the following general methods:

  1. Transformations: You can verify the quality of your data by creating transformations to check values for consistency and completeness and, if needed, taking action on the data itself for deviations.

    1. Transformations are built in the Transformer page to add steps to your recipe.

      Tip

      If you need to take actions in the data itself based on data quality checks, it may be better to use a transformation.

  2. Data quality rules: You can create data quality rules, which are persistent checks of columnar data against rules that you define. You can perform a variety of checks that exist outside of the recipe, so as you transform your data, the data quality rules automatically show the effects of your transformations on the overall quality of your data.

    1. Data quality rules are not recipe steps. They exist outside of recipes and persist in the Transformer page to help you to build steps to transform your data.

    2. Data quality rules are built in the Data Quality Rules panel in the Transformer page.

    3. For more information, see Overview of Data Quality.

      Tip

      If you are attempting to transform the data to get all values in a column to pass one or more data quality checks, use data quality rules.

Examples of both types of data quality checks are provided below.

Validate Consistency

Designer Cloud Powered by Trifacta Enterprise Edition provides useful features for checking that your data is consistent across its rows. With a few recipe steps, you can create custom validation checks to verify values.

Mismatched values

In the data quality bar at the top of a column, you can review the valid (green), mismatched (red), and missing (gray) values.

When you click the red bar:

  • The rows that contain mismatched values are highlighted in the data grid.

  • The application provides suggestions in the form of suggestion cards for ways that you can transform your data.

Transformation:

Maybe you are unsure of what to do with your data. If you would like to examine all of the rows together, you can insert a transformation like the following in your recipe.

Transformation Name

New formula

Parameter: Formula type

Single row formula

Parameter: Formula

ismismatched(Primary_Website_or_URL, ['Url'])

Parameter: New column name

mismatched_Primary_Website_or_URL

The above checks the values in the Primary_Website_or_URL column against the Url data type. If the value in the source column is not a valid URL, then the new column value is true.

Data quality rule:

The following data quality rule checks the Primary_Website_or_URL column against the Url data type:

Data Quality Rule

Valid

Parameter: Column

Primary_Website_or_URL

Parameter: Data type

'Url'

Outlying values

Through the Column Details panel, you can review statistical information about individual columns. To open, select Column Details... from a column's drop-down menu.

In the Summary area, you can review the count of Outlier values. In Designer Cloud Powered by Trifacta Enterprise Edition, an outlier is defined as any value that is more than 4 standard deviations from the mean for the set of column values.

The Column Details panel also contains:

  • Counts of valid, unique, mismatched, and missing values.

  • Breakdowns by quartile and information on maximum, minimum, and mean values.

Available statistics depend on the data type for the column.

Data range checks

Standard deviation ranges

For example, your range of values does not match the application's definition of an outlier, and you need to identify values that are more than 5 standard deviations from the mean.

You can create your custom transforms to evaluate standard deviations from mean for a specific column. For more information, see Locate Outliers.

Fixed value ranges

Transformation:

If you need to test a column of values compared to two fixed values, you can use the following transformation. This one tests evaluates a column value. If the value in Rating column is less than 10 or greater than 90, then the generated column value is true.

Transformation Name

New formula

Parameter: Formula type

Single row formula

Parameter: Formula

((Rating < 10) || (Rating > 90))

Parameter: New column name

Outlier_Rating

Data quality rule:

The following data quality rule performs the same evaluation as the previous transformation yet persists in the Transformer page.

Data Quality Rule

Formula

Parameter: Formula

((Rating < 10) || (Rating > 90))

Parameter: Group rows by

(empty)

Duplicate rows

Entire rows can be tested for duplication. The deduplicate transform allows you to remove identical rows. Note that whitespace and case differences are evaluated as different rows.

Uniqueness checks

For an individual column, the Column Details panel contains an indicator of the number of unique values in the column. If this value does not match the count of values and the count of rows in the sample, then some values are duplicated. Remember that these counts apply to just the sample in the Transformer page and may not be consistent measures across the entire dataset.

You can perform ad-hoc tests for uniqueness of individual values.

Data quality rule:

The following data quality rule verifies that all of the values in the custId column are unique:

Data Quality Rule

Unique

Parameter: Column

custId

Permitted character checks

You can test for the presence of permitted characters in individual columns by using a regular expression test.

Transformation:

The following transformation evaluates to true if all of the characters in a column field are alphanumeric or the space character:

Transformation Name

New formula

Parameter: Formula type

Single row formula

Parameter: Formula

MATCHES(MarketName, /^[a-zA-Z0-9 ]*$/)

You can add additional permitted characters inside the square brackets. For more information, see Text Matching.

Data quality rule:

This data quality performs the same test as the above transformation:

Data Quality Rule

Match

Parameter: Column

MarketName

Parameter: Matches pattern

/^[a-zA-Z0-9 ]*$/

Validate Completeness

Designer Cloud Powered by Trifacta Enterprise Edition provides easy methods for identifying if cells are missing values or contain null values. You can also create lookups to identify if values are not represented in your dataset.

Missing values

At the top of each column, the data quality bar includes a gray bar indicating the number of cells in the column that do not contain values. This set of values includes missing values.

Click the gray bar to prompt for a set of suggestion cards for handling those values.

Null values

While null values are categorized with missing values, they are not the same thing. In some cases, it might be important to distinguish the actual null values within your dataset, and several Wrangle can assist in finding them.

Validate data against other data

You can also test if your dataset contains at least one instance of a set of values.

For example, your dataset contains businesses throughout the United States. You might want to check to see if each state is represented in your dataset.

Steps:

  1. Create a reference dataset that contains a single instance of each item you are checking. In this example, it'd be a simple CSV file with the name of each state on a separate line.

    Tip

    To your second dataset, you might want to add a second column containing the value true, which allows you to keep separate validation data from the columns that you join.

  2. Add this CSV file as a new dataset to your flow.

  3. Open your source dataset. In the Search panel, enter join datasets.

  4. In the Join window:

    1. Select the reference dataset you just created. Click Accept. Click Next.

    2. Select the type of join to perform:

      1. Right outer join: Select this join type if you want to delete rows in your source dataset that do not have a key value in the reference dataset. In the example, all rows that do not have a value in the State column would be removed from the generated dataset.

      2. Full outer join: Select this type to preserve all data, including the rows in the source that do not contain key values.

    3. Select the two fields that you want to use to join. In the example, you would select the two fields that identify state values. Click Next.

    4. Select the fields that you want to include in the final dataset. Click Review.

    5. Click Add to Recipe.

  5. The generated dataset includes all of the fields you specified.

  6. For one of your key values, click the gray bar and select the link for the number of affected rows, which loads them into the data grid. Review the missing values in each key column.

  7. To remove these rows, select the missing value category in the data quality bar for the appropriate column and apply a delete statement.

  8. The generated command should look like the following:

    Transformation Name

    Delete rows

    Parameter: Condition

    ISMISSING([State])

For a detailed example, see Validate Column Values against a Dataset.

After Transformation

Generate output profile

After you have completed your recipe, you should generate a profile with your executed job. You can open this profile and the profile you created for the source data in separate browser tabs to evaluate how consistent and complete your data remains from beginning to end of the wrangling process.

Note

The statistical information in the generated profile should be compared to the statistics generated from the source, so that you can identify if your changes have introduced unwanted changes to these values.

Decisions

After you have performed your data validation checks, you might need to make some decisions about how to address any issues you might have encountered:

  • Some problems in the data might have been generated in the source system. If you plan to use additional sources from this system, you should try to get these issues corrected in the source and, if necessary, have your source data regenerated.

  • Some data quality issues can be ignored. For the sake of downstream consumers of the data, you might want to annotate your dataset with information about possible issues. Be sure to inform consumers on how to identify this information.