The process of cleansing, enhancing, and transforming your data can introduce significant changes to it, some of which might not be intended. This page provides some tips and techniques for validating your dataset, from start to finish for your data wrangling efforts. |
Data validation can be broken down into the following categories:
Before you begin building your data pipeline, you should identify your standards for data quality.
NOTE: Depending on your source system, you might be able to generate data quality reports from within it. These reports can be used as the basis for validating your work in If your source system does not enable generation of these reports, you should consider profiling your dataset as soon as you load your data into |
Before you begin modifying your dataset, you should review the columns and ranges of values in those columns that are expected by the downstream consumer of your dataset. A quick review can provide guidance to identify the key areas of your dataset that require end-to-end validation.
For datasets with many columns, it might be problematic to apply consistent validation across all columns. In these situations, you might need to decide the columns whose consistency, completeness, and accuracy are most important.
Before you get started building your recipe on your dataset, it might be a good idea to create a visual profile of your source data. This process involves creating a minimal recipe on a dataset after you have loaded into the Transformer page. Then, you run a job to generate a profile of the data, which can be used as a baseline for validating the data and as an assistant in debugging the origin of any data problems you discover.
Visual profiling also generates statistics on the values in each column in the dataset. You can use this statistical information to assess overall data quality of the source data. This visual profile information is part of the record for the job, which remains in the system after execution.
When a dataset is first loaded into the Transformer, the default sampling collects the first N rows of data, depending on the size and density of each row. However, your dataset might contain variations in the data that are not present in this first sample. New samples can be generated through the Samples panel.
You can perform data quality rules through the following general methods:
Transformations are built in the Transformer page to add steps to your recipe.
Tip: If you need to take actions in the data itself based on data quality checks, it may be better to use a transformation. |
For more information, see Overview of Data Quality.
Tip: If you are attempting to transform the data to get all values in a column to pass one or more data quality checks, use data quality rules. |
Examples of both types of data quality checks are provided below.
provides useful features for checking that your data is consistent across its rows. With a few recipe steps, you can create custom validation checks to verify values.
In the data quality bar at the top of a column, you can review the valid (green), mismatched (red), and missing (gray) values.
When you click the red bar:
Transformation:
Maybe you are unsure of what to do with your data. If you would like to examine all of the rows together, you can insert a transformation like the following in your recipe.
The above checks the values in the Primary_Website_or_URL
column against the Url
data type. If the value in the source column is not a valid URL, then the new column value is true
.
Data quality rule:
The following data quality rule checks the Primary_Website_or_URL
column against the Url
data type:
Through the Column Details panel, you can review statistical information about individual columns. To open, select Column Details... from a column's drop-down menu.
In the Summary area, you can review the count of Outlier values. In , an outlier is defined as any value that is more than 4 standard deviations from the mean for the set of column values.
The Column Details panel also contains:
Available statistics depend on the data type for the column.
For example, your range of values does not match the application's definition of an outlier, and you need to identify values that are more than 5 standard deviations from the mean.
You can create your custom transforms to evaluate standard deviations from mean for a specific column. For more information, see Locate Outliers.
Transformation:
If you need to test a column of values compared to two fixed values, you can use the following transformation. This one tests evaluates a column value. If the value in Rating
column is less than 10 or greater than 90, then the generated column value is true
.
Data quality rule:
The following data quality rule performs the same evaluation as the previous transformation yet persists in the Transformer page.
Entire rows can be tested for duplication. The deduplicate
transform allows you to remove identical rows. Note that whitespace and case differences are evaluated as different rows.
For an individual column, the Column Details panel contains an indicator of the number of unique values in the column. If this value does not match the count of values and the count of rows in the sample, then some values are duplicated. Remember that these counts apply to just the sample in the Transformer page and may not be consistent measures across the entire dataset.
You can perform ad-hoc tests for uniqueness of individual values.
Data quality rule:
The following data quality rule verifies that all of the values in the custId
column are unique:
You can test for the presence of permitted characters in individual columns by using a regular expression test.
Transformation:
The following transformation evaluates to true
if all of the characters in a column field are alphanumeric or the space character:
You can add additional permitted characters inside the square brackets. For more information, see Text Matching.
Data quality rule:
This data quality performs the same test as the above transformation:
provides easy methods for identifying if cells are missing values or contain null values. You can also create lookups to identify if values are not represented in your dataset.
At the top of each column, the data quality bar includes a gray bar indicating the number of cells in the column that do not contain values. This set of values includes missing values.
Click the gray bar to prompt for a set of suggestion cards for handling those values.
While null values are categorized with missing values, they are not the same thing. In some cases, it might be important to distinguish the actual null values within your dataset, and several can assist in finding them.
You can also test if your dataset contains at least one instance of a set of values.
For example, your dataset contains businesses throughout the United States. You might want to check to see if each state is represented in your dataset.
Steps:
Create a reference dataset that contains a single instance of each item you are checking. In this example, it'd be a simple CSV file with the name of each state on a separate line.
Tip: To your second dataset, you might want to add a second column containing the value |
join datasets
.To remove these rows, select the missing value category in the data quality bar for the appropriate column and apply a delete statement.
The generated command should look like the following:
For a detailed example, see Validate Column Values against a Dataset.
After you have completed your recipe, you should generate a profile with your executed job. You can open this profile and the profile you created for the source data in separate browser tabs to evaluate how consistent and complete your data remains from beginning to end of the wrangling process.
NOTE: The statistical information in the generated profile should be compared to the statistics generated from the source, so that you can identify if your changes have introduced unwanted changes to these values. |
After you have performed your data validation checks, you might need to make some decisions about how to address any issues you might have encountered: