The process of cleansing, enhancing, and transforming your data can introduce significant changes to it, some of which might not be intended. This page provides some tips and techniques for validating your dataset, from start to finish for your data wrangling efforts.
Data validation can be broken down into the following categories:
Before you begin building your data pipeline, you should identify your standards for data quality.
NOTE: Depending on your source system, you might be able to generate data quality reports from within it. These reports can be used as the basis for validating your work in .
If your source system does not enable generation of these reports, you should consider profiling your dataset as soon as you load your data into .
Before you begin modifying your dataset, you should review the columns and ranges of values in those columns that are expected by the downstream consumer of your dataset. A quick review can provide guidance to identify the key areas of your dataset that require end-to-end validation.
For datasets with many columns, it might be problematic to apply consistent validation across all columns. In these situations, you might need to decide the columns whose consistency, completeness, and accuracy are most important.
Before you get started building your recipe on your dataset, it might be a good idea to create a visual profile of your source data. This process involves creating a minimal recipe on a dataset after you have loaded into the Transformer page. Then, you run a job to generate a profile of the data, which can be used as a baseline for validating the data and as an assistant in debugging the origin of any data problems you discover.
Visual profiling also generates statistics on the values in each column in the dataset. You can use this statistical information to assess overall data quality of the source data. This visual profile information is part of the record for the job, which remains in the system after execution.
For more information, see Profile Your Source Data.
When a dataset is first loaded into the Transformer, the default sampling collects the first N rows of data, depending on the size and density of each row. However, your dataset might contain variations in the data that are not present in this first sample. For more information, see Samples Panel.
provides useful features for checking that your data is consistent across its rows. With a few recipe steps, you can create custom validation checks to verify values.
In the data quality bar at the top of a column, you can review the valid (green), mismatched (red), and missing (black) values.
When you click the red bar:
Maybe you are unsure of what to do with your data. If you would like to examine all of the rows together, you can insert a transformation like the following in your recipe:
The above checks the values in the
Primary_Website_or_URL column against the
Url data type. If the value in the source column is not a valid URL, then the new column value is
true.After sorting by this new column, all of the invalid URLs are displayed next to each other in the data grid, where you can review them in detail.
Through the Column Details panel, you can review statistical information about individual columns. To open, select Column Details... from a column's drop-down menu.
In the Summary area, you can review the count of Outlier values. In , an outlier is defined as any value that is more than 4 standard deviations from the mean for the set of column values.
The Column Details panel also contains:
For more information, see Column Details Panel.
Available statistics depend on the data type for the column. For more information, see Locate Outliers.
For example, your range of values does not match the application's definition of an outlier, and you need to identify values that are more than 5 standard deviations from the mean.
You can create your custom transforms to evaluate standard deviations from mean for a specific column. For more information, see Locate Outliers.
If you need to test a column of values compared to two fixed values, you can use the following transformation. This one tests evaluates a column value. If the value in
Rating column is less than 10 or greater than 90, then the generated column value is
Entire rows can be tested for duplication. The
deduplicate transform allows you to remove identical rows. Note that whitespace and case differences are evaluated as different rows. For more information, see Deduplicate Data.
For an individual column, the column details panel contains an indicator of the number of unique values in the column. If this value does not match the count of values and the count of rows in the sample, then some values are duplicated. Remember that these counts apply to just the sample in the Transformer page and may not be consistent measures across the entire dataset. See Column Details Panel.
You can perform ad-hoc tests for uniqueness of individual values. For more information, see Deduplicate Data.
You can test for the presence of permitted characters in individual columns by using a regular expression test. The following transformation evaluates to
true if all of the characters in a column field are alphanumeric or the space character:
You can add additional permitted characters inside the square brackets. For more information, see Text Matching.
provides easy methods for identifying if cells are missing values or contain null values. You can also create lookups to identify if values are not represented in your dataset.
At the top of each column, the data quality bar includes a black bar indicating the number of cells in the column that do not contain values. This set of values includes missing values.
Click the black bar to prompt for a set of suggestion cards for handling those values.
For more information, see Find Missing Data.
While null values are categorized with missing values, they are not the same thing. In some cases, it might be important to distinguish the actual null values within your dataset, and several can assist in finding them. See Manage Null Values.
You can also test if your dataset contains at least one instance of a set of values.
For example, your dataset contains businesses throughout the United States. You might want to check to see if each state is represented in your dataset.
Create a reference dataset that contains a single instance of each item you are checking. In this example, it'd be a simple CSV file with the name of each state on a separate line.
Tip: To your second dataset, you might want to add a second column containing the value
To remove these rows, select the missing value category in the data quality bar for the appropriate column and apply a delete statement.
The generated command should look like the following:
For more information, see Join Window.
After you have completed your recipe, you should generate a profile with your executed job. You can open this profile and the profile you created for the source data in separate browser tabs to evaluate how consistent and complete your data remains from beginning to end of the wrangling process.
NOTE: The statistical information in the generated profile should be compared to the statistics generated from the source, so that you can identify if your changes have introduced unwanted changes to these values.
After you have performed your data validation checks, you might need to make some decisions about how to address any issues you might have encountered: