The process of cleansing, enhancing, and transforming your data can introduce significant changes to it, some of which might not be intended. This page provides some tips and techniques for validating your dataset, from start to finish for your data wrangling efforts.
Data validation can be broken down into the following categories:
- Consistency - Does your data fit into expected values for it? Do field values match the data type for the column? Are values within acceptable ranges? Are rows unique? Duplicated?
- Completeness - Are all expected values included in your data? Are some fields missing values? Are there expected values that are not present in the dataset?
Before You Begin
Before you begin building your data pipeline, you should identify your standards for data quality.
Visual profiling also generates statistics on the values in each column in the dataset. You can use this statistical information to assess overall data quality of the source data. This visual profile information is part of the record for the job, which remains in the system after execution.
For more information, see Profile Your Source Data.
Generate a new random sample
When a dataset is first loaded into the Transformer, the default sampling collects the first N rows of data, depending on the size and density of each row. However, your dataset might contain variations in the data that are not present in this first sample. For more information, see Samples PanelNew samples can be generated through the Samples panel.
Transformations vs. Data Quality Rules
- Transformations: You can verify the quality of your data by creating transformations to check values for consistency and completeness and, if needed, taking action on the data itself for deviations.
Transformations are built in the Transform Builder in the Transformer page to add steps to your recipe. For more information, see Transform Builder.
Tip: If you need to take actions in the data itself based on data quality checks, it may be better to use a transformation.
- Data quality rules: You can create data quality rules, which are persistent checks of columnar data against rules that you define. You can perform a variety of checks that exist outside of the recipe, so as you transform your data, the data quality rules automatically show the effects of your transformations on the overall quality of your data.
- Data quality rules are not recipe steps. They exist outside of recipes and persist in the Transformer page to help you to build steps to transform your data.
- Data quality rules are built in the Data Quality Rules panel in the Transformer page.
For more information, see Overview of Data Quality.
Tip: If you are attempting to transform the data to get all values in a column to pass one or more data quality checks, use data quality rules.
In the data quality bar at the top of a column, you can review the valid (green), mismatched (red), and missing (blackgray) values.
When you click the red bar:
The above checks the values in the
Primary_Website_or_URL column against the
Url data type. If the value in the source column is not a valid URL, then the new column value is
true.After sorting by this new column, all of the invalid URLs are displayed next to each other in the data grid, where you can review them in detail.
Data quality rule:
The following data quality rule checks the
Primary_Website_or_URL column against the
Url data type:
- Counts of valid, unique, mismatched, and missing values.
- Breakdowns by quartile and information on maximum, minimum, and mean values.
For more information, see Column Details Panel.
Available statistics depend on the data type for the column. For more information, see Locate Outliers.
Data range checks
Standard deviation ranges
Entire rows can be tested for duplication. The
deduplicate transform allows you to remove identical rows. Note that whitespace and case differences are evaluated as different rows. For more information, see Deduplicate Data.
For an individual column, the column details Column Details panel contains an indicator of the number of unique values in the column. If this value does not match the count of values and the count of rows in the sample, then some values are duplicated. Remember that these counts apply to just the sample in the Transformer page and may not be consistent measures across the entire dataset. See Column Details Panel.
You can perform ad-hoc tests for uniqueness of individual values. For more information, see Deduplicate Data.
Data quality rule:
The following data quality rule verifies that all of the values in the
custId column are unique:
At the top of each column, the data quality bar includes a black gray bar indicating the number of cells in the column that do not contain values. This set of values includes missing values.
Click the black gray bar to prompt for a set of suggestion cards for handling those values. For more information, see Find Missing Data.
While null values are categorized with missing values, they are not the same thing. In some cases, it might be important to distinguish the actual null values within your dataset, and several
|D s lang|
Validate data against other data
For example, your dataset contains businesses throughout the United States. You might want to check to see if each state is represented in your dataset.
Create a reference dataset that contains a single instance of each item you are checking. In this example, it'd be a simple CSV file with the name of each state on a separate line.
Tip: To your second dataset, you might want to add a second column containing the value
true, which allows you to keep separate validation data from the columns that you join.
- Add this CSV file as a new dataset to your flow.
- Open your source dataset. In the Search panel, enter
- In the Join window:
- Select the reference dataset you just created. Click Accept. Click Next.
- Select the type of join to perform:
- Right outer join: Select this join type if you want to delete rows in your source dataset that do not have a key value in the reference dataset. In the example, all rows that do not have a value in the State column would be removed from the generated dataset.
- Full outer join: Select this type to preserve all data, including the rows in the source that do not contain key values.
- Select the two fields that you want to use to join. In the example, you would select the two fields that identify state values. Click Next.
- Select the fields that you want to include in the final dataset. Click Review.
- Click Add to Recipe.
- The generated dataset includes all of the fields you specified.
- For one of your key values, click the black bar gray bar and select the link for the number of affected rows, which loads them into the data grid. Review the missing values in each key column.
To remove these rows, select the missing value category in the data quality bar for the appropriate column and apply a delete statement.
The generated command should look like the following:
D trans Type step p01Name Condition p01Value ISMISSING([State]) SearchTerm Delete rows
Generate output profile
- Some problems in the data might have been generated in the source system. If you plan to use additional sources from this system, you should try to get these issues corrected in the source and, if necessary, have your source data regenerated.
- Some data quality issues can be ignored. For the sake of downstream consumers of the data, you might want to annotate your dataset with information about possible issues. Be sure to inform consumers on how to identify this information.
|D s also|