...
When a dataset is first loaded into the Transformer, the default sampling collects the first N rows of data, depending on the size and density of each row. However, your dataset might contain variations in the data that are not present in this first sample. For more information, see Samples Panel.
Transformations vs. Data Quality Rules
You can perform data quality rules through the following general methods:
- Transformations: You can verify the quality of your data by creating transformations to check values for consistency and completeness and, if needed, taking action on the data itself for deviations.
Transformations are built in the Transform Builder in the Transformer page to add steps to your recipe. For more information, see Transform Builder.
Tip Tip: If you need to take actions in the data itself based on data quality checks, it may be better to use a transformation.
- Data quality rules: You can create data quality rules, which are persistent checks of columnar data against rules that you define. You can perform a variety of checks that exist outside of the recipe, so as you transform your data, the data quality rules automatically show the effects of your transformations on the overall quality of your data.
- Data quality rules are not recipe steps. They exist outside of recipes and persist in the Transformer page to help you to build steps to transform your data.
- Data quality rules are built in the Data Quality Rules panel in the Transformer page.
For more information, see Overview of Data Quality.
Tip Tip: If you are attempting to transform the data to get all values in a column to pass one or more data quality checks, use data quality rules.
Examples of both types of data quality checks are provided below.
Validate Consistency
D s product |
---|
...
- The rows that contain mismatched values are highlighted in the data grid.
- The application provides suggestions in the form of suggestion cards for ways that you can transform your data.
Transformation:
Maybe you are unsure of what to do with your data. If you would like to examine all of the rows together, you can insert a transformation like the following in your recipe:.
D trans | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
The above checks the values in the Primary_Website_or_URL
column against the Url
data type. If the value in the source column is not a valid URL, then the new column value is true
.After sorting by this new column, all of the invalid URLs are displayed next to each other in the data grid, where you can review them in detail.
Data quality rule:
The following data quality rule checks the Primary_Website_or_URL
column against the Url
data type:
D trans | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Outlying values
Through the Column Details panel, you can review statistical information about individual columns. To open, select Column Details... from a column's drop-down menu.
...
You can create your custom transforms to evaluate standard deviations from mean for a specific column. For more information, see Locate Outliers.
Fixed value ranges
Transformation:
If you need to test a column of values compared to two fixed values, you can use the following transformation. This one tests evaluates a column value. If the value in Rating
column is less than 10 or greater than 90, then the generated column value is true
.
D trans | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Data quality rule:
The following data quality rule performs the same evaluation as the previous transformation yet persists in the Transformer page.
D trans | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Duplicate rows
Entire rows can be tested for duplication. The deduplicate
transform allows you to remove identical rows. Note that whitespace and case differences are evaluated as different rows. For more information, see Deduplicate Data.
...
You can perform ad-hoc tests for uniqueness of individual values. For more information, see Deduplicate Data.
Data quality rule:
The following data quality rule verifies that all of the values in the custId
column are unique:
D trans | ||||||||
---|---|---|---|---|---|---|---|---|
|
Permitted character checks
You can test for the presence of permitted characters in individual columns by using a regular expression test.
Transformation:
The following transformation evaluates to true
if all of the characters in a column field are alphanumeric or the space character:
...
You can add additional permitted characters inside the square brackets. For more information, see Text Matching.
Data quality rule:
This data quality performs the same test as the above transformation:
D trans | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Validate Completeness
D s product |
---|
...
While null values are categorized with missing values, they are not the same thing. In some cases, it might be important to distinguish the actual null values within your dataset, and several
D s lang |
---|
...
Validate data against other data
You can also test if your dataset contains at least one instance of a set of values.
...