You can create rules to validate the quality of the data in your sample. When created, these rules allow you to highlight exceptions to the rule to assist in building your data cleansing recipe steps.
NOTE: Data quality rules are not transformation steps. They assess the current state of the sampled data in the Transformer page. |
NOTE: As you apply transformation steps to the data, the state of your data quality rules is automatically updated to reflect the changes. If you delete columns or other elements referenced in the data quality rules, errors are generated in the Transformer page. |
You can add a rule from inside the Transformer page.
Steps:
If you have not created any rules, the panel is empty. To create a new rule, click Add rule.
Tip: You can review a set of suggested data quality rules that are based on your dataset and add them as needed. Click View suggestions. For more information, see Data Quality Rules Panel. |
Not Null
. See Examples below.May be missing: Some rule types support the May be missing checkbox. When it is enabled, the Data Quality rule allows missing values to be acceptable for a specified column.
NOTE: The May be missing rule parameter is not applicable to Not Null, Not Missing, Not Equal, Not In Set, and Formula rule types. |
Select the column or columns to which the rule applies.
Tip: Some rules can be applied to multiple columns. |
Click Add.
Review the previewed results.
Tip: To simplify the preview, click the Show Only Affected Columns checkbox in the status bar. |
The new rule is displayed in the Data Quality Rules panel. In the data quality bar for the rule, green indicates the row values that have passed the rule, and red indicates the row values that failed.
Tip: After creating a rule, you can jump back and forth between the Recipe panel and this panel to review how your changes to your recipe steps affect the data quality bars for your rules. |
Additional options are available in the context menu for the rule. For more information, see Data Quality Rules Panel.
The following data quality rule tests the values in the storeAddress
column to see if they are missing (empty) values.
The following rule evaluates the primaryKey
column to determine if all values in it are unique.
Suppose the values of your SKUs must be in the form of "SKU
+ 6 digits".
Following uses to perform the match. For more information on
, see Text Matching.
This rule tests the values in the orderColor
column to verify that all values are Blue
, Yellow
, or Green
.
In the following, the Acceptable values must be formatted as an array. See below.
For some rule types, you can create a data quality rule using custom metrics to assess the data quality. You can use the calculated metric type (derived metrics) as a data quality input type and create a metric-based data quality rule. For more information on input types, see Data Quality Rules Reference.
For example, you can create a metric-based rule to find out if the minimum value metric is within the acceptable range. To do this, you can select a In Range
data quality rule, select Minimum
input metric, and other required parameters to create a metric-based rule.
NOTE: Metric-based rules are supported only for some metric types. For more information on the rule types that support metrics, see Data Quality Rules Reference. |
Steps:
If you have not created any rules, the panel is empty. To create a new rule, click Add rule.
In Range
. For Group by rows, you can select the column whose individual values are used to group the calculated metric rule. For example, if you group a maximum order by product identifier column, then the rule is calculated for individual product identifiers.
NOTE: If you do not group by row values, then the metric is calculated over all rows in the dataset. For large datasets, calculating flat aggregates can impact performance. |
Maximum value: Indicates the maximum (upper bound) of the range.
NOTE: For Greater than and Less than rule types, Minimum value and Maximum value options are displayed. |
Exclude minimum and maximum from range: When this option is enabled, the Data Quality rule excludes the minimum and maximum values from the acceptable range.
NOTE: The Exclude minimum and maximum from range rule parameter is applicable only to In Range, Greater than, and Less than rule types. |
May be missing: When this checkbox is selected, the Data Quality rule allows missing values to be acceptable for a specified column.
Click Add. The metric-based data quality rule is added.
The new rule is displayed in the Data Quality Rules panel. In the data quality bar for the rule, the green bar indicates the row values that have passed the rule, and the red bar indicates the row values that failed.
Additional options are available in the context menu for the rule. For more information, see Data Quality Rules Panel.
This rule checks if the values in the POS_SALES
column fall between the minimum and maximum range.
When this rule is added, the rows that do not fall within the range are flagged in red.
This rule checks if the values in the POS_QTY
column are greater than the minimum value.
When this rule is added, the rows that do not fall within the range are flagged in red.
This rule checks if the values in the NET_SHIP_QTY
column are less than the maximum value.
When this rule is added, the rows that do not fall within the range are flagged in red.
Tip: You can use the available checkboxes to include the missing values and to exclude the minimum and maximum values from range. |
You can add custom rules using formulas containing functions.
is the proprietary language used to transform your data. You can also apply the functions of the language to your data quality rules. For more information, see Wrangle Language.
Steps:
Formula
.In the Formula textbox, enter the formula to test your data.
NOTE: The formula that you provide must evaluate to |
For aggregation functions, you can group the evaluation of your rule based on the values in your grouping column.
Tip: You can group by multiple columns. The first column is the outermost grouping. |
To add the rule, click Add.
You can use data quality rules to perform some data analysis functions. For example, suppose you want to flag the dates where the total sales of all of your orders was less than 100.
When this rule is added, the rows whose date total is less than 100 are flagged in red.
To edit a rule, select Edit rule from the context menu for the rule in the panel.
To delete a rule, select Delete rule from the context menu for the rule in the panel.
When you generate a profile as part of your job results, you can download the profile in JSON or PDF format.
When you download the profile in JSON format, the set of rules for the job are also included. Search for profilerRules
in the JSON file.
For more information, see Job Details Page.
When flows are exported and imported, the rule definitions for the recipes in the flow are also exported. For more information, see Export Flow.