Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version next

...

When a dataset is first loaded into the Transformer, the default sampling collects the first N rows of data, depending on the size and density of each row. However, your dataset might contain variations in the data that are not present in this first sample. For more information, see Samples Panel.

Transformations vs. Data Quality Rules

You can perform data quality rules through the following general methods:

  1. Transformations: You can verify the quality of your data by creating transformations to check values for consistency and completeness and, if needed, taking action on the data itself for deviations.
    1. Transformations are built in the Transform Builder in the Transformer page to add steps to your recipe. For more information, see Transform Builder.

      Tip

      Tip: If you need to take actions in the data itself based on data quality checks, it may be better to use a transformation.

  2. Data quality rules: You can create data quality rules, which are persistent checks of columnar data against rules that you define. You can perform a variety of checks that exist outside of the recipe, so as you transform your data, the data quality rules automatically show the effects of your transformations on the overall quality of your data. 
    1. Data quality rules are not recipe steps. They exist outside of recipes and persist in the Transformer page to help you to build steps to transform your data. 
    2. Data quality rules are built in the Data Quality Rules panel in the Transformer page.
    3. For more information, see Overview of Data Quality.

      Tip

      Tip: If you are attempting to transform the data to get all values in a column to pass one or more data quality checks, use data quality rules.

Examples of both types of data quality checks are provided below.

Validate Consistency

D s product
 provides useful features for checking that your data is consistent across its rows. With a few recipe steps, you can create custom validation checks to verify values.

...

  • The rows that contain mismatched values are highlighted in the data grid.
  • The application provides suggestions in the form of suggestion cards for ways that you can transform your data.

Transformation:

Maybe you are unsure of what to do with your data. If you would like to examine all of the rows together, you can insert a transformation like the following in your recipe:.

D trans
p03Valuemismatched_Primary_Website_or_URL
Typestep
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02Valueismismatched(Primary_Website_or_URL, ['Url'])
p03NameNew column name
SearchTermNew formula

The above checks the values in the Primary_Website_or_URL column against the Url data type. If the value in the source column is not a valid URL, then the new column value is true.After sorting by this new column, all of the invalid URLs are displayed next to each other in the data grid, where you can review them in detail.
Data quality rule:

The following data quality rule checks the Primary_Website_or_URL column against the Url data type:

D trans
Typedq
p01NameColumn
p01ValuePrimary_Website_or_URL
p02NameData type
p02Value'Url'
SearchTermValid

Outlying values

Through the Column Details panel, you can review statistical information about individual columns. To open, select Column Details... from a column's drop-down menu.

...

You can create your custom transforms to evaluate standard deviations from mean for a specific column. For more information, see Locate Outliers.

Fixed value ranges

Transformation:

If you need to test a column of values compared to two fixed values, you can use the following transformation. This one tests evaluates a column value. If the value in Rating column is less than 10 or greater than 90, then the generated column value is true

D trans
p03ValueOutlier_Rating
Typestep
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02Value((Rating < 10) || (Rating > 90))
p03NameNew column name
SearchTermNew formula

Data quality rule:

The following data quality rule performs the same evaluation as the previous transformation yet persists in the Transformer page.

D trans
Typedq
p01NameFormula
p01Value((Rating < 10) || (Rating > 90))
p02NameGroup rows by
p02Value(empty)
SearchTermFormula

Duplicate rows

Entire rows can be tested for duplication. The deduplicate transform allows you to remove identical rows. Note that whitespace and case differences are evaluated as different rows. For more information, see Deduplicate Data.

...

You can perform ad-hoc tests for uniqueness of individual values. For more information, see Deduplicate Data.

Data quality rule:

The following data quality rule verifies that all of the values in the custId column are unique:

D trans
Typedq
p01NameColumn
p01ValuecustId
SearchTermUnique

Permitted character checks

You can test for the presence of permitted characters in individual columns by using a regular expression test.

Transformation:

The following transformation evaluates to true if all of the characters in a column field are alphanumeric or the space character:

...

You can add additional permitted characters inside the square brackets. For more information, see Text Matching.

Data quality rule:

This data quality performs the same test as the above transformation:

D trans
Typedq
p01NameColumn
p01ValueMarketName
p02NameMatches pattern
p02Value/^[a-zA-Z0-9 ]*$/
SearchTermMatch

Validate Completeness

D s product
 provides easy methods for identifying if cells are missing values or contain null values. You can also create lookups to identify if values are not represented in your dataset.

...

While null values are categorized with missing values, they are not the same thing. In some cases, it might be important to distinguish the actual null values within your dataset, and several

D s lang
 can assist in finding them. See Manage Null Values.

...

Validate data against other data

You can also test if your dataset contains at least one instance of a set of values.

...