Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r0811

D toc

d-excerpt

The process of cleansing, enhancing, and transforming your data can introduce significant changes to it, some of which might not be intended. This page provides some tips and techniques for validating your dataset, from start to finish for your data wrangling efforts.

Data validation can be broken down into the following categories:

  • Consistency - Does your data fit into expected values for it? Do field values match the data type for the column? Are values within acceptable ranges? Are rows unique? Duplicated?
  • Completeness - Are all expected values included in your data? Are some fields missing values? Are there expected values that are not present in the dataset?

...

Visual profiling also generates statistics on the values in each column in the dataset. You can use this statistical information to assess overall data quality of the source data. This visual profile information is part of the record for the job, which remains in the system after execution.For more information, see Profile Your Source Data.

Generate a new random sample

...

  • Counts of valid, unique, mismatched, and missing values.
  • Breakdowns by quartile and information on maximum, minimum, and mean values.

...

  • .

Available statistics depend on the data type for the column. For more information, see Locate Outliers. 

Data range checks

Standard deviation ranges

...

You can perform ad-hoc tests for uniqueness of individual values. For more information, see Deduplicate Data. 

Data quality rule:

The following data quality rule verifies that all of the values in the custId column are unique:

...

Click the gray bar to prompt for a set of suggestion cards for handling those values. For more information, see Find Missing Data.

Null values

While null values are categorized with missing values, they are not the same thing. In some cases, it might be important to distinguish the actual null values within your dataset, and several

D s lang
 can assist in finding them. See Manage Null Values. 

Validate data against other data

...

  • Some problems in the data might have been generated in the source system. If you plan to use additional sources from this system, you should try to get these issues corrected in the source and, if necessary, have your source data regenerated. 
  • Some data quality issues can be ignored. For the sake of downstream consumers of the data, you might want to annotate your dataset with information about possible issues. Be sure to inform consumers on how to identify this information.

D s also
inCQLtrue
label(label = "validation_tasks")