Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r0822

D toc

Excerpt

D s product
rtrue
provides multiple mechanisms to transform and standardize data to meet usage needs, including profile visualizations and type-based quality bars to identify potential anomalies and quality problems. Data quality checks can be applied during data import, transformation, or export in the form of visual profiling.

Broadly speaking, data quality identifies the degree to which data is usable and responsive to your use case. When you assess data quality, you are designing tests to assess its suitability for generic usage and for your specific uses. 

Data Quality Characteristics

Data quality covers the following characteristics:

  • Completeness: values are present where they are needed and expected
  • Accuracy: data is substantively free of errors
  • Consistency: a dataset can be matched across different data sources of the enterprise
  • Timeliness: data values are up-to-date
  • Uniqueness: aggregate data are free from any duplication via filters or other transformations of source data
  • Validity: data are structured based on an adequate and rigorous classification system
  • Availability / Accessibility: data are made available to the relevant stakeholders
  • Traceability: the history, processing and location of the data under consideration can be easily traced

Schema Validation

Type inference

When data is imported, the 

D s product
 attempts to infer the data types in the source and to type columns in the dataset accordingly. Type inference uses the first 20-25 rows of the initial sample to assess the appropriate data type to apply to the column. For more information, see Type Conversions.

Some imported data, such as relational tables, may include schema information to identify the data type of each column. In some cases you can disable type inferencing on imported data:

  • Global:
    D s item
    itemadministrators
    can disable type inferencing for all imported schematized sources. In this manner, the
    D s platform
    uses the schema of the source to define the initial types assigned to the columns of the dataset.

  • Connections: As part of the definition of a connection, you can optionally choose to disable type inference. For more information, see Create Connection Window.

  • Per-dataset: When you import a dataset, you can modify the import settings for the selected source to disable type inference. See Import Data Page.

Assign targets

To assist in your transformation efforts, you can assign a target schema for each recipe. This target schema is super-imposed on the columns of your data. Using visual tools to review differences and select changes, you can rapidly convert the structure of your dataset in development to meet the expected target schema. For more information, see Overview of RapidTarget

Identify Anomalies

In the Transformer page, you can use the available visual tools to review the data quality characteristics of the columns in your data. These data visualizations and type-based quality bars can assist in identifying potential anomalies and quality problems.

Data quality bar

At the top of each column, you can see a data quality bar, which uses the following color coding to validate the column values against the selected column type.

Color barDescription
greenValues that are valid for the current data type of the column
redValues that are mismatched for the column data type
blackMissing or null values
Tip

Tip: Click any of the color bars to receive suggestions for transformations to add to your recipe. See Overview of Predictive Transformation.


Tip

Tip: You can change a column's data type in the column header. See Column Menus.

For more information, see Data Quality Bars.

Column histogram

In the column header, you can review the count and distribution of values in the column. A column's histogram can be useful for identifying anomalies or for selecting specific sets of values in the column for further exploration.  

Tip

Tip: Click and drag over any set of values to receive suggestions for transformations to add to your recipe. See Overview of Predictive Transformation.

See Column Histograms.

Column details

Through the Column Details panel, you can explore the quality and distribution of the values in the column. The contents of the panel vary depending on the data type. For example, if the column is typed for Datetime values, then the Column Details panel includes information on the distribution of values across the days of the week and days of the month. 

For all data types, you can review useful statistics on statistical quartiles, the uniqueness of values, mismatches, and outliers. 

Tip

Tip: The Column Details panel is very useful for acquiring statistical information on column values in a visual format. Click any data quality bar to be prompted for suggestions of transformation steps. See Overview of Predictive Transformation.

For more information, see Column Details Panel.

Standardization

You can use the Standardization tool to standardized clustered sets of column values to values that are common and consistent throughout your enterprise's data. For more information, see Overview of Standardization

Data Quality Functions

The following functions are available for assessing data quality.

D generate child excerpts
pagesType functions
heading3

Count functions

The following functions measure counts of values within a column, optionally counted by group. 

ItemDescription
COUNT Function

D excerpt include
pageCOUNT Function
nopaneltrue

COUNTA Function

D excerpt include
pageCOUNTA Function
nopaneltrue

COUNTDISTINCT Function

D excerpt include
pageCOUNTDISTINCT Function
nopaneltrue

UNIQUE Function

D excerpt include
pageUNIQUE Function
nopaneltrue

Aggregation functions

ItemDescription
AVERAGE Function

D excerpt include
pageAVERAGE Function
nopaneltrue

See also:

SUM Function

D excerpt include
pageSUM Function
nopaneltrue

MIN Function

D excerpt include
pageMIN Function
nopaneltrue

MAX Function

D excerpt include
pageMAX Function
nopaneltrue

MODE Function

D excerpt include
pageMODE Function
nopaneltrue

MINDATE Function

D excerpt include
pageMINDATE Function
nopaneltrue

MAXDATE Function

D excerpt include
pageMAXDATE Function
nopaneltrue

MODEDATE Function

D excerpt include
pageMODEDATE Function
nopaneltrue

Statistical functions - single column

Variations in these functions:

  • Some of these functions have variations that use the sample population method of computation.  
  • IF conditional functions can be used to compute statistical computations based on a condition.

General statistics

ItemDescription
VAR Function

D excerpt include
pageVAR Function
nopaneltrue

STDEV Function

D excerpt include
pageSTDEV Function
nopaneltrue

MEDIAN Function

D excerpt include
pageMEDIAN Function
nopaneltrue

QUARTILE Function

D excerpt include
pageQUARTILE Function
nopaneltrue

PERCENTILE Function

D excerpt include
pagePERCENTILE Function
nopaneltrue


ItemDescription
APPROXIMATEMEDIAN Function

D excerpt include
pageAPPROXIMATEMEDIAN Function
nopaneltrue

APPROXIMATEQUARTILE Function

D excerpt include
pageAPPROXIMATEQUARTILE Function
nopaneltrue

APPROXIMATEPERCENTILE Function

D excerpt include
pageAPPROXIMATEPERCENTILE Function
nopaneltrue

Statistical functions - multi-column

ItemDescription
COVAR Function

D excerpt include
pageCOVAR Function
nopaneltrue

CORREL Function

D excerpt include
pageCORREL Function
nopaneltrue

Data Quality in Job Details

When you run a job and generate results, you can review the the quality of the data of the generated output. 

Visual profiling

In parallel with executing the job, you can generate a visual profile of the generated results. This visual profile provides graphical representations of the valid and mismatched values against each column's data type, as well as indications about missing values in the output. 

Tip

Tip: Visual profiles can be downloaded in PDF or JSON format for offline analysis.

Visual profiling is selected as part of the job definition process. See Run Job Page.

For more information, see Overview of Visual Profiling.

Rules tab

When visual profiling is enabled for your job, the Rules tab in the Job Details page contains the results of the data quality rules for the job's recipes applied across the entire dataset. 

Tip

Tip: Data quality rules are available for download in JSON and PDF format. For more information, see Job Details Page.

For more information, see Job Details Page.