Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r088

D toc

Excerpt

This section provides an overview of how data types are managed during import, transformation, and export of data in

D s product
rtrue
.

Terms:

data type is a definition of a set of values, including:

  • Possible set of values (e.g. Dates from 1/1/1400 to 12/31/2599)
  • Meaning of those values (e.g. two-letter codes can represent states of the United States)
  • Set of operations that can be applied to those values (e.g. functions applicable to integers) 

A data format is a representation of the underlying type, which has the same meaning and available operations associated with the data type. For example, the following values are all valid for Datetime data type, but each is represented in a different format:

Code Block
12/31/2021
31-12-2021
December 31, 2021


Info

NOTE: Some data types can be explicitly formatted through functions. Other data types support different formats implicitly through the application.

How Data Types Are Used

In the

D s webapp
, data types are used for the following:

  • Anomaly detection (Is the value valid or invalid?)
  • Suggestions (What are the available transformation suggestions for this column based on its data type?)
  • Standardization (How can all of these valid values be standardized for the column's data type?)
  • Pattern recognition (How to identify different formats in the same column?)

Data Types

D s product
 supports the following categories of data types:

Logical data types

logical data type is a class of values that is understood by native system representations.

Tip

Tip: These types are recognized internally by

D s product
. Each running environment to which
D s product
connections natively supports these logical data types.

These data types have no additional specification requirements:

Data TypeDescription
String Data Type

D excerpt include
pageString Data Type
nopaneltrue

Integer Data Type

D excerpt include
pageInteger Data Type
nopaneltrue

Decimal Data Type

D excerpt include
pageDecimal Data Type
nopaneltrue

Boolean Data Type

D excerpt include
pageBoolean Data Type
nopaneltrue

Datetime Data Type

D excerpt include
pageDatetime Data Type
nopaneltrue

Object Data Type

D excerpt include
pageObject Data Type
nopaneltrue

Array Data Type

D excerpt include
pageArray Data Type
nopaneltrue

Formatting differences may apply. For example, 

D s product
 may recognize Y and N as Boolean data type, while other systems may not.

Complex data types

complex data type typically is defined by applying additional restrictions on String data type values to define the class of possible values. For example, 

D s product
 supports a Gender data type, which validates values such as MF and male and female as Gender data type. 

The following are the complex data types supported by 

D s product
.

Data TypeDescription
Social Security Number Data Type

D excerpt include
pageSocial Security Number Data Type
nopaneltrue

Phone Number Data Type

D excerpt include
pagePhone Number Data Type
nopaneltrue

Email Address Data Type

D excerpt include
pageEmail Address Data Type
nopaneltrue

Credit Card Data Type

D excerpt include
pageCredit Card Data Type
nopaneltrue

Gender Data Type

D excerpt include
pageGender Data Type
nopaneltrue

Zip Code Data Type

D excerpt include
pageZip Code Data Type
nopaneltrue

State Data Type

D excerpt include
pageState Data Type
nopaneltrue

IP Address Data Type

D excerpt include
pageIP Address Data Type
nopaneltrue

URL Data Type

D excerpt include
pageURL Data Type
nopaneltrue

HTTP Code Data Type

D excerpt include
pageHTTP Code Data Type
nopaneltrue

Complex data types are typically defined based on a regular expression applied to String values. For example, the following regex is used to define the accepted values for the Email Address data type:

Code Block
      "regexes": [
        "^[a-z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-z0-9.-]+(?:\\.[a-z0-9-]+)*\\.[a-z]{2,}$"
      ],


Logical and complex data types

Data typeCategoryInternal data typeNotes
String Data TypelogicalStringThe default data type. Any non-empty/non-value is valid for String data type.
Integer Data TypelogicalIntUse NUMFORMAT Function to format these values. Underlying values are not modified.
Decimal Data TypelogicalFloatUse NUMFORMAT Function to format these values. Underlying values are not modified.
Boolean Data TypelogicalBool
Datetime Data TypelogicalDatetimeUse DATEFORMAT Function to format these values. Underlying values are not modified.
Object Data TypelogicalMap/Object
Array Data TypelogicalArray
Social Security Number Data Type

complex

StringString data type constrained by a regular expression.
Phone Number Data Type
complex
StringString data type constrained by a regular expression.
Email Address Data Type
complex
StringString data type constrained by a regular expression.
Credit Card Data Type

complex

StringString data type constrained by a regular expression.
Gender Data Type
complex
StringString data type constrained by a regular expression.
Zip Code Data Type
complex
StringString data type constrained by a regular expression.
State Data Type
complex
StringString data type constrained by a regular expression.
IP Address Data Type
complex
StringString data type constrained by a regular expression.
URL Data Type
complex
StringString data type constrained by a regular expression.
HTTP Code Data Type
complex
StringString data type constrained by a regular expression.


Types in Source Data

Depending on the source system, imported data may be typed to

D s item
itemdata types
 according to one of the following methods. 

Tip

Tip: For each method, 

D s product
 attempts to map the source data to one of the above data types. For schematized sources, however, you may wish to use the original data types in the source. Optionally, you can choose to disable the mapping of source to internal data type. See "Type Inference" below.

Schematized files

Some file formats, such as Avro or Parquet, are stored in a non-readable format. Part of the metadata associated with the file is information identifying the schema of the file. A schema represents the data types and other constraints of individual columns. It can be read independently of the data in the source table.  

Schematized tables

For relational sources, schema information is typically stored with the table. This schema information defines data type validation within the datastore and can be read independently from the source table.

Tip

Tip: Database schemas can be used to define a class of tables to ensure consistency within a database.

Inferred data types

In most cases, an imported data source is assigned a data type for each column based on a review of a subset of the data. For example, a CSV file contains no information about the data types of individual columns. The data types for each column must be assigned by 

D s product
. This process is called type inference. For more information, see "Type Inference" below.

Type Inference

By default, the 

D s webapp
 applies type inference for imported data. The application attempts to infer a column's appropriate data type in the application based on a review of the first lines in the sample.

Tip

Tip: In many programming languages, a column must be explicitly "type cast" to a data type as part of a functional operation.

D s lang
handles this typecasting for you through the process of type inference.


Info

NOTE: Mapping source data types to

D s item
itemdata types
depends on a sufficient number of values that match the criteria of the internal data type. The mapping of import types to internal data types depends on the data.

  • Type inference needs a minimum of 25 rows of data in a column to work consistently.

  • If your dataset has fewer than 20 rows, type inference may not have sufficient data to properly infer the column type.

In some datasets, the first 25 rows may be of a data type that is a subset of the best matching type. For example, if the first 25 rows in the initial same match the Integer data type, the column may be typed as Integer, even if the other 2,000 rows match for the Decimal data type. If the column data type is unmodified:

  • The data is written out from 
    D s product
     as Integer data type. This works for the first 25 rows.
  • The other 2,000 rows are written out as null values, since they do not match the Integer data type. If the source data used decimal notation (e.g. 3.0 in the source), then those values are written out as null values, too.

In this case, it may be easier to disable type inference for this dataset.

Tip

Tip: If you are having trouble getting your imported dataset to map to expected data types, you can disable type inference for the individual dataset. For more information, see Import Data Page.

After data is imported, the 

D s webapp
 provides some mechanisms for applying stronger typecasting to the data. Example:

  • If all input values are double-quoted, then 
    D s product
     evaluates all columns as String type. As a result, type inference cannot be applied. 
  • Since non-String data types cannot be inferred, then the first row cannot be detected as anomalous against the inferred type (String). Column headers cannot be automatically detected from double-quoted source files.
Tip

Tip: The default data type is String. If the

D s webapp
is unable to evaluate a column's data type, the type is mapped to String data type. Within the application, you can use functions to remap the data type or to parse values according to a specified type.

For more information, see "Working with Data Types" below.

Disable type inference

For schematized files or tables, the 

D s webapp
 inference of data type from the source may result in incorrect initial typing of a dataset's columns in the application. As needed, you can disable type inference for the following:

Info

NOTE: When type inference is disabled for imported datasets, it is not disabled within the

D s webapp
. For more information, see "Type inference in the application" below.


  • Disable for individual files: In the Import Data page, select the file. In the right-hand column, click Edit Settings. For more information, see File Import Settings.
  • Disable for individual tables: In the Import Data page, select the table. In the right-hand column, click Edit Settings. For more information, see Relational Table Settings.


  • Disable for individual connections: In the Connections page, edit the connection. In the Edit Connection window, select Disabled under Default Column Data Type Inference. By default, all datasets through this connection have type inference disabled. For more information, see Create Connection Window.

Type inference in the application

Within the 

D s webapp
, column data types may be re-inferred based on your activities in the Transformer page: 

Info

NOTE: Disabling type inference does not disable the re-inference of types in the Transformer page.

The following general actions may result in column data types being re-inferred:

  • After a sample is taken, column data types are inferred based on the first set of rows in the sample.
  • If a transform or function is provided with a data type that does not match the expected input data type, the values are typecast to the expected output, so you may see changes to the data type of the output to better align with the function.
  • Multi-dataset operations generally do not cause re-inferring of data types. However, if there is a mismatch of data types between two columns in a union operation, for example, the data type of the first dataset is preferred.

Working with Data Types

After data has been imported, you can remap individual column types through recipe steps. For more information, see Change Column Data Type.

Data types in the grid

When a sample is loaded, the data types and their formats for each column are inferred by default. Data types and formatting information is displayed for each column in the Transformer page.


D caption
Column header example

At the top of each column, you can review graphical representations of type information:

  • Data type indicator: To the left of the column name, you can see a graphic of the data type. In the above, the data type is set to Zip code.

    Tip

    Tip: Select the data type indicator to change the column to a different data type. This change is added as a step in your recipe. See "Changing the data type."

  • Data quality bar: Below the column name, you can see a bar indicating the relative percentage of valid, invalid and empty values in the column, compared to the listed data type.
    • Green: Valid for the data type
    • Red: Invalid for the data type
    • Gray: empty or null

      Tip

      Tip: Select one of the colored bars to be prompted by a set of transformation suggestions that can be applied to the selected set of values.

  • Column histogram: Below the data quality bar, you can see the distribution of values within the column. The column histogram may represent the data in different ways, depending on the column's data type.

    Tip

    Tip: Click or SHIFT-click values in the histogram to be prompted for transformation suggestions that can be applied to the selected values.

For more information, see Data Grid Panel

For more information, see Column Menus.

Changing the data type

Change the data type through the data type menu at the top of a column.

Tip

Tip: For some types, such as Datetime type, you must select a data format when you are selecting the type. See below.

Info

NOTE: Changing the data type may not change the underlying logical type. For example, if you change a String column to a Gender column, the underlying data is still stored as String values.

See Change Column Data Type.

Changing type across multiple columns

To change the data type for multiple columns, you can a transformation similar to the following, which changes the data type from the reqProdId column to the prodC column and all columns in between:

D trans
RawWrangletrue
p03ValueString
Typestep
WrangleTextsettype col: reqProdId~prodC type: String
p01NameColumns
p01ValueRange
p02NameColumn list
p02ValuereqProdId~prodC
p03NameNew type
SearchTermChange column data type

Changing the data format

You can use the following functions to apply formatting on top of a column of a specified data type.  For example, depending on your locale, numbers may require different formatting for use of the decimal point and the digit separator.

Info

NOTE: When you apply a formatting function to a column, the data appears in the specified format in the 

D s webapp
, but the underlying data is unmodified. Formatting changes appear as a step in your recipe and are applied to the generated results.


Formatting FunctionApplicable Data TypeDescription
NUMFORMAT FunctionInteger, Decimal

D excerpt include
pageNUMFORMAT Function
nopaneltrue

DATEFORMAT FunctionDatetime

D excerpt include
pageDATEFORMAT Function
nopaneltrue

UNIXTIMEFORMAT FunctionDatetime

D excerpt include
pageUNIXTIMEFORMAT Function
nopaneltrue

Type functions

The 

D s webapp
 provides a set of functions for managing types. 

Validation functions

These functions can be used to test for valid or invalid values against a specific data type.

FunctionDescription
VALID Function

D excerpt include
pageVALID Function
nopaneltrue

ISMISMATCHED Function

D excerpt include
pageISMISMATCHED Function
nopaneltrue

IFVALID Function

D excerpt include
pageIFVALID Function
nopaneltrue

IFMISMATCHED Function

D excerpt include
pageIFMISMATCHED Function
nopaneltrue


Parsing functions

These functions can be used to parse String values against a specific data type.

FunctionDescription
PARSEINT Function

D excerpt include
pagePARSEINT Function
nopaneltrue

PARSEFLOAT Function

D excerpt include
pagePARSEFLOAT Function
nopaneltrue

PARSEBOOL Function

D excerpt include
pagePARSEBOOL Function
nopaneltrue

PARSEDATE Function

D excerpt include
pagePARSEDATE Function
nopaneltrue

PARSEARRAY Function

D excerpt include
pagePARSEARRAY Function
nopaneltrue

PARSEOBJECT Function

D excerpt include
pagePARSEOBJECT Function
nopaneltrue

PARSESTRING Function

D excerpt include
pagePARSESTRING Function
nopaneltrue

Managing null and empty values

These functions allow you to generate or test for missing or null values.

FunctionDescription
ISNULL Function

D excerpt include
pageISNULL Function
nopaneltrue

NULL Function

D excerpt include
pageNULL Function
nopaneltrue

IFNULL Function

D excerpt include
pageIFNULL Function
nopaneltrue

ISMISSING Function

D excerpt include
pageISMISSING Function
nopaneltrue

IFMISSING Function

D excerpt include
pageIFMISSING Function
nopaneltrue

Type Conversions on Export

The 

D s webapp
 attempts to map the data types that you have specified to match data types in the target platform.

Info

NOTE: Values that do not match the data type of the target system for a column are subject to the method by which the target system handles mismatches. Rows could be dropped. Values can be rendered as null values. You should attempt to verify that all columns have valid values before generating results.


Info

NOTE: Missing or null values may be treated differently between target systems. Additionally, if these systems feed downstream systems, those systems may have independent rules for managing missing or null values.

D generate child excerpts
pagesType Conversions
heading3

For more information, see Type Conversions.