This section provides an overview of how data types are managed during import, transformation, and export of data in .
A data type is a definition of a set of values, including:
- Possible set of values (e.g. Dates from 1/1/1400 to 12/31/2599)
- Meaning of those values (e.g. two-letter codes can represent states of the United States)
- Set of operations that can be applied to those values (e.g. functions applicable to integers)
A data format is a representation of the underlying type, which has the same meaning and available operations associated with the data type. For example, the following values are all valid for Datetime data type, but each is represented in a different format:
December 31, 2021
NOTE: Some data types can be explicitly formatted through functions. Other data types support different formats implicitly through the application.
How Data Types Are Used
, data types are used for the following:
- Anomaly detection (Is the value valid or invalid?)
- Suggestions (What are the available transformation suggestions for this column based on its data type?)
- Standardization (How can all of these valid values be standardized for the column's data type?)
- Pattern recognition (How to identify different formats in the same column?)
supports the following categories of data types:
Logical data types
A logical data type is a class of values that is understood by native system representations.
Tip: These types are recognized internally by . Each running environment to which connections natively supports these logical data types.
These data types have no additional specification requirements:
Formatting differences may apply. For example, may recognize
N as Boolean data type, while other systems may not.
Complex data types
A complex data type typically is defined by applying additional restrictions on String data type values to define the class of possible values. For example,
supports a Gender data type, which validates values such as
as Gender data type.
The following are the complex data types supported by
Complex data types are typically defined based on a regular expression applied to String values. For example, the following regex is used to define the accepted values for the Email Address data type:
Logical and complex data types
Types in Source Data
Depending on the source system, imported data may be typed to
according to one of the following methods.
Tip: For each method, attempts to map the source data to one of the above data types. For schematized sources, however, you may wish to use the original data types in the source. Optionally, you can choose to disable the mapping of source to internal data type. See "Type Inference" below.
Some file formats, such as Avro or Parquet, are stored in a non-readable format. Part of the metadata associated with the file is information identifying the schema of the file. A schema represents the data types and other constraints of individual columns. It can be read independently of the data in the source table.
For relational sources, schema information is typically stored with the table. This schema information defines data type validation within the datastore and can be read independently from the source table.
Tip: Database schemas can be used to define a class of tables to ensure consistency within a database.
Inferred data types
In most cases, an imported data source is assigned a data type for each column based on a review of a subset of the data. For example, a CSV file contains no information about the data types of individual columns. The data types for each column must be assigned by
. This process is called type inference. For more information, see "Type Inference" below.
By default, the
applies type inference for imported data. The application attempts to infer a column's appropriate data type in the application based on a review of the first lines in the sample.
Tip: In many programming languages, a column must be explicitly "type cast" to a data type as part of a functional operation. handles this typecasting for you through the process of type inference.
NOTE: Mapping source data types to depends on a sufficient number of values that match the criteria of the internal data type. The mapping of import types to internal data types depends on the data.
In some datasets, the first 25 rows may be of a data type that is a subset of the best matching type. For example, if the first 25 rows in the initial same match the Integer data type, the column may be typed as Integer, even if the other 2,000 rows match for the Decimal data type. If the column data type is unmodified:
- The data is written out from as Integer data type. This works for the first 25 rows.
- The other 2,000 rows are written out as null values, since they do not match the Integer data type. If the source data used decimal notation (e.g.
3.0 in the source), then those values are written out as null values, too.
In this case, it may be easier to disable type inference for this dataset.
Tip: If you are having trouble getting your imported dataset to map to expected data types, you can disable type inference for the individual dataset. For more information, see Import Data Page.
After data is imported, the
provides some mechanisms for applying stronger typecasting to the data. Example:
- If all input values are double-quoted, then evaluates all columns as String type. As a result, type inference cannot be applied.
- Since non-String data types cannot be inferred, then the first row cannot be detected as anomalous against the inferred type (String). Column headers cannot be automatically detected from double-quoted source files.
Tip: The default data type is String. If the is unable to evaluate a column's data type, the type is mapped to String data type. Within the application, you can use functions to remap the data type or to parse values according to a specified type.
For more information, see "Working with Data Types" below.
Disable type inference
For schematized files or tables, the
inference of data type from the source may result in incorrect initial typing of a dataset's columns in the application. As needed, you can disable type inference for the following:
NOTE: When type inference is disabled for imported datasets, it is not disabled within the . For more information, see "Type inference in the application" below.
- Disable for individual files: In the Import Data page, select the file. In the right-hand column, click Edit Settings. For more information, see File Import Settings.
- Disable for individual tables: In the Import Data page, select the table. In the right-hand column, click Edit Settings. For more information, see Relational Table Settings.
- Disable for individual connections: In the Connections page, edit the connection. In the Edit Connection window, select
Disabled under Default Column Data Type Inference. By default, all datasets through this connection have type inference disabled. For more information, see Create Connection Window.
Type inference in the application
, column data types may be re-inferred based on your activities in the Transformer page:
NOTE: Disabling type inference does not disable the re-inference of types in the Transformer page.
The following general actions may result in column data types being re-inferred:
- After a sample is taken, column data types are inferred based on the first set of rows in the sample.
- If a transform or function is provided with a data type that does not match the expected input data type, the values are typecast to the expected output, so you may see changes to the data type of the output to better align with the function.
- Multi-dataset operations generally do not cause re-inferring of data types. However, if there is a mismatch of data types between two columns in a union operation, for example, the data type of the first dataset is preferred.
Working with Data Types
After data has been imported, you can remap individual column types through recipe steps. For more information, see Change Column Data Type.
Data types in the grid
When a sample is loaded, the data types and their formats for each column are inferred by default. Data types and formatting information is displayed for each column in the Transformer page.
Column header example
At the top of each column, you can review graphical representations of type information:
Data type indicator: To the left of the column name, you can see a graphic of the data type. In the above, the data type is set to Zip code.
Tip: Select the data type indicator to change the column to a different data type. This change is added as a step in your recipe. See "Changing the data type."
- Data quality bar: Below the column name, you can see a bar indicating the relative percentage of valid, invalid and empty values in the column, compared to the listed data type.
- Green: Valid for the data type
- Red: Invalid for the data type
Gray: empty or null
Tip: Select one of the colored bars to be prompted by a set of transformation suggestions that can be applied to the selected set of values.
Column histogram: Below the data quality bar, you can see the distribution of values within the column. The column histogram may represent the data in different ways, depending on the column's data type.
Tip: Click or SHIFT-click values in the histogram to be prompted for transformation suggestions that can be applied to the selected values.
For more information, see Data Grid Panel.
For more information, see Column Menus.
Changing the data type
Change the data type through the data type menu at the top of a column.
Tip: For some types, such as Datetime type, you must select a data format when you are selecting the type. See below.
NOTE: Changing the data type may not change the underlying logical type. For example, if you change a String column to a Gender column, the underlying data is still stored as String values.
See Change Column Data Type.
Changing type across multiple columns
To change the data type for multiple columns, you can a transformation similar to the following, which changes the data type from the
reqProdId column to the
prodC column and all columns in between:
|WrangleText||settype col: reqProdId~prodC type: String|
|SearchTerm||Change column data type|
Changing the data format
You can use the following functions to apply formatting on top of a column of a specified data type. For example, depending on your locale, numbers may require different formatting for use of the decimal point and the digit separator.
NOTE: When you apply a formatting function to a column, the data appears in the specified format in the , but the underlying data is unmodified. Formatting changes appear as a step in your recipe and are applied to the generated results.
The provides a set of functions for managing types.
These functions can be used to test for valid or invalid values against a specific data type.
These functions can be used to parse String values against a specific data type.
Managing null and empty values
These functions allow you to generate or test for missing or null values.
Type Conversions on Export
attempts to map the data types that you have specified to match data types in the target platform.
NOTE: Values that do not match the data type of the target system for a column are subject to the method by which the target system handles mismatches. Rows could be dropped. Values can be rendered as null values. You should attempt to verify that all columns have valid values before generating results.
NOTE: Missing or null values may be treated differently between target systems. Additionally, if these systems feed downstream systems, those systems may have independent rules for managing missing or null values.
|D generate child excerpts|
For more information, see Type Conversions.