Page tree

Trifacta SaaS



Contents:


   

Contents:


This section provides an overview of how data types are managed during import, transformation, and export of data in Trifacta® SaaS.

Terms:

data type is a definition of a set of values, including:

  • Possible set of values (e.g. Dates from 1/1/1400 to 12/31/2599)
  • Meaning of those values (e.g. two-letter codes can represent states of the United States)
  • Set of operations that can be applied to those values (e.g. functions applicable to integers) 

A data format is a representation of the underlying type, which has the same meaning and available operations associated with the data type. For example, the following values are all valid for Datetime data type, but each is represented in a different format:

12/31/2021
31-12-2021
December 31, 2021


NOTE: Some data types can be explicitly formatted through functions. Other data types support different formats implicitly through the application.

How Data Types Are Used

In the Trifacta application, data types are used for the following:

  • Anomaly detection (Is the value valid or invalid?)
  • Suggestions (What are the available transformation suggestions for this column based on its data type?)
  • Standardization (How can all of these valid values be standardized for the column's data type?)
  • Pattern recognition (How to identify different formats in the same column?)

Data Types

Trifacta SaaS supports the following categories of data types:

Logical data types

logical data type is a class of values that is understood by native system representations.

Tip: These types are recognized internally by Trifacta SaaS. Each running environment to which Trifacta SaaS connections natively supports these logical data types.

These data types have no additional specification requirements:

Data TypeDescription
String Data Type

Any non-null value can be typed as String. A String can be anything.

Integer Data Type

The Integer data type applies to positive and negative numeric values that have no decimal point. 

Decimal Data Type

Decimal data type applies to floating points up to 15 digits in length.

  • In the Trifacta application, this data type is referenced as Decimal.
  • In storage, this data type is written as Double.

Boolean Data Type

The Boolean data type expresses true or false values.

Datetime Data Type

Trifacta® SaaS supports a variety of Datetime formats, each of which has additional variations to it.

Object Data Type

An Object data type is a method for encoding key-value pairs. A single field value may contain one or more sets of key-value pairs.

Array Data Type

An array is a list of values grouped into a single value. An array may be of variable length; in one record the array field may contain two elements, while in the next record, it contains six elements. 

Formatting differences may apply. For example,  Trifacta SaaS may recognize Y and N as Boolean data type, while other systems may not.

Complex data types

complex data type typically is defined by applying additional restrictions on String data type values to define the class of possible values. For example,  Trifacta SaaS supports a Gender data type, which validates values such as MF and male and female as Gender data type. 

The following are the complex data types supported by  Trifacta SaaS.

Data TypeDescription
Social Security Number Data Type

This data type is applied to numeric data following the pattern for United States Social Security numbers.

Phone Number Data Type

This data type is applied to numeric data following common patterns that express telephone numbers and known valid phone numbers in the United States.

Email Address Data Type

This data type matches text values that are properly formatted email addresses.

Credit Card Data Type

Credit card numbers are numeric data that follow the 14-digit or 16-digit patterns for credit cards.

Gender Data Type

This data type matches a variety of text patterns for expressing male/female distinctions.

Zip Code Data Type

This data type matches five- and nine-digit U.S. zipcode patterns.

State Data Type

State data type is applied to data that uses the full names or the two-letter abbreviations for states in the United States.

IP Address Data Type

The IP Address data type supports IPv4 address.

URL Data Type

URL data type is applied to data that follows generalized patterns of URLs.

HTTP Code Data Type

Values of these data types are three-digit numeric values, which correspond to recognized HTTP Status Codes.

Complex data types are typically defined based on a regular expression applied to String values. For example, the following regex is used to define the accepted values for the Email Address data type:

      "regexes": [
        "^[a-z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-z0-9.-]+(?:\\.[a-z0-9-]+)*\\.[a-z]{2,}$"
      ],


Logical and complex data types

Data typeCategoryInternal data typeNotes
String Data TypelogicalStringThe default data type. Any non-empty/non-value is valid for String data type.
Integer Data TypelogicalIntUse NUMFORMAT Function to format these values. Underlying values are not modified.
Decimal Data TypelogicalFloatUse NUMFORMAT Function to format these values. Underlying values are not modified.
Boolean Data TypelogicalBool
Datetime Data TypelogicalDatetimeUse DATEFORMAT Function to format these values. Underlying values are not modified.
Object Data TypelogicalMap/Object
Array Data TypelogicalArray
Social Security Number Data Type

complex

StringString data type constrained by a regular expression.
Phone Number Data Type
complex
StringString data type constrained by a regular expression.
Email Address Data Type
complex
StringString data type constrained by a regular expression.
Credit Card Data Type

complex

StringString data type constrained by a regular expression.
Gender Data Type
complex
StringString data type constrained by a regular expression.
Zip Code Data Type
complex
StringString data type constrained by a regular expression.
State Data Type
complex
StringString data type constrained by a regular expression.
IP Address Data Type
complex
StringString data type constrained by a regular expression.
URL Data Type
complex
StringString data type constrained by a regular expression.
HTTP Code Data Type
complex
StringString data type constrained by a regular expression.


Types in Source Data

Depending on the source system, imported data may be typed to Trifacta data types according to one of the following methods. 

Tip: For each method,  Trifacta SaaS attempts to map the source data to one of the above data types. For schematized sources, however, you may wish to use the original data types in the source. Optionally, you can choose to disable the mapping of source to internal data type. See "Type Inference" below.

Schematized files

Some file formats, such as Avro or Parquet, are stored in a non-readable format. Part of the metadata associated with the file is information identifying the schema of the file. A schema represents the data types and other constraints of individual columns. It can be read independently of the data in the source table.  

Schematized tables

For relational sources, schema information is typically stored with the table. This schema information defines data type validation within the datastore and can be read independently from the source table.

Tip: Database schemas can be used to define a class of tables to ensure consistency within a database.

Inferred data types

In most cases, an imported data source is assigned a data type for each column based on a review of a subset of the data. For example, a CSV file contains no information about the data types of individual columns. The data types for each column must be assigned by  Trifacta SaaS. This process is called type inference. For more information, see "Type Inference" below.

Type Inference

By default, the Trifacta application applies type inference for imported data. The application attempts to infer a column's appropriate data type in the application based on a review of the first lines in the sample.

Tip: In many programming languages, a column must be explicitly "type cast" to a data type as part of a functional operation. Wrangle handles this typecasting for you through the process of type inference.


NOTE: Mapping source data types to Trifacta data types depends on a sufficient number of values that match the criteria of the internal data type. The mapping of import types to internal data types depends on the data.

  • Type inference needs a minimum of 25 rows of data in a column to work consistently.

  • If your dataset has fewer than 20 rows, type inference may not have sufficient data to properly infer the column type.

In some datasets, the first 25 rows may be of a data type that is a subset of the best matching type. For example, if the first 25 rows in the initial same match the Integer data type, the column may be typed as Integer, even if the other 2,000 rows match for the Decimal data type. If the column data type is unmodified:

  • The data is written out from  Trifacta SaaS as Integer data type. This works for the first 25 rows.
  • The other 2,000 rows are written out as null values, since they do not match the Integer data type. If the source data used decimal notation (e.g. 3.0 in the source), then those values are written out as null values, too.

In this case, it may be easier to disable type inference for this dataset.

Tip: If you are having trouble getting your imported dataset to map to expected data types, you can disable type inference for the individual dataset. For more information, see Import Data Page.

After data is imported, the Trifacta application provides some mechanisms for applying stronger typecasting to the data. Example:

  • If all input values are double-quoted, then  Trifacta SaaS evaluates all columns as String type. As a result, type inference cannot be applied. 
  • Since non-String data types cannot be inferred, then the first row cannot be detected as anomalous against the inferred type (String). Column headers cannot be automatically detected from double-quoted source files.

Tip: The default data type is String. If the Trifacta application is unable to evaluate a column's data type, the type is mapped to String data type. Within the application, you can use functions to remap the data type or to parse values according to a specified type.

For more information, see "Working with Data Types" below.

Disable type inference

For schematized files or tables, the Trifacta application inference of data type from the source may result in incorrect initial typing of a dataset's columns in the application. As needed, you can disable type inference for the following:

NOTE: When type inference is disabled for imported datasets, it is not disabled within the Trifacta application. For more information, see "Type inference in the application" below.


  • Disable for individual files: In the Import Data page, select the file. In the right-hand column, click Edit Settings. For more information, see File Import Settings.
  • Disable for individual tables: In the Import Data page, select the table. In the right-hand column, click Edit Settings. For more information, see Relational Table Settings.


  • Disable for individual connections: In the Connections page, edit the connection. In the Edit Connection window, select Disabled under Default Column Data Type Inference. By default, all datasets through this connection have type inference disabled. For more information, see Create Connection Window.

Type inference in the application

Within the Trifacta application, column data types may be re-inferred based on your activities in the Transformer page: 

NOTE: Disabling type inference does not disable the re-inference of types in the Transformer page.

The following general actions may result in column data types being re-inferred:

  • After a sample is taken, column data types are inferred based on the first set of rows in the sample.
  • If a transform or function is provided with a data type that does not match the expected input data type, the values are typecast to the expected output, so you may see changes to the data type of the output to better align with the function.
  • Multi-dataset operations generally do not cause re-inferring of data types. However, if there is a mismatch of data types between two columns in a union operation, for example, the data type of the first dataset is preferred.

Working with Data Types

After data has been imported, you can remap individual column types through recipe steps. For more information, see Change Column Data Type.

Data types in the grid

When a sample is loaded, the data types and their formats for each column are inferred by default. Data types and formatting information is displayed for each column in the Transformer page.


Figure: Column header example

At the top of each column, you can review graphical representations of type information:

  • Data type indicator: To the left of the column name, you can see a graphic of the data type. In the above, the data type is set to Zip code.

    Tip: Select the data type indicator to change the column to a different data type. This change is added as a step in your recipe. See "Changing the data type."

  • Data quality bar: Below the column name, you can see a bar indicating the relative percentage of valid, invalid and empty values in the column, compared to the listed data type.
    • Green: Valid for the data type
    • Red: Invalid for the data type
    • Gray: empty or null

      Tip: Select one of the colored bars to be prompted by a set of transformation suggestions that can be applied to the selected set of values.

  • Column histogram: Below the data quality bar, you can see the distribution of values within the column. The column histogram may represent the data in different ways, depending on the column's data type.

    Tip: Click or SHIFT-click values in the histogram to be prompted for transformation suggestions that can be applied to the selected values.

For more information, see Data Grid Panel

For more information, see Column Menus.

Changing the data type

Change the data type through the data type menu at the top of a column.

Tip: For some types, such as Datetime type, you must select a data format when you are selecting the type. See below.

NOTE: Changing the data type may not change the underlying logical type. For example, if you change a String column to a Gender column, the underlying data is still stored as String values.

See Change Column Data Type.

Changing type across multiple columns

To change the data type for multiple columns, you can a transformation similar to the following, which changes the data type from the reqProdId column to the prodC column and all columns in between:

Transformation Name Change column data type
Parameter: Columns Range
Parameter: Column list reqProdId~prodC
Parameter: New type String

Changing the data format

You can use the following functions to apply formatting on top of a column of a specified data type.  For example, depending on your locale, numbers may require different formatting for use of the decimal point and the digit separator.

NOTE: When you apply a formatting function to a column, the data appears in the specified format in the Trifacta application, but the underlying data is unmodified. Formatting changes appear as a step in your recipe and are applied to the generated results.


Formatting FunctionApplicable Data TypeDescription
NUMFORMAT FunctionInteger, Decimal

Formats a numeric set of values according to the specified number formatting. Source values can be a literal numeric value, a function returning a numeric value, or reference to a column containing an Integer or Decimal values.

DATEFORMAT FunctionDatetime

Formats a specified Datetime set of values according to the specified date format. Source values can be a reference to a column containing Datetime values.

UNIXTIMEFORMAT FunctionDatetime

Formats a set of Unix timestamps according to a specified date formatting string.

Type functions

The Trifacta application provides a set of functions for managing types. 

Validation functions

These functions can be used to test for valid or invalid values against a specific data type.

FunctionDescription
VALID Function

Tests whether a set of values is valid for a specified data type and is not a null value.

ISMISMATCHED Function

Tests whether a set of values is not valid for a specified data type.

IFVALID Function

The IFVALID function writes out a specified value if the input expression matches the specified data type. Otherwise, it writes the source value. Input can be a literal, a column reference, or a function.

IFMISMATCHED Function

The IFMISMATCHED function writes out a specified value if the input expression does not match the specified data type or typing array. Otherwise, it writes the source value. Input can be a literal, a column reference, or a function.


Parsing functions

These functions can be used to parse String values against a specific data type.

FunctionDescription
PARSEINT Function

Evaluates a String input against the Integer datatype. If the input matches, the function outputs an Integer value. Input can be a literal, a column of values, or a function returning String values.

PARSEFLOAT Function

Evaluates a String input against the Decimal datatype. If the input matches, the function outputs a Decimal value. Input can be a literal, a column of values, or a function returning String values.

PARSEBOOL Function

Evaluates a String input against the Boolean datatype. If the input matches, the function outputs a Boolean value. Input can be a literal, a column of values, or a function returning String values.

PARSEDATE Function

Evaluates an input against the default input formats or (if specified) an array of Datetime format strings in their listed order. If the input matches one of the formats, the function outputs a Datetime value.

PARSEARRAY Function

Evaluates a String input against the Array datatype. If the input matches, the function outputs an Array value. Input can be a literal, a column of values, or a function returning String values.

PARSEOBJECT Function

Evaluates a String input against the Object datatype. If the input matches, the function outputs an Object value. Input can be a literal, a column of values, or a function returning String values.

PARSESTRING Function

Evaluates an input against the String datatype. If the input matches, the function outputs a String value. Input can be a literal, a column of values, or a function returning values. Values can be of any data type.

Managing null and empty values

These functions allow you to generate or test for missing or null values.

FunctionDescription
ISNULL Function

The ISNULL function tests whether a column of values contains null values. For input column references, this function returns true or false.

NULL Function

The NULL function generates null values.

IFNULL Function

The IFNULL function writes out a specified value if the source value is a null. Otherwise, it writes the source value. Input can be a literal, a column reference, or a function.

ISMISSING Function

The ISMISSING function tests whether a column of values is missing or null. For input column references, this function returns true or false.

IFMISSING Function

The IFMISSING function writes out a specified value if the source value is a null or missing value. Otherwise, it writes the source value. Input can be a literal, a column reference, or a function.

Type Conversions on Export

The Trifacta application attempts to map the data types that you have specified to match data types in the target platform.

NOTE: Values that do not match the data type of the target system for a column are subject to the method by which the target system handles mismatches. Rows could be dropped. Values can be rendered as null values. You should attempt to verify that all columns have valid values before generating results.


NOTE: Missing or null values may be treated differently between target systems. Additionally, if these systems feed downstream systems, those systems may have independent rules for managing missing or null values.

Type Conversions

Item Description
Redshift Data Type Conversions
Snowflake Data Type Conversions
DB2 Data Type Conversions
Oracle Data Type Conversions
MySQL Data Type Conversions
Postgres Data Type Conversions
Avro Data Type Conversions
Salesforce Data Type Conversions
SharePoint Data Type Conversions
SQL Server Data Type Conversions

For more information, see Type Conversions.

This page has no comments.