Page tree

Trifacta Dataprep



Contents:

If you licensed Dataprep by Trifacta before Oct. 14, 2020, you are using the Dataprep by Trifacta Legacy product edition. On October 14, 2022, this product edition will be decommissioned by Google and will be no longer available for use. Current customers of this product edition are encouraged to transition to one of the product editions hosted by Trifacta. See Product Editions.

   

Contents:


NOTE: The Trifacta® data types listed in this page reflect the raw data type of the converted column. Depending on the contents of the column, the Transformer Page may re-infer a different data type, when a dataset using this type of source is loaded.

Access/Read

When a BigQuery data type is imported, its data type is remapped according to the following table.

Tip: Data precision may be lost during conversion. You may want to generate min and max values and compute significant digits for values in your BigQuery tables and then compute the same in Dataprep by Trifacta.

Source Data TypeSupported?

Trifacta Data Type

Notes
StringYString
BytesYString
IntegerYInteger
FloatYDecimal
BooleanYBoolean
TimestampYDatetimeMillisecond and microsecond precision in BigQuery is dropped during the import process.
DateYDatetime

If imported data contains microsecond precision, values are handled differently depending on the area of the product:

  • Initial sampling, import: Uses the BigQuery API, which can return microseconds.
  • All other areas of the product, including job execution: Uses the Dataflow SDK, which returns milliseconds.
TimeYDatetime
DatetimeYDatetime
ArrayYArray
RecordYObject

Publish/Write

NOTE: Trifacta data types not listed here are written as string types in the target database.

Trifacta Data Type
BigQuery Data TypeNotes
StringString
IntegerInteger
DecimalFloat
BooleanBoolean
Datetime
See below.
Datetime (time data only)
See below.
ArrayArray for primitive data typesString

for other data types


See below.
ObjectString

Publishing Datetime values to BigQuery

NOTE: Publishing of microseconds to BigQuery is not supported.

Depending on several factors, Datetime values in  Dataprep by Trifacta are published to BigQuery in different ways. Factors include:

  • Method of publication:
    • create/drop, which creates a new table
    • append/truncate/merge, which updates an existing table
  • Date format of the Trifacta source

NOTE: The following publishing behaviors assume that the default behavior of strict rules for type matching has been enabled on the publishing action for the target BigQuery table. Otherwise, Datetime values are always published as Datetime. For more information on strict type matching, see BigQuery Table Settings.

Datetime for creating new tables

Trifacta Datetime value

BigQuery target table typeCreate and drop operations
Datetime with date information onlyDateDate
Datetime with time information onlyTimeTime
DatetimeDatetimeDatetime
Datetime with timestamp information onlyTimestampTimestamp
Datetime with date information only and no timezoneTimestamp (no timezone)Timestamp appends 1970-01-01

Datetime for updating existing tables

For append, truncate, and merge operations, these updates depend on the target table type in BigQuery

Trifacta Datetime value

Date table typeTime table typeDatetime table typeTimestamp table type
Datetime with date information onlyDateINVALIDAppends 00:00:00Appends UTC, 00:00:00
Datetime with time information onlyINVALIDTimeAppends 1970-01-01Appends UTC, 1970-01-01
DatetimeTruncates Time infoTruncates Date infoDatetimeAppends UTC
Datetime with timestamp information onlyTruncates Time and Timezone infoTruncates Date and Timezone infoTruncates Timezone infoTimestamp
Datetime with date information only and no timezoneTimestamp appends 1970-01-01

Truncates Timezone

Appends Date and truncates Timezone infoAppends UTC, 1970-01-01

Publishing limitations for Trifacta Photon:

When publishing to BigQuery for a job executed on Trifacta Photon, the following limitations apply:

  • A staging CSV file is created for the job, and publication to BigQuery happens through a separate, internal Publish job. Since a CSV has no declared schema, the schema must be retrieved from the target table or tables, where possible.
  • When writing to multiple BigQuery target tables: 
    • The Trifacta application assumes that the schema of all tables is the same.
    • The Trifacta application selects one table to represent the schema for all, including representative datetime formats.
    • If a target table does not exist, then  Dataprep by Trifacta typecasting is applied:
      • For create and drop actions, a new table is creating using  Dataprep by Trifacta typecasting.
      • For append, truncate, and merge actions, the table may or may not exist.

Supported Datetime formats

Dataprep by Trifacta supports the following Datetime formats for writing dates to BigQuery:

yy-dd-mm
yy-mm-dd
dd-mm-yy
mm-dd-yy
dd-mm
mm-dd
mm-yy

For the last three formats in the above list, additional information is written to BigQuery, where appropriate:

FormatWritten as
dd-mm1970-dd-mm
mm-dd1970-mm-dd
mm-yyyy-mm-01

Publishing errors

Strict data type matching is enabled by default when publishing to BigQuery. When strictly matching, job runs result in the following:

Big Query data type matchingResult

Publishing Array type columns to BigQuery

You can publish Dataprep by Trifacta array type column in BigQuery as an array type. For more information, see BigQuery Table Settings.

NOTE: Dataprep by Trifacta array type columns in datasets that were imported before Release 9.2 are still published as String type. This is a known issue. As a workaround, you can create a new imported dataset from the same source to publish those columns as BigQuery arrays.

In BigQuery,  an array is defined by the field Type along with the Mode. The following is the schema of array columns in BigQuery:

Field NameTypeMode

int_array

INTEGERREPEATED
float_arrayFLOATREPEATED
string_arraySTRINGREPEATED
bool_arrayBOOLEANREPEATED

You can publish the following primitive type columns as Arrays in BigQuery: 

  • Integer

  • Float 

  • Boolean

  • String

Other complex types of Arrays are published as String values.

INVALID Trifacta type matching with BigQuery data type

Error
Data is appended or truncated on publicationWarning
Data is written as-is to BigQueryValid

This page has no comments.