Page tree

 

Contents:


The Data Service enables the Trifacta® platform to stream metadata and records from JDBC sources for sampling and job execution in the Trifacta Photon running environment. This section describes how to enable and configure the service, including performance tweaks and connection-specific configuration. 

Configure Service

The following basic properties enable the service and specify basic location for it.

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

PropertyDescription
"data-service.enabled"

When true, the data service is enabled.

NOTE: When set to false, access to any relational connection is prevented.

Default is true .

"data-service.host"Hostname for the service. Default is localhost.
"data-service.port"

Port number used by the service. Default is 41912 .

NOTE: If you are changing the port number, avoid creating conflicts with existing ports in use. For more information, see System Ports.


"data-service.classpath"The Java class path for the data service.
"data-service.autoRestart"When true, the data service is automatically restarted if it crashes. Default is true .
"data-service.vendorPath"

Path to the vendor configuration files for relational connections. Default value:

%(topOfTree)s/services/data-service/build/conf/vendor

Configure connection pools

The Trifacta platform uses a connection pool mechanism for managing individual connections between the platform and its relational datasources. As needed, you can modify the following file to configure global connection pool limits:

NOTE: Do not modify this file unless you are experiencing connection issues across multiple relational connections.

/opt/trifacta/services/data-service/build/conf/c3po.properties

NOTE: You should create a backup of this file before you modify it. To apply changes to the file, make your edits, save the file, and restart the platform.

PropertyDescription
c3p0.maxIdleTimeMaximum number of seconds that a connection should remain idle. Default is 100.
c3p0.initialPoolSizeInitial number of connections per pool. Default is 1.
c3p0.minPoolSizeMinimum number of connections per pool. Default is 1.
c3p0.maxPoolSizeMaximum number of connections per pool. Default is 10.
c3p0.maxStatementsMaximum number of statements per pool. Default is 180.
c3p0.acquireRetryAttemptsNumber of re-tries before connection breaks. Default is 1.
c3p0.acquireIncrementNumber of allowed parallel calls when a pool is exhausted. Default is 3.
c3p0.numHelperThreadsFor connection pool optimization, this value specifies the number of cleanup threads. Default is 3.

Configure SQL Options  

Configure relational read stream limits

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

The Data Service reads data from relational sources in streams of records. You can modify the following parameters to configure the limits of SQL record streaming during read operations. The size of these streams are defined by the following parameters:

"data-service.sqlOptions.maxReadStreamRecords": -1,
"data-service.sqlOptions.limitedReadStreamRecords": 1000000,
"data-service.sqlOptions.initialReadStreamRecords": 25,
"data-service.sqlOptions.hiveReadStreamRecords": 100000000,
PropertyDescription
"data-service.sqlOptions.maxReadStreamRecords"

The maximum number of JDBC records pulled in per stream read during batch execution.

If this value is set to -1, then no limit is applied.

"data-service.sqlOptions.limitedReadStreamRecords"

Max number of records read for the initial sample and quick scan sampling. Setting to -1 means there is no limit.

"data-service.sqlOptions.initialReadStreamRecords"Initial number of records to read for client-side preview and for client-side transform. Set to -1 to apply no limit.
"data-service.sqlOptions.hiveReadStreamRecords"

Max number of records that can be read from Hive, if maxReadStreamRecords is -1.

NOTE: This value cannot be set to -1, which results in a Data Service error. Hive reads must be limited.

VARCHAR string length max

By default, when the Trifacta platform publishes to one of the following relational systems, String types are published to VARCHAR columns with a maximum length of 256 characters. This setting applies to the following relational systems:

Relational DBMaximum string length
Hive65,535
Redshift65,535
SQL DW

8000

NOTE: For string values that exceed this length in the Trifacta platform, the value is truncated to this limit on export.

NOTE: This setting applies to the data service, which is used for publication to all three systems. Performance may be impacted across all three systems if you raise this value.

As needed, you can change the maximum permitted length of strings published from the Trifacta platform to VARCHAR columns.

Steps:

  1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.
  2. Modify the following property:

    "data-service.sqlOptions.stringSizeInBytes": 256,
  3. Save your changes and restart the platform.

Configure Caching

The data service maintains a cache of JDBC objects that have been retrieved for use. You can configure the following properties to tune settings of the cache.

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

PropertyDescription
"data-service.cacheOptions.validationDelayMilliseconds"Number of milliseconds to wait between checks validating cached pools. Default is 3600000 (1 hour).
"data-service.cacheOptions.maxSize"

Maximum number of objects in the cache. Default is 100.

NOTE: Set this value to 0 to disable data service caching.

"data-service.cacheOptions.expirySeconds"Objects in the cache that are older than this number of seconds are automatically expired. Default is 86400 (1 day).

Configure for Specific Integrations

Configure Data Service for Hive

The following properties apply to how the platform connects to Hive.

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

PropertyDescription
"data-service.hiveManagedTableFormat"Managed table format for your Hive deployment. Default is PARQUET.
"data-service.hiveJdbcJar"Path to the JAR to use for JDBC connectivity to Hive. Default path depends on your Hadoop distribution.

Configure Oracle ciphers for SSL connections

For SSL connections, Oracle requires that a set of ciphers be passed as part of the connection string to determine the algorithm to use for the connection. This cipher list can be modified in the following file:

/opt/trifacta/conf/data-service/application.properties

NOTE: Do not modify this cipher list unless you are experiencing difficulties making Oracle SSL connections work or are aware of new ciphers missing from this list. For more information, see https://docs.oracle.com/cd/B19306_01/java.102/b14355/sslthin.htm#BABDJBAE.

com.trifacta.dataservice.oracle.ciphers=(SSL_RSA_WITH_3DES_EDE_CBC_SHA,SSL_RSA_WITH_RC4_128_SHA,SSL_RSA_WITH_RC4_128_MD5,
SSL_RSA_WITH_DES_CBC_SHA,SSL_DH_anon_WITH_3DES_EDE_CBC_SHA,SSL_DH_anon_WITH_RC4_128_MD5,SSL_DH_anon_WITH_DES_CBC_SHA,
SSL_RSA_EXPORT_WITH_RC4_40_MD5,SSL_RSA_EXPORT_WITH_DES40_CBC_SHA)

Optimize JDBC fetch size by vendor

You can optimize the maximum number of records to fetch per read for each database vendor. Edit the application.properties file from the previous section.

The following are the defaults:

com.trifacta.dataservice.trifacta.postgresql.fetchSize=1000
com.trifacta.dataservice.trifacta.redshift.fetchSize=1000

You can add settings for individual JDBC connections using the following structure:

com.trifacta.dataservice.trifacta.VENDOR_NAME.fetchSize=1000

where:

  • VENDOR_NAME is the internal name of the vendor. For more information on these names, see Connection Types.

 

Configure Vendor Field Mappings

As needed, you can configure the mappings between fields from specific vendor sources and the Trifacta data types. These mappings can be tweaked as new fields become available or as requirements change. These mappings are defined in the following file:

/opt/trifacta/conf/data-service/application.properties

NOTE: This file applies only to the relational connections that are natively supported by the platform. For other relational connections, field mappings are maintained in their individual configuration files.

The following JDBC connections are supported natively in the platform:

  • Hive
  • PostgreSQL
  • SQL Server
  • Teradata
  • Oracle

  • Redshift
  • SQL Datawarehouse

NOTE: You should create a backup of this file before you modify it. To apply changes to the file, make your edits, save the file, and restart the platform.

Disable vendor-specific connections

If needed, you can disable the creation of any connections of the following listed types by setting these parameters to false . When a connection type is disabled, no user of the platform can create a connection of that type.

  • Hive: com.trifacta.dataservice.trifacta.sources.hive.enabled=true
  • PostgreSQL: com.trifacta.dataservice.trifacta.sources.postgres.enabled=true
  • SQL Server: com.trifacta.dataservice.trifacta.sources.sqlserver.enabled=true
  • Teradata: com.trifacta.dataservice.trifacta.sources.teradata.enabled=true
  • Oracle: com.trifacta.dataservice.trifacta.sources.oracle.enabled=true

  • Redshift: com.trifacta.dataservice.trifacta.sources.redshift.enabled=true
  • SQL Datawarehouse: com.trifacta.dataservice.trifacta.sources.sqldatawarehouse.enabled=true

 


Whitelists and greylists

For each of the natively supported relational connection types, a set of fields have been specified for import in two separate lists: whitelist and greylist.

NOTE: These lists are maintained in this file only for the specified listed types. Other relational connections maintain their type lists in their configuration files.

 

  • Whitelist: This comma-separated list identifies the fields that are imported from the source as their native type. Whitelisted types include more typing information and are more likely to match a type in the Trifacta platform.
  • Greylist: This comma-separated list identifies the fields that are imported as String type in the Trifacta platform. Greylisted fields are imported into the platform, but they are likely to require some cleanup in the application.

    Tip: When importing a new data type, you should try to import it as on the greylist first. If it doesn't look correct as a basic string, you can try bumping to the whitelist, but this runs the risk of crashing the whole load for some types. In general, greylisting is safer.

  • Unlisted: Unlisted types are not imported into the Trifacta platform.

You can add or remove fields from the whitelist and graylist by modifying the appropriate property in the file. Below, you can see the whitelist and greylist for the Oracle connection type:

com.trifacta.dataservice.oracle.whitelist=NUMBER,BINARY_FLOAT,BINARY_DOUBLE,DATE,CLOB,NCLOB,RAW,LONG RAW,CHAR,VARCHAR,VARCHAR2,NCHAR,NVARCHAR2,FLOAT,HTTPURITYPE,XDBURITYPE,DBURITYPE,URITYPE
com.trifacta.dataservice.oracle.greylist=TIMESTAMP,TIMESTAMP WITH TIME ZONE,TIMESTAMP WITH LOCAL TIME ZONE,INTERVAL YEAR TO MONTH,INTERVAL DAY TO SECOND,INTERVALYM,INTERVALDS

Data type mappings for publishing

When publishing results from the Trifacta application to a relational target, the  Trifacta platform utilizes a set of field mappings from Trifacta fields to equivalent fields in the target. 

The following table identifies the Trifacta type and its corresponding list of comma-separated fields. In the right column, the asterisk (*) is a wildcard for the target name (e.g. oracle).  

Tip: As the target systems support additional fields, you can add them to the appropriate list for publication. You must acquire the vendor-specific string for each field to add from the vendor's documentation.

NOTE: If there is no mapping between the Trifacta type to a target field type, the data for the field is written as String type.

Trifacta type

Property List of Fields
Integer
com.trifacta.dataservice.*.publish.type.integer
String
com.trifacta.dataservice.*.publish.type.string
Boolean
com.trifacta.dataservice.*.publish.type.bool
Decimal (float)
com.trifacta.dataservice.*.publish.type.float
Datetime
com.trifacta.dataservice.*.publish.type.datetime

or

com.trifacta.dataservice.*.publish.type.time

Logging

For more information on logging for the service, see Configure Logging for Services.

Other Topics

  • If you are reading large datasets from relational sources, you can enable JDBC ingestion, which reads source data in the background and stages on the backend datastore for execution. For more information, see Configure JDBC Ingestion
  • Optionally, SSO authentication can be applied to relational connections. For more information, see Enable SSO for Relational Connections.

This page has no comments.