Configure Data Source Caching

This section describes some of the configuration options for the data source caching feature. When data is read from the source, the Designer Cloud Powered by Trifacta platform can populate a global or user-specific cache with ingested objects. These objects can be sourced from:

JDBC tables, which are ingested as part of running jobs
Excel data, which must be converted to CSV format and ingested
PDF table data, which must be converted to CSV format and ingested

After initial ingest, cached objects can be referenced later for faster performance on tasks such as sampling and job execution.

Limitations

JDBC ingest caching is not supported for Hive.

Enable

To enable JDBC ingestion and performance caching, the following parameter must be enabled.

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

Parameter Name	Description
feature.jdbcIngestionCaching.enabled	Enables caching of ingested JDBC data. Note `webapp.connectivity.ingest.enabled` must be set to `true` to enable JDBC caching. When disabled, no caching of JDBC data sources is performed.

Parameter Name

Description

feature.jdbcIngestionCaching.enabled

Enables caching of ingested JDBC data.

Note

webapp.connectivity.ingest.enabled must be set to true to enable JDBC caching.

When disabled, no caching of JDBC data sources is performed.

Configure

In the following sections, you can review the available configuration parameters for performance caching.

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

Configure Storage

When files are ingested, they are stored in one of the following locations:

If caching is enabled:
- If the global datasource cache is enabled: files are stored in a user-specific sub-folder of the path indicated by the following parameter: hdfs.pathsConfig.globalDatasourceCache
- If the global cache is disabled: files are stored in a sub-folder of the output area for each user, named: /.datasourceCaches.
If caching is disabled: files are stored in a sub-folder within the jobs area for the job group. Ingested files are stored in as .trifacta files.

Note

Whenever a job is run, its source files must be re-ingested. If two or more datasets in the same job run share the same source, only one copy of the source is ingested.

Additional information is provided below.

Global or user cache

Parameter	Description
datasourceCaching.useGlobalDatasourceCache	When set to `true`, the platform uses the global data source cache location for storing cached ingest data. Note When global caching is enabled, data is still stored individual locations per user. Through the application, users cannot access the cached objects stored for other users. When set to `false`, the platform uses the output directory for each user for storing cached ingest data. Within the output directory, cached data is stored in the `.datasourceCaches` directory. Note You should verify that there is sufficient storage in each user's output directory to store the maximum cache size as well as any projected uploaded datasets.
hdfs.pathsConfig.globalDataSourceCache	Specifies the path of the global datasource cache, if it is enabled. Specify the path from the root folder of the backend datastore. Tip This setting applies to HDFS or other backend datastores.

Parameter

Description

datasourceCaching.useGlobalDatasourceCache

When set to true, the platform uses the global data source cache location for storing cached ingest data.

Note

When global caching is enabled, data is still stored individual locations per user. Through the application, users cannot access the cached objects stored for other users.

When set to false, the platform uses the output directory for each user for storing cached ingest data. Within the output directory, cached data is stored in the .datasourceCaches directory.

Note

You should verify that there is sufficient storage in each user's output directory to store the maximum cache size as well as any projected uploaded datasets.

hdfs.pathsConfig.globalDataSourceCache

Specifies the path of the global datasource cache, if it is enabled. Specify the path from the root folder of the backend datastore.

Tip

This setting applies to HDFS or other backend datastores.

Cache sizing

Parameter	Description
datasourceCaching.refreshThreshold	The number of hours that an object can be cached. If the object has not been refreshed in that period of time, the next request for the datasource collects fresh data from the source. By default, this value is set to `168` (one week).
`datasourceCaching.maxSize`	Maximum size in bytes of the datasource cache. This value applies to individual user caches when either global or user-specific caching is enabled.

Parameter

Description

datasourceCaching.refreshThreshold

The number of hours that an object can be cached. If the object has not been refreshed in that period of time, the next request for the datasource collects fresh data from the source.

By default, this value is set to 168 (one week).

datasourceCaching.maxSize

Maximum size in bytes of the datasource cache. This value applies to individual user caches when either global or user-specific caching is enabled.

Logging

Parameter Name	Description
data-service.systemProperties.logging.level	When the logging level is set to `debug`, log messages on JDBC caching are recorded in the data service log. Note Use this setting for debug purposes only, as the log files can grow quite large. Lower the setting after the issue has been debugged. See Logging below.

Parameter Name

Description

data-service.systemProperties.logging.level

When the logging level is set to debug, log messages on JDBC caching are recorded in the data service log.

Note

Use this setting for debug purposes only, as the log files can grow quite large. Lower the setting after the issue has been debugged.

See Logging below.

When the logging level is set to debug for the data service and caching is enabled, cache messages are logged. These messages include:

Cache hits and misses
Cache key generation

In this section:

Configure Data Source Caching

Limitations

Enable

Configure

Logging

Search results