Configure Data Source Caching
This section describes some of the configuration options for the data source caching feature. When data is read from the source, the Designer Cloud Powered by Trifacta platform can populate a global or user-specific cache with ingested objects. These objects can be sourced from:
JDBC tables, which are ingested as part of running jobs
Excel data, which must be converted to CSV format and ingested
PDF table data, which must be converted to CSV format and ingested
After initial ingest, cached objects can be referenced later for faster performance on tasks such as sampling and job execution.
Limitations
JDBC ingest caching is not supported for Hive.
Enable
To enable JDBC ingestion and performance caching, the following parameter must be enabled.
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json
. For more information, see Platform Configuration Methods.
Parameter Name | Description |
---|---|
feature.jdbcIngestionCaching.enabled | Enables caching of ingested JDBC data. Note
When disabled, no caching of JDBC data sources is performed. |
Configure
In the following sections, you can review the available configuration parameters for performance caching.
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json
. For more information, see Platform Configuration Methods.
When files are ingested, they are stored in one of the following locations:
If caching is enabled:
If the global datasource cache is enabled: files are stored in a user-specific sub-folder of the path indicated by the following parameter:
hdfs.pathsConfig.globalDatasourceCache
If the global cache is disabled: files are stored in a sub-folder of the output area for each user, named:
/.datasourceCaches
.
If caching is disabled: files are stored in a sub-folder within the jobs area for the job group. Ingested files are stored in as
.trifacta
files.
Note
Whenever a job is run, its source files must be re-ingested. If two or more datasets in the same job run share the same source, only one copy of the source is ingested.
Additional information is provided below.
Parameter | Description |
---|---|
datasourceCaching.useGlobalDatasourceCache | When set to Note When global caching is enabled, data is still stored individual locations per user. Through the application, users cannot access the cached objects stored for other users. When set to Note You should verify that there is sufficient storage in each user's output directory to store the maximum cache size as well as any projected uploaded datasets. |
hdfs.pathsConfig.globalDataSourceCache | Specifies the path of the global datasource cache, if it is enabled. Specify the path from the root folder of the backend datastore. Tip This setting applies to HDFS or other backend datastores. |
Parameter | Description |
---|---|
datasourceCaching.refreshThreshold | The number of hours that an object can be cached. If the object has not been refreshed in that period of time, the next request for the datasource collects fresh data from the source. By default, this value is set to |
| Maximum size in bytes of the datasource cache. This value applies to individual user caches when either global or user-specific caching is enabled. |
Logging
Parameter Name | Description |
---|---|
data-service.systemProperties.logging.level | When the logging level is set to Note Use this setting for debug purposes only, as the log files can grow quite large. Lower the setting after the issue has been debugged. See Logging below. |
When the logging level is set to debug
for the data service and caching is enabled, cache messages are logged. These messages include:
Cache hits and misses
Cache key generation