This section describes some of the configuration options for the JDBC (relational) ingestion and caching features, which enables execution of large-scale JDBC-based jobs on the Spark running environment.
Data ingestion works by streaming a JDBC source into a temporary storage space in the base storage layer. The job can then be run on Spark, the default running environment. When the job is complete, the temporary data is removed from base storage or retained in the cache (if it is enabled).
Data ingestion happens only for Spark jobs. Data ingestion does not apply to jobs.
When data is read from the source, the can populate a user-specific ingest cache, which is maintained to limit long load times from JDBC sources and to improve overall platform performance.
The cache is allowed to remain for a predefined expiration limit, after which any new requests for the data are pulled from the source.
NOTE: Expired cache files are not automatically purged at expiration. |
Job Type | JDBC Ingestion Enabled only | JDBC Ingestion and Caching Enabled | |
---|---|---|---|
transformation job | Data is retrieved from the source and stored in a temporary backend location for use in sampling. | Data is retrieved from the source for the job and refreshes the cache where applicable. | |
sampling job | See previous. | Cache is first checked for valid data objects. Outdated objects are retrieved from the data source. Retrieved data refreshes the cache.
As needed you can force an override of the cache when executing the sample. Data is collected from the source. See Samples Panel. |
Although there is no absolute limit, you should avoid executing jobs on tables over several 100 GBs. Larger data sources can significantly impact end-to-end performance.
NOTE: This recommendation applies to all JDBC-based jobs. |
Rule of thumb:
Scalability:
Rule of thumb for max concurrent jobs for a similar edge node:
max concurrent sources = max cores - cores used for services |
To enable JDBC ingestion and performance caching, both of the following parameters must be enabled.
NOTE: For new installations, this feature is enabled by default. For customers upgrading to Release 5.1 and later, this feature is disabled by default. |
Parameter Name | Description | |
---|---|---|
webapp.connectivity.ingest.enabled | Enables JDBC ingestion. Default is true . | |
feature.jdbcIngestionCaching.enabled | Enables caching of ingested JDBC data.
|
In the following sections, you can review the available configuration parameters for JDBC ingest and JBC performance caching.
Parameter Name | Description | ||
---|---|---|---|
batchserver.workers.ingest.max | Maximum number of ingester threads that can run on the | ||
batchserver.workers.ingest.bufferSizeBytes | Memory buffer size while copying to backend storage. A larger size for the buffer yields fewer network calls, which in rare cases may speed up ingest. | ||
batch-job-runner.cleanup.enabled
| Clean up after job, which deletes the ingested data from backend storage. Default is
|
When files are ingested, they are stored in one of the following locations:
hdfs.pathsConfig.globalDatasourceCache
/.datasourceCache
. .trifacta
files.NOTE: Whenever a job is run, its source files must be re-ingested. If two or more datasets in the same job run share the same source, only one copy of the source is ingested. |
Additional information is provided below.
Parameter | Description | ||
---|---|---|---|
datasourceCaching.useGlobalDatasourceCache | When set to
When set to
| ||
hdfs.pathsConfig.globalDataSourceCache | Specifies the path of the global datasource cache, if it is enabled. Specify the path from the root folder of HDFS, S3, or other backend datastore. |
Parameter | Description |
---|---|
datasourceCaching.refreshThreshold | The number of hours that an object can be cached. If the object has not been refreshed in that period of time, the next request for the datasource collects fresh data from the source. By default, this value is set to |
datasourceCaching.maxSize | Maximum size in bytes of the datasource cache. This value applies to individual user caches when either global or user-specific caching is enabled. |
Parameter Name | Description | |
---|---|---|
data-service.systemProperties.logging.level | When the logging level is set to
See Logging below. |
You can use the following methods to track progress of ingestion jobs.
jobType=ingest
jobs through the API endpoints. See API JobGroups Get Jobs v3.Ingest:
During and after an ingest job, you can download the job logs through the Jobs page. Logs include:
See Jobs Page.
Caching:
When the logging level is set to debug
for the data service and caching is enabled, cache messages are logged. These messages include: