Contents:
This section describes some of the configuration options for the JDBC (relational) ingestion and caching features, which enables execution of large-scale JDBC-based jobs on the Spark running environment.
Overview
Data ingestion works by streaming a JDBC source into a temporary storage space in the base storage layer. The job can then be run on Spark, the default running environment. When the job is complete, the temporary data is removed from base storage or retained in the cache (if it is enabled).
Data ingestion happens only for Spark jobs. Data ingestion does not apply to Trifacta Photon jobs.
- Data ingestion applies only to JDBC sources that are not native to the running environment. For example, JDBC ingestion is not supported for Hive.
- Supported for HDFS and other large-scale backend datastores.
When data is read from the source, the Designer Cloud Powered by Trifacta® platform can populate a user-specific ingest cache, which is maintained to limit long load times from JDBC sources and to improve overall platform performance.
The cache is allowed to remain for a predefined expiration limit, after which any new requests for the data are pulled from the source.
NOTE: Expired cache files are not automatically purged at expiration.
- If the cache fails for some reason, the platform falls back to ingest-only mode, and the related job should complete as expected.
Job Type | JDBC Ingestion Enabled only | JDBC Ingestion and Caching Enabled |
---|---|---|
transformation job | Data is retrieved from the source and stored in a temporary backend location for use in sampling. | Data is retrieved from the source for the job and refreshes the cache where applicable. |
sampling job | See previous. | Cache is first checked for valid data objects. Outdated objects are retrieved from the data source. Retrieved data refreshes the cache. NOTE: Caching applies only to full scan sampling jobs. Quick scan sampling is performed in the Trifacta Photon web client. As needed you can force an override of the cache when executing the sample. Data is collected from the source. See Samples Panel. |
Recommended Table Size
Although there is no absolute limit, you should avoid executing jobs on tables over several 100 GBs. Larger data sources can significantly impact end-to-end performance.
NOTE: This recommendation applies to all JDBC-based jobs.
Performance
Rule of thumb:
- For a single job with 16 ingest jobs occurring in parallel, maximum expected transfer rate is 1 GB/minute.
Scalability:
- 1 ingest job per source, meaning a dataset with 3 sources = 3 ingest jobs.
Rule of thumb for max concurrent jobs for a similar edge node:
max concurrent sources = max cores - cores used for services
- Above is valid until the network becomes a bottleneck. Internally, the above maxed out at about 15 concurrent sources.
- Default concurrent jobs = 16, pool size of 10, 2 minute timeout on pool. This is to prevent overloading of your database.
- Adding more concurrent jobs once network has bottleneck will start slow down all the transfer jobs simultaneously.
- If processing is fully saturated (# of workers is maxed):
- max transfer can drop to 1/3 GB/minute.
- Ingest waits for two minutes to acquire a connection. If after two minutes a connection cannot be acquired, the job fails.
- When job is queued for processing:
- Job is silently queued and appears to be in progress.
- Service waits until other jobs complete.
- Currently, there is no timeout for queueing based on the maximum number of concurrent ingest jobs.
Limitations
JDBC ingest caching is not supported for Hive.
Enable
To enable JDBC ingestion and performance caching, both of the following parameters must be enabled.
NOTE: For new installations, this feature is enabled by default. For customers upgrading to Release 5.1 and later, this feature is disabled by default.
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json
.
For more information, see Platform Configuration Methods.
Parameter Name | Description |
---|---|
webapp.connectivity.ingest.enabled | Enables JDBC ingestion. Default is true . |
feature.jdbcIngestionCaching.enabled | Enables caching of ingested JDBC data. NOTE: |
feature.enableLongLoading | When enabled, you can monitor the ingestion of long-loading JDBC datasets through the Import Data page. Default is Tip: After a long-loading dataset has been ingested, importing the data and loading it in the Transformer page should perform faster. |
Configure
In the following sections, you can review the available configuration parameters for JDBC ingest and JBC performance caching.
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json
.
For more information, see Platform Configuration Methods.
Configure Ingestion
Parameter Name | Description |
---|---|
batchserver.workers.ingest.max | Maximum number of ingester threads that can run on the Designer Cloud Powered by Trifacta platform at the same time. |
batchserver.workers.ingest.bufferSizeBytes | Memory buffer size while copying to backend storage. A larger size for the buffer yields fewer network calls, which in rare cases may speed up ingest. |
batch-job-runner.cleanup.enabled
| Clean up after job, which deletes the ingested data from backend storage. Default is NOTE: If JDBC ingestion is disabled, relational source data is not removed from platform backend storage. This feature can be disabled for debugging and should be re-enabled afterward. NOTE: This setting rarely applies if JDBC ingest caching has been enabled. |
Configure Storage
When files are ingested, they are stored in one of the following locations:
- If caching is enabled:
- If the global datasource cache is enabled: files are stored in a user-specific sub-folder of the path indicated by the following parameter:
hdfs.pathsConfig.globalDatasourceCache
- If the global cache is disabled: files are stored in a sub-folder of the output area for each user, named:
/.datasourceCache
.
- If the global datasource cache is enabled: files are stored in a user-specific sub-folder of the path indicated by the following parameter:
- If caching is disabled: files are stored in a sub-folder within the jobs area for the job group. Ingested files are stored in as
.trifacta
files.
NOTE: Whenever a job is run, its source files must be re-ingested. If two or more datasets in the same job run share the same source, only one copy of the source is ingested.
Additional information is provided below.
Global or user cache
Parameter | Description |
---|---|
datasourceCaching.useGlobalDatasourceCache | When set to NOTE: When global caching is enabled, data is still stored individual locations per user. Through the application, users cannot access the cached objects stored for other users. When set to NOTE: You should verify that there is sufficient storage in each user's output directory to store the maximum cache size as well as any projected uploaded datasets. |
hdfs.pathsConfig.globalDataSourceCache | Specifies the path of the global datasource cache, if it is enabled. Specify the path from the root folder of HDFS or other backend datastore. |
Cache sizing
Parameter | Description |
---|---|
datasourceCaching.refreshThreshold | The number of hours that an object can be cached. If the object has not been refreshed in that period of time, the next request for the datasource collects fresh data from the source. By default, this value is set to |
datasourceCaching.maxSize | Maximum size in bytes of the datasource cache. This value applies to individual user caches when either global or user-specific caching is enabled. |
Logging
Parameter Name | Description |
---|---|
data-service.systemProperties.logging.level | When the logging level is set to NOTE: Use this setting for debug purposes only, as the log files can grow quite large. Lower the setting after the issue has been debugged. See Logging below. |
Monitoring Progress
You can use the following methods to track progress of ingestion jobs.
- Through application: In the Jobs page, you can track progress of all jobs, including ingestion. Where there are errors, you can download logs for further review.
- See Jobs Page.
- See Logging below.
- Through APIs:
- You can track status of
jobType=ingest
jobs through the API endpoints. - From the above endpoint, get the ingest jobId to track progress.
- See API JobGroups Get Jobs v4.
- You can track status of
Logging
Ingest:
During and after an ingest job, you can download the job logs through the Jobs page. Logs include:
- All details including errors
- Progress on ingest transfer
- Record ingestion
See Jobs Page.
Caching:
When the logging level is set to debug
for the data service and caching is enabled, cache messages are logged. These messages include:
- Cache hits and misses
- Cache key generation
This page has no comments.