This section describes some of the configuration options for the JDBC (relational) ingestion, which support faster execution of JDBC-based jobs.
Data ingestion works by streaming a JDBC source into a temporary storage space in the base storage layer to stage the data for job execution. The job can then be run on Photon or Spark. When the job is complete, the temporary data is removed from base storage or retained in the cache (if it is enabled).
Data ingestion happens for Spark and jobs.
Data caching refers to the process of ingesting and storing data sources on the for a period of time for faster access if they are needed for additional platform operations.
Tip: Data ingestion and data caching can work together. For more information on data caching, see Configure Data Source Caching. |
Job Type | JDBC Ingestion Enabled only | JDBC Ingestion and Caching Enabled | |
---|---|---|---|
transformation job | Data is retrieved from the source and stored in a temporary backend location for use in sampling. | Data is retrieved from the source for the job and refreshes the cache where applicable. | |
sampling job | See previous. | Cache is first checked for valid data objects. Outdated objects are retrieved from the data source. Retrieved data refreshes the cache.
As needed you can force an override of the cache when executing the sample. Data is collected from the source. See Samples Panel. |
Although there is no absolute limit, you should avoid executing jobs on tables over several 100 GBs. Larger data sources can significantly impact end-to-end performance.
NOTE: This recommendation applies to all JDBC-based jobs. |
Rule of thumb:
Scalability:
Rule of thumb for max concurrent jobs for a similar edge node:
max concurrent sources = max cores - cores used for services |
To enable JDBC ingestion and performance caching, the first two of the following parameters must be enabled.
NOTE: For new installations, this feature is enabled by default. For customers upgrading to Release 5.1 and later, this feature is disabled by default. |
Parameter Name | Description | |
---|---|---|
webapp.connectivity.ingest.enabled | Enables JDBC ingestion. Default is true . | |
feature.jdbcIngestionCaching.enabled | Enables caching of ingested JDBC data.
| |
feature.enableLongLoading | When enabled, you can monitor the ingestion of long-loading JDBC datasets through the Import Data page. Default is
| |
longloading.addToFlow | When long-loading is enabled, set this value to true to enable monitoring of the ingest process when large relational sources are added to a flow. Default is true . See Flow View Page. | |
longloading.addToLibrary | When long-loading is enabled, this feature enables monitoring of the ingest process when large relational sources are added to the library. Default is |
In the following sections, you can review the available configuration parameters for JDBC ingest.
Parameter Name | Description | ||
---|---|---|---|
batchserver.workers.ingest.max | Maximum number of ingester threads that can run on the | ||
batchserver.workers.ingest.bufferSizeBytes | Memory buffer size while copying to backend storage. A larger size for the buffer yields fewer network calls, which in rare cases may speed up ingest. | ||
batch-job-runner.cleanup.enabled
| Clean up after job, which deletes the ingested data from backend storage. Default is
|
Parameter Name | Description | |
---|---|---|
data-service.systemProperties.logging.level | When the logging level is set to
See Logging below. |
You can use the following methods to track progress of ingestion jobs.
jobType=ingest
jobs through the API endpoints. See
operation/getJobGroup |
During and after an ingest job, you can download the job logs through the Jobs page. Logs include:
See Jobs Page.