This section describes some of the configuration options for the JDBC (relational) ingestion and caching features, which enables execution of large-scale JDBC-based jobs on the Spark running environment.

Overview

Data ingestion works by streaming a JDBC source into a temporary storage space in the base storage layer. The job can then be run on Spark, the default running environment. When the job is complete, the temporary data is removed from base storage or retained in the cache (if it is enabled).

When data is read from the source, the  can populate a user-specific ingest cache, which is maintained to limit long load times from JDBC sources and to improve overall platform performance.

Job TypeJDBC Ingestion Enabled onlyJDBC Ingestion and Caching Enabled
transformation jobData is retrieved from the source and stored in a temporary backend location for use in sampling.Data is retrieved from the source for the job and refreshes the cache where applicable.
sampling jobSee previous.

Cache is first checked for valid data objects. Outdated objects are retrieved from the data source.

Retrieved data refreshes the cache.

NOTE: Caching applies only to full scan sampling jobs. Quick scan sampling is performed in the web client.

As needed you can force an override of the cache when executing the sample. Data is collected from the source. See Samples Panel.

Recommended Table Size

Although there is no absolute limit, you should avoid executing jobs on tables over several 100 GBs. Larger data sources can significantly impact end-to-end performance.

NOTE: This recommendation applies to all JDBC-based jobs.

Performance

Rule of thumb:

Scalability:

Limitations

Enable

To enable JDBC ingestion and performance caching, both of the following parameters must be enabled.

NOTE: For new installations, this feature is enabled by default. For customers upgrading to Release 5.1 and later, this feature is disabled by default.

Parameter NameDescription
webapp.connectivity.ingest.enabledEnables JDBC ingestion. Default is true.
feature.jdbcIngestionCaching.enabled

Enables caching of ingested JDBC data.

NOTE: webapp.connectivity.ingest.enabled must be set to true to enable JDBC caching.

When disabled, no caching of JDBC data sources is performed.

Configure

In the following sections, you can review the available configuration parameters for JDBC ingest and JBC performance caching.

Configure Ingestion

Parameter NameDescription
batchserver.workers.ingest.max

Maximum number of ingester threads that can run on the at the same time.

batchserver.workers.ingest.bufferSizeBytes

Memory buffer size while copying to backend storage.

A larger size for the buffer yields fewer network calls, which in rare cases may speed up ingest.

batch-job-runner.cleanup.enabled

Clean up after job, which deletes the ingested data from backend storage. Default is true.

NOTE: If JDBC ingestion is disabled, relational source data is not removed from platform backend storage. This feature can be disabled for debugging and should be re-enabled afterward.


NOTE: This setting rarely applies if JDBC ingest caching has been enabled.


Configure Storage

When files are ingested, they are stored in one of the following locations: 

NOTE: Whenever a job is run, its source files must be re-ingested. If two or more datasets in the same job run share the same source, only one copy of the source is ingested.

Additional information is provided below.

Global or user cache

ParameterDescription
datasourceCaching.useGlobalDatasourceCache

When set to true, the platform uses the global data source cache location for storing cached ingest data.

NOTE: When global caching is enabled, data is still stored individual locations per user. Through the application, users cannot access the cached objects stored for other users.

When set to false, the platform uses the output directory for each user for storing cached ingest data. Within the output directory, cached data is stored in the .datasourceCache directory.

NOTE: You should verify that there is sufficient storage in each user's output directory to store the maximum cache size as well as any projected uploaded datasets.


hdfs.pathsConfig.globalDataSourceCache

Specifies the path of the global datasource cache, if it is enabled. Specify the path from the root folder of HDFS, S3, or other backend datastore.

Cache sizing

ParameterDescription
datasourceCaching.refreshThreshold

The number of hours that an object can be cached. If the object has not been refreshed in that period of time, the next request for the datasource collects fresh data from the source.

By default, this value is set to 168 (one week).

datasourceCaching.maxSize

Maximum size in bytes of the datasource cache. This value applies to individual user caches when either global or user-specific caching is enabled.


Logging

Parameter NameDescription
data-service.systemProperties.logging.level

When the logging level is set to debug, log messages on JDBC caching are recorded in the data service log.

NOTE: Use this setting for debug purposes only, as the log files can grow quite large. Lower the setting after the issue has been debugged.

See Logging below.

Monitoring Progress

You can use the following methods to track progress of ingestion jobs.

Logging

Ingest:

During and after an ingest job, you can download the job logs through the Jobs page. Logs include:

See Jobs Page.

Caching:

When the logging level is set to debug for the data service and caching is enabled, cache messages are logged. These messages include: