Data ingestion happens only for Spark jobs. Data ingestion does not apply to
D s photon
- Data ingestion applies only to JDBC sources that are not native to the running environment. For example, JDBC ingestion is not supported for Hive.
- Supported for HDFS and other large-scale backend datastores.
- For a single job with 16 ingest jobs occurring in parallel, maximum expected transfer rate is 1 GB/minute.
- Good for 100's of MBs. Not good for tables of GB size.
- 1 ingest job per source, meaning a dataset with 3 sources = 3 ingest jobs.
Rule of thumb for max concurrent jobs for a similar edge node:
max concurrent sources = max cores - cores used for services
- Above is valid until the network becomes a bottleneck. Internally, the above maxed out at about 15 concurrent sources.
- Default concurrent jobs = 16, pool size of 10, 2 minute timeout on pool. This is to prevent overloading of your database.
- Adding more concurrent jobs once network has bottleneck will start slow down all the transfer jobs simultaneously.
- If processing is fully saturated (# of workers is maxed):
- max transfer can drop to 1/3 GB/minute.
- Ingest waits for two minutes to acquire a connection. If after two minutes a connection cannot be acquired, the job fails.
- When job is queued for processing:
- Job is silently queued and appears to be in progress.
- Service waits until other jobs complete.
- Currently, there is no timeout for queueing based on the maximum number of concurrent ingest jobs.
JDBC ingest caching is not supported for Hive.
To enable JDBC ingestion and performance caching, both of the following parameters must be enabled.
NOTE: For new installations, this feature is enabled by default. For customers upgrading to Release 5.1 and later, this feature is disabled by default.
|D s config|
|Enables JDBC ingestion. Default is |
Enables caching of ingested JDBC data.
When enabled, you can monitor the ingestion of long-loading JDBC datasets through the Import Data page. Default is
The number of hours that an object can be cached. If the object has not been refreshed in that period of time, the next request for the datasource collects fresh data from the source.
By default, this value is set to
Maximum size in bytes of the datasource cache. This value applies to individual user caches when either global or user-specific caching is enabled.
When the logging level is set to
See Logging below.