...
Data ingestion happens only for Spark jobs. Data ingestion does not apply to
jobs.D s photon - Data ingestion applies only to JDBC sources that are not native to the running environment. For example, JDBC ingestion is not supported for Hive.
- Supported for HDFS and other large-scale backend datastores.
...
- For a single job with 16 ingest jobs occurring in parallel, maximum expected transfer rate is 1 GB/minute.
Scalability:
- Good for 100's of MBs. Not good for tables of GB size.
- 1 ingest job per source, meaning a dataset with 3 sources = 3 ingest jobs.
Rule of thumb for max concurrent jobs for a similar edge node:
Code Block max concurrent sources = max cores - cores used for services
- Above is valid until the network becomes a bottleneck. Internally, the above maxed out at about 15 concurrent sources.
- Default concurrent jobs = 16, pool size of 10, 2 minute timeout on pool. This is to prevent overloading of your database.
- Adding more concurrent jobs once network has bottleneck will start slow down all the transfer jobs simultaneously.
- If processing is fully saturated (# of workers is maxed):
- max transfer can drop to 1/3 GB/minute.
- Ingest waits for two minutes to acquire a connection. If after two minutes a connection cannot be acquired, the job fails.
- When job is queued for processing:
- Job is silently queued and appears to be in progress.
- Service waits until other jobs complete.
- Currently, there is no timeout for queueing based on the maximum number of concurrent ingest jobs.
Limitations
JDBC ingest caching is not supported for Hive.
Enable
To enable JDBC ingestion and performance caching, both of the following parameters must be enabled.
Info |
---|
NOTE: For new installations, this feature is enabled by default. For customers upgrading to Release 5.1 and later, this feature is disabled by default. |
D s config |
---|
Parameter Name | Description | ||
---|---|---|---|
webapp.connectivity.ingest.enabled | Enables JDBC ingestion. Default is true . | ||
feature.jdbcIngestionCaching.enabled | Enables caching of ingested JDBC data.
| ||
feature.enableLongLoading | When enabled, you can monitor the ingestion of long-loading JDBC datasets through the Import Data page. Default is
|
...
Parameter | Description |
---|---|
datasourceCaching.refreshThreshold | The number of hours that an object can be cached. If the object has not been refreshed in that period of time, the next request for the datasource collects fresh data from the source. By default, this value is set to |
datasourceCaching.maxSize | Maximum size in bytes of the datasource cache. This value applies to individual user caches when either global or user-specific caching is enabled. |
...
Logging
Parameter Name | Description | ||
---|---|---|---|
data-service.systemProperties.logging.level | When the logging level is set to
See Logging below. |
...