Data ingestion happens only for Spark jobs. Data ingestion does not apply to
D s photon
- Data ingestion applies only to JDBC sources that are not native to the running environment. For example, JDBC ingestion is not supported for Hive.
- Supported for HDFS, S3, and other large-scale backend datastores.
- For a single job with 16 ingest jobs occurring in parallel, maximum expected transfer rate is 1 GB/minute.
- Good for 100's of MBs. Not good for tables of GB size.
- 1 ingest job per source, meaning a dataset with 3 sources = 3 ingest jobs.
Rule of thumb for max concurrent jobs for a similar edge node:
max concurrent sources = max cores - cores used for services
- Above is valid until the network becomes a bottleneck. Internally, the above maxed out at about 15 concurrent sources.
- Default concurrent jobs = 16, pool size of 10, 2 minute timeout on pool. This is to prevent overloading of your database.
- Adding more concurrent jobs once network has bottleneck will start slow down all the transfer jobs simultaneously.
- If processing is fully saturated (# of workers is maxed):
- max transfer can drop to 1/3 GB/minute.
- Ingest waits for two minutes to acquire a connection. If after two minutes a connection cannot be acquired, the job fails.
- When job is queued for processing:
- Job is silently queued and appears to be in progress.
- Service waits until other jobs complete.
- Currently, there is no timeout for queueing based on the maximum number of concurrent ingest jobs.
The number of hours that an object can be cached. If the object has not been refreshed in that period of time, the next request for the datasource collects fresh data from the source.
By default, this value is set to
Maximum size in bytes of the datasource cache. This value applies to individual user caches when either global or user-specific caching is enabled.
When the logging level is set to
See Logging below.