Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Data ingestion happens only for Spark jobs. Data ingestion does not apply to

    D s photon
     jobs.

     


  • Data ingestion applies only to JDBC sources that are not native to the running environment. For example, JDBC ingestion is not supported for Hive.
  • Supported for HDFS, S3, and other large-scale backend datastores.

...

  • For a single job with 16 ingest jobs occurring in parallel, maximum expected transfer rate is 1 GB/minute.

Scalability:

  • Good for 100's of MBs. Not good for tables of GB size.
  • 1 ingest job per source, meaning a dataset with 3 sources = 3 ingest jobs.
  • Rule of thumb for max concurrent jobs for a similar edge node:

    Code Block
    max concurrent sources = max cores - cores used for services

     


    • Above is valid until the network becomes a bottleneck. Internally, the above maxed out at about 15 concurrent sources.
    • Default concurrent jobs = 16, pool size of 10, 2 minute timeout on pool.   This is to prevent overloading of your database.
    • Adding more concurrent jobs once network has bottleneck will start slow down all the transfer jobs simultaneously.
  • If processing is fully saturated (# of workers is maxed): 
    • max transfer can drop to 1/3 GB/minute.
    • Ingest waits for two minutes to acquire a connection. If after two minutes a connection cannot be acquired, the job fails.
  • When job is queued for processing:
    • Job is silently queued and appears to be in progress.
    • Service waits until other jobs complete. 
    • Currently, there is no timeout for queueing based on the maximum number of concurrent ingest jobs.

...

ParameterDescription
datasourceCaching.refreshThreshold

The number of hours that an object can be cached. If the object has not been refreshed in that period of time, the next request for the datasource collects fresh data from the source.

By default, this value is set to 168 (one week).

datasourceCaching.maxSize

Maximum size in bytes of the datasource cache. This value applies to individual user caches when either global or user-specific caching is enabled.

...


Logging

Parameter NameDescription
data-service.systemProperties.logging.level

When the logging level is set to debug, log messages on JDBC caching are recorded in the data service log.

Info

NOTE: Use this setting for debug purposes only, as the log files can grow quite large. Lower the setting after the issue has been debugged.

See Logging below.

...