The interacts with your enterprise Hadoop cluster like any other Hadoop client. It utilizes existing HDFS interfaces and can optionally integrate with Hadoop components such as Spark and Hive for better connectivity and performance.
In standard deployment, the reads files and stores results in HDFS. Optionally, it can execute distributed data processing jobs in Spark.
The following diagrams illustrate how the interacts with Hadoop in various execution and deployment scenarios. Diagrams include the in-use client and server components and the ports over which they communicate.
WebHDFS requires access to HTTP port 50070 on the namenode server and a redirected request to HTTP port 50075 on any datanode containing queried data.
Run on Server
YARN jobs must distribute jobs via ResourceManager IPC port 8032. They also communicate with the Application Master created for that job on the cluster, which uses a new port within a configured range of ports.
Run Job in Hadoop YARN Cluster
By default, the executes transformation jobs and profiling jobs using the Scala version of Spark. This set of libraries can be deployed to nodes of the cluster from the , so that no cluster instance of Spark is required.
Run Job on Scala Spark
The following diagram shows the workflow for Scala Spark-based profiling jobs:
Run Profiling Job on Scala Spark
The uses Batch Job Runner, an Activiti-based service, for job orchestration. In the following publishing flow, the is publishing results to Hive.
Coordination and publishing to Hive