The  interacts with your enterprise Hadoop cluster like any other Hadoop client. It utilizes existing HDFS interfaces and can optionally integrate with Hadoop components such as Spark and Hive for better connectivity and performance. 

In standard deployment, the  reads files and stores results in HDFS.  Optionally, it can execute distributed data processing jobs in Spark.

The following diagrams illustrate how the  interacts with Hadoop in various execution and deployment scenarios.  Diagrams include the in-use client and server components and the ports over which they communicate.  

Run the Job Locally

WebHDFS requires access to HTTP port 50070 on the namenode server and a redirected request to HTTP port 50075 on any datanode containing queried data.

Run on Server

Run in Hadoop (YARN Cluster)

YARN jobs must distribute jobs via ResourceManager IPC port 8032. They also communicate with the Application Master created for that job on the cluster, which uses a new port within a configured range of ports.

Run Job in Hadoop YARN Cluster

Run Job in Scala Spark

By default, the  executes transformation jobs and profiling jobs using the Scala version of Spark. This set of libraries can be deployed to nodes of the cluster from the , so that no cluster instance of Spark is required. 

Run Job on Scala Spark

Run Profiling Job in Scala Spark

The following diagram shows the workflow for Scala Spark-based profiling jobs:

Run Profiling Job on Scala Spark

Coordination and Publishing Flows

The  uses Batch Job Runner, an Activiti-based service, for job orchestration. In the following publishing flow, the  is publishing results to Hive.  

Coordination and publishing to Hive