Before you deploy the , you should complete the following configuration steps within your Hadoop environment. 

NOTE: The requires access to a set of Hadoop components. See System Requirements.

Create  account on Hadoop cluster

The  interacts with Hadoop through a single system user account.  A user for the platform must be added to the cluster.

NOTE: In a cluster without Kerberos or SSO user management, the user must be created on each node of the cluster.

If LDAP is enabled, the user should be created in the same realm as the cluster.

If Kerberos is enabled, the user must exist on every node where jobs run.

For POSIX-compliant Hadoop environments, the user IDs of the accessing the cluster and the Hadoop user must match exactly.

UserID:

If possible, please create the user ID as: 

This user should belong to the group: 

User requirements:

Verify that the following HDFS paths have been created and that their permissions enable access to the  account:

NOTE: Depending on your Hadoop distribution, you may need to modify the following commands to use the Hadoop client installed on the .

Below, change the values for  to match the  user for your environment:

hdfs dfs -mkdir /trifacta
hdfs dfs -chown trifacta /trifacta
hdfs dfs -mkdir -p /user/trifacta
hdfs dfs -chown trifacta /user/trifacta

HDFS directories

The following directories must be available to the  on HDFS. Below, you can review the minimum permissions set for basic and impersonated authentication for each default directory. Secure impersonation is described later.

NOTE: Except for the dictionaries directory, which is used to hold smaller reference files, each of these directories should be configured to permit storage of a user's largest datasets.

DirectoryMinimum required permissionsSecure impersonation permissions
/trifacta/uploads700

770

Set this to 730 to prevent users from browsing this directory.

/trifacta/queryResults
700770
/trifacta/dictionaries
700770
/trifacta/tempfiles
770770

You can use the following commands to configure permissions on these directories. Following permissions scheme reflects the secure impersonation permissions in the above table:

$ hdfs dfs -mkdir -p /trifacta/uploads
$ hdfs dfs -mkdir -p /trifacta/queryResults
$ hdfs dfs -mkdir -p /trifacta/dictionaries
$ hdfs dfs -mkdir -p /trifacta/tempfiles
$ hdfs dfs -chown -R trifacta:trifacta /trifacta
$ hdfs dfs -chmod -R 770 /trifacta
$ hdfs dfs -chmod -R 730 /trifacta/uploads

If these standard locations cannot be used, you can configure the HDFS paths. 

"hdfs.pathsConfig.fileUpload": "/trifacta/uploads", 
"hdfs.pathsConfig.batchResults": "/trifacta/queryResults",
"hdfs.pathsConfig.dictionaries": "/trifacta/dictionaries",

Kerberos authentication

The  supports Kerberos authentication on Hadoop.

NOTE: If Kerberos is enabled for the Hadoop cluster, the keytab file must be made accessible to the . See Set up for a Kerberos-enabled Hadoop cluster.

Hadoop component configuration

Acquire cluster configuration files

The Hadoop cluster configuration files must be made available to the . You can either copy the files over from the cluster or create a local symlink to them. 

For more information, see Configure for Hadoop.

YARN configuration overview

This section provides an overview of configuration recommendations to be applied to the Hadoop cluster from the .

NOTE: The recommendations in this section are optimized for use with the . These may or may not conform to requirements for other applications using the Hadoop cluster. assumes no responsibility for the configuration of the cluster.

YARN manages cluster resources (CPU and memory) by running all processes within allocated containers. Containers restrict the resources available to its process(es). Processes are monitored and killed if they overrun the container allocation. 

YARN configuration specifies:

The following parameters are available in yarn-site.xml:

ParameterTypeDescription

yarn.nodemanager.resource.memory-mb

Per Cluster NodeAmount of physical memory, in MB, that can be allocated for containers
yarn.nodemanager.resource.cpu-vcoresPer Cluster NodeNumber of CPU cores that can be allocated for containers
yarn.scheduler.minimum-allocation-mbPer ContainerMinimum container memory, in MBs; requests lower than this will be increased to this value
yarn.scheduler.maximum-allocation-mbPer ContainerMaximum container memory, in MBs; requests higher than this will be capped to this value
yarn.scheduler.increment-allocation-mbPer ContainerGranularity of container memory requests
yarn.scheduler.minimum-allocation-vcoresPer ContainerMinimum allocation virtual CPU cores per container; requests lower than will increased to this value.
yarn.scheduler.maximum-allocation-vcoresPer ContainerMaximum allocation virtual CPU cores per container; requests higher than this will be capped to this value
yarn.scheduler.increment-allocation-vcoresPer ContainerGranularity of container virtual CPU requests

Spark configuration overview

Spark processes run multiple executors per job. Each executor must run within a YARN container. Therefore, resource requests must fit within YARN’s container limits.

Like YARN containers, multiple executors can run on a single node. More executors provide additional computational power and decreased runtime.

Spark’s dynamic allocation adjusts the number of executors to launch based on the following:

The per-executor resource request sizes can be specified by setting the following properties in the spark.props section:

ParameterDescription

spark.executor.memory

Amount of memory to use per executor process (in a specified unit)
spark.executor.coresNumber of cores to use on each executor - limit to 5 cores per executor for best performance

A single special process, the application driver, also runs in a container. Its resources are specified in the spark.props section:

ParameterDescription

spark.driver.memory

Amount of memory to use for the driver process (in a specified unit)
spark.driver.coresNumber of cores to use for the driver process

Recommendations

The following configuration settings can be applied through  configuration based on the number of nodes in the Hadoop cluster. 

NOTE: These recommendations should be modified based on the technical capabilities of your network, the nodes in the cluster, and other applications using the cluster.

 1241016
Available memory (GB)163264160256
Available vCPUs48164064
yarn.nodemanager.resource.memory-mb122882457657344147456245760
yarn.nodemanager.resource.cpu-vcores36133252
yarn.scheduler.minimum-allocation-mb10241024102410241024
yarn.scheduler.maximum-allocation-mb122882457657344147456245760
yarn.scheduler.increment-allocation-mb512512512512512
yarn.scheduler.minimum-allocation-vcores11111
yarn.scheduler.maximum-allocation-vcores36133252
yarn.scheduler.increment-allocation-vcores11111
spark.executor.memory6GB6GB16GB20GB20GB
spark.executor.cores22455
spark.driver.memory4GB4GB4GB4GB4GB
spark.driver.cores11111


The specified configuration allows, maximally, the following Spark configuration per node:

CoresxNodeConfiguration Options

1x1

(1 driver + 1 executor) or 1 executor
2x1(1 driver + 2 executor) or 3 executors
4x1(1 driver + 3 executors) or 3 executors
10x1(1 driver + 6 executors) or 6 executors
16x1(1 driver + 10 executors) or 10 executors