The Trifacta® platform supports integration with a number of Hadoop distributions, using a range of components within each distribution. This page provides information on the set of configuration tasks that you need to complete to integrate the platform with your Hadoop environment.
Before You Begin
Key deployment considerations
- Hadoop cluster: The Hadoop cluster should already be installed and operational. As part of the install preparation, you should have prepared the Hadoop platform for integration with the Trifacta platform. See Prepare Hadoop for Integration with the Platform.
- For more information on the components supported in your Hadoop distribution, see Install Reference.
- Storage: on-premises, cloud, or hybrid.
- The Trifacta platform can interact with storage that is in the local environment, in the cloud, or in some combination. How your storage is deployed affects your configuration scenarios. See Storage Deployment Options.
Base storage layer: You must configure one storage platform to be the base storage layer. Details are described later.
NOTE: Some deployments require that you select a specific base storage layer.
After you have defined the base storage layer, it cannot be changed. Please review your Storage Deployment Options carefully. The required configuration is described later.
The Trifacta platform supports integration only with the versions of Hadoop that are supported for your version of the platform.
NOTE: The versions of your Hadoop software and the libraries in use by the Trifacta platform must match. Unless specifically directed by Trifacta Support, integration with your Hadoop cluster using a set of Hadoop libraries from a different version of Hadoop is not supported.
For more information, see System Requirements.
After the Trifacta platform and its databases have been installed, you can perform platform configuration. You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json. For more information, see Platform Configuration Methods.
NOTE: Some platform configuration is required, regardless of your deployment. See Required Platform Configuration.
Specify Trifacta user
NOTE: Where possible, you should define or select a user with a userID value greater than 1000. In some environments, lower userID values can result in failures when running jobs on Hadoop.
Set the Hadoop username
] for the Trifacta platform to use for executing jobs:
If the Trifacta software is installed in a Kerberos environment, additional steps are required, which are described later.
Configuration for Hadoop
In the sections below are a series of questions about the Hadoop environment with which the Trifacta platform is integrating. Based on your answer, additional configuration may be required.
The Trifacta platform supports access to the following storage environments:
Set the base storage layer
At this time, you should define the base storage layer from the platform. See Set Base Storage Layer.
Required configuration for each type of storage is described below.
If output files are to be written to an HDFS environment, you must configure the Trifacta platform to interact with HDFS.
- Hadoop Distributed File Service (HDFS) is a distributed file system that provides read-write access to large datasets in a Hadoop cluster. For more information, see http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html.
If your deployment is using HDFS, do not use the
trifacta/uploads directory. This directory is used for storing uploads and metadata, which may be used by multiple users. Manipulating files outside of the Trifacta application can destroy other users' data. Please use the tools provided through the interface for managing uploads from HDFS.
Below, replace the value for
] with the value appropriate for your environment. You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json. For more information, see Platform Configuration Methods.
Username in the Hadoop cluster to be used by the Trifacta platform for executing jobs.
|Host name of namenode in the Hadoop cluster. You may reference multiple namenodes.|
Port to use to access the namenode. You may reference multiple namenodes.
NOTE: Default values for the port number depend on your Hadoop distribution. See System Ports.
Individual users can configure the HDFS directory where exported results are stored.
NOTE: Multiple users cannot share the same home directory.
See User Profile Page.
Access to HDFS is supported over one of the following protocols:
If you are using HDFS, it is assumed that WebHDFS has been enabled on the cluster. Apache WebHDFS enables access to an HDFS instance over HTTP REST APIs. For more information, see https://hadoop.apache.org/docs/r1.0.4/webhdfs.html.
The following properties can be modified:
Path to locally installed version of WebHDFS.
Hostname for the WebHDFS service.
NOTE: If this value is not specified, then the expected host must be defined in
Port number for WebHDFS. The default value is
NOTE: The default port number for SSL to WebHDFS is
|To use HttpFS instead of WebHDFS, set this value to |
webhdfs.hostto be the hostname of the node that hosts WebHDFS.
webhdfs.portto be the port number over which WebHDFS communicates. The default value is
50070. For SSL, the default value is
hdfs.namenodes, you must set the
portvalues to point to the active namenode for WebHDFS.
You can configure the Trifacta platform to use the HttpFS service to communicate with HDFS, in addition to WebHDFS.
NOTE: HttpFS serves as a proxy to WebHDFS. When HttpFS is enabled, both services are required.
In some cases, it is required:
- High Availability requires HttpFS.
- Your secured HDFS user account has access restrictions.
NOTE: If your environment meets any of the above requirements, you must enable HttpFS.
For more information, see Enable HttpFS.
The Trifacta platform can integrate with an S3 bucket for reading and writing of data. See Enable S3 Access.
Configure ResourceManager settings
Configure the following:
NOTE: Do not modify the other host/port settings unless you have specific information requiring the modifications.
For more information, see System Ports.
Default Hadoop job results format
For smaller datasets, the platform recommends using the Trifacta Server.
For larger datasets, if the size information is unavailable, the platform recommends by default that you run the job on the Hadoop cluster. For these jobs, the default publishing action for the job is specified to run on the Hadoop cluster, generating the output format defined by this parameter. Publishing actions, including output format, can always be changed as part of the job specification.
For more information, see Run Job Page.
Specify distribution client bundle
The Trifacta platform ships with client bundles supporting a number of major Hadoop distributions. You must configure the jarfile for the distribution to use. These distributions are stored in the following directory:
Configure the bundle distribution property (
hadoopBundleJar) in platform configuration. Examples:
x.yis the major-minor build number (e.g. 5.4)
NOTE: The path must be specified relative to the install directory.
The Trifacta platform supports integration with Kerberos security. The platform can utilize Kerberos' secure impersonation to broker interactions with the Hadoop environment.
The Trifacta platform can integrate with your SSO platform to manage authentication to the Trifacta application. See Configure SSO for AD-LDAP.
If you are using Hadoop KMS to encrypt data transfers to and from the Hadoop cluster, additional configuration is required. See Configure for KMS.
Hadoop distribution version
Tip: If there is no bundle for the distribution you need, you might try the one that is the closest match in terms of Apache Hadoop baseline. For example, CDH5 is based on Apache 2.3.0, so that client bundle will probably run ok against a vanilla Apache Hadoop 2.3.0 installation. For more information, see Trifacta Support.
Some additional configuration is required. See Configure for Cloudera.
After install, integration with the Hortonworks Data Platform requires additional configuration. See Configure for Hortonworks.
Apache Hive is a data warehouse service for querying and managing large datasets in a Hadoop environment using a SQL-like querying language. For more information, see https://hive.apache.org/.
See Configure for Hive.
High availability environment
You can integrate the platform with the Hadoop cluster's high availability configuration, so that the Trifacta platform can match the failover configuration for the cluster.
NOTE: If you are deploying high availability failover, you must use HttpFS, instead of WebHDFS, for communicating with HDFS, which is described in a previous section.
For more information, see Enable Integration with Cluster High Availability.
Acquire Hadoop cluster configuration files
NOTE: If the Trifacta node has been properly configured as a Hadoop Edge node, these files should already exist on the local node. The location of these files on the Hadoop cluster may vary based on Hadoop distribution, version, and enabled components. For more information, please contact your Hadoop administrator.
To enable the platform to use YARN installations, you must provide a set of client
NOTE: The above file is required if you are integrating with Hive, using the Spark running environment, or both. For more information, see Configure for Hadoop.
NOTE: If these configuration files change in the Hadoop cluster, the versions installed on the Trifacta node should be updated, or components may fail to work. You may be better served by setting permissions on these files so that they can be read by the
] user and then creating a symlink from the Trifacta platform node.
Locate Client Configuration
For CDH 5:
In Cloudera Manager, select Actions > Download Client Configuration.
Configuration files are also available in
/etc/hadoop/conf on any cluster edge node.
For HDP 2:
Client configuration files can be retrieved from an existing client node. Acquire
*-site.xml files from
If you are using Hortonworks, you must complete the following modification to the site configuration file that is hosted on the Trifacta node.
NOTE: Before you begin, you must acquire the full version and build number of your Hortonworks distribution. On any of the Hadoop nodes, navigate to
/usr/hdp. The version and build number is a directory in this location, named in the following form:
In the Trifacta deployment, edit the following file:
- Perform the following global search and replace:
Replace with your hard-coded version and build number:
- Save the file.
- Restart the Trifacta platform.
YARN maintains site site configuration files in a similar location. These XML files should be retrieved, too.
Deploy Client Configuration
After you've collected the Hadoop client configuration, copy all
*-site.xml files to the following:
Restart services. See Start and Stop the Platform.
You can review system services and download log files through the Trifacta application.
This page has no comments.