Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r0712

D toc

The 

D s platform
rtrue
 supports integration with a number of Hadoop distributions, using a range of components within each distribution. This page provides information on the set of configuration tasks that you need to complete to integrate the platform with your Hadoop environment.

Before You Begin

Key deployment considerations

  1. Hadoop cluster: The Hadoop cluster should already be installed and operational. As part of the install preparation, you should have prepared the Hadoop platform for integration with the 
    D s platform
    . See Prepare Hadoop for Integration with the Platform
    1. For more information on the components supported in your Hadoop distribution, See Install Reference.
  2. Storage: on-premises, cloud, or hybrid.
    1. The 
      D s platform
       can interact with storage that is in the local environment, in the cloud, or in some combination. How your storage is deployed affects your configuration scenarios. See Storage Deployment Options.
  3. Base storage layer: You must configure one storage platform to be the base storage layer. Details are described later.

    Info

    NOTE: Some deployments require that you select a specific base storage layer.

    Warning

    After you have defined the base storage layer, it cannot be changed. Please review your Storage Deployment Options carefully. The required configuration is described later. 

Hadoop versions

The 

D s platform
 supports integration only with the versions of Hadoop that are supported for your version of the platform. 

Info

NOTE: The versions of your Hadoop software and the libraries in use by the

D s platform
must match. Unless specifically directed by
D s support
, integration with your Hadoop cluster using a set of Hadoop libraries from a different version of Hadoop is not supported.

For more information, see Product Support Matrix.

Platform configuration

After the

D s platform
 and its databases have been installed, you can perform platform configuration. 
D s config

Info

NOTE: Some platform configuration is required, regardless of your deployment. See Required Platform Configuration.


Required Configuration for Hadoop

Please complete the following sections to configure the platform to work with Hadoop.

Specify 
D s item
itemuser

Info

NOTE: Where possible, you should define or select a user with a userID value greater than 1000. In some environments, lower userID values can result in failures when running jobs on Hadoop.

 

Set the Hadoop username 

D s defaultuser
Typehadoop
Fulltrue
 for the 
D s platform
 to use for executing jobs:

Code Block
"hdfs.username": [hadoop.user],

If the 

D s item
itemsoftware
 is installed in a Kerberos environment, additional steps are required, which are described later.

Data storage

The 

D s platform
 supports access to the following Hadoop storage layers:

  • HDFS

  • S3

Set the base storage layer

At this time, you should define the base storage layer from the platform. 

Excerpt Include
Install Set Base Storage Layer
Install Set Base Storage Layer
nopaneltrue

Required configuration for each type of storage is described below.

HDFS

If output files are to be written to an HDFS environment, you must configure the

D s platform
 to interact with HDFS. 

Warning

If your deployment is using HDFS, do not use the trifacta/uploads directory. This directory is used for storing uploads and metadata, which may be used by multiple users. Manipulating files outside of the

D s webapp
can destroy other users' data. Please use the tools provided through the interface for managing uploads from HDFS.

Info

NOTE: Use of HDFS in safe mode is not supported.

Below, replace the value for 

D s defaultuser
Typehadoop
Fulltrue
 with the value appropriate for your environment. 
D s config

Code Block
"hdfs": {
  "username": "[hadoop.user]",
  ...
  "namenode": {
    "host": "hdfs.example.com",
    "port": 8080
  },
}, 
ParameterDescription
username
Username in the Hadoop cluster to be used by the 
D s platform
 for executing jobs.
namenode.hostHost name of namenode in the Hadoop cluster. You may reference multiple namenodes.
namenode.port

Port to use to access the namenode. You may reference multiple namenodes.

Info

NOTE: Default values for the port number depend on your Hadoop distribution. See System Ports.

Individual users can configure the HDFS directory where exported results are stored.

Info

NOTE: Multiple users cannot share the same home directory.

See Storage Config Page.

Access to HDFS is supported over one of the following protocols:

WebHDFS

If you are using HDFS, it is assumed that WebHDFS has been enabled on the cluster. Apache WebHDFS enables access to an HDFS instance over HTTP REST APIs. For more information, see https://hadoop.apache.org/docs/r1.0.4/webhdfs.html.

The following properties can be modified:

Code Block
"webhdfs": {
  ...
  "version": "/webhdfs/v1",
  "host": "",
  "port": 50070,
  "httpfs": false
}, 
ParameterDescription
version

Path to locally installed version of WebHDFS.

Info

NOTE: For version, please leave the default value unless instructed to do otherwise.

host

Hostname for the WebHDFS service.

Info

NOTE: If this value is not specified, then the expected host must be defined in hdfs.namenode.host.

port

Port number for WebHDFS. The default value is 50070.

Info

NOTE: The default port number for SSL to WebHDFS is 50470 .

httpfsTo use HttpFS instead of WebHDFS, set this value to true. The port number must be changed. See HttpFS below.

Steps:

  1. Set webhdfs.host to be the hostname of the node that hosts WebHDFS. 
  2. Set webhdfs.port to be the port number over which WebHDFS communicates. The default value is 50070. For SSL, the default value is 50470.
  3. Set webhdfs.httpfs to false.
  4. For hdfs.namenodes, you must set the host and port values to point to the active namenode for WebHDFS.

HttpFS

You can configure the 

D s platform
 to use the HttpFS service to communicate with HDFS, in addition to WebHDFS. 

Info

NOTE: HttpFS serves as a proxy to WebHDFS. When HttpFS is enabled, both services are required.

In some cases, HttpFS is required:

  • High availability requires HttpFS.
  • Your secured HDFS user account has access restrictions.

If your environment meets any of the above requirements, you must enable HttpFS. For more information, see Enable HttpFS.

S3

The

D s platform
 can integrate with an S3 bucket. See Enable S3 Access.

Configure ResourceManager settings

Configure the following:

Code Block
"yarn.resourcemanager.host": "hadoop",
"yarn.resourcemanager.port": 8032,
Info

NOTE: Do not modify the other host/port settings unless you have specific information requiring the modifications.

For more information, see System Ports.

Specify distribution client bundle

The 

D s platform
 ships with client bundles supporting a number of major Hadoop distributions.  You must configure the jarfile for the distribution to use.  These distributions are stored in the following directory:

/opt/trifacta/hadoop-deps

Configure the bundle distribution property (hadoopBundleJar) in platform configuration. Examples:

Hadoop DistributionhadoopBundleJar property value
Cloudera

"hadoop-deps/cdh-x.y/build/libs/cdh-x.y-bundle.jar"

Hortonworks"hadoop-deps/hdp-x.y/build/libs/hdp-x.y-bundle.jar"

where:

x.y is the major-minor build number (e.g. 5.4)

Info

NOTE: The path must be specified relative to the install directory.

Tip

Tip: If there is no bundle for the distribution you need, you might try the one that is the closest match in terms of Apache Hadoop baseline. For example, CDH5 is based on Apache 2.3.0, so that client bundle will probably run ok against a vanilla Apache Hadoop 2.3.0 installation. For more information, see

D s support
.

Cloudera distribution

Some additional configuration is required. See Configure for Cloudera.

Hortonworks distribution

After install, integration with the Hortonworks Data Platform requires additional configuration. See Configure for Hortonworks.

Default Hadoop job results format

For smaller datasets, the platform recommends using the 

D s photon
 running environment.

For larger datasets, if the size information is unavailable, the platform recommends by default that you run the job on the Hadoop cluster. For these jobs, the default publishing action for the job is specified to run on the Hadoop cluster, generating the output format defined by this parameter. Publishing actions, including output format, can always be changed as part of the job specification. 

As needed, you can change this default format. 

D s config

Code Block
"webapp.defaultHadoopFileFormat": "csv",

Accepted values: csvjsonavropqt

For more information, see Run Job Page.

Additional Configuration for Hadoop

Authentication

Kerberos

The 

D s platform
 supports integration with Kerberos security. The platform can utilize Kerberos' secure impersonation to broker interactions with the Hadoop environment.

See Configure for Kerberos Integration

See Configure for Secure Impersonation.

Single Sign-On

The 

D s platform
 can integrate with your SSO platform to manage authentication to the 
D s webapp
. See Configure SSO for AD-LDAP.

Hadoop KMS

If you are using Hadoop KMS to encrypt data transfers to and from the Hadoop cluster, additional configuration is required. See Configure for KMS.

Hive access

Apache Hive is a data warehouse service for querying and managing large datasets in a Hadoop environment using a SQL-like querying language. For more information, see https://hive.apache.org/.

See Configure for Hive.

High availability environment

You can integrate the platform with the Hadoop cluster's high availability configuration, so that the 

D s platform
 can match the failover configuration for the cluster.

Info

NOTE: If you are deploying high availability failover, you must use HttpFS, instead of WebHDFS, for communicating with HDFS, which is described in a previous section.

For more information, see Enable Integration with Cluster High Availability.

Include Page
Install Cluster Configuration Files
Install Cluster Configuration Files

Configure Snappy publication

If you are publishing using Snappy compression, you may need to perform the following additional configuration.

Steps:

  1. Verify that the snappy and snappy-devel packages have been installed on the 

    D s node
    . For more information, see https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/NativeLibraries.html.

  2. From the 

    D s node
    , execute the following command:

    Code Block
    hadoop checknative
  3. The above command identifies where the native libraries are located on the 
    D s node
  4. Cloudera: 
    1. On the cluster, locate the libsnappy.so file. Verify that this file has been installed on all nodes of the cluster, including the 
      D s node
      . Retain the path to the file on the 
      D s node
      .
    2. D s config
    3. Locate the spark.props configuration block. Insert the following properties and values inside the block:

      Code Block
      "spark.driver.extraLibraryPath": "/path/to/file",
      "spark.executor.extraLibraryPath": "/path/to/file",
  5. Hortonworks:
    1. Verify on the 

      D s node
       that the following locations are available:

      Info

      NOTE: The asterisk below is a wildcard. Please collect the entire path of both values.

      Code Block
      /hadoop-client/lib/snappy*.jar
      /hadoop-client/lib/native/
    2. D s config
    3. Locate the spark.props configuration block. Insert the following properties and values inside the block:

      Code Block
      "spark.driver.extraLibraryPath": "/hadoop-client/lib/snappy*.jar;/hadoop-client/lib/native/",
      "spark.executor.extraLibraryPath": "/hadoop-client/lib/snappy*.jar;/hadoop-client/lib/native/",
  6. Save your changes and restart the platform.
  7. Verify that the /tmp directory has the proper permissions for publication. For more information, see Supported File Formats.

Debugging

You can review system services and download log files through the

D s webapp
.

See System Services and Logs.