D toc |
---|
The
D s platform | ||
---|---|---|
|
Before You Begin
Key deployment considerations
- Hadoop cluster: The Hadoop cluster should already be installed and operational. As part of the install preparation, you should have prepared the Hadoop platform for integration with the
. See Prepare Hadoop for Integration with the Platform.D s platform - For more information on the components supported in your Hadoop distribution, See Install Reference.
- Storage: on-premises, cloud, or hybrid.
- The
can interact with storage that is in the local environment, in the cloud, or in some combination. How your storage is deployed affects your configuration scenarios. See Storage Deployment Options.D s platform
- The
Base storage layer: You must configure one storage platform to be the base storage layer. Details are described later.
Info NOTE: Some deployments require that you select a specific base storage layer.
Warning After you have defined the base storage layer, it cannot be changed. Please review your Storage Deployment Options carefully. The required configuration is described later.
Hadoop versions
The
D s platform |
---|
Info | ||
---|---|---|
NOTE: The versions of your Hadoop software and the libraries in use by the
|
For more information, see Product Support Matrix.
Platform configuration
After the
D s platform |
---|
D s config |
---|
Info |
---|
NOTE: Some platform configuration is required, regardless of your deployment. See Required Platform Configuration. |
Required Configuration for Hadoop
Please complete the following sections to configure the platform to work with Hadoop.
Specify
D s item | ||
---|---|---|
|
Info |
---|
NOTE: Where possible, you should define or select a user with a userID value greater than 1000. In some environments, lower userID values can result in failures when running jobs on Hadoop. |
Set the Hadoop username
D s defaultuser | ||||
---|---|---|---|---|
|
D s platform |
---|
Code Block |
---|
"hdfs.username": [hadoop.user], |
If the
D s item | ||
---|---|---|
|
Data storage
The
D s platform |
---|
- HDFS
- S3
Set the base storage layer
At this time, you should define the base storage layer from the platform.
Excerpt Include | ||||||
---|---|---|---|---|---|---|
|
Required configuration for each type of storage is described below.
HDFS
If output files are to be written to an HDFS environment, you must configure the
D s platform |
---|
- Hadoop Distributed File Service (HDFS) is a distributed file system that provides read-write access to large datasets in a Hadoop cluster. For more information, see http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html.
Warning | |
---|---|
If your deployment is using HDFS, do not use the |
Info |
---|
NOTE: Use of HDFS in safe mode is not supported. |
Below, replace the value for
D s defaultuser | ||||
---|---|---|---|---|
|
D s config |
---|
Code Block |
---|
"hdfs": { "username": "[hadoop.user]", ... "namenode": { "host": "hdfs.example.com", "port": 8080 }, }, |
Parameter | Description | ||
---|---|---|---|
username | Username in the Hadoop cluster to be used by the
| ||
namenode.host | Host name of namenode in the Hadoop cluster. You may reference multiple namenodes. | ||
namenode.port | Port to use to access the namenode. You may reference multiple namenodes.
|
Individual users can configure the HDFS directory where exported results are stored.
Info |
---|
NOTE: Multiple users cannot share the same home directory. |
See Storage Config Page.
Access to HDFS is supported over one of the following protocols:
WebHDFS
If you are using HDFS, it is assumed that WebHDFS has been enabled on the cluster. Apache WebHDFS enables access to an HDFS instance over HTTP REST APIs. For more information, see https://hadoop.apache.org/docs/r1.0.4/webhdfs.html.
The following properties can be modified:
Code Block |
---|
"webhdfs": { ... "version": "/webhdfs/v1", "host": "", "port": 50070, "httpfs": false }, |
Parameter | Description | ||
---|---|---|---|
version | Path to locally installed version of WebHDFS.
| ||
host | Hostname for the WebHDFS service.
| ||
port | Port number for WebHDFS. The default value is
| ||
httpfs | To use HttpFS instead of WebHDFS, set this value to true . The port number must be changed. See HttpFS below. |
Steps:
- Set
webhdfs.host
to be the hostname of the node that hosts WebHDFS. - Set
webhdfs.port
to be the port number over which WebHDFS communicates. The default value is50070
. For SSL, the default value is50470
. - Set
webhdfs.httpfs
to false. - For
hdfs.namenodes
, you must set thehost
andport
values to point to the active namenode for WebHDFS.
HttpFS
You can configure the
D s platform |
---|
Info |
---|
NOTE: HttpFS serves as a proxy to WebHDFS. When HttpFS is enabled, both services are required. |
In some cases, HttpFS is required:
- High availability requires HttpFS.
- Your secured HDFS user account has access restrictions.
If your environment meets any of the above requirements, you must enable HttpFS. For more information, see Enable HttpFS.
S3
The
D s platform |
---|
Configure ResourceManager settings
Configure the following:
Code Block |
---|
"yarn.resourcemanager.host": "hadoop", "yarn.resourcemanager.port": 8032, |
Info |
---|
NOTE: Do not modify the other host/port settings unless you have specific information requiring the modifications. |
For more information, see System Ports.
Specify distribution client bundle
The
D s platform |
---|
/opt/trifacta/hadoop-deps
Configure the bundle distribution property (hadoopBundleJar
) in platform configuration. Examples:
Hadoop Distribution | hadoopBundleJar property value |
---|---|
Cloudera |
|
Hortonworks | "hadoop-deps/hdp-x.y/build/libs/hdp-x.y-bundle.jar" |
where:
x.y
is the major-minor build number (e.g. 5.4)
Info |
---|
NOTE: The path must be specified relative to the install directory. |
Tip | |
---|---|
Tip: If there is no bundle for the distribution you need, you might try the one that is the closest match in terms of Apache Hadoop baseline. For example, CDH5 is based on Apache 2.3.0, so that client bundle will probably run ok against a vanilla Apache Hadoop 2.3.0 installation. For more information, see
|
Cloudera distribution
Some additional configuration is required. See Configure for Cloudera.
Hortonworks distribution
After install, integration with the Hortonworks Data Platform requires additional configuration. See Configure for Hortonworks.
Default Hadoop job results format
For smaller datasets, the platform recommends using the
D s photon |
---|
For larger datasets, if the size information is unavailable, the platform recommends by default that you run the job on the Hadoop cluster. For these jobs, the default publishing action for the job is specified to run on the Hadoop cluster, generating the output format defined by this parameter. Publishing actions, including output format, can always be changed as part of the job specification.
As needed, you can change this default format.
D s config |
---|
Code Block |
---|
"webapp.defaultHadoopFileFormat": "csv", |
Accepted values: csv
, json
, avro
, pqt
For more information, see Run Job Page.
Additional Configuration for Hadoop
Authentication
Kerberos
The
D s platform |
---|
See Configure for Kerberos Integration.
See Configure for Secure Impersonation.
Single Sign-On
The
D s platform |
---|
D s webapp |
---|
Hadoop KMS
If you are using Hadoop KMS to encrypt data transfers to and from the Hadoop cluster, additional configuration is required. See Configure for KMS.
Hive access
Apache Hive is a data warehouse service for querying and managing large datasets in a Hadoop environment using a SQL-like querying language. For more information, see https://hive.apache.org/.
See Configure for Hive.
High availability environment
You can integrate the platform with the Hadoop cluster's high availability configuration, so that the
D s platform |
---|
Info |
---|
NOTE: If you are deploying high availability failover, you must use HttpFS, instead of WebHDFS, for communicating with HDFS, which is described in a previous section. |
For more information, see Enable Integration with Cluster High Availability.
Include Page | ||||
---|---|---|---|---|
|
Configure Snappy publication
If you are publishing using Snappy compression, you may need to perform the following additional configuration.
Steps:
Verify that the
snappy
andsnappy-devel
packages have been installed on the
. For more information, see https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/NativeLibraries.html.D s node From the
, execute the following command:D s node Code Block hadoop checknative
- The above command identifies where the native libraries are located on the
.D s node - Cloudera:
- On the cluster, locate the
libsnappy.so
file. Verify that this file has been installed on all nodes of the cluster, including the
. Retain the path to the file on theD s node
.D s node D s config Locate the
spark.props
configuration block. Insert the following properties and values inside the block:Code Block "spark.driver.extraLibraryPath": "/path/to/file", "spark.executor.extraLibraryPath": "/path/to/file",
- On the cluster, locate the
- Hortonworks:
Verify on the
that the following locations are available:D s node Info NOTE: The asterisk below is a wildcard. Please collect the entire path of both values.
Code Block /hadoop-client/lib/snappy*.jar /hadoop-client/lib/native/
D s config Locate the
spark.props
configuration block. Insert the following properties and values inside the block:Code Block "spark.driver.extraLibraryPath": "/hadoop-client/lib/snappy*.jar;/hadoop-client/lib/native/", "spark.executor.extraLibraryPath": "/hadoop-client/lib/snappy*.jar;/hadoop-client/lib/native/",
- Save your changes and restart the platform.
- Verify that the
/tmp
directory has the proper permissions for publication. For more information, see Supported File Formats.
Debugging
You can review system services and download log files through the
D s webapp |
---|