Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r092
Excerpt

The 

D s platform
 supports access to the following Hadoop storage layers:

  • HDFS

  • S3

Set the base storage layer

At this time, you should define the base storage layer from the platform. 

Excerpt Include
Install Set Base Storage Layer
Install Set Base Storage Layer
nopaneltrue

Required configuration for each type of storage is described below.

S3

The 

D s platform
 can integrate with an S3 bucket:

  • If you are using HDFS as the base storage layer, you can integrate with S3 for read-only access.
  • Base storage layer requires read-write access.

    Info

    NOTE: If you are integrating with S3, additional configuration is required. Instead of completing the HDFS configuration below, please enable read-write access to S3. See S3 Access in the Configuration Guide.

HDFS

If output files are to be written to an HDFS environment, you must configure the 

D s platform
 to interact with HDFS. 

Warning

If your deployment is using HDFS, do not use the trifacta/uploads directory. This directory is used for storing uploads and metadata, which may be used by multiple users. Manipulating files outside of the

D s webapp
can destroy other users' data. Please use the tools provided through the interface for managing uploads from HDFS.

Info

NOTE: Use of HDFS in safe mode is not supported.

Below, replace the value for 

D s defaultuser
Typehadoop
Fulltrue
 with the value appropriate for your environment. 
D s config

Code Block
"hdfs": {
  "username": "[hadoop.user]",
  ...
  "namenode": {
    "host": "hdfs.example.com",
    "port": 8080
  },
}, 
ParameterDescription
username
Username in the Hadoop cluster to be used by the 
D s platform
 for executing jobs.
namenode.hostHost name of namenode in the Hadoop cluster. You may reference multiple namenodes.
namenode.port

Port to use to access the namenode. You may reference multiple namenodes.

Info

NOTE: Default values for the port number depend on your Hadoop distribution. See System Ports in the Planning Guide.

Individual users can configure the HDFS directory where exported results are stored.

Info

NOTE: Multiple users cannot share the same home directory.

See Storage Config Page in the User Guide.

Access to HDFS is supported over one of the following protocols:

WebHDFS

If you are using HDFS, it is assumed that WebHDFS has been enabled on the cluster. Apache WebHDFS enables access to an HDFS instance over HTTP REST APIs. For more information, see https://hadoop.apache.org/docs/r1.0.4/webhdfs.html.

The following properties can be modified:

Code Block
"webhdfs": {
  ...
  "version": "/webhdfs/v1",
  "host": "",
  "port": 50070,
  "httpfs": false
}, 
ParameterDescription
version

Path to locally installed version of WebHDFS.

Info

NOTE: For version, please leave the default value unless instructed to do otherwise.

host

Hostname for the WebHDFS service.

Info

NOTE: If this value is not specified, then the expected host must be defined in hdfs.namenode.host.

port

Port number for WebHDFS. The default value is 50070.

Info

NOTE: The default port number for SSL to WebHDFS is 50470 .

httpfsTo use HttpFS instead of WebHDFS, set this value to true. The port number must be changed. See HttpFS 202959495 below.

Steps:

  1. Set webhdfs.host to be the hostname of the node that hosts WebHDFS. 
  2. Set webhdfs.port to be the port number over which WebHDFS communicates. The default value is 50070. For SSL, the default value is 50470.
  3. Set webhdfs.httpfs to false.
  4. For hdfs.namenodes, you must set the host and port values to point to the active namenode for WebHDFS.

HttpFS

You can configure the 

D s platform
 to use the HttpFS service to communicate with HDFS, in addition to WebHDFS. 

Info

NOTE: HttpFS serves as a proxy to WebHDFS. When HttpFS is enabled, both services are required.

In some cases, HttpFS is required:

  • High availability requires HttpFS.
  • Your secured HDFS user account has access restrictions.

If your environment meets any of the above requirements, you must enable HttpFS. For more information, see Enable HttpFS in the Configuration Guide.