Page tree

Release 8.2.1



Contents:


   

The Trifacta platform supports access to the following Hadoop storage layers:

  • HDFS

  • S3

Set the base storage layer

At this time, you should define the base storage layer from the platform. 

The platform requires that one backend datastore be configured as the base storage layer. This base storage layer is used for storing uploaded data and writing results and profiles. Please complete the following steps to set the base storage layer for the Trifacta platform.

You cannot change the base storage layer after it has been set. You must uninstall and reinstall the platform to change it.

Steps:

  1. You can apply this change through the Admin Settings Page (recommended) or
    trifacta-conf.json
    . For more information, see Platform Configuration Methods.
  2. Locate the following parameter and set it to the value for your base storage layer:

    "webapp.storageProtocol": "hdfs",
  3. Save your changes and restart the platform.

    NOTE: To complete the integration with the base storage layer, additional configuration is required.

Required configuration for each type of storage is described below.

S3

The Trifacta platform can integrate with an S3 bucket:

  • If you are using HDFS as the base storage layer, you can integrate with S3 for read-only access.
  • Base storage layer requires read-write access.

    NOTE: If you are integrating with S3, additional configuration is required. Instead of completing the HDFS configuration below, please enable read-write access to S3. See Enable S3 Access in the Configuration Guide.

HDFS

If output files are to be written to an HDFS environment, you must configure the Trifacta platform to interact with HDFS. 

If your deployment is using HDFS, do not use the trifacta/uploads directory. This directory is used for storing uploads and metadata, which may be used by multiple users. Manipulating files outside of the Trifacta application can destroy other users' data. Please use the tools provided through the interface for managing uploads from HDFS.

NOTE: Use of HDFS in safe mode is not supported.

Below, replace the value for [hadoop.user (default=trifacta)] with the value appropriate for your environment. You can apply this change through the Admin Settings Page (recommended) or

trifacta-conf.json
. For more information, see Platform Configuration Methods.

"hdfs": {
  "username": "[hadoop.user]",
  ...
  "namenode": {
    "host": "hdfs.example.com",
    "port": 8080
  },
}, 
ParameterDescription
username
Username in the Hadoop cluster to be used by the Trifacta platform for executing jobs.
namenode.hostHost name of namenode in the Hadoop cluster. You may reference multiple namenodes.
namenode.port

Port to use to access the namenode. You may reference multiple namenodes.

NOTE: Default values for the port number depend on your Hadoop distribution. See System Ports in the Planning Guide.

Individual users can configure the HDFS directory where exported results are stored.

NOTE: Multiple users cannot share the same home directory.

See Storage Config Page in the User Guide.

Access to HDFS is supported over one of the following protocols:

WebHDFS

If you are using HDFS, it is assumed that WebHDFS has been enabled on the cluster. Apache WebHDFS enables access to an HDFS instance over HTTP REST APIs. For more information, see https://hadoop.apache.org/docs/r1.0.4/webhdfs.html.

The following properties can be modified:

"webhdfs": {
  ...
  "version": "/webhdfs/v1",
  "host": "",
  "port": 50070,
  "httpfs": false
}, 
ParameterDescription
version

Path to locally installed version of WebHDFS.

NOTE: For version, please leave the default value unless instructed to do otherwise.

host

Hostname for the WebHDFS service.

NOTE: If this value is not specified, then the expected host must be defined in hdfs.namenode.host.

port

Port number for WebHDFS. The default value is 50070.

NOTE: The default port number for SSL to WebHDFS is 50470 .

httpfsTo use HttpFS instead of WebHDFS, set this value to true. The port number must be changed. See HttpFS below.

Steps:

  1. Set webhdfs.host to be the hostname of the node that hosts WebHDFS. 
  2. Set webhdfs.port to be the port number over which WebHDFS communicates. The default value is 50070. For SSL, the default value is 50470.
  3. Set webhdfs.httpfs to false.
  4. For hdfs.namenodes, you must set the host and port values to point to the active namenode for WebHDFS.

HttpFS

You can configure the Trifacta platform to use the HttpFS service to communicate with HDFS, in addition to WebHDFS. 

NOTE: HttpFS serves as a proxy to WebHDFS. When HttpFS is enabled, both services are required.

In some cases, HttpFS is required:

  • High availability requires HttpFS.
  • Your secured HDFS user account has access restrictions.

If your environment meets any of the above requirements, you must enable HttpFS. For more information, see Enable HttpFS in the Configuration Guide.

This page has no comments.