- HDFS
- S3
Set the base storage layer
At this time, you should define the base storage layer from the platform.
The platform requires that one backend datastore be configured as the base storage layer. This base storage layer is used for storing uploaded data and writing results and profiles. Please complete the following steps to set the base storage layer for the Designer Cloud Powered by Trifacta platform.
You cannot change the base storage layer after it has been set. You must uninstall and reinstall the platform to change it.
Steps:
- You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods. Locate the following parameter and set it to the value for your base storage layer:
"webapp.storageProtocol": "hdfs",
Save your changes and restart the platform.
NOTE: To complete the integration with the base storage layer, additional configuration is required.
Required configuration for each type of storage is described below.
S3
The Designer Cloud Powered by Trifacta platform can integrate with an S3 bucket:
- If you are using HDFS as the base storage layer, you can integrate with S3 for read-only access.
Base storage layer requires read-write access.
NOTE: If you are integrating with S3, additional configuration is required. Instead of completing the HDFS configuration below, please enable read-write access to S3. See S3 Access in the Configuration Guide.
HDFS
If output files are to be written to an HDFS environment, you must configure the Designer Cloud Powered by Trifacta platform to interact with HDFS.
- Hadoop Distributed File Service (HDFS) is a distributed file system that provides read-write access to large datasets in a Hadoop cluster. For more information, see http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html.
If your deployment is using HDFS, do not use the trifacta/uploads
directory. This directory is used for storing uploads and metadata, which may be used by multiple users. Manipulating files outside of the Designer Cloud application can destroy other users' data. Please use the tools provided through the interface for managing uploads from HDFS.
NOTE: Use of HDFS in safe mode is not supported.
Below, replace the value for [hadoop.user
(default=trifacta
)]
with the value appropriate for your environment. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json
.
For more information, see Platform Configuration Methods.
"hdfs": { "username": "[hadoop.user]", ... "namenode": { "host": "hdfs.example.com", "port": 8080 }, },
Parameter | Description |
---|---|
username | Username in the Hadoop cluster to be used by the Designer Cloud Powered by Trifacta platform for executing jobs. |
namenode.host | Host name of namenode in the Hadoop cluster. You may reference multiple namenodes. |
namenode.port | Port to use to access the namenode. You may reference multiple namenodes. NOTE: Default values for the port number depend on your Hadoop distribution. See System Ports in the Planning Guide. |
Individual users can configure the HDFS directory where exported results are stored.
NOTE: Multiple users cannot share the same home directory.
See Storage Config Page in the User Guide.
Access to HDFS is supported over one of the following protocols:
WebHDFS
If you are using HDFS, it is assumed that WebHDFS has been enabled on the cluster. Apache WebHDFS enables access to an HDFS instance over HTTP REST APIs. For more information, see https://hadoop.apache.org/docs/r1.0.4/webhdfs.html.
The following properties can be modified:
"webhdfs": { ... "version": "/webhdfs/v1", "host": "", "port": 50070, "httpfs": false },
Parameter | Description |
---|---|
version | Path to locally installed version of WebHDFS. NOTE: For |
host | Hostname for the WebHDFS service. NOTE: If this value is not specified, then the expected host must be defined in |
port | Port number for WebHDFS. The default value is NOTE: The default port number for SSL to WebHDFS is |
httpfs | To use HttpFS instead of WebHDFS, set this value to true . The port number must be changed. See 202959495 below. |
Steps:
- Set
webhdfs.host
to be the hostname of the node that hosts WebHDFS. - Set
webhdfs.port
to be the port number over which WebHDFS communicates. The default value is50070
. For SSL, the default value is50470
. - Set
webhdfs.httpfs
to false. - For
hdfs.namenodes
, you must set thehost
andport
values to point to the active namenode for WebHDFS.
HttpFS
You can configure the Designer Cloud Powered by Trifacta platform to use the HttpFS service to communicate with HDFS, in addition to WebHDFS.
NOTE: HttpFS serves as a proxy to WebHDFS. When HttpFS is enabled, both services are required.
In some cases, HttpFS is required:
- High availability requires HttpFS.
- Your secured HDFS user account has access restrictions.
If your environment meets any of the above requirements, you must enable HttpFS. For more information, see Enable HttpFS in the Configuration Guide.
This page has no comments.