Below are instructions on how to configure the Trifacta platform to point to S3 instead.
- Simple Storage Service (S3) is an online data storage service provided by Amazon, which provides low-latency access through web services. For more information, see https://aws.amazon.com/s3/.
NOTE: Please review the limitations before enabling. See Limitations of S3 Integration below.
Base Storage Layer
- If base storage layer is S3: you can enable read/write access to S3.
- If base storage layer is not S3: you can enable read only access to S3.
Limitations of S3 Integration
- The Trifacta platform supports S3 connectivity for the following distributions:
- The Trifacta platform only supports running S3-enabled instances over AWS.
Write access requires using S3 as the base storage layer. See Set Base Storage Layer.
NOTE: If S3 is set as the base storage layer, you cannot publish to Hive.
- On the Trifacta node, you must install the Oracle Java Runtime Environment for Java 1.8. Other versions of the JRE are not supported. For more information on the JRE, see http://www.oracle.com/technetwork/java/javase/downloads/index.html.
Required AWS Account Permissions
All access to S3 sources occurs through a single AWS account (system mode) or through an individual user's account (user mode). For either mode, the AWS access key and secret combination must provide read and write access to the default bucket associated with the account.
NOTE: To enable viewing and browsing of all folders within a bucket, the following permissions are required:
- The system account or individual user accounts must have the
ListAllMyBucketsaccess permission for the bucket.
- All objects to be browsed within the bucket must have Get access enabled.
Depending on your S3 environment, you can define:
- S3 as base storage layer
- S3 access protocol method
- read access to S3
- S3 bucket that is the default write destination
- access to additional S3 buckets
Define base storage layer
The base storage layer is the default platform for storing results. To enable write access to S3, you must define it as the base storage layer for your Trifacta deployment.
The base storage layer for your Trifacta instance is defined during initial installation and cannot be changed afterward.
If S3 is the base storage layer, you must also define the default storage bucket to use during initial installation, which cannot be changed at a later time. See Define default S3 write bucket below.
For more information on the various options for storage, see Storage Deployment Options.
For more information on setting the base storage layer, see Set Base Storage Layer.
S3 access protocol method
The Trifacta platform requires use of S3a protocol to connect to S3.
NOTE: Use of s3n is not supported.
Access S3 buckets from a region using V4 signing protocol
This configuration applies to the following conditions:
- If you are planning to access S3 buckets that are located in a region that uses the V4 signing method, additional configuration is required.
NOTE: If V4 signing is used for the region, the Trifacta platform can be configured to work with buckets in this region only.
NOTE: These changes should be applied in the XML file local to the Trifacta Server. You do not need to apply these changes in the cluster.
On the Trifacta Server, edit the local version of
core-site.xml. This file is typically located in the following directory:
Locate the following configuration. Apply the location of the S3 bucket, including the geographic region:
- Save your changes and restart the platform.
Enable read access to S3
When read access is enabled, Trifacta users can explore S3 buckets for creating datasets. See S3 Browser.
NOTE: When read access is enabled, Trifacta users have automatic access to all buckets to which the specified S3 user has access. You may want to create a specific user account for S3 access.
NOTE: Data that is mirrored from one S3 bucket to another might inherit the permissions from the bucket where it is owned.
- In the S3 configuration section, set
enabled=true, which allows Trifacta users to browse S3 buckets through the Trifacta application.
Specify the AWS
secretvalues for the user to access S3 storage.
NOTE: Please verify that the AWS account has all required permissions to access the S3 buckets in use. The account must have the ListAllMyBuckets ACL among its permissions.
S3 access modes
The Trifacta platform supports the following modes for access S3. You must choose one access mode and then complete the related configuration steps.
(default) Access to S3 buckets is enabled and defined for all users of the platform. All users use the same AWS access key, secret, and default bucket.
System mode is enabled by default. To complete configuration, apply the settings to platform configuration that are listed in the rest of this section.
Optionally, access to S3 can be defined on a per-user basis. This mode allows administrators to define access to specific buckets using various key/secret combinations as a means of controlling permissions.
NOTE: When this mode is enabled, individual users must have AWS configuration settings applied to their account, either by an administrator or by themselves. The global settings in this section do not apply in this mode.
Please verify that the settings below have been configured:
- Per user-configuration:
User mode - Create encryption key file
If you have enabled user mode for S3 access, you must create and deploy an encryption key file. For more information, see Create Encryption Key File.
NOTE: If you have enabled
user access mode, you can skip the following sections, which pertain to the
system access mode, and jump to the Enable Redshift Connection section below.
System access mode - additional configuration
The following sections apply only to
system access mode.
Define default S3 write bucket
When S3 is defined as the base storage layer, write access to S3 is enabled. The Trifacta platform attempts to store outputs in the designated default S3 bucket.
NOTE: This bucket must be set during initial installation. Modifying it at a later time is not recommended and can result in inaccessible data in the platform.
NOTE: Bucket names cannot have underscores in them. See http://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html.
- Define S3 to be the base storage layer. See Set Base Storage Layer.
- Enable read access. See Enable read access.
Specify a value for
aws.s3.bucket.name, which defines the S3 bucket where data is written. Do not include a protocol identifier. For example, if your bucket address is
s3://MyOutputBucket, the value to specify is the following:
NOTE: This bucket also appears as a read-access bucket if the specified S3 user has access.
Adding additional S3 buckets
When read access is enabled, all S3 buckets of which the specified user is the owner appear in the Trifacta application. You can also add additional S3 buckets from which to read.
NOTE: Additional buckets are accessible only if the specified S3 user has read privileges.
NOTE: Bucket names cannot have underscores in them.
extraBucketsarray as a comma-separated list of buckets as in the following:
These values are mapped to the following bucket addresses:
|When set to |
Set this value to the name of the S3 bucket to which you are writing.
Access Key ID for the AWS account to use.
NOTE: This value cannot contain a slash (
Secret Access Key for the AWS account.
Add references to any additional S3 buckets to this comma-separated array of values.
The S3 user must have read access to these buckets.
Enable use of server-side encryption
You can configure the Trifacta platform to publish data on S3 when a server-side encryption policy is enabled. SSE-S3 and SSE-KMS methods are supported. For more information, see http://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html.
- When encryption is enabled, all buckets to which you are writing must share the same encryption policy. Read operations are unaffected.
- This feature is supported for the following Hadoop distributions:
- SSE-S3: CDH 5.10 or later, HDP 2.6 or later
- SSE-KMS: CDH 5.11 or later, HDP 2.6.1 or later
To enable, please specify the following parameters.
Server-side encryption method
Set this value to the method of encryption used by the S3 server. Supported values:
NOTE: Lower case values are required.
Server-side KMS key identifier
When KMS encryption is enabled, you must specify the AWS KMS key ID to use for the server-side encryption.
- The authenticating user must have access to this key, or the AWS IAM role must be assigned to this key.
- The authenticating user or the AWS IAM role must be given Encrypt/Decrypt permissions for the specified KMS key ID. For more information, see https://docs.aws.amazon.com/kms/latest/developerguide/key-policy-modifying.html.
The format for referencing this key is the following:
You can use an AWS alias in the following formats. The format of the AWS-managed alias is the following:
The format for a custom alias is the following:
<FSR> is the name of the alias for the entire key.Save your changes and restart the platform.
Create Redshift Connection
For more information, see Create Redshift Connections.
Hadoop distribution-specific configuration
NOTE: If you are using Spark profiling through Hortonworks HDP on data stored in S3, additional configuration is required. See Configure for Hortonworks.
Additional Configuration for S3
The following parameters can be configured through the Trifacta platform to affect the integration with S3. You may or may not need to modify these values for your deployment.
S3 does not guarantee that at any time the files that have been written to a directory will be consistent with the files available for reading. S3 does guarantee that eventually the files are in sync.
This guarantee is important for some platform jobs that write data to S3 and then immediately attempt to read from the written data.
This timeout defines how long the platform waits for this guarantee. If the timeout is exceeded, the job is failed. The default value is
Depending on your environment, you may need to modify this value.
If your S3 deployment is either of the following:
Then, you can specify this setting to point to the S3 endpoint for Java/Spark services.
For more information on this location, see https://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region.
Restart services. See Start and Stop the Platform.
Try running a simple job from the Trifacta application. For more information, see Verify Operations.
Profiling consistently fails for S3 sources of data
If you are executing visual profiles of datasets sourced from S3, you may see an error similar to the following in the
This issue is caused by improperly configuring buffering when writing to S3 jobs. The specified local buffer cannot be accessed as part of the batch job running process, and the job fails to write results to S3.
You may do one of the following:
- Use a valid temp directory when buffering to S3.
- Disable buffering to directory completely.
Locate the following, where you can insert either of the following Hadoop configuration properties:
Specifies the temporary directory on the Trifacta node to use for buffering when uploading to S3. If
fs.s3a.fast.uploadis set to
false, this parameter is unused.
When set to
false, buffering is disabled.
Save your changes and restart the platform.
Spark local directory has no space
During execution of a Spark job, you may encounter the following error:
- Restart Trifacta services, which may free up some temporary space.
- Use the steps in the preceding solution to reassign a temporary directory for Spark to use (
This page has no comments.