Below are instructions on how to configure to point to S3.
NOTE: Please review the limitations before enabling. See Limitations of S3 Integration below. |
The supports S3 connectivity for the following distributions:
NOTE: You must use a Hadoop version that is supported for your release of the product. |
Hortonworks 2.5 and later. See Supported Deployment Scenarios for Hortonworks.
NOTE: Spark 2.3.0 jobs may fail on S3-based datasets due to a known incompatibility. For details, see https://github.com/apache/incubator-druid/issues/4456. If you encounter this issue, please set |
All access to S3 sources occurs through a single AWS account (system mode) or through an individual user's account (user mode). For either mode, the AWS access key and secret combination must provide access to the default bucket associated with the account.
NOTE: These permissions should be set up by your AWS administrator |
NOTE: To enable viewing and browsing of all folders within a bucket, the following permissions are required:
|
The policy statement to enable read-only access to your default S3 bucket should look similar to the following. Replace 3c-my-s3-bucket
with the name of your bucket:
{ "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket", "s3:GetBucketLocation" ], "Resource": [ "arn:aws:s3:::3c-my-s3-bucket", "arn:aws:s3:::3c-my-s3-bucket/*", ] } ] } |
Write access is enabled by adding the PutObject
and DeleteObject
actions to the above. Replace 3c-my-s3-bucket
with the name of your bucket:
{ "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket", "s3:GetBucketLocation", "s3:PutObject", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::3c-my-s3-bucket", "arn:aws:s3:::3c-my-s3-bucket/*", ] } ] } |
NOTE: This feature must be enabled. |
To access S3 assets that are managed by , you must apply the following policy definition to any IAM role that is used to access
. This bucket contain demo assets for the On-Boarding tour:
{ "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor1", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::trifacta-public-datasets/*", "arn:aws:s3:::trifacta-public-datasets" ] } ] } |
For more information on creating policies, see https://console.aws.amazon.com/iam/home#/policies.
If any accessible bucket is encrypted with KMS-SSE, another policy must be deployed. For more information, see https://docs.aws.amazon.com/kms/latest/developerguide/iam-policies.html.
Depending on your S3 environment, you can define:
The base storage layer is the default platform for storing results.
Required for:
The base storage layer for your If S3 is the base storage layer, you must also define the default storage bucket to use during initial installation, which cannot be changed at a later time. See Define default S3 write bucket below. |
For more information on the various options for storage, see Storage Deployment Options.
For more information on setting the base storage layer, see Set Base Storage Layer.
When read access is enabled, can explore S3 buckets for creating datasets.
NOTE: When read access is enabled, |
NOTE: Data that is mirrored from one S3 bucket to another might inherit the permissions from the bucket where it is owned. |
Steps:
Set the following property to true
:
"aws.s3.enabled": true, |
enabled=true
, which allows Specify the AWS key
and secret
values for the user to access S3 storage.
The supports the following modes for access S3. You must choose one access mode and then complete the related configuration steps.
NOTE: Avoid switching between user mode and system mode, which can disable user access to data. At install mode, you should choose your preferred mode. |
(default) Access to S3 buckets is enabled and defined for all users of the platform. All users use the same AWS access key, secret, and default bucket.
For read-only access, the key, secret, and default bucket must be specified in configuration.
NOTE: Please verify that the AWS account has all required permissions to access the S3 buckets in use. The account must have the ListAllMyBuckets ACL among its permissions. |
Steps:
Locate the following parameters:
Parameters | Description | |
---|---|---|
aws.s3.key | Set this value to the AWS key to use to access S3. | |
aws.s3.secret | Set this value to the secret corresponding to the AWS key provided. | |
aws.s3.bucket.name | Set this value to the name of the S3 bucket from which users may read data.
|
Optionally, access to S3 can be defined on a per-user basis. This mode allows administrators to define access to specific buckets using various key/secret combinations as a means of controlling permissions.
NOTE: When this mode is enabled, individual users must have AWS configuration settings applied to their account, either by an administrator or by themselves. The global settings in this section do not apply in this mode. |
To enable:
Please verify that the settings below have been configured:
"aws.s3.enabled": true, "aws.mode": "user", |
If you have enabled user mode for S3 access, you must create and deploy an encryption key file. For more information, see Create Encryption Key File.
NOTE: If you have enabled |
The following sections apply only to system
access mode.
When S3 is defined as the base storage layer, write access to S3 is enabled. The attempts to store outputs in the designated default S3 bucket.
NOTE: This bucket must be set during initial installation. Modifying it at a later time is not recommended and can result in inaccessible data in the platform. |
NOTE: Bucket names cannot have underscores in them. See http://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html. |
Steps:
Specify a value for aws.s3.bucket.name
, which defines the S3 bucket where data is written. Do not include a protocol identifier. For example, if your bucket address is s3://MyOutputBucket
, the value to specify is the following:
MyOutputBucket |
NOTE: Specify the top-level bucket name only. There should not be any backslashes in your entry. |
NOTE: This bucket also appears as a read-access bucket if the specified S3 user has access. |
When read access is enabled, all S3 buckets of which the specified user is the owner appear in the . You can also add additional S3 buckets from which to read.
NOTE: Additional buckets are accessible only if the specified S3 user has read privileges. |
NOTE: Bucket names cannot have underscores in them. |
Steps:
Locate the following parameter: aws.s3.extraBuckets
:
In the Admin Settings page, specify the extra buckets as a comma-separated string of additional S3 buckets that are available for storage. Do not put any quotes around the string. Whitespace between string values is ignored.
In , specify the
extraBuckets
array as a comma-separated list of buckets as in the following:
"extraBuckets": ["MyExtraBucket01","MyExtraBucket02","MyExtraBucket03"] |
NOTE: Specify the top-level bucket name only. There should not be any backslashes in your entry. |
Specify the
extraBuckets
array as a comma-separated list of buckets as in the following:
"extraBuckets": ["MyExtraBucket01","MyExtraBucket02","MyExtraBucket03"] |
These values are mapped to the following bucket addresses:
s3://MyExtraBucket01 s3://MyExtraBucket02 s3://MyExtraBucket03 |
"aws.s3.enabled": true, "aws.s3.bucket.name": "<BUCKET_FOR_OUTPUT_IF_WRITING_TO_S3>" "aws.s3.key": "<AWS_KEY>", "aws.s3.secret": "<AWS_SECRET>", "aws.s3.extraBuckets": ["<ADDITIONAL_BUCKETS_TO_SHOW_IN_FILE_BROWSER>"] |
Setting | Description | |
---|---|---|
| When set to true , the S3 file browser is displayed in the GUI for locating files. For more information, see S3 Browser. | |
bucket.name | Set this value to the name of the S3 bucket to which you are writing.
| |
key | Access Key ID for the AWS account to use.
| |
secret | Secret Access Key for the AWS account. | |
extraBuckets | Add references to any additional S3 buckets to this comma-separated array of values. The S3 user must have read access to these buckets. |
You can configure the to publish data on S3 when a server-side encryption policy is enabled. SSE-S3 and SSE-KMS methods are supported. For more information, see http://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html.
Notes:
To enable, please specify the following parameters.
"aws.s3.serverSideEncryption": "none", |
Set this value to the method of encryption used by the S3 server. Supported values:
NOTE: Lower case values are required. |
sse-s3
sse-kms
none
When KMS encryption is enabled, you must specify the AWS KMS key ID to use for the server-side encryption.
"aws.s3.serverSideKmsKeyId": "", |
Notes:
The AWS IAM role must be assigned to this key.
The AWS IAM role must be given these permissions.
The format for referencing this key is the following:
"arn:aws:kms:<regionId>:<acctId>:key/<keyId>" |
You can use an AWS alias in the following formats. The format of the AWS-managed alias is the following:
"alias/aws/s3" |
The format for a custom alias is the following:
"alias/<FSR>" |
where:
<FSR>
is the name of the alias for the entire key.
Save your changes and restart the platform.
The following configuration can be applied through the Hadoop site-config.xml
file. If your installation does not have a copy of this file, you can insert the properties listed in the steps below into to configure the behavior of the S3 filewriter.
Steps:
Locate the filewriter.hadoopConfig
block, where you can insert the following Hadoop configuration properties:
"filewriter": { max: 16, "hadoopConfig": { "fs.s3a.buffer.dir": "/tmp", "fs.s3a.fast.upload": "true" }, ... } |
Property | Description | |
---|---|---|
fs.s3a.buffer.dir | Specifies the temporary directory on the
| |
fs.s3a.fast.upload | Set to When set to |
Save your changes and restart the platform.
For more information, see Create Redshift Connections.
NOTE: If you are using Spark profiling through Hortonworks HDP on data stored in S3, additional configuration is required. See Configure for Hortonworks. |
The following parameters can be configured through the to affect the integration with S3. You may or may not need to modify these values for your deployment.
Parameter | Description | ||
---|---|---|---|
aws.s3.consistencyTimeout | S3 does not guarantee that at any time the files that have been written to a directory will be consistent with the files available for reading. S3 does guarantee that eventually the files are in sync. This guarantee is important for some platform jobs that write data to S3 and then immediately attempt to read from the written data. This timeout defines how long the platform waits for this guarantee. If the timeout is exceeded, the job is failed. The default value is Depending on your environment, you may need to modify this value. | ||
aws.s3.endpoint | This value should be the S3 endpoint DNS name value.
Example value:
If your S3 deployment is either of the following:
Then, you can specify this setting to point to the S3 endpoint for Java/Spark services. For more information on this location, see https://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region. |
Restart services. See Start and Stop the Platform.
Try running a simple job from the . For more information, see Verify Operations.
If you are executing visual profiles of datasets sourced from S3, you may see an error similar to the following in the batch-job-runner.log
file:
01:19:52.297 [Job 3] ERROR com.trifacta.hadoopdata.joblaunch.server.BatchFileWriterWorker - BatchFileWriterException: Batch File Writer unknown error: {jobId=3, why=bound must be positive} 01:19:52.298 [Job 3] INFO com.trifacta.hadoopdata.joblaunch.server.BatchFileWriterWorker - Notifying monitor for job 3 with status code FAILURE |
This issue is caused by improperly configuring buffering when writing to S3 jobs. The specified local buffer cannot be accessed as part of the batch job running process, and the job fails to write results to S3.
Solution:
You may do one of the following:
Steps:
Locate the following, where you can insert either of the following Hadoop configuration properties:
"filewriter": { max: 16, "hadoopConfig": { "fs.s3a.buffer.dir": "/tmp", "fs.s3a.fast.upload": false }, ... } |
Property | Description |
---|---|
fs.s3a.buffer.dir | Specifies the temporary directory on the |
fs.s3a.fast.upload | When set to false , buffering is disabled. |
Save your changes and restart the platform.
During execution of a Spark job, you may encounter the following error:
org.apache.hadoop.util.DiskChecker$DiskErrorException: No space available in any of the local directories. |
Solution:
fs.s3a.buffer.dir
).