Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r0711

D toc

Below are instructions on how to configure

D s product
rtrue
 to point to S3.

  • Simple Storage Service (S3) is an online data storage service provided by Amazon, which provides low-latency access through web services. For more information, see https://aws.amazon.com/s3/.

 

Info

NOTE: Please review the limitations before enabling. See Limitations of S3 Integration below.

Base Storage Layer

  • If base storage layer is S3: you can enable read/write access to S3.
  • If base storage layer is not S3: you can enable read-only access to S3.

Limitations

  • The
    D s platform
     only supports running S3-enabled instances over AWS.
  • Access to AWS S3 Regional Endpoints through internet protocol is required. If the machine hosting the
    D s platform
     is in a VPC with no internet access, a VPC endpoint enabled for S3 services is required. The 
    D s platform
     does not support access to S3 through a proxy server.


Info

NOTE: Spark 2.3.0 jobs may fail on S3-based datasets due to a known incompatibility. For details, see https://github.com/apache/incubator-druid/issues/4456.

If you encounter this issue, please set spark.version to 2.1.0 in platform configuration. For more information, see Admin Settings Page.

Pre-requisites

  • If IAM instance role is used for S3 access, it must have access to resources at the bucket level.

Required AWS Account Permissions

All access to S3 sources occurs through a single AWS account (system mode) or through an individual user's account (user mode). For either mode, the AWS access key and secret combination must provide access to the default bucket associated with the account. 

Info

NOTE: These permissions should be set up by your AWS administrator

Read-only access polices

Info

NOTE: To enable viewing and browsing of all folders within a bucket, the following permissions are required:

  • The system account or individual user accounts must have the ListAllMyBuckets access permission for the bucket.
  • All objects to be browsed within the bucket must have Get access enabled.

The policy statement to enable read-only access to your default S3 bucket should look similar to the following. Replace 3c-my-s3-bucket with the name of your bucket:

Code Block
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::3c-my-s3-bucket",
                "arn:aws:s3:::3c-my-s3-bucket/*",
            ]
        }
    ]
}


Write access polices

Write access is enabled by adding the PutObject and DeleteObject actions to the above. Replace 3c-my-s3-bucket with the name of your bucket:

Code Block
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::3c-my-s3-bucket",
                "arn:aws:s3:::3c-my-s3-bucket/*",
            ]
        }
    ]
}

Other AWS policies for S3

Policy for access to 
D s item
itempublic buckets

Info

NOTE: This feature must be enabled. For more information, see Enable Onboarding Tour.

To access S3 assets that are managed by 

D s company
, you must apply the following policy definition to any IAM role that is used to access 
D s product
productssp
. This bucket contain demo assets for the On-Boarding tour:

Code Block
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::trifacta-public-datasets/*",
                "arn:aws:s3:::trifacta-public-datasets"
            ]
        }
    ]
}

For more information on creating policies, see https://console.aws.amazon.com/iam/home#/policies.

KMS policy

If any accessible bucket is encrypted with KMS-SSE, another policy must be deployed. For more information, see https://docs.aws.amazon.com/kms/latest/developerguide/iam-policies.html.

Configuration

Depending on your S3 environment, you can define:

  • read access to S3
  • access to additional S3 buckets

  • S3 as base storage layer
  • Write access to S3
    • S3 bucket that is the default write destination

Define base storage layer

The base storage layer is the default platform for storing results.

Required for:

  • Write access to S3
  • Connectivity to Redshift
Warning

The base storage layer for your

D s item
iteminstance
is defined during initial installation and cannot be changed afterward.

If S3 is the base storage layer, you must also define the default storage bucket to use during initial installation, which cannot be changed at a later time. See Define default S3 write bucket below.

For more information on the various options for storage, see Storage Deployment Options.

For more information on setting the base storage layer, see Set Base Storage Layer.

Enable job output manifest

When the base storage layer is set to S3, you must enable the platform to generate job output manifest files. During job execution, the platform can create a manifest file of all files generated during job execution. When the job results are published, this manifest file ensures proper publication.

Info

NOTE: This feature must be enabled when using S3 as the base storage layer.

Steps:

  1. D s config
  2. Locate the following parameter and set it to true:

    Code Block
    "feature.enableJobOutputManifest": true,
  3. Save your changes and restart the platform.

Enable read access to S3

When read access is enabled,

D s item
itemusers
 can explore S3 buckets for creating datasets. 

Info

NOTE: When read access is enabled,

D s item
itemusers
have automatic access to all buckets to which the specified S3 user has access. You may want to create a specific user account for S3 access.


Info

NOTE: Data that is mirrored from one S3 bucket to another might inherit the permissions from the bucket where it is owned.

Steps:

  1. D s config
  2. Set the following property to true:

    Code Block
    "aws.s3.enabled": true,
  3. Save your changes.
  4. In the S3 configuration section, set enabled=true, which allows
    D s item
    itemusers
     to browse S3 buckets through the
    D s webapp
    .
  5. Specify the AWS key and secret values for the user to access S3 storage.

S3 access modes

The 

D s platform
 supports the following modes for access S3. You must choose one access mode and then complete the related configuration steps.

Info

NOTE: Avoid switching between user mode and system mode, which can disable user access to data. At install mode, you should choose your preferred mode.

 

System mode

(default) Access to S3 buckets is enabled and defined for all users of the platform. All users use the same AWS access key, secret, and default bucket.

System mode - read-only access

For read-only access, the key, secret, and default bucket must be specified in configuration.

Info

NOTE: Please verify that the AWS account has all required permissions to access the S3 buckets in use. The account must have the ListAllMyBuckets ACL among its permissions.

Steps:

  1. D s config

  2. Locate the following parameters:

    ParametersDescription
    aws.s3.keySet this value to the AWS key to use to access S3.
    aws.s3.secretSet this value to the secret corresponding to the AWS key provided.
    aws.s3.bucket.name

    Set this value to the name of the S3 bucket from which users may read data.

    Info

    NOTE: Additional buckets may be specified. See below.

  3. Save your changes.

User mode

Optionally, access to S3 can be defined on a per-user basis. This mode allows administrators to define access to specific buckets using various key/secret combinations as a means of controlling permissions.

Info

NOTE: When this mode is enabled, individual users must have AWS configuration settings applied to their account, either by an administrator or by themselves. The global settings in this section do not apply in this mode.

To enable:

  1. D s config

  2. Please verify that the settings below have been configured:

    Code Block
    "aws.s3.enabled": true,
    "aws.mode": "user",
  3. Additional configuration is required for per-user authentication. For more information, see Configure AWS Per-User Authentication.
User mode - Create encryption key file

If you have enabled user mode for S3 access, you must create and deploy an encryption key file. For more information, see Create Encryption Key File.

Info

NOTE: If you have enabled user access mode, you can skip the following sections, which pertain to the system access mode, and jump to the Enable Redshift Connection section below.

System mode - additional configuration

The following sections apply only to system access mode.

Define default S3 write bucket

When S3 is defined as the base storage layer, write access to S3 is enabled. The 

D s platform
 attempts to store outputs in the designated default S3 bucket. 

Info

NOTE: This bucket must be set during initial installation. Modifying it at a later time is not recommended and can result in inaccessible data in the platform.

Info

NOTE: Bucket names cannot have underscores in them. See http://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html.

Steps:

  1. Define S3 to be the base storage layer. See Set Base Storage Layer.
  2. Enable read access. See Enable read access.
  3. Specify a value for  aws.s3.bucket.name , which defines the S3 bucket where data is written. Do not include a protocol identifier. For example, if your bucket address is s3://MyOutputBucket, the value to specify is the following:

    Code Block
    MyOutputBucket
    Info

    NOTE: Specify the top-level bucket name only. There should not be any backslashes in your entry.

Info

NOTE: This bucket also appears as a read-access bucket if the specified S3 user has access.

Adding additional S3 buckets

When read access is enabled, all S3 buckets of which the specified user is the owner appear in the

D s webapp
. You can also add additional S3 buckets from which to read.

Info

NOTE: Additional buckets are accessible only if the specified S3 user has read privileges.

Info

NOTE: Bucket names cannot have underscores in them.

Steps:

  1. D s config

  2. Locate the following parameter: aws.s3.extraBuckets:

    1. In the Admin Settings page, specify the extra buckets as a comma-separated string of additional S3 buckets that are available for storage. Do not put any quotes around the string. Whitespace between string values is ignored.

    2. In 

      D s triconf
      , specify the extraBuckets array as a comma-separated list of buckets as in the following: 

      Code Block
      "extraBuckets": ["MyExtraBucket01","MyExtraBucket02","MyExtraBucket03"]
      Info

      NOTE: Specify the top-level bucket name only. There should not be any backslashes in your entry.

  3. Specify the extraBuckets array as a comma-separated list of buckets as in the following: 

    Code Block
    "extraBuckets": ["MyExtraBucket01","MyExtraBucket02","MyExtraBucket03"]
  4. These values are mapped to the following bucket addresses:

    Code Block
    s3://MyExtraBucket01
    s3://MyExtraBucket02
    s3://MyExtraBucket03

S3 Configuration

Configuration reference

D s config

Code Block
"aws.s3.enabled": true,
"aws.s3.bucket.name": "<BUCKET_FOR_OUTPUT_IF_WRITING_TO_S3>"
"aws.s3.key": "<AWS_KEY>",
"aws.s3.secret": "<AWS_SECRET>",
"aws.s3.extraBuckets": ["<ADDITIONAL_BUCKETS_TO_SHOW_IN_FILE_BROWSER>"]
SettingDescription

enabled

When set to true, the S3 file browser is displayed in the GUI for locating files. For more information, see S3 Browser.
bucket.name 

Set this value to the name of the S3 bucket to which you are writing.

  • When webapp.storageProtocol is set to s3, the output is delivered to aws.s3.bucket.name.
key

Access Key ID for the AWS account to use.

Info

NOTE: This value cannot contain a slash (/).

secret

Secret Access Key for the AWS account.

extraBuckets

Add references to any additional S3 buckets to this comma-separated array of values.

The S3 user must have read access to these buckets.

Enable use of server-side encryption

You can configure the 

D s platform
 to publish data on S3 when a server-side encryption policy is enabled. SSE-S3 and SSE-KMS methods are supported. For more information, see http://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html.

Notes:

  • When encryption is enabled, all buckets to which you are writing must share the same encryption policy. Read operations are unaffected.

To enable, please specify the following parameters.

D s config

Server-side encryption method

Code Block
"aws.s3.serverSideEncryption": "none",

Set this value to the method of encryption used by the S3 server. Supported values:

Info

NOTE: Lower case values are required.

 

  • sse-s3
  • sse-kms
  • none

Server-side KMS key identifier

When KMS encryption is enabled, you must specify the AWS KMS key ID to use for the server-side encryption.

Code Block
"aws.s3.serverSideKmsKeyId": "",

Notes:

The format for referencing this key is the following:

Code Block
"arn:aws:kms:<regionId>:<acctId>:key/<keyId>"

You can use an AWS alias in the following formats. The format of the AWS-managed alias is the following:

Code Block
"alias/aws/s3"

The format for a custom alias is the following:

Code Block
"alias/<FSR>"

where:

<FSR> is the name of the alias for the entire key.

Save your changes and restart the platform.

Configure S3 filewriter

The following configuration can be applied through the Hadoop site-config.xml file. If your installation does not have a copy of this file, you can insert the properties listed in the steps below into

D s triconf
to configure the behavior of the S3 filewriter.

Steps:

  1. D s config
    typet

  2. Locate the filewriter.hadoopConfig block, where you can insert the following Hadoop configuration properties:

    Code Block
    "filewriter": {
      max: 16,
      "hadoopConfig": {
        "fs.s3a.buffer.dir": "/tmp",
        "fs.s3a.fast.upload": "true"
      },
      ...
    }
    PropertyDescription
    fs.s3a.buffer.dir

    Specifies the temporary directory on the

    D s node
    to use for buffering when uploading to S3. If fs.s3a.fast.upload is set to false, this parameter is unused.

    Info

    NOTE: This directory must be accessible to the Batch Job Runner process during job execution.

    fs.s3a.fast.upload

    Set to true to enable buffering in blocks.

    When set to false, buffering in blocks is disabled. For a given file, the entire object is buffered to the disk of the

    D s node
    . Depending on the size and volume of your datasets, the node can run out of disk space.

  3. Save your changes and restart the platform.

Create Redshift Connection

For more information, see Create Redshift Connections.

Hadoop distribution-specific configuration

Hortonworks

Info

NOTE: If you are using Spark profiling through Hortonworks HDP on data stored in S3, additional configuration is required. See Configure for Hortonworks.

Additional Configuration for S3

The following parameters can be configured through the 

D s platform
 to affect the integration with S3. You may or may not need to modify these values for your deployment.

D s config

ParameterDescription
aws.s3.consistencyTimeout

S3 does not guarantee that at any time the files that have been written to a directory will be consistent with the files available for reading. S3 does guarantee that eventually the files are in sync.

This guarantee is important for some platform jobs that write data to S3 and then immediately attempt to read from the written data.

This timeout defines how long the platform waits for this guarantee. If the timeout is exceeded, the job is failed. The default value is 120.

Depending on your environment, you may need to modify this value.

aws.s3.endpoint

This value should be the S3 endpoint DNS name value.

Info

NOTE: Do not include the protocol identifier.

Example value:

Code Block
s3.us-east-1.amazonaws.com

If your S3 deployment is either of the following:

  • located in a region that does not support the default endpoint, or
  • v4-only signature is enabled in the region

Then, you can specify this setting to point to the S3 endpoint for Java/Spark services.

For more information on this location, see https://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region.

Testing

Restart services. See Start and Stop the Platform.

Try running a simple job from the

D s webapp
. For more information, see Verify Operations.

Troubleshooting

Profiling consistently fails for S3 sources of data

If you are executing visual profiles of datasets sourced from S3, you may see an error similar to the following in the batch-job-runner.log file:

Code Block
01:19:52.297 [Job 3] ERROR com.trifacta.hadoopdata.joblaunch.server.BatchFileWriterWorker - BatchFileWriterException: Batch File Writer unknown error:
{jobId=3, why=bound must be positive}
01:19:52.298 [Job 3] INFO com.trifacta.hadoopdata.joblaunch.server.BatchFileWriterWorker - Notifying monitor for job 3 with status code FAILURE

This issue is caused by improperly configuring buffering when writing to S3 jobs. The specified local buffer cannot be accessed as part of the batch job running process, and the job fails to write results to S3.

Solution:

You may do one  of the following:

  • Use a valid temp directory when buffering to S3.
  • Disable buffering to directory completely.

Steps:

  • D s config
    typet

  • Locate the following, where you can insert either of the following Hadoop configuration properties:

    Code Block
    "filewriter": {
      max: 16,
      "hadoopConfig": {
        "fs.s3a.buffer.dir": "/tmp",
        "fs.s3a.fast.upload": false
      },
      ...
    }
    PropertyDescription
    fs.s3a.buffer.dir

    Specifies the temporary directory on the

    D s node
    to use for buffering when uploading to S3. If fs.s3a.fast.upload is set to false, this parameter is unused.

    fs.s3a.fast.upload
    When set to false, buffering is disabled.
  • Save your changes and restart the platform.

Spark local directory has no space

During execution of a Spark job, you may encounter the following error:

Code Block
org.apache.hadoop.util.DiskChecker$DiskErrorException: No space available in any of the local directories.

Solution:

  • Restart 
    D s item
    services
    services
    , which may free up some temporary space.
  • Use the steps in the preceding solution to reassign a temporary directory for Spark to use (fs.s3a.buffer.dir).