Skip to main content

External S3 Connections

You can create connections to specific S3 buckets through the Trifacta Application. These connections to S3 enable workspace users to read from and write to specific S3 buckets.

Simple Storage Service (S3)is an online data storage service provided by Amazon, which provides low-latency access through web services. For more information, see https://aws.amazon.com/s3/.

Supported Environments:

Operation

Designer Cloud Powered by Trifacta Enterprise Edition

Amazon

Microsoft Azure

Read

Supported

Supported

Not supported

Write

Supported

Supported

Not supported

Prerequisites

Before you begin, please verify that your Alteryx environment meets the following requirements:

  • Integration: Your Alteryx instance is connected to a running environment supported by your product edition.

  • Multiple region : Multiple S3 connections can be configured in different regions.

  • Verify that Enable S3 Connectivity has been enabled in the Workspace Settings Page.

  • Acquire the Access Key ID and Secret Key for the S3 bucket or buckets to which you are connecting. For more information on acquiring your key/secret combination, contact your S3 administrator.

Permissions

Access to S3 requires:

  • Each user must have appropriate permissions to access S3.

    Note

    If a user does not have write permissions to the specified S3 bucket, publishing jobs to the bucket fail.

  • To browse multiple buckets through a single S3 connection, additional permissions are required. See below.

Limitations

  • Authentication using IAM roles is not supported.

  • Automatic region detection in the create and edit connection is not supported.

  • Publishing the output to multi-part files is not supported.

    Note

    For some file formats, like Parquet, multi-part files are the default output.

  • Publishing the output using compression option is not supported for Trifacta Photon jobs.

    Note

    If you need to generate an output using compression to this S3 bucket, you can run the job on another running environment.

Create Connection

You can create additional S3 connections by the following method:

Create through application

You can create a S3 connection through the application.

Steps:

  1. Login to the application.

  2. In the left navigation bar, click the Connections icon.

  3. In the Create Connection page, click the External Amazon S3 card.

  4. Specify the connection properties:

    Property

    Description

    DefaultBucket

    (Optional) The default S3 bucket to which to connect. When the connection is first accessed for browsing, the contents of this bucket are displayed.

    If this value is not provided, then the list of available buckets based on the key/secret combination is displayed when browsing through the connection.

    Note

    To see the list of available buckets, the connecting user must have the getBucketList permission. If that permission is not present and no default bucket is listed, then the user cannot browse S3.

    Access Key ID

    Access Key ID for the S3 connection.

    Secret Key

    Secret Key for the S3 connection.

    Server Side Encryption

    If server-side encryption has been enabled on your bucket, you can select theserver-side encryption policy to use when writing to the bucket. SSE-S3 and SSE-KMS methods are supported. For more information, see http://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html.

    Server Side Kms key Id

    When KMS encryption is enabled, you must specify the AWS KMS key ID to use for the server-side encryption. For more information, see "Server Side KMS Key Identifier" below.

    For more information on the other options, see Create Connection Window.

  5. Click Save.

Server Side KMS Key Identifier

When KMS encryption is enabled, you must specify the AWS KMS key ID to use for the server-side encryption.

The format for referencing this key is the following:

"arn:aws:kms:<regionId>:<acctId>:key/<keyId>"

You can use an AWS alias in the following formats. The format of the AWS-managed alias is the following:

"alias/aws/s3"

The format for a custom alias is the following:

"alias/<FSR>"

where:

<FSR> is the name of the alias for the entire key.

Create via API

For more information on the vendor and type information to use, see Connection Types.

For more information, see https://api.trifacta.com/ee/9.7/index.html#operation/createConnection

API: API Reference

  • Type: remotefile

  • vendor: aws

Java VFS Service

The Java VFS Service has been modified to handle an optional connection ID, enabling S3 URLs with connection ID and credentials. The other connection details are fetched through the Trifacta Application to create the required URL and configuration.

// sample URI
s3://bucket-name/path/to/object?connectionId=136


// sample java-vfs-service CURL request with s3
curl -H 'x-trifacta-person-workspace-id: 1' -X GET 'http://localhost:41917/vfsList?uri=s3://bucket-name/path/to/object?connectionId=136'

Write

You can publish results to your external S3 buckets.

Note

The append action is not supported when publishing to S3.

Using S3 Connections

Uses of S3

The Designer Cloud Powered by Trifacta platform can use S3 for the following tasks:

  1. Enabled S3 Integration: The Designer Cloud Powered by Trifacta platform has been configured to integrate with your S3 instance.For more information, see S3 Access.

  2. Creating Datasets from S3 Files: You can read in source data stored in S3. An imported dataset may be a single S3 file or a folder of identically structured files. See Reading from Sources in S3 below.

  3. Reading Datasets: When creating a dataset, you can pull your data from a source in S3. See Creating Datasets below.

  4. Writing Results: After a job has been executed, you can write the results back to S3. See Writing Results below.

In the Trifacta Application, S3 is accessed through the S3 browser. See S3 Browser.

Note

When Trifacta Application executes a job on a dataset, the source data is untouched. Results are written to a new location, so that no data is disturbed by the process.

Before you begin using S3

  • Access: If you are using system-wide permissions, your administrator must configure access parameters for S3 locations. If you are using per-user permissions, this requirement does not apply.See S3 Access.

    Warning

    Avoid using /trifacta/uploads for reading and writing data. This directory is used by the Trifacta Application.

  • Your administrator should provide a writeable home output directory for you. This directory location is available through your user profile. See Storage Config Page.

Secure access

Your administrator can grant access on a per-user basis or for the entire workspace.

The Designer Cloud Powered by Trifacta platform utilizes an S3 key and secret to access your S3 instance. These keys must enable read/write access to the appropriate directories in the S3 instance.

Note

If you disable or revoke your S3 access key, you must update the S3 keys for each user or for the entire system.

For more information, see S3 Access.

Storing data in S3

Your administrator should provide raw data or locations and access for storing raw data within S3. All Alteryx users should have a clear understanding of the folder structure within S3 where each individual can read from and write results.

  • Users should know where shared data is located and where personal data can be saved without interfering with or confusing other users.

  • The Trifacta Application stores the results of each job in a separate folder in S3.

Note

Designer Cloud Powered by Trifacta Enterprise Edition does not modify source data in S3. Source data stored in S3 is read without modification from source locations, and source data uploaded to the Designer Cloud Powered by Trifacta platform is stored in /trifacta/uploads.

Reading from sources in S3

You can create an imported dataset from one or more files stored in S3.

Note

To be able to import datasets from the base storage layer, your user account must include the dataAdmin role.

Note

Import of glaciered objects is not supported.

Wildcards:

You can parameterize your input paths to import source files as part of the same imported dataset. For more information, see Overview of Parameterization.

Folder selection:

When you select a folder in S3 to create your dataset, you select all files in the folder to be included.

Notes:

  • This option selects all files in all sub-folders and bundles them into a single dataset. If your sub-folders contain separate datasets, you should be more specific in your folder selection.

  • All files used in a single imported dataset must be of the same format and have the same structure. For example, you cannot mix and match CSV and JSON files if you are reading from a single directory.

When a folder is selected from S3, the following file types are ignored:

  • *_SUCCESS and *_FAILED files, which may be present if the folder has been populated by the running environment.

Note

If you have a folder and file with the same name in S3, search only retrieves the file. You can still navigate to locate the folder.

Creating datasets

When creating a dataset, you can choose to read data in from a source stored from S3 or local file.

  • S3 sources are not moved or changed.

  • Local file sources are uploaded to /trifacta/uploads where they remain and are not changed.

Data may be individual files or all of the files in a folder. In the Import Data page, click the S3 tab. See Import Data Page.

Writing results

When you run a job, you can specify the S3 bucket and file path where the generated results are written. By default, the output is generated in your default bucket and default output home directory.

  • Each set of results must be stored in a separate folder within your S3 output home directory.

  • For more information on your output home directory, see Storage Config Page.

Warning

If Alteryx installation is using S3, do not use the trifacta/uploads directory. This directory is used for storing uploads and metadata, which may be used by multiple users. Manipulating files outside of the Trifacta Application can destroy other users' data. Please use the tools provided through the Trifacta Application interface for managing uploads from S3.

Note

When writing files to S3, you may encounter an issue where the UI indicates that the job failed, but the output file or files have been written to S3. This issue may be caused when S3 does not report the files back to the application before the S3 consistency timeout has expired. For more information on raising this timeout setting, see S3 Access.

Creating a new dataset from results

As part of writing results, you can choose to create a new dataset, so that you can chain together data wrangling tasks.

Note

When you create a new dataset as part of your results, the file or files are written to the designated output location for your user account. Depending on how your permissions are configured, this location may not be accessible to other users.

Purging files

Other than temporary files, the Designer Cloud Powered by Trifacta platform does not remove any files that were generated or used by the platform, including:

  • Uploaded datasets

  • Generated samples

  • Generated results

If you are concerned about data accumulation, you should create a bucket policy to periodically backup or purge directories in use. For more information, please see the S3 documentation.