Page tree

 

Contents:


This install process applies to installing  Trifacta® Wrangler Enterprise on an AWS infrastructure that you manage. 

AWS Marketplace deployments:

NOTE: Content in this section does not apply to deployments from the AWS Marketplace. For more information on installing from the Marketplace, see the AWS Marketplace listing.


Scenario Description

NOTE: All hardware in use for supporting the platform is maintained within your enterprise infrastructure on AWS.

  • Installation of  Trifacta Wrangler Enterprise on an EC2 server in AWS
  • Installation of Trifacta databases on AWS
  • Integration with a supported EMR cluster.
  • Base storage layer and backend datastore of S3

For more information on deployment scenarios, see Supported Deployment Scenarios for AWS.

Limitations

Deployment Limitations

The following limitations apply to installations of  Trifacta Wrangler Enterprise on AWS:

  • No support for high availability and failover
  • Job cancellation is not supported on EMR.
  • When publishing single files to S3, you cannot apply an append publishing action.
  • The following limitations apply to EMR integration only:
    • No support for Hive integration
    • No support for secure impersonation or Kerberos

Product Limitations

For general limitations of  Trifacta Wrangler Enterprise, see Product Limitations in the Planning Guide.

Pre-requisites

Please acquire the following assets:

  • Install Package: Acquire the installation package for your operating system.
    • License Key: As part of the installation package, you should receive a license key file. See License Key for details.
    • For more information, contact  Trifacta Support.
  • Offline system dependencies: If you are completing the installation without Internet access, you must also acquire the offline versions of the system dependencies. See Install Dependencies without Internet Access.

AWS desktop requirements

  • All desktop users must be able to connect to the EC2 instance through the enterprise infrastructure.

AWS pre-requisites

Depending on which of the following AWS components you are deploying, additional pre-requisites and limitations may apply. Please review these sections as well.

Preparation

Before you install Trifacta Wrangler Enterprise on AWS, please verify that you have completed the following:

  1. Read: Please read this entire document before you create the EMR cluster or install the Trifacta platform.
  2. VPC: Enable and deploy a working AWS VPC.
    1. In your VPC. you must define a subnet where you plan to deploy the Trifacta node.
  3. S3: Enable and deploy an AWS S3 bucket to use as the base storage layer for the platform. In the bucket, the platform stores metadata in the following location:

    <S3_bucket_name>/trifacta

    See https://s3.console.aws.amazon.com/s3/home.

  4. IAM Policies: Create IAM policies for access to the S3 bucket. Required permissions are the following: 
    • The system account or individual user accounts must have full permissions for the S3 bucket:

      Delete*, Get*, List*, Put*, Replicate*, Restore*
    • These policies must apply to the bucket and its contents. Example:

      "arn:aws:s3:::my-trifacta-bucket-name"
      "arn:aws:s3:::my-trifacta-bucket-name/*"
    • See https://console.aws.amazon.com/iam/home#/policies
  5. EC2 instance role: Create an EC2 instance role for your S3 bucket policy. See https://console.aws.amazon.com/iam/home#/roles.
  6. EC2 instance: Deploy an AWS EC2 with SELinux where the Trifacta software can be installed.
    1. The required set of ports must be enabled for listening. See System Ports in the Planning Guide.

    2. This node should be dedicated for Trifacta use.

      NOTE: The EC2 node must meet the system requirements for installing the platform. For more information, see System Requirements in the Planning Guide.

  7. EMR cluster: An existing EMR cluster is required. 
    1. Cluster sizing: Before you begin, you should allocate sufficient resources for sizing the cluster. For guidance, please contact your Trifacta representative.

    2. See Deploy the Cluster below.
  8. Databases: 
    1. The platform utilizes a set of databases that must be accessed from the Trifacta node. Databases are installed as part of the workflow described later.
    2. If installing databases on Amazon RDS, an admin account to RDS is required. For more information, see Install Databases on Amazon RDS.

AWS Information

Before you begin installation, please acquire the following information from AWS:

  • EMR:
    • AWS region for the EMR cluster, if it exists.
    • ID for EMR cluster, if it exists
      • If you are creating an EMR cluster as part of this process, please retain the ID.
      • The EMR cluster must allow access from the Trifacta node. This configuration is described later.
  • Subnet: Subnet within your virtual private cloud (VPC) where you want to launch the Trifacta platform.
    • This subnet should be in the same VPC as the EMR cluster.
    • Subnet can be private or public.
    • If it is private and it cannot access the Internet, additional configuration is required. See below.
  • S3:
    • Name of the S3 bucket that the platform can use
    • Path to resources on the S3 bucket

  • EC2: 
    • Instance type for the Trifacta node

Internet access

From AWS, the Trifacta platform requires Internet access for the following services:

NOTE: Depending on your AWS deployment, some of these services may not be required.

 

  • AWS S3
  • Key Management System [KMS] (if sse-kms server side encryption is enabled)
  • Secure Token Service [STS] (if temporary credential provider is used)
  • EMR (if integration with EMR cluster is enabled)

NOTE: If the Trifacta platform is hosted in a VPC where Internet access is restricted, access to S3, KMS and STS services must be provided by creating a VPC endpoint. If the platform is accessing an EMR cluster, a proxy server can be configured to provide access to the AWS ElasticMapReduce regional endpoint.

Deploy the Cluster

In your AWS infrastructure, you must deploy a supported version of EMR across a recommended number of nodes to support the expected data volumes of your Trifacta jobs.

  • For more information on suggested sizing, see Sizing Guidelines in the Planning Guide.

NOTE: Cluster information including cluster configuration files must be accessible to the Trifacta node. These requirements are described below.

Deploy the EC2 Node

An EC2 node of the cluster must be deployed to host the Trifacta platform software. Here are some guidelines for deploying the EC2 cluster from the EC2 cluster:

  1. Instance size: Select the instance size.
  2. Network: Configure the VPC, subnet, firewall and other configuration settings necessary to communicate with the instance. 
  3. Auto-assigned Public IP: You must create a public IP to access the Trifacta platform.
  4. EC2 role: Select the EC2 role that you created.
  5. Local storage: Select a local EBS volume. The default volume includes 100GB storage.

    NOTE: The local storage environment contains the Trifacta databases, the product installation, and its log files. No source data is ever stored within the product.

  6. Security group: Use a security group that exposes access to port 3005, which is the default port for the platform. 
  7. Create an AWS key-pair for access: This key is used to provide SSH access to the platform, which may be required for some admin tasks.
  8. Save your changes.

Install Workflow

NOTE: These steps are covered in greater detail later in this section.

The installation and configuration process requires the following steps. To continue, see Next Steps below.

  1. Install software: Install the Trifacta platform software on the Trifacta node. See Install Software.

  2. Install databases: The platform requires several databases for storage.

    NOTE: The default configuration assumes that you are installing the databases on a PostgreSQL server on the same edge node as the software using the default ports. If you are changing the default configuration, additional configuration is required as part of this installation process.

    For more information, see Install Databases in the Databases Guide.

  3. Start the platform: For more information, see Start and Stop the Platform.
  4. Login to the application: After software and databases are installed, you can login to the application to complete configuration:
    1. See Login.
    2. As soon as you login, you should change the password on the admin account. In the left menu bar, select Settings > Settings > Admin Settings. Scroll down to Manage Users. For more information, see Change Admin Password in the Configuration Guide.

      Tip: At this point, you can access the online documentation through the application. In the left menu bar, select Help menu > Documentation. All of the following content, plus updates, is available online. See Documentation below.

  5. Install configuration: After you are able to successfully login to the Trifacta application, you must configure the product to work with your backend storage layer and the running environment on the cluster. See Install Configuration.

Next Steps

To continue, please install the Trifacta software on the Trifacta node.

NOTE: Please complete the installation steps for the operating system version that is installed on the Trifacta node.

See Install Software.

This page has no comments.