Page tree

 

Contents:


This install process applies to installing  Trifacta® Wrangler Enterprise on an AWS infrastructure that you manage. 

AWS Marketplace deployments:

NOTE: Content in this section does not apply to deployments from the AWS Marketplace, which provide fewer deployment and configuration options. For more information, see the AWS Marketplace.

Scenario Description

NOTE: All hardware in use for supporting the platform is maintained within the enterprise infrastructure on AWS.

  • Installation of  Trifacta Wrangler Enterprise on an EC2 server in AWS
  • Installation of Trifacta databases on AWS
  • Integration with a supported EMR cluster.
  • Base storage layer and backend datastore of S3

NOTE: When the above installation and configuration steps have been completed, the platform is operational. Additional configuration may be required, which is referenced at the end of this section.

For more information on deployment scenarios, see Supported Deployment Scenarios for AWS.

Product Limitations

The following limitations apply to installations of  Trifacta Wrangler Enterprise on AWS:

  • No support for Hive integration
  • No support for secure impersonation or Kerberos
  • No support for high availability and failover
  • Job cancellation is not supported on EMR.
  • When publishing single files to S3, you cannot apply an append publishing action.

Pre-requisites

Desktop Requirements

  • All desktop users of the platform should have a supported version of Google Chrome installed on their desktops.
  • All desktop users must be able to connect to the EC2 instance through the enterprise infrastructure.

AWS Pre-requisites

Depending on which of the following AWS components you are deploying, additional pre-requisites and limitations may apply. Please review these sections as well.

Prep

Before you begin, please verify that you have completed the following:

 

  1. Review Planning Guide: Please review and verify Install Preparation and sub-topics.
    1. Limitations: For more information on limitations of this scenario, see Product Limitations in the Install Preparation area.
  2. Read: Please read this entire document before you create the EMR cluster or install the Trifacta platform.

  3. Acquire Assets: Acquire the installation package for your operating system and your license key. For more information, contact  Trifacta Support.
    1. If you are completing the installation without Internet access, you must also acquire the offline versions of the system dependencies. See Install Dependencies without Internet Access.
  4. VPC: Enable and deploy a working AWS VPC.
  5. S3: Enable and deploy an AWS S3 bucket to use as the base storage layer for the platform. In the bucket, the platform stores metadata in the following location:

    <S3_bucket_name>/trifacta

    See https://s3.console.aws.amazon.com/s3/home.

  6. IAM Policies: Create IAM policies for access to the S3 bucket. Required permissions are the following: 
    • The system account or individual user accounts must have full permissions for the S3 bucket:

      Delete*, Get*, List*, Put*, Replicate*, Restore*
    • These policies must apply to the bucket and its contents. Example:

      "arn:aws:s3:::my-trifacta-bucket-name"
      "arn:aws:s3:::my-trifacta-bucket-name/*"
    • See https://console.aws.amazon.com/iam/home#/policies
  7. EC2 instance role: Create an EC2 instance role for your S3 bucket policy. See https://console.aws.amazon.com/iam/home#/roles.
  8. EC2 instance: Deploy an AWS EC2 with SELinux where the Trifacta software can be installed.
    1. The required set of ports must be enabled for listening. See System Ports.

    2. This node should be dedicated for Trifacta use.

      NOTE: The EC2 node must meet the system requirements. For more information, see System Requirements.

  9. EMR cluster: An existing EMR cluster is required. 
    1. Cluster sizing: Before you begin, you should allocate sufficient resources for sizing the cluster. For guidance, please contact your Trifacta representative.

    2. See Deploy the Cluster below.
  10. Databases:
    1. The platform utilizes a set of databases that must be accessed from the Trifacta node. Databases are installed as part of the workflow described later.
    2. For more information on the supported databases and versions, see System Requirements.
    3. For more information on database installation requirements, see Install Databases.
    4. If installing databases on Amazon RDS an admin account to RDS is required. For more information, see Install Databases on Amazon RDS.

AWS Information

Before you begin installation, please acquire the following information from AWS:

  • EMR:
    • AWS region for the EMR cluster, if it exists.
    • ID for EMR cluster, if it exists
      • If you are creating an EMR cluster as part of this process, please retain the ID.
      • The EMR cluster must allow access from the Trifacta node. This configuration is described later.
  • Subnet: Subnet within your virtual private cloud (VPC) where you want to launch the Trifacta platform.
    • This subnet should be in the same VPC as the EMR cluster.
    • Subnet can be private or public.
    • If it is private and it cannot access the Internet, additional configuration is required. See below.
  • S3:
    • Name of the S3 bucket that the platform can use
    • Path to resources on the S3 bucket

  • EC2: 
    • Instance type for the Trifacta node

Internet access

From AWS, the Trifacta platform requires Internet access for the following services:

NOTE: Depending on your AWS deployment, some of these services may not be required.

 

  • AWS S3
  • Key Management System [KMS] (if sse-kms server side encryption is enabled)
  • Secure Token Service [STS] (if temporary credential provider is used)
  • EMR (if integration with EMR cluster is enabled)

NOTE: If the Trifacta platform is hosted in a VPC where Internet access is restricted, access to S3, KMS and STS services must be provided by creating a VPC endpoint. If the platform is accessing an EMR cluster, a proxy server can be configured to provide access to the AWS ElasticMapReduce regional endpoint.

Deploy the Cluster

In your AWS infrastructure, you must deploy a supported version of EMR across a recommended number of nodes to support the expected data volumes of your Trifacta jobs.

For more information on the supported EMR distributions, see  Supported Deployment Scenarios for AWS.

When you configure the platform to integrate with the cluster, you must acquire some information about the cluster resources. For more information on the set of information to collect, see Pre-Install Checklist in the Install Preparation area.

Deploy the EC2 Node

An EC2 node of the cluster must be deployed to host the Trifacta platform software. For more information on the requirements of this node, see System Requirements

When you configure the platform to integrate with the cluster, you must acquire some information about the cluster resources. For more information on the set of information to collect, see Pre-Install Checklist in the Install Preparation area.

Here are some guidelines for deploying the EC2 cluster from the EC2 cluster:

  1. Instance size: Select the instance size.
  2. Network: Configure the VPC, subnet, firewall and other configuration settings necessary to communicate with the instance. 
  3. Auto-assigned Public IP: You must create a public IP to access the Trifacta platform.
  4. EC2 role: Select the EC2 role that you created.
  5. Local storage: Select a local EBS volume. The default volume includes 100GB storage.

    NOTE: The local storage environment contains the Trifacta databases, the product installation, and its log files. No source data is ever stored within the product.

  6. Security group: Use a security group that exposes access to port 3005, which is the default port for the platform. 
  7. Create an AWS key-pair for access: This key is used to provide SSH access to the platform, which may be required for some admin tasks.
  8. Save your changes.

Install Workflow

NOTE: These steps are covered in greater detail later in this section.

After you have completed, the above, please complete these steps listed in order:

  1. Install software: Install the Trifacta platform software on the EC2 node you created. See Install Software.

  2. Install databases: The platform requires several databases for storing metadata.

    NOTE: The software assumes that you are installing the databases on a PostgreSQL server on the same node as the software. If you are not or are changing database names or ports, additional configuration is required as part of this installation process.

    For more information, see Install Databases.

  3. Start the platform: For more information, see Start and Stop the Platform.
  4. Login to the application: After software and databases are installed, you can login to the application to complete configuration:
    1. See Login.
    2. As soon as you login, you should change the password on the admin account. In the left menu bar, select Settings > Admin Settings. Scroll down to Manage Users. For more information, see Change Admin Password.

Tip: At this point, you can access the online documentation through the application. In the left menu bar, select Help menu > Product Docs. All of the following content, plus updates, is available online. See Documentation below.

Configure for EMR

NOTE: If you are creating a new EMR cluster as part of this installation process, please skip this section. That workflow is covered later in the document. For more information, see Configure for EMR.

Please complete the following configuration to enable access to your pre-existing EMR cluster from the Trifacta platform.

IAM and Security Group updates

You must make changes to your IAM and Security Group changes to enable the Trifacta instance to communicate with your existing EMR cluster and your EMR cluster to read/write to the Trifacta data bucket. Below are the requirements and suggested implementation details. Please adapt these suggestions to fit your environment as long as the requirements are satisfied. 

For additional documentation around these changes:

RequirementExample

Trifacta EC2 instance role must be permitted to use your EMR cluster.

{
    "Version": "2008-10-17",
    "Statement": [
        {
            "Action": [
                "elasticmapreduce:DescribeStep",
                "elasticmapreduce:ListBootstrapActions",
                "elasticmapreduce:ListClusters",
                "elasticmapreduce:DescribeCluster",
                "elasticmapreduce:AddJobFlowSteps",
                "elasticmapreduce:DescribeJobFlows",
                "elasticmapreduce:ListInstanceGroups"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

EMR EC2 instance role must be permitted to use the Trifacta data bucket.

{
    "Version": "2008-10-17",
    "Statement": [
        {
            "Action": [
                "elasticmapreduce:Describe*",
                "elasticmapreduce:List*",
                "s3:ListAllMyBuckets",
                "ec2:Describe*"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:PutObject",
                "s3:ListBucket",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::YOUR-TRIFACTA-BUCKET",
                "arn:aws:s3:::YOUR-TRIFACTA-BUCKET/*"
            ],
            "Effect": "Allow"
        }
    ]
}

Your EMR Service Role should permit access to the Trifacta bucket.

NOTE: This example is not a complete policy. You should update your existing policy with these statements.

        {
            "Action": [
                "s3:HeadBucket",
                "s3:ListAllMyBuckets"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::YOUR-TRIFACTA-BUCKET",
                "arn:aws:s3:::YOUR-TRIFACTA-BUCKET/*"
            ],
            "Effect": "Allow"
        },

Your EMR cluster master node must permit the Trifacta EC2 instance to access it.

  • The Trifacta EC2 instance must be able to communicate with your EMR master node on TCP ports 18080 and 8088.
  • You should create a security group and then associate it with your EMR master node using the "additional security groups" functionality.
  • For future ease of use, you should specify the security group associated with your Trifacta EC2 instance as the source.

Additional configuration must be applied within the platform. These steps are described later.

Additional Configuration for AWS Installs

Apply license key to EC2 node

Steps:

  1. Acquire the license.json license key file that was provided to you by your Trifacta representative.

  2. Transfer the license key file to the EC2 node that is hosting the Trifacta platform. Navigate to the directory where you stored it.

  3. Make the Trifacta user the owner of the file:

    sudo chown trifacta:trifacta license.json
  4. Make sure that the Trifacta user  has read permissions on the file:

    sudo chmod 644 license.json
  5. Copy the license key file to the proper location:

    cp license.json /opt/trifacta/license/

Launch the platform

For more information on how to launch the platform, see Start and Stop the Platform.

When the instance is spinning up for the first time, performance may be slow. When the instance is up, navigate to the following:

http://<public_hostname>:3005

When the login screen appears, enter the default admin credentials provided to you.

NOTE: As soon as you login as an admin for the first time, you should immediately change the password. From the left nav bar, select Settings > Settings > User Profile. Change the password and click Save to restart the platform.

Configure for EMR clusters

The following steps apply to configure the platform to integrate with the EMR cluster:

  1. From the application menu, select the Settings menu. Then, click Settings > Admin Settings
  2. In the Admin Settings page, you can configure many aspects of the platform, including user management tasks, and perform restarts to apply the changes.
    1. In the Search bar, enter the following:

      aws.s3.bucket.name
    2. Set the value of this setting to be S3 bucket name.

  3. Check the following setting. Verify that it is set to 2.3.0:

    "spark.version": "2.3.0",
  4. The following setting must be specified.

    "aws.mode":"system",

    You can set the above value to either of the following:

    aws.mode valueDescription
    systemSet the mode to system to enable use of EC2 instance-based authentication for access.
    userSet the mode to user to utilize user-based credentials. This mode requires additional configuration.

    Details on the above configuration are described later.

  5. Set the following parameter to true, which instructs the Trifacta application to run jobs on the integrated EMR cluster:

    "webapp.runinEMR" = true,
  6. In the Admin Settings page, locate the External Service Settings section. 

  7. In the Admin Settings page, locate the External Service Settings section.

    1. AWS EMR Cluster ID: Paste the value for the EMR Cluster ID for the cluster to which the platform is connecting.

    2. AWS Region: Enter the region where your EMR cluster is located.
    3. Resource Bucket: Enter the name of the S3 bucket to use.
    4. Resource Path: you should use something like EMRLOGS.
  8. Click Save underneath the External Service Settings section.

Set base storage layer

The platform requires that one backend datastore be configured as the base storage layer. This base storage layer is used for storing uploaded data and writing results and profiles. 

NOTE: By default, the base storage layer for Trifacta Wrangler Enterprise is set to HDFS. You must change this value for S3. After this base storage layer is defined, it cannot be changed again.

See Set Base Storage Layer.

Verify Operations

NOTE: You can try to verify operations using the Trifacta Photon running environment at this time. While you can also try to run a job in the Spark running environment, additional configuration may be required to complete the integration. These steps are listed under Next Steps below.

 

Prepare Your Sample Dataset

To complete this test, you should locate or create a simple dataset. Your dataset should be created in the format that you wish to test.

Characteristics:

  • Two or more columns. 
  • If there are specific data types that you would like to test, please be sure to include them in the dataset.
  • A minimum of 25 rows is required for best results of type inference.
  • Ideally, your dataset is a single file or sheet. 

Store Your Dataset

If you are testing an integration, you should store your dataset in the datastore with which the product is integrated.

Tip: Uploading datasets is always available as a means of importing datasets.

 

  • You may need to create a connection between the platform and the datastore.
  • Read and write permissions must be enabled for the connecting user to the datastore.
  • For more information, see Connections Page.

Verification Steps

Steps:

  1. Login to the application. See Login.
  2. In the application menu bar, click Library.
  3. Click Import Data. See Import Data Page.
    1. Select the connection where the dataset is stored. For datasets stored on your local desktop, click Upload.
    2. Select the dataset.
    3. In the right panel, click the Add Dataset to a Flow checkbox. Enter a name for the new flow.
    4. Click Import and Add to Flow.
    5. Troubleshooting:   At this point, you have read access to your datastore from the platform. If not, please check the logs, permissions, and your Trifacta® configuration.

  4. In the left menu bar, click the Flows icon. Flows page, open the flow you just created. See Flows Page.
  5. In the Flows page, click the dataset you just imported. Click Add new Recipe.
  6. Select the recipe. Click Edit Recipe.
  7. The initial sample of the dataset is opened in the Transformer page, where you can edit your recipe to transform the dataset.
    1. In the Transformer page, some steps are automatically added to the recipe for you. So, you can run the job immediately.
    2. You can add additional steps if desired. See Transformer Page.
  8. Click Run Job
    1. If options are presented, select the defaults.

    2. To generate results in other formats or output locations, click Add Publishing Destination. Configure the output formats and locations. 
    3. To test dataset profiling, click the Profile Results checkbox. Note that profiling runs as a separate job and may take considerably longer. 
    4. See Run Job Page.
    5. Troubleshooting: Later, you can re-run this job on a different running environment. Some formats are not available across all running environments.

  9. When the job completes, you should see a success message under the Jobs tab in the Flow View page. 
    1. Troubleshooting: Either the Transform job or the Profiling job may break. To localize the problem, try re-running a job by deselecting the broken job type or running the job on a different running environment (if available). You can also download the log files to try to identify the problem. See Job Details Page.
  10. Click View Results from the context menu for the job listing. In the Job Details page, you can see a visual profile of the generated results. See Job Details Page.
  11. In the Output Destinations tab, click a link to download the results to your local desktop. 
  12. Load these results into a local application to verify that the content looks ok.

Checkpoint: You have verified importing from the selected datastore and transforming a dataset. If your job was successfully executed, you have verified that the product is connected to the job running environment and can write results to the defined output location. Optionally, you may have tested profiling of job results. If all of the above tasks completed, the product is operational end-to-end.

Documentation

Tip: You should access online documentation through the product. Online content may receive updates that are not present in PDF content.

You can access complete product documentation online and in PDF format. From within the Trifacta application, select Help menu > Product Docs.

Next Steps

After you have accessed the documentation, the following topics are relevant to AWS enterprise infrastructure deployments.

NOTE: These materials are located in the Configuration Guide.

Please review them in order.

TopicDescription
Required Platform Configuration

This section covers the following topics, some of which should already be completed:

  • Set Base Storage Layer - The base storage layer must be set once and never changed. Set this value to s3.

  • Create Encryption Key File - If you plan to integrate the platform with any relational sources, including Redshift, you must create an encryption key file and store it on the Trifacta node
  • Running Environment Options - Depending on your scenario, you may need to perform additional configuration for your available running environment(s) for executing jobs.
  • Profiling Options - In some environments, tweaks to the settings for visual profiling may be required. You can disable visual profiling if needed.
  • Configure for Spark - If you are enabling the Spark running environment, please review and verify the configuration for integrating the platform with the Spark running environment.

Configure for EMR

Set up for a new EMR cluster. Some content may apply to existing EMR clusters.

Enable Integration with Compressed ClustersIf the Hadoop cluster uses compression, additional configuration is required.
Enable Integration with Cluster High Availability

If you are integrating with high availability on the Hadoop cluster, please complete these steps.

  • If you are integrating with high availability on the Hadoop cluster, HttpFS must be enabled in the platform. HttpFS is required in other, less-common cases. See Enable HttpFS.
Enable Relational Connections

Enable integration with relational databases, including Redshift.

Configure for KMSIntegration with the Hadoop cluster's key management system (KMS) for encrypted transport. Instructions are provided for distribution-specific versions of Hadoop.
Configure Security

A list of topics on applying additional security measures to the Trifacta platform and how integrates with Hadoop.

Configure SSO for AD-LDAPPlease complete these steps if you are integrating with your enterprise's AD/LDAP Single Sign-On (SSO) system.

Upgrade

For more information on upgrading your  Trifacta Wrangler Enterprise on AWS, please contact  Trifacta Customer Success Services.

This page has no comments.