Page tree

 

Contents:


This documentation applies to installation from a supported Marketplace. Please use the installation instructions provided with your deployment.


If you are installing or upgrading a Marketplace deployment, please use the available PDF content. You must use the install and configuration PDF available through the Marketplace listing.

Scenario Description

CloudFormation templates enable you to install Trifacta® Wrangler Enterprise with a minimal amount of effort.

  • After install, customizations can be applied by tweaking the resources that were created by the CloudFormation process.
  • If you have additional requirements or a complex environment, please contact Trifacta Supportfor assistance with your solution.

Install

The CloudFormation template creates a complete working instance of Trifacta Wrangler Enterprise, including the following:

  • VPC and all required networking infrastructure
  • EC2 instance with all supporting policies/roles
  • S3 bucket
  • EMR cluster
    • Configurable autoscaling instance groups
    • All supporting policies/roles

Pre-requisites

If you are integrating the Trifacta platform with an EMR cluster, you must acquire a Trifacta license first. Additional configuration is required. For more information, please contact aws-marketplace@trifacta.com.

Before you begin:

  1. Read: Please read this entire document before you begin.

  2. EULA. Before you begin, please review the End-User License Agreement. See https://docs.trifacta.com/display/PUB/End-User+License+Agreement+-+Trifacta+Wrangler+Enterprise.

  3. Trifacta license fileIf you have not done so already, please acquire a Trifacta license file from your Trifacta representative.

Internet access

From AWS, the Trifacta platform requires Internet access for the following services:

NOTE: Depending on your AWS deployment, some of these services may not be required.

 

  • AWS S3
  • Key Management System [KMS] (if sse-kms server side encryption is enabled)
  • Secure Token Service [STS] (if temporary credential provider is used)
  • EMR (if integration with EMR cluster is enabled)

NOTE: If the Trifacta platform is hosted in a VPC where Internet access is restricted, access to S3, KMS and STS services must be provided by creating a VPC endpoint. If the platform is accessing an EMR cluster, a proxy server can be configured to provide access to the AWS ElasticMapReduce regional endpoint.

SELinux

By default,  Trifacta Wrangler Enterprise is installed on a server with SELinux enabled. Security-enhanced Linux (SELinux) provides a set of security features for, among other things, managing access controls. 

Tip: The following may be applied to other deployments of the Trifacta platform on servers where SELinux has been enabled.

In some cases, SELinux can interfere with normal operations of platform software. If you are experiencing connectivity problems related to SELinux, you can do either one of the following:

  1. Disable SELinux on the server. For more information, please see the CentOS documentation.
  2. Apply the following commands on the server, as root:
    1. Open ports on the server for listening. 
      1. By default, the Trifacta application listens on port 3005. The following opens that port when SELinux is enabled:

        semanage port -a -t http_port_t -p tcp 3005
      2. Repeat the above step for any other ports that you wish to open on the server.
    2. Permit nginx, the proxy on the Trifacta node, to open websockets:

      setsebool -P httpd_can_network_connect 1

Product Limitations

  • The EC2 instance, S3 buckets, and any connected Redshift databases must be located in the same Amazon region. Cross-region integrations are not supported at this time.
  • No support for Hive integration
  • No support for secure impersonation or Kerberos
  • No support for high availability and failover
  • Job cancellation is not supported on EMR.
  • When publishing single files to S3, you cannot apply an append publishing action.

Install

Desktop Requirements

  • All desktop users of the platform must have the latest version of Google Chrome installed on their desktops.
  • All desktop users must be able to connect to the EC2 instance through the enterprise infrastructure.

Steps:

 

  1. In the Marketplace listing, click Deploy into a new VPC.
  2. Choose a Template: The template path is automatically populated for you.
  3. Specify Details:
    1. Stack Name: Display name of the stack is used in the names of resources created by the stack and as an identifier for the stack.

      NOTE: Each instance of the Trifacta platform must have a separate name.

    2. Instance Type: Please select the appropriate instance depending on the number of users and data volumes of your environment. For more information, see the Sizing Guide above.

    3. Key Pair: This SSH key pair is used to access the Trifacta Instance and the EMR cluster instances.

    4. Allowed HTTP Source: This range of addresses are permitted access to the Trifacta Instance on port 80, 443, and 3005.

      1. Port numbers 80 and 443 do not have any services by default, but you may modify the Trifacta configuration to enable access via these ports.

    5. Allowed SSH Source: This range of addresses is permitted access to port 22 on the Trifacta Instance.

    6. EMR Cluster Node Configuration: Allows you to customize the configuration of the deployed EMR nodes
      1. Reasonable values are used as defaults.
      2. If you do customize these values, you should upsize. Avoid downsizing these values.
    7. EMR Cluster Autoscaling Configuration: Allows you to customize the autoscaling settings used by the EMR cluster.
      1. Reasonable values are used as defaults.
  4. Options: None of these is required for installation. Specify any options as needed for your environment.
  5. Review: Review your installation and configured options.
    1. Select the checkbox at the end of the page.
    2. To launch the stack, click Create.
  6. Please wait while the stack creates all required resources.
  7. In the Stacks list, select the name of your application. Click the Outputs tab and collect the following information. Instructions on how to use this information are provided later.

    ParameterDescriptionUse

    Trifacta URL value

    URL and port number to which to connect to the Trifacta application

    Users must connect to this IP address and port number to access. By default, it is set to 3005. The access port can be moved to 80 or 443 if desired. Please contact us for more details.

    Trifacta Bucket

    The address of the default S3 bucketThis value must be applied through the application after it has been deployed.

    Trifacta Instance Id

    The identifier for the instance of the platform

    This value is the default password for the admin account.

    NOTE: You must change this password on the first login to the application.

  8. After the Trifacta instance has been created, you must add a license file before starting the Trifacta service. In the following steps, you SSH into the server, create the license file, and paste in the license file content, plus update the ownership and permissions of that file:

    1. SSH into the server as the CentOS user, using the key you specified.

    2. Change to root user:

      sudo su
    3. Add your license:

      vi /opt/trifacta/license/license.json
    4. Into the above file, paste the contents of the license.json file that was provided to you by your Trifacta representative.

    5. Verify permissions on the file:

      chown trifacta:trifacta /opt/trifacta/license/license.json
      chmod 644 /opt/trifacta/license/license.json
  9. Start the Trifacta service:

    service trifacta start
  10. It may take some time for the server to finish coming online. Navigate to the Trifacta application.
  11. When the login screen appears, enter the following:
    1. Username: admin@trifacta.local
    2. Password: (the TrifactaInstanceId value)

      NOTE: After you login as an admin for the first time, you must change the password.

  12. From the application menu, select the Settings menu. Then, click Settings >Admin Settings
  13. In the Admin Settings page, you can configure many aspects of the platform, including user management tasks, and perform restarts to apply the changes.

  14. Add the S3 bucket that was automatically created to store and EMR content. Search for:

    "aws.s3.bucket.name"

     

    1. Update the value with the Trifacta Bucket value provided when you created the stack in AWS.
  15. Verify your Spark version. If the cluster was launched from AWS, this value should be set to 2.3.0. Search for:

    "spark.version"

     

    1. Update its value to 2.3.0, if necessary.

  16. Enable the Run in EMR option within the platform. Search for:

    "webapp.runinEMR"
    1. Select the checkbox to enable it.

  17. Click Save underneath the Platform Settings section.

  18. In the Admin Settings page, locate the External Service Settings section.

    1. AWS EMR Cluster ID: Paste the value for the EMR Cluster ID for the cluster to which the platform is connecting.

      1. Verify that there are no extra spaces in any copied value.
    2. AWS Region: Enter the region where your EMR cluster is located.
    3. Resource Bucket: you may use the already created Trifacta Bucket.
      1. Verify that there are no extra spaces in any copied value.
    4. Resource Path: you should use something like EMRLOGS.
  19. Click Save underneath the External Service Settings section.

  20. When the platform restarts, you can begin using the product.

Note about deleting the CloudFormation stack

If you must delete the CloudFormation stack, please be aware of the following.

  1. The S3 bucket that was created for the stack is not removed. If you want to delete it, you must empty it first and then delete it.
  2. Any EMR security groups created for the stack cannot be deleted, due to circular references. The stack deletion process informs you of the security groups that it failed to delete. To complete the deletion:
    1. Remove all rules from the security groups.
    2. Delete the security groups manually.
    3. Re-run the stack deletion, which should complete successfully.

Verify

Start and Stop the Platform

Use the following command line commands to start, stop, and restart the platform.

Start:

sudo service trifacta start

Stop:

sudo service trifacta stop

Restart:

sudo service trifacta restart

Verify Operations

After you have installed or made changes to the platform, you should verify end-to-end operations.

NOTE: The Trifacta® platform is not operational until it is connected to a supported backend datastore.

Steps:

  1. Login to the application as an administrator. See Login

  2. Through the Admin Settings page, run Tricheck, which performs tests on the Trifacta node and any connected cluster. See Admin Settings Page.

  3. In the application menu bar, click  LibraryClick Import Dataset. Select your backend datastore.
     

  4. Navigate your datastore directory structure to locate a small CSV or JSON file. 
     
  5. Select the file.  In the right panel, click Create and Transform.
    1. Troubleshooting: If the steps so far work, then you have read access to the datastore from the platform.   If not, please check permissions for the  Trifacta user  and its access to the appropriate directories. 
    2. See Import Data Page.
  6. In the Transformer page, some steps have already been added to your recipe, so you can run the job right away. Click  Run Job. 
    1. See  Transformer Page .

  7. In the Run Job Page: 
    1. For Running Environment, some of these options may not be available. Choose according to the running environment you wish to test.
      1. Photon: Runs job on the Photon running environment hosted on the Trifacta node. This method of job execution does not utilize any integrated cluster.
      2. Spark: Runs the job on Spark on the integrated cluster.
      3. Databricks: If the platform is integrated with an Azure Databricks cluster, you can test job execution on the cluster.

        NOTE: Use of Azure Databricks is not supported for Marketplace installs.


    2. Select CSV and JSON output. 
    3. Select the Profile Results checkbox. 
    4. Troubleshooting: At this point, you are able to initiate a job for execution on the selected running environment. Later, you can verify operations by running the same job on other available environments .
    5. See Run Job Page.
       
  8. When the job completes, you should see a success message in the Jobs tab of the Flow View page. 
    1. Troubleshooting: Either the Transform job or the Profiling job may break. To localize the problem, mouse over the Job listing in the Jobs page. Try re-running a job by deselecting the broken job type or running the job in a different environment. You can also download the log files to try to identify the problem. See Jobs Page.
       
  9. Click View Results in the Jobs page. In the Profile tab of the Job Details page, you can see a visual profile of the generated results. 
    1. See Job Details Page.
  10. In the Output Destinations tab, click the CSV and JSON links to download the results to your local desktop. See Import Data Page.
     
  11. Load these results into a local application to verify that the content looks ok.

Upgrade

For more information, see Upgrade for AWS Marketplace with EMR.

Documentation

You can access complete product documentation in online and PDF format. After the platform has been installed, select Help menu > Product Docs from the menu in the Trifacta application.

This page has no comments.