Page tree

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »



This documentation applies to installation from a supported Marketplace. Please use the installation instructions provided with your deployment.

If you are installing or upgrading a Marketplace deployment, please use the available PDF content. You must use the install and configuration PDF available through the Marketplace listing.

Scenario Description

This scenario assumes the following about the Trifacta® platform deployment:

  • The platform is to be installed via an Amazon AMI onto an EC2 instance.
  • It is to be connected to an EMR cluster.
  • No security features are applied to the platform and its use of the datastore.
  • You have acquired a Trifacta license key. The license key must be deployed to the  Trifacta node before you start the platform.

NOTE: This scenario does not provide information on installing and configuring optional components, including security features. It is intended to get the Trifacta platform installed, operational, and connected to the EMR cluster.


If you are integrating the Trifacta platform with an EMR cluster, you must acquire a license first. Additional configuration is required. For more information, please contact

Before you begin:

  1. Read: Please read this entire document before you create the EMR cluster or install the Trifacta platform.

  2. Cluster sizing: Before you begin, you should allocate sufficient resources for sizing the EMR cluster. For guidance, please contact your Trifacta representative.

Internet access

From AWS, the Trifacta platform requires Internet access for the following services:

NOTE: Depending on your AWS deployment, some of these services may not be required.


  • AWS S3
  • Key Management System [KMS] (if sse-kms server side encryption is enabled)
  • Secure Token Service [STS] (if temporary credential provider is used)
  • EMR (if integration with EMR cluster is enabled)

NOTE: If the Trifacta platform is hosted in a VPC where Internet access is restricted, access to S3, KMS and STS services must be provided by creating a VPC endpoint. If the platform is accessing an EMR cluster, a proxy server can be configured to provide access to the AWS ElasticMapReduce regional endpoint.


By default,  Trifacta Wrangler Pro is installed on a server with SELinux enabled. Security-enhanced Linux (SELinux) provides a set of security features for, among other things, managing access controls. 

Tip: The following may be applied to other deployments of the Trifacta platform on servers where SELinux has been enabled.


In some cases, SELinux can interfere with normal operations of platform software. If you are experiencing connectivity problems related to SELinux, you can do either one of the following:

  1. Disable SELinux on the server. For more information, please see the CentOS documentation.
  2. Apply the following commands on the server, as root:
    1. Open ports on the server for listening. 
      1. By default, the Trifacta application listens on port 3005. The following opens that port when SELinux is enabled:

        semanage port -a -t http_port_t -p tcp 3005
      2. Repeat the above step for any other ports that you wish to open on the server.
    2. Permit nginx, the proxy on the Trifacta node, to open websockets:

      setsebool -P httpd_can_network_connect 1

Product Limitations

  • The EC2 instance, S3 buckets, and any connected Redshift databases must be located in the same Amazon region. Cross-region integrations are not supported at this time.
  • No support for Hive integration
  • No support for secure impersonation or Kerberos
  • No support for high availability and failover
  • Job cancellation is not supported on EMR.
  • When publishing single files to S3, you cannot apply an append publishing action.


NOTE: Before you install, you should review the configuration content for specific instructions on setting up the Trifacta node. See below.

  1. Create the EC2 instance for the Trifacta platform.
  2. Download and deploy the AMI into the EC2 instance.

Desktop Requirements

  • All desktop users of the platform must have the latest version of Google Chrome installed on their desktops.
    • Google Chrome must have the PNaCl client installed and enabled.
    • PNaCl Version:  0.50.x.y or later
  • All desktop users must be able to connect to the EC2 instance through the enterprise infrastructure.


Before you install the platform, please verify that the following steps have been completed.

  1. EULA. Before you begin, please review the End-User License Agreement. See

  2. S3 bucket. Please create an S3 bucket to store Trifacta assets. In the bucket, the platform stores metadata in the following location:



  3. IAM policies. Create IAM policies for access to the S3 bucket. Required permissions are the following: 
    • The system account or individual user accounts must have full permissions for the S3 bucket:

      Delete*, Get*, List*, Put*, Replicate*, Restore*
    • These policies must apply to the bucket and its contents. Example:

    • See
  4. EC2 instance role. Create an EC2 instance role for this policy. See

Install Steps

  1. Launch the product.
  2. In the EC2 Console:
    1. Instance size: Select the instance size.
    2. Network: Configure the VPC, subnet, firewall and other configuration settings necessary to communicate with the instance. 
    3. Auto-assigned Public IP: You must create a public IP to access the Trifacta platform.
    4. EC2 role: Select the EC2 role that you created.
    5. Local storage: Select a local EBS volume. The default volume includes 100GB storage.

      NOTE: The local storage environment contains the Trifacta databases, the product installation, and its log files. No source data is ever stored within the product.

    6. Security group: Use a security group that exposes access to port 3005, which is the default port for the platform. 
    7. Create an AWS key-pair for access: This key is used to provide SSH access to the platform, which may be required for some admin tasks.
    8. Save your changes.
  3. Apply license key:

    1. Acquire the license.json license key file that was provided to you by your Trifacta representative.

    2. Transfer the license key file to the EC2 node that is hosting the Trifacta platform. Navigate to the directory where you stored it.

    3. Make the Trifacta user the owner of the file:

      sudo chown trifacta:trifacta license.json
    4. Make sure that the Trifacta user has read permissions on the file:

      sudo chmod 644 license.json

    5. Copy the license key file to the proper location:

      cp license.json /opt/trifacta/license/
  4. Launch the configured platform.

    NOTE: From the EC2 Console, please acquire the instanceId, which is needed in a later step.

  5. When the instance is spinning up for the first time, performance may be slow. When the instance is up, navigate to the following:

  6. When the login screen appears, enter the following:
    1. Username: admin@trifacta.local
    2. Password: (the instanceId value)

      NOTE: As soon as you login as an admin for the first time, you should immediately change the password. Select the User Profile menu item in the upper-right corner. Change the password and click Save to restart the platform.

  7. From the application menu, select Settings menu > Admin Settings
  8. In the Admin Settings page, you can configure many aspects of the platform, including user management tasks, and perform restarts to apply the changes.
    1. In the Search bar, enter the following:
    2. Set the value of this setting to be the bucket that you created.

  9. The following setting must be specified.


    You can set the above value to either of the following:

    aws.mode valueDescription
    systemSet the mode to system to enable use of EC2 instance-based authentication for access.
    userSet the mode to user to utilize user-based credentials to access the EMR cluster.

    Details on the above configuration are described later.

  10. Click Save.

  11. When the platform restarts, you can begin using the product.

SSH Access

If you need to SSH to the Trifacta node, you can use the following command:

ssh -i <path_to_key_file> <userId>@<tri_node_DNS_or_IP>

Path to the key file stored on your local computer.

<userId>The user ID is always centos.

DNS or IP address of the Trifacta node

If you are integrating with an EMR cluster, additional configuration is required. 

NOTE: Please review these steps with your Trifacta representative.

Set up EMR Cluster

Use the following section to set up your EMR cluster for use with the Trifacta platform.

  • Via AWS EMR UI: This method is assumed in this documentation.
  • Via AWS command line interface: For this method, it is assumed that you know the required steps to perform the basic configuration. For custom configuration steps, additional documentation is provided below.

NOTE: It is recommended that you set up your cluster for exclusive use by the Trifacta platform.

Cluster options

In the Amazon EMR console, click Create Cluster. Click Go to advanced options. Complete the sections listed below.

NOTE: Please be sure to read all of the cluster options before setting up your EMR cluster.

NOTE: Please perform your configuration through the Advanced Options workflow.

For more information on setting up your EMR cluster, see

Advanced Options

In the Advanced Options screen, please select the following:

  • Software Configuration:

    • Release: EMR 5.6 - 5.15

    • Select:
      • Hadoop 2.7.3
      • Hue 3.12.0
      • Ganglia 3.7.2

        Tip: Although it is optional, Ganglia is recommended for monitoring cluster performance.

      • Spark:
        • For EMR 5.6 - EMR 5.7: Spark 2.1.1
        • For EMR 5.8 - EMR 5.12.1: Spark 2.2.x

        • For EMR 5.13 - EMR 5.15: Spark 2.3.0

          NOTE: You must apply the Spark version number in the spark.version property in Admin Settings. Additional configuration is required. See Configure for Spark.

    • Deselect everything else.
  • Edit the software settings:
    • Copy and paste the following into Enter Configuration:

       "Classification": "capacity-scheduler",
       "Properties": {
       "yarn.scheduler.capacity.resource-calculator": "org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"
  • Auto-terminate cluster after the last step is completed: Disable this option.

Hardware configuration

NOTE: Please apply the sizing information for your EMR cluster that was recommended for you. If you have not done so, please contact your Trifacta representative.

General Options

  • Cluster name: Provide a descriptive name.
  • Logging: Enable logging on the cluster. 
    • S3 folder: Please specify the S3 bucket and path to the logging folder.

      NOTE: Please verify that this location is read accessible to all users of the platform. See below for details.

  • Debugging: Enable.
  • Termination protection: Enable.
  • Scale down behavior: Terminate at instance hour.
  • Tags:
    • No optiions required.
  • Additional Options:
    • EMRFS consistent view: You should enable this setting. Doing so may incur additional costs. For more information, see EMRFS consistent view is recommended below.
    • Custom AMI ID: None.
    • Bootstrap Actions:
      • If you are using the default credential provider, you must create a bootstrap action. 

        NOTE: This configuration must be completed before you create the EMR cluster. For more information, see Authentication below.

Security Options

  • EC2 key pair: Please select a key/pair to use if you wish to access EMR nodes via SSH. 
  • Permissions: Set to Custom to reduce the scope of permissions. For more information, see EMR cluster policies below.

    NOTE: Default permissions give access to everything in the cluster.

  • Encryption Options
    • No requirements.
  • EC2 Security Groups:

    • The selected security group for the master node on the cluster must allow traffic on port 8088. For more information, see System Ports .

Create cluster and acquire cluster ID

If you performed all of the configuration, including the sections below, you can create the cluster.

NOTE: You must acquire your EMR cluster ID for use in configuration of the Trifacta platform.

Specify cluster roles

The following cluster roles and their permissions are required. For more information on the specifics of these policies, see EMR cluster policies

  • EMR Role: 
    • Read/write access to log bucket
    • Read access to resource bucket
  • EC2 instance profile:
    • If using instance mode: 
      • EC2 profile should have read/write access for all users. 
      • EC2 profile should have same permissions as EC2 Edge node role. 
    • Read/write access to log bucket
    • Read access to resource bucket
  • Auto-scaling role:
    • Read/write access to log bucket
    • Read access to resource bucket
    • Standard auto-scaling permissions


You can use one of two methods for authenticating the EMR cluster:

  • Role-based IAM authentication (recommended): This method leverages your IAM roles on the EC2 instance. 
  • Custom credential provider JAR file: This method utilizes a JAR file provided with the platform. This JAR file must be deployed to all nodes on the EMR cluster through a bootstrap action script.

Role-based IAM authentication

You can leverage your IAM roles to provide role-based authentication to the S3 buckets.

NOTE: The IAM role that is assigned to the EMR cluster and to the EC2 instances on the cluster must have access to the data of all users on S3.

For more information, see Configure for EC2 Role-Based Authentication.

Specify the custom credential provider JAR file

If you are not using IAM roles for access, you can manage access using either of the following:

  • AWS key and secret values specified in 
  • AWS user mode

In either scenario, you must use the custom credential provider JAR provided in the installation. This JAR file must be available to all nodes of the EMR cluster.

After you have installed the platform and configured the S3 buckets, please complete the following steps to deploy this JAR file.

NOTE: These steps must be completed before you create the EMR cluster.

NOTE: This section applies if you are using the default credential provider mechanism for AWS and are not using the IAM instance-based role authentication mechanism.



  1. From the installation of the Trifacta platform, retrieve the following file:


    NOTE: Do not remove the timestamp value from the filename. This information is useful for support purposes.

  2. Upload this JAR file to an S3 bucket location where the EMR cluster can access it:

    1. Via AWS Console S3 UI: See
    2. Via AWS command line:

      aws s3 cp trifacta-aws-emr-credential-provider[TIMESTAMP].jar s3://<YOUR-BUCKET>/
  3. Create a bootstrap action script named The contents must be the following:

    sudo aws s3 cp s3://<YOUR-BUCKET>/trifacta-aws-emr-credential-provider[TIMESTAMP].jar  /usr/share/aws/emr/emrfs/auxlib/
  4. This script must be uploaded into S3 in a location that can be accessed from the EMR cluster. Retain the full path to this location.
  5. Add bootstrap action to EMR cluster configuration.
    1. Via AWS Console S3 UI: Create the bootstrap action to point to the script you uploaded on S3.


    2. Via AWS command line: 
      1. Upload the file to the accessible S3 bucket.
      2. In the command line cluster creation script, add a custom bootstrap action, such as the following:

        --bootstrap-actions '[
        {"Path":"s3://<YOUR-BUCKET>/","Name":"Custom action"}

When the EMR cluster is launched with the above custom bootstrap action, the cluster does one of the following:

  • Interacts with S3 using the credentials specified in
  • if aws.mode = user, then the credentials registered by the user are used.

For more information about AWSCredentialsProvider for EMRFS please see:

EMRFS consistent view is recommended

Although it is not required, you should enable the consistent view feature for EMRFS on your cluster.

During job execution, including profiling jobs, on EMR, the Trifacta platform writes files in rapid succession, and these files are quickly read back from storage for further processing. However, Amazon S3 does not provide a guarantee of a consistent file listing until a later time.

To ensure that the Trifacta platform does not begin reading back an incomplete set of files, you should enable EMRFS consistent view. 

NOTE: If EMRFS consistent view is enabled, additional policies must be added for users and the EMR cluster. Details are below.

NOTE: If EMRFS consistent view is not enabled, profiling jobs may not get a consistent set of files at the time of execution. Jobs can fail or generate inconsistent results.

For more information on EMRFS consistent view, see


Amazon's DynamoDB is automatically enabled to store metadata for EMRFS consistent view.

NOTE: DynamoDB incurs costs while it is in use. For more information, see

NOTE: DynamoDB does not automatically purge metadata after a job completes. You should configure periodic purges of the database during off-peak hours.

Set up S3 Buckets

Bucket setup

You must set up S3 buckets for read and write access. 

NOTE: Within the Trifacta platform, you must enable use of S3 as the default storage layer. This configuration is described later.

For more information, see Enable S3 Access.

Set up EMR resources buckets

On the EMR cluster, all users of the platform must have access to the following locations:

LocationDescriptionRequired Access
EMR Resources bucket and path

The S3 bucket and path where resources can be stored by the Trifacta platform for execution of Spark jobs on the cluster.

NOTE: If server-side encryption is in use, only SSE-S3 encryption type is supported for the resources bucket. If you are using the same bucket for resources and data and SSE-KMS is in use, you may need to deploy a second bucket for EMR resources. For more information on server-side encryption, see Enable S3 Access.

The locations are configured separately in the Trifacta platform.

EMR Logs bucket and path

The S3 bucket and path where logs are written for cluster job execution.  


These locations are configured on the Trifacta platform later.

Access Policies

EC2 instance profile

Trifacta users require the following policies to run jobs on the EMR cluster:

    "Statement": [
            "Effect": "Allow",
            "Action": [
            "Resource": [
            "Effect": "Allow",
            "Action": [
            "Resource": [


EMR roles

The following policies should be assigned to the EMR roles listed below for read/write access:

            "Effect": "Allow",
            "Action": [
            "Resource": [

EMRFS consistent view policies

If EMRFS consistent view is enabled, the following policy must be added for users and the EMR cluster permissions:

  "Version": "2012-10-17",
  "Statement": [
      "Action": [
      "Effect": "Allow",
      "Resource": [

Configure Trifacta platform for EMR

Please complete the following sections to configure the Trifacta platform to communicate with the EMR cluster.


Change admin password

As soon as you have installed the software, you should login to the application and change the admin password. The initial admin password is the instanceId for the EC2 instance. For more information, see Change Password.

Verify S3 as base storage layer

EMR integrations requires use of S3 as the base storage layer.

NOTE: The base storage layer must be set during initial installation and set up of the Trifacta node.

See Set Base Storage Layer.

Set up S3 integration

To integrate with S3, additional configuration is required. See Enable S3 Access.

Enable EMR integration

After you have configured S3 to be the base storage layer, you must enable EMR integration.


You can apply this change through the Admin Settings Page (recommended) or

. For more information, see Platform Configuration Methods.

  1. Search for the following setting:

    "webapp.runInEMR": false,
  2. Set the above value to true
  3. Set the following value to false:

    "webapp.runInHadoop": false,
  4. Verify the following property values:

    "webapp.runInTrifactaServer": true,
    "webapp.runInEMR": true,
    "webapp.runInHadoop": false,
    "webapp.runInDataflow": false,
    "photon.enabled": true,

Apply EMR cluster ID

The Trifacta platform must be aware of the EMR cluster to which to connection. 


  1. Administrators can apply this configuration change through the Admin Settings Page in the application. If the application is not available, the settings are available in

    . For more information, see Platform Configuration Methods.

  2. Under External Service Settings, enter your AWS EMR Cluster ID. Click the Save button below the textbox.

For more information, see Admin Settings Page.

Extract IP address of master node in private sub-net

If you have deployed your EMR cluster on a private sub-net that is accessible outside of AWS, you must enable this property, which permits the extraction of the IP address of the master cluster node through DNS.

NOTE: This feature must be enabled if your EMR is accessible outside of AWS on a private network.


  1. You can apply this change through the Admin Settings Page (recommended) or
    . For more information, see Platform Configuration Methods.
  2. Set the following property to true:

    "emr.extractIPFromDNS": false,
  3. Save your changes and restart the platform.

EMR Authentication for the Trifacta platform

Depending on the authentication method you used, you must set the following properties.

You can apply this change through the Admin Settings Page (recommended) or

. For more information, see Platform Configuration Methods.

Authentication methodProperties and values

Use default credential provider for all Trifacta access including EMR.

NOTE: This method requires the deployment of a custom credential provider JAR.


Use default credential provider for all Trifacta access. However, EC2 role-based IAM authentication is used for EMR.


EC2 role-based IAM authentication for all Trifacta access


Configure Spark for EMR

For EMR, you can configure a set of Spark-related properties to manage the integration and its performance.

Configure Spark version

Depending on the version of EMR with which you are integrating, the Trifacta platform must be modified to use the appropriate version of Spark to connect to EMR. For more information, see Configure for Spark.

Specify YARN queue for Spark jobs

Through the Admin Settings page, you can specify the YARN queue to which to submit your Spark jobs. All Spark jobs from the Trifacta platform are submitted to this queue.


  1. In platform configuration, locate the following:

  2. Specify the name of the queue. 
  3. Save your changes.

Allocation properties

The following properties must be passed from the Trifacta platform to Spark for proper execution on the EMR cluster. 

To apply this configuration change, login as an administrator to the Trifacta node. Then, edit

. Some of these settings may not be available through the Admin Settings Page. For more information, see Platform Configuration Methods.

NOTE: Do not modify these properties through the Admin Settings page. These properties must be added as extra properties through the Spark configuration block. Ignore any references in

to these properties and their settings.

"spark": { 
  "props": { 
    "spark.dynamicAllocation.enabled": "true",
    "spark.shuffle.service.enabled": "true", 
    "spark.executor.instances": "0", 
    "spark.executor.memory": "2048M", 
    "spark.executor.cores": "2",
    "spark.driver.maxResultSize": "0"
Enable dynamic allocation on the Spark cluster, which allows Spark to dynamically adjust the number of executors.true
Enable Spark shuffle service, which manages the shuffle data for jobs, instead of the executors.true
Default count of executor instances.See Sizing Guide.
Default memory allocation of executor instances.See Sizing Guide.
Default count of executor cores.See Sizing Guide.
spark.driver.maxResultSizeEnable serialized results of unlimited size by setting this parameter to zero (0).0

Default Hadoop job results format

For smaller datasets, the platform recommends using the Trifacta Server.

For larger datasets, if the size information is unavailable, the platform recommends by default that you run the job on the Hadoop cluster. For these jobs, the default publishing action for the job is specified to run on the Hadoop cluster, generating the output format defined by this parameter. Publishing actions, including output format, can always be changed as part of the job specification. 

As needed, you can change this default format. You can apply this change through the Admin Settings Page (recommended) or

. For more information, see Platform Configuration Methods.

"webapp.defaultHadoopFileFormat": "csv",

Accepted values: csvjsonavropqt

For more information, see Run Job Page.

Additional configuration for EMR

You can set the following parameters as needed:


You can apply this change through the Admin Settings Page (recommended) or

. For more information, see Platform Configuration Methods.


S3 bucket name where Trifacta executables, libraries, and other resources can be stored that are required for Spark execution.

NOTE: If server-side encryption is in use, only SSE-S3 encryption type is supported for the resources bucket. If you are using the same bucket for resources and data and SSE-KMS is in use, you may need to deploy a second bucket for EMR resources. For more information on server-side encryption, see Enable S3 Access.


S3 path within the bucket where resources can be stored for job execution on the EMR cluster.

NOTE: Do not include leading or trailing slashes for the path value.


This value defines the user for the Trifacta users to use for connecting to the cluster.

NOTE: Do not modify this value.

aws.emr.maxLogPollingRetriesNConfigure maximum number of retries when polling for log files from EMR after job success or failure. Minimum value is 5.

Defines the timeout for EMR jobs in milliseconds. By default, this value is set to -1, which allows jobs to run for an infinite length of time.

NOTE: This setting should be modified only if you are experiencing problems with jobs hanging during execution on the EMR cluster.


Defines the number of days that temporary files in the /trifacta/tempfiles directory on EMR HDFS are permitted to age.

By default, this value is set to 0, which means that cleanup is disabled.

If needed, you can set this to a positive integer value. During each job run, the platform scans this directory for temp files older than the specified number of days and removes any that are found. This cleanup provides an additional level of system hygiene.

Before enabling this secondary cleanup process, please execute the following command to clear the tempfiles directory:

hdfs dfs -rm -r -skipTrash /trifacta/tempfiles

Optional Configuration

Configure for Redshift

For more information on configuring the platform to integrate with Redshift, see Create Redshift Connections.

Switch EMR Cluster

If needed, you can switch to a different EMR cluster through the application. For example, if the original cluster suffers a prolonged outage, you can switch clusters by entering the cluster ID of a new cluster. For more information, see Admin Settings Page.

Configure Batch Job Runner

Batch Job Runner manages jobs executed on the EMR cluster. You can modify aspects of how jobs are executed and how logs are collected. For more information, see Configure Batch Job Runner.

Modify Job Tag Prefix

In environments where the EMR cluster is shared with other job-executing applications, you can review and specify the job tag prefix, which is prepended to job identifiers to avoid conflicts with other applications.


  1. You can apply this change through the Admin Settings Page (recommended) or
    . For more information, see Platform Configuration Methods.
  2. Locate the following and modify if needed:

    "aws.emr.jobTagPrefix": "TRIFACTA_JOB_",
  3. Save your changes and restart the platform.

Configure for EC2 Role-Based Authentication

This configuration is optional.

When you are running the Trifacta platform on an EC2 instance, you can leverage your enterprise IAM roles to manage permissions on the instance for the Trifacta platform. When this type of authentication is enabled, Trifacta administrators can apply a role to the EC2 instance where the platform is running. That role's permissions apply to all users of the platform.

IAM roles

Before you begin, your IAM roles should be defined and attached to the associated EC2 instance.

NOTE: The IAM instance role used for S3 access should have access to resources at the bucket level.


For more information, see

AWS System Mode

To enable role-based instance authentication, the following parameter must be enabled.

"aws.mode": "system",

Additional AWS Configuration

The following additional parameters must be specified:


aws.credentialProviderSet this value to instance. IAM instance role is used for providing access.

Set this value to true for CDH. The class information is provided below.

Shared instance provider class information



In the future:

CDH is moving back to using the Instance class in a future release. For details, see

Use of S3 Sources

To access S3 for storage, additional configuration for S3 may be required.

NOTE: Do not configure the properties that apply to user mode.

Output sizing recommendations:

Start and Stop the Platform

Error rendering macro 'excerpt-include'

No link could be created for 'Install Start Platform'.


Error rendering macro 'excerpt-include'

No link could be created for 'Install Verify'.


For more information, see Upgrade for Amazon Marketplace with EMR.


You can access complete product documentation in online and PDF format. From within the product, select Help menu > Product Docs.

Related Topics

  • No labels

This page has no comments.