Page tree

 

Contents:


This documentation applies to installation from a supported Marketplace. Please use the installation instructions provided with your deployment.


If you are installing or upgrading a Marketplace deployment, please use the available PDF content. You must use the install and configuration PDF available through the Marketplace listing.

This guide steps through the requirements and process for installing  Trifacta® Wrangler from the Azure Marketplace.

Product Limitations

  • HDInsight 3.6 only
  • HDInsight Hadoop, Spark and HBase cluster types

Documentation Scope

This document guides you through the process of installing the product and beginning to use it.

  • If you are creating a new cluster as part of this process: You can begin running jobs. Instructions are provided in this document for testing job execution.
  • If you are integrating the product with a pre-existing cluster: Additional configuration is required after you complete the installation process. Relevant topics are listed in the Documentation section at the end of this document. 

Install

Desktop Requirements

  • All desktop users of the platform must have the latest version of Google Chrome installed on their desktops.
    • Google Chrome must have the PNaCl client installed and enabled.
    • PNaCl Version:  0.50.x.y or later
  • All desktop users must be able to connect to the created Trifacta node instance through the enterprise infrastructure.

Sizing Guide

NOTE: The following guidelines apply only to Trifacta Wrangler Enterprise on the Azure Marketplace.

Use the following guidelines to select your instance size:

Azure virtual machine typevCPUsRAM (GB)Max recommended concurrent users

Avg. input data size of jobs on Trifacta Server (GB)

Standard_A4814301
Standard_A6428302
Standard_A7856605
Standard_A10856605
Standard_A11161129011
Standard_D4_v2828302
Standard_D5_v21656905
Standard_D12_v2428302
Standard_D13_v2856605
Standard_D14_v2161129011
Standard_D15_v22014012014

Pre-requisites

Before you install the platform, please verify that the following steps have been completed.

  1. License: When you install the software, the installed license is valid for 24 hours. You must acquire a license key file from Trifacta. For more information, please contact Sales.

  2. Supported Cluster Types:
    1. Hadoop
    2. HBase
    3. Spark
  3. Data Lake Store only: If you are integrating with Azure Data Lake Store, please review and complete the following section.

Register application and r/w access to Data Lake Store

You must create an Azure Active Directory registered application with appropriate permissions for the Trifacta platform, with read/write access to the Azure Key Vault resources and Data Lake Store resource.

  1. To create a new service principal, see https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-create-service-principal-portal#create-an-azure-active-directory-application.
  2. After you have registered, please acquire the following property values prior to install.
    1. For an existing service principal, see https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-create-service-principal-portal#get-application-id-and-authentication-key to obtain the property values.

These values are applied during the install process.

Azure PropertyLocationUse
Application ID

Acquire this value from the Registered app blade of the Azure Portal.

Applied to Trifacta platform configuration: azure.applicationid.

Service User Key

Create a key for the Registered app in the Azure Portal.

Applied to Trifacta platform configuration: azure.secret.

Directory IDCopy the Directory ID from the Properties blade of Azure Active Directory.

Applied to Trifacta platform configuration: azure.directoryId.

Install Process

Methods

You can install from the Azure Marketplace using one of the following methods:

  1. Create cluster: Create a brand-new HDI cluster and add Trifacta Wrangler Enterprise as an application.

    Tip: This method is easiest and fastest to deploy.

  2. Add application: Use an existing HDI cluster and add Trifacta Wrangler Enterprise as an application.

    Tip: This method does not support choosing a non-default size for the Trifacta node. If you need more flexibility, please choose the following option.

  3. Create a custom ARM template: Use an existing HDI cluster and configure a custom application via Azure Resource Manager (ARM) template.

    Tip: Use the third method only if your environment requires additional configuration flexibility or automated deployment via ARM template.

Depending on your selection, please follow the steps listed in one of the following sections.

NOTE: These steps include required settings or recommendations only for configuring a cluster and the application for use with Trifacta Wrangler Enterprise. Any other settings should be specified for your enterprise requirements.

Install Method - New cluster

Please use the following steps if you are creating a new HDI cluster and adding the Trifacta application to it.

Steps:

  1. From the Azure Marketplace listing, click Get it Now.
  2. In Microsoft Azure Portal, click the New blade.
  3. Select Trifacta Wrangler Enterprise .
  4. Click Create. Then, click the Quick Create tab.
    1. Please configure any settings that are not listed below according to your enterprise requirements.
    2. For more information on the Quick Create settings, see https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-provision-linux-clusters.
  5. Basics tab:
    1. Cluster type:
      1. Hadoop: One of the following:
        1. Hadoop
        2. HBase
        3. Spark
      2. Version:
        1. HDI 3.6
      3. Click Select.

    2. Click Next.
  6. Storage tab:
    1. Select your primary storage:

      NOTE: Trifacta Wrangler Enterprise does not support Additional storage options. Additional configuration is required after install is complete for the supported storage options.

       

      1. Azure Storage
      2. Azure Data Lake Store
    2. Select a SQL database for Hive: If you plan to use Hive, you can choose to specify a database. If not, the platform creates one for you.
    3. Click Next.
  7. Trifacta Wrangler Enterprise tab:
    1. Please review and accept the terms for using the product.
    2. Click Create. Click Ok.
    3. Click Next.
  8. Custom tab:

    1. By default, the Trifacta node is defined as: Standard_D13_V2.
    2. If you need to change the edge node size, please click the Custom settings tab and make your selections. The following virtual machine types are supported:

      Standard_A4
      Standard_A6
      Standard_A7
      Standard_A10
      Standard_A11
      Standard_D4_v2
      Standard_D5_v2
      Standard_D12_v2
      Standard_D13_v2
      Standard_D14_v2
      Standard_D15_v2
    3. Click Next.
  9. Advanced Settings tab:

    1. No changes are needed here.
  10. Summary tab:

    1. Review the specification of the cluster you are creating.
    2. Make modifications as needed.
    3. To create the cluster, click Create.
  11. After the cluster has been started and deployed:
    1. Login to the application.
    2. Change the admin password.
    3. Perform required additional configuration.
    4. Instructions are provided below.

Install Method - Add application to a cluster

Please use the following steps if you are adding the Trifacta application to a pre-existing HDI cluster.

NOTE: The Trifacta node is set to Standard_V13_D2 by default. The size cannot be modified.

Steps:

  1. In the Microsoft Azure Portal, select the HDInsight cluster to which you are adding the application.
  2. In the Portal, select the Applications blade. Click + Add.
  3. From the list of available applications, select Trifacta Wrangler Enterprise .
  4. Please accept the legal terms.
  5. Click Next.
  6. The application is created for you.
  7. After the cluster has been started and deployed:
    1. Login to the application.
    2. Change the admin password.
    3. Perform required additional configuration.
    4. Instructions are provided below.

Install Method - Build custom ARM template

Please use the following steps if you are creating a custom application template for later deployment. This method provides more flexible configuration options and can be used for deployments in the future.

NOTE: You must have a pre-existing HDI cluster for which to create the application template.

NOTE: Before you begin, you should review the End-User License Agreement. See End-User License Agreement.

Steps:

  1. Start here:
    1. https://github.com/trifacta/azure-deploy/tree/release/5.0
    2. Click Deploy from Azure.
  2. From the Microsoft Azure Portal, select the custom deployment link.
  3. Resource Group: Create or select one.
  4. Cluster Name: Select an existing cluster name.
  5. Edge Node Size: Select the instance type. For more information, see the Sizing Guide above.
  6. Trifacta version: For the version, select the latest listed version of Trifacta Wrangler Enterprise.
  7. Application Name: If desired, modify the application name as needed. This name must be unique per cluster.
  8. Subdomain Application URI Suffix: If desired, modify the three-character alphanumeric string used in the DNS name of the application. This suffix must be unique per cluster.
  9. Please specify values for the following:
    1. Application ID
    2. Directory ID
    3. Secret
  10. Gallery Package Identifier: please leave the default value.
  11. Please accept the Microsoft terms of use.
  12. To create the template, click Purchase.
  13. The custom template can be used to create the Trifacta Wrangler Enterprise application. For more information, please see the Azure documentation.
  14. After the application has been started and deployed:
    1. Login to the application.
    2. Change the admin password.
    3. Perform required additional configuration.
    4. Instructions are provided below.

Login to Trifacta Wrangler Enterprise

Steps:

  1. In the Azure Portal, select the HDI cluster.
  2. Select the Applications blade.
  3. Select the Trifacta Wrangler Enterprise application.
  4. Click the Portal link.
  5. You may be required to apply the cluster username and password.

    NOTE: You can create a local user of the cluster to avoid enabling application users to use the administrative user's cluster credentials. To create such a user:

    1. Navigate to the cluster's Ambari console.
    2. In the user menu, select the Manage Ambari page.
    3. Select Users.
    4. Select Create Local User.
    5. Enter a unique (lowercase) user name.
    6. Enter a password and confirm that password.
    7. Click Save.

  6. You are connected to the Trifacta application.
  7. In the login screen, enter the default username and password:
    1. Username: admin@trifacta.local
    2. Password: admin
    3. Click Login.

NOTE: If this is your first login to the application, please be sure to reset the admin password. Steps are provided below.

Change admin password

Steps:

  1. If you haven't done so already, login to the Trifacta application as an administrator.
  2. In the menu bar, select Settings menu > Administrator.
  3. In the User Profile, enter and re-enter a new password.
  4. Click Save.
  5. Logout and login again using the new password.

Post-install configuration

Base parameter settings

If you are integrating with HDI and did not install via a custom ARM template, the following settings must be specified with the Trifacta application .

NOTE: These settings are specified as part of the cluster definition. If you have not done so already, you should acquire the corresponding values for the Trifacta application in the Azure Portal.

Steps:

  1. Login to the Trifacta application as an administrator.
  2. In the menu bar, select Settings menu > Admin Settings.
  3. In the Admin Settings page, specify the values for the following parameters:

    "azure.secret"
    "azure.applicationId"
    "azure.directoryId"
  4. Save your changes and restart the platform.

  5. When the platform is restarted, continue the following configuration.

Apply license file

When the application is first created, the license is valid for 24 hours. Before the license expires, you must apply the license key file to the Trifacta node. Please complete the following general steps.

Steps:

  1. Locate the license key file that was provided to you by Trifacta. Please store this file in a safe location that is not on the Trifacta node.
  2. In the Azure Portal, select the HDI cluster.
  3. Select the Applications blade.
  4. Select the Trifacta Wrangler Enterprise application.
  5. From the application properties, acquire the SSH endpoint.
  6. Connect via SSH to the Trifacta node. For more information, see https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-linux-use-ssh-unix.
  7. Drop the license key file in the following directory on the node:

    /opt/trifacta/license
  8. Restart the platform.

Review refresh token encryption key

By default, the Trifacta platform includes a static refresh token encryption key for the secure token service. The same default key is used for all instances of the platform.

NOTE: A valid base64 value must be configured for the platform, or the platform fails to start.

If preferred, you can generate your own key value, which is unique to your instance of the platform.

Steps:

  1. Login at the command line to the Trifacta node.
  2. To generate a new refresh token encryption key, please execute the following command:

    cd /opt/trifacta/services/secure-token-service/ && \
    java -cp \
    server/build/install/secure-token-service/ \
    secure-token-service.jar:server/build/install/secure-token-service/lib/* \
    com.trifacta.services.secure_token_service.tools.RefeshTokenEncryptionKeyGeneratorTool
  3. The refresh token encryption key is printed to the screen. Please copy this value to the clipboard.

  4. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.
  5. Paste the value in the following property:

    "com.trifacta.services.secure_token_service.refresh_token_encryption_key": "<generated_refresh_token_encryption_key>",
  6. Save your changes.

Configure Spark Execution (Advanced)

The cluster instance of Spark is used to execute larger jobs across the nodes of the cluster. Spark processes run multiple executors per job. Each executor must run within a YARN container. Therefore, resource requests must fit within YARN’s container limits.

Like YARN containers, multiple executors can run on a single node. More executors provide additional computational power and decreased runtime.

Spark’s dynamic allocation adjusts the number of executors to launch based on the following:

  • job size

  • job complexity

  • available resources

The per-executor resource request sizes can be specified by setting the following properties in the spark.props section.

Parameter
Description

spark.executor.memory

Amount of memory to use per executor process (in a specified unit)
spark.executor.coresNumber of cores to use on each executor - limit to 5 cores per executor for best performance

A single special process (the application driver) also runs in a container. Its resources are specified in the spark.props section:

Parameter
Description

spark.driver.memory

Amount of memory to use for the driver process (in a specified unit)
spark.driver.coresNumber of cores to use for the driver process

Optimizing "Small" Joins (Advanced)

Broadcast, or map-side, joins materialize one side of the join and send it to all executors to be stored in memory. This technique can significantly accelerate joins by skipping the sort and shuffle phases during a "reduce" operation. However, there is also a cost in communicating the table to all executors. Therefore, only "small" tables should be considered for broadcast join. The definition of "small" is set by the spark.sql.autoBroadcastJoinThreshold parameter which can be added to the spark.props section of trifacta-conf.json. By default, Spark sets this to 10485760 (10MB).

NOTE:  You should set this parameter between 20 and 100MB. It should not exceed 200MB.

Example Spark configuration parameters for given YARN configurations:

Property settings in bold are provided by the cluster.

Property
Small
Medium
Large
Extra large
YARN NodeManager node (v)CPUs481640
YARN NodeManager node memory (GB)163264160
yarn.nodemanager.resource.memory-mb 12288 24576 57344 147456
yarn.nodemanager.resource.cpu-vcores 3 6 13 32
yarn.scheduler.maximum-allocation-mb 12288 24576 57344 147456
yarn.scheduler.maximum-allocation-vcores 3 6 13 32
spark.executor.memory6GB6GB16GB20GB
spark.executor.cores2245
spark.driver.memory4GB4GB4GB4GB
spark.driver.cores1111
spark.sql.autoBroadcastJoinThreshold 209715202097152052428800104857600

Additional Configuration

Integration with a pre-existing cluster

If you did not create an HDI cluster as part of this install process, you must perform additional configuration to integrate the Trifacta platform with your cluster. Please see the links under Documentation below.

Exploring and Wrangling Data in Azure

NOTE: If you are integrating the Trifacta platform with an existing cluster, these steps do not work. Additional configuration is required. See the Documentation section below.

Basic steps:

  1. When data is imported to the Trifacta platform, a reference to it is stored by the platform as an imported dataset. The source data is not modified.
  2. In the application, you modify the recipe associated with a dataset to transform the imported data.
  3. When the recipe is ready, you define and run a job, which executes the recipe steps across the entire dataset. 
  4. The source of the dataset is untouched, and the results are written to the specified location in the specified format.

Steps:

NOTE: Any user with a valid user account can import data from a local file.

  1. Login. 
  2. In the menubar, click Datasets. Click Import Data.
  3. To add a dataset:

    1. Select the connection where your source is located:
      1. WASB (Blob Storage)
      2. ADL (Azure Data Lake Store)
      3. Hive
    2. Navigate to the file or files for your source.
    3. To add the dataset, click the Plus icon next to its name.
  4. To begin working with a dataset, you must first add it into a flow, which is a container for datasets. Click the Add Dataset to a Flow checkbox and enter the name for a new flow.

    Tip: If you have selected a single file, you can begin wrangling it immediately. Click Import and Wrangle. The flow is created for you, and your dataset is added to it.

  5. Click Import & Add to Flow.
  6. After the flow has been created, the flow is displayed. Select the dataset, which is on the left side of the screen. 
  7. Click Add New Recipe. Click Edit Recipe
  8. The dataset is opened in the Transformer page, where you can begin building your recipe steps.

Upgrade

Please complete the instructions in this section if you are upgrading from a previous version of  Trifacta® Wrangler Enterprise

NOTE: These instructions apply only to Trifacta® Wrangler Enterprise available through the Azure Marketplace.

 

Overview of the upgrade process

Upgrading your instance of  Trifacta Wrangler Enterprise for Azure follows these basic steps:

  1. Back up the databases and configuration for your existing platform instance.
  2. Download the latest version of  Trifacta Wrangler Enterprise
  3. Uninstall the existing version of  Trifacta Wrangler Enterprise. Install the version you downloaded in the previous step.
  4. Start up  Trifacta Wrangler Enterprise. This step automatically performs the DB migrations from the older version to the latest version and upgrades the configurations. 
  5. Within the application, perform required configuration updates in the upgraded instance.
  6. Apply the custom migrations to migrate paths in the DBs.

Instructions for these steps are provided below. 

Acquire resources

Before you begin, please acquire the following publicly available resources.

Backup script:

https://raw.githubusercontent.com/trifacta/trifacta-utils/release/5.0/azure/trifacta-backup-config-and-db.sh

DB migration script:

https://github.com/trifacta/azure-deploy/blob/release/5.0/bin/migrations/path-migrations-4_2-5_0.sql

Restore script:

https://raw.githubusercontent.com/trifacta/trifacta-utils/release/5.0/azure/trifacta-restore-from-backup.sh

Latest Release DEB file:

NOTE: Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \. The backslash should not be included if the line is used as input.

https://trifactamarketplace.blob.core.windows.net/artifacts/trifacta-server-5.0.0-110~xenial_amd64.deb \
?sr=c&si=trifacta-deploy-public-read&sig=ksMPhDkLpJYPEXnRNp4vAdo6QQ9ulpP%2BM4Gsi/nea%2Bg%3D&sv=2016-05-31

Backup data from platform instance

Before you begin, you should backup your current instance.

  1. SSH to your current Marketplace instance.
  2. Stop the Trifacta platform on your current Marketplace instance:

    sudo service trifacta stop
  3. Update the backup script with a more current version.
    1. If you have not done so already, download 5.0.0 backup script from the following location:

      https://raw.githubusercontent.com/trifacta/trifacta-utils/release/5.0/azure/trifacta-backup-config-and-db.sh
    2. Example command to download the script:

      curl --output trifacta-backup-config-and-db.sh https://raw.githubusercontent.com/trifacta/trifacta-utils/release/5.0/azure/trifacta-backup-config-and-db.sh
    3. Overwrite the downloaded script to the following location:

      /opt/trifacta/bin/setup-utils/trifacta-backup-config-and-db.sh
    4. Verify that this script is executable: 

      sudo chmod 775 /opt/trifacta/bin/setup-utils/trifacta-backup-config-and-db.sh
  4. Run the backup script:

    sudo /opt/trifacta/bin/setup-utils/trifacta-backup-config-and-db.sh

     

    1. When the script is complete, the output identifies the location of the backup. Example:

      /opt/trifacta-backups/trifacta-backup-4.2.1+126.20171217124021.a8ed455-20180514213601.tgz
  5. Store the backup in a safe location. 

Download latest version of  Trifacta Wrangler Enterprise

Download the latest version of by using the command below

wget https://trifactamarketplace.blob.core.windows.net/artifacts/trifacta-server-5.0.0-110~xenial_amd64.deb?sr=c&si=trifacta-deploy-public-read&sig=ksMPhDkLpJYPEXnRNp4vAdo6QQ9ulpP%2BM4Gsi/nea%2Bg%3D&sv=2016-05-31

Uninstall older version of Trifacta instance and install latest version

  1. To uninstall older version of  Trifacta® Wrangler Enterprise, execute as root user the following command on the Trifacta node:

    apt-get remove trifacta
  2. Install the latest version that was downloaded. Execute the following command as root:

    dpkg -i <location of the 5.0 deb file>

Start up latest version of Trifacta instance

To migrate the DBs and upgrade configs from the older version of  Trifacta® Wrangler Enterprise to the latest version, the platform needs to be started. To start the Trifacta platform on the instance:

service trifacta start

Post-upgrade configuration

Create Key Vault and use true WASB protocol

NOTE: This section applies only if you are upgrading from Release 4.2.x to Release 5.0 or later and were using WASB as your base storage layer.

Prior to Release 5.0, the Trifacta platform supported the use of WASB as a base storage layer by connecting through WebWASB using the HDFS protocol.

In Release 5.0 and later, the above workaround has been replaced by true WASB support. To migrate your installation, please complete the following sections.

Create Key Vault: To enable access to the WASB base storage layer, you must create a Key Vault to manage authentication via secure token. For more information, see Configure for Key Vault in Configure for Azure.

Set base storage layer to WASB: After the Key Vault has been created, you must set the base storage layer to wasb or wasbs. For more information, see Configure base storage layer in Configure for Azure.

Configure secure token service

NOTE: This section applies only if you are upgrading from Release 4.2.x to Release 5.0 or later and were using WASB or plan to use SSO to access ADLS.

To manage access, you must enable and configure the secure token service. For more information, see Configure Secure Token Service in Configure for Azure

Migrate the paths from WebWASB to WASB

After you have restarted the Trifacta platform, the databases retain the old version of the WebWASB paths, which must be migrated to the WASB paths. Please complete the following steps to migrate your existing databases to use the new paths.

Steps:

  1. If you have not done so already, acquire the DB migration script:

    https://github.com/trifacta/azure-deploy/blob/release/5.0/bin/migrations/path-migrations-4_2-5_0.sql
  2. Store this script in an executable location on the Trifacta node.

  3. Acquire the values for the following parameters from the platform configuration. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.:

    "webapp.db.port"
    "azure.wasb.defaultStore.blobHost"
    "azure.wasb.defaultStore.container"
  4. Update the blobHost and container parameters of the migration script with the values you acquired.

    SET LOCAL myvars.blobHost = '<azure.wasb.defaultStore.blobHost value>';
    SET LOCAL myvars.container = '<azure.wasb.defaultStore.container value>';
  5. After updating the parameter, run the migration script:

    psql -U trifacta -d trifacta -p <webapp.db.port> -a -f <location for the sql migration script>

Verify 

The upgrade is complete. To verify:

Steps:

  1. Restart the platform:

    sudo service trifacta start
  2. Run a simple job with profiling. 
  3. Verify that the job has successfully completed. 
    1. In the Jobs page, locate the job results. See Jobs Page.
    2. Click View Details next to the job to review the profile.

 

Documentation

You can access complete product documentation online and in PDF format. From within the product, select Help menu > Product Docs.

After you have accessed the documentation, the following topics are relevant to Azure deployments. Please review them in order.

TopicDescription
Supported Deployment Scenarios for AzureMatrix of supported Azure components.

Configure for Azure

Top-level configuration topic on integrating the platform with Azure.

Tip: You should review this page.

Configure for HDInsight

Review this section if you are integrating the Trifacta platform with a pre-existing HDI cluster.

Enable ADLS AccessConfiguration to enable access to ADLS.
Enable WASB AccessConfiguration to enable access to WASB.
Configure SSO for Azure AD

How to integrate the Trifacta platform with Azure Active Directory for Single Sign-On.

This page has no comments.