Page tree

 

Contents:


To install  Trifacta® Wrangler Enterprise inside your enterprise infrastructure, please review and complete the following sections in the order listed below.

Scenario Description

  • Installation of  Trifacta Wrangler Enterprise on a server on-premises
  • Installation of Trifacta databases on a server on-premises
  • Integration with a supported Hadoop cluster on premises.
  • Base storage layer of HDFS

Preparation

  1. Review Planning Guide: Please review and verify Install Preparation and sub-topics.
  2. Acquire Assets: Acquire the installation package for your operating system and your license key. For more information, contact  Trifacta Support.
    1. If you are completing the installation without Internet access, you must also acquire the offline versions of the system dependencies. See Install Dependencies without Internet Access.
  3. Deploy Hadoop cluster: In this scenario, the Trifacta platform does not create a Hadoop cluster. See below.

    NOTE: Installation and maintenance of a working Hadoop cluster is the responsibility of the Trifacta Wrangler Enterprise customer. Guidance is provided below on the requirements for integrating the platform with the cluster.

  4. Deploy Trifacta node: Trifacta Wrangler Enterprise must be installed on an edge node of the cluster. Details are below.

Limitations: For more information on limitations of this scenario, see Product Limitations in the Install Preparation area.

Deploy the Cluster

In your enterprise infrastructure, you must deploy a cluster using a supported version of Hadoop to manage the expected data volumes of your Trifacta jobs. For more information on suggested sizing, see Sizing Guidelines in the Install Preparation area.

When you configure the platform to integrate with the cluster, you must acquire information about the cluster configuration. For more information on the set of information to collect, see Pre-Install Checklist in the Install Preparation area.

NOTE: By default, smaller jobs are executed on the Trifacta Photon running environment . Larger jobs are executed using Spark on the integrated Hadoop cluster. Spark must be installed on the cluster. For more information, see System Requirements in the Install Preparation area.

The Trifacta platform supports integration with the following cluster types. For more information on the supported versions, please see the listed sections below.

Prepare the cluster

Before installing software, please complete the following steps if you are integrating with a Hadoop cluster.

Before you begin, please verify or complete the following:

  1. On the Hadoop cluster: 
    1. Create a user [hadoop.user (default=trifacta)] and a group for it [hadoop.group (default=trifactausers)].
    2. Create the following directories: 
      1. /trifacta
      2. /user/trifacta
    3. Change the ownership of /trifacta and /user/trifacta to trifacta:trifacta or the corresponding values for the Hadoop user in your environment.

      NOTE:  You must verify that the [hadoop.user] user has complete ownership and full access to Read, Write and Execute on these directories recursively.

  2. Verify that WebHDFS is configured and running on the cluster.

     

  3. Software installation is completed on a dedicated node in the cluster. The user installing the  Trifacta software  must have sudo access.


  4. If you are installing on a server with an older instance of Postgres, you should remove the older version or change the default ports. 

For more information, see Prepare Hadoop for Integration with the Platform

Additional users may be required. For more information, see Required Users and Groups in the Install Preparation area.

Deploy the Trifacta node

An edge node of the cluster is required to host the Trifacta platform software. For more information on the requirements of this node, see System Requirements

Install Workflow

Please complete these steps listed in order:

  1. Install software: Install the Trifacta platform software on the cluster edge node. See Install Software.

  2. Install databases: The platform requires several databases for storage.

    NOTE: The default configuration assumes that you are installing the databases on a PostgreSQL server on the same edge node as the software using the default ports. If you are changing the default configuration, additional configuration is required as part of this installation process.

    For more information, see Install Databases.

  3. Start the platform: For more information, see Start and Stop the Platform.
  4. Login to the application: After software and databases are installed, you can login to the application to complete configuration:
    1. See Login.
    2. As soon as you login, you should change the password on the admin account. In the left menu bar, select Settings > Admin Settings. Scroll down to Manage Users. For more information, see Change Admin Password.

Tip: At this point, you can access the online documentation through the application. In the left menu bar, select Help menu > Product Docs. All of the following content, plus updates, is available online. See Documentation below.

Configure for Hadoop

After you have performed the base installation of the Trifacta® platform, please complete the following steps if you are integrating with a Hadoop cluster.

Apply cluster configuration files - non-edge node

NOTE: Installation on a non-edge node is not supported. Legacy customers can continue to use a non-edge node, but this deployment is not recommended.

If the Trifacta platform is being installed on a non-edge node, you must copy over the Hadoop Client Configuration files from the cluster. 

NOTE: When these files change, you must update the local copies. For this reason, it is best to install on an edge node.

  1. Download the Hadoop Client Configuration files from the Hadoop cluster. The required files are the following:
    1. core-site.xml
    2. hdfs-site.xml
    3. mapred-site.xml
    4. yarn-site.xml
    5. hive-site.xml (if you are using Hive)
  2. These configuration files must be moved to the Trifacta deployment. By default, these files are in /etc/hadoop/conf:

    sudo cp <location>/*.xml /opt/trifacta/conf/hadoop-site/
    sudo chown trifacta:trifacta /opt/trifacta/conf/hadoop-site/*.xml 
    

For more information, see Configure for Hadoop.

Apply cluster configuration files - edge node

If the Trifacta platform is being installed on an edge node of the cluster, you can create a symlink from a local directory to the source cluster files so that they are automatically updated as needed.

  1. Navigate to the following directory on the Trifacta node:

    cd /opt/trifacta/conf/hadoop-site
  2. Create a symlink for each of the Hadoop Client Configuration files referenced in the previous steps. Example:

    ln -s /etc/hadoop/conf/core-site.xml core-site.xml
  3. Repeat the above steps for each of the Hadoop Client Configuration files.

For more information, see Configure for Hadoop.

Modify Trifacta configuration changes

  1. To apply this configuration change, login as an administrator to the Trifacta node. Then, edit trifacta-conf.json. Some of these settings may not be available through the Admin Settings Page. For more information, see Platform Configuration Methods.

  2. HDFS: Change the host and port information for HDFS as needed. Please apply the port numbers for your distribution:

    "hdfs.namenode.host": "<namenode>",
    "hdfs.namenode.port": <hdfs_port_num>
    "hdfs.yarn.resourcemanager": {
    "hdfs.yarn.webappPort": 8088,
    "hdfs.yarn.adminPort": 8033,
    "hdfs.yarn.host": "<resourcemanager_host>",
    "hdfs.yarn.port": <resourcemanager_port>,
    "hdfs.yarn.schedulerPort": 8030

    For more information, see Configure for Hadoop.
     

  3. Save your changes and restart the platform.

Configure Spark Job Service

The Spark Job Service must be enabled for both execution and profiling jobs to work in Spark.

Below is a sample configuration and description of each property. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods..

"spark-job-service" : {
  "systemProperties" : {
    "java.net.preferIPv4Stack": "true",
    "SPARK_YARN_MODE": "true"
  },
  "sparkImpersonationOn": false,
  "optimizeLocalization": true,
  "mainClass": "com.trifacta.jobserver.SparkJobServer",
  "jvmOptions": [
"-Xmx128m"
],
  "hiveDependenciesLocation": "%(topOfTree)s/hadoop-deps/cdh-5.4/build/libs",
  "env": {
    "SPARK_JOB_SERVICE_PORT": "4007",
    "SPARK_DIST_CLASSPATH": "",
    "MAPR_TICKETFILE_LOCATION": "<MAPR_TICKETFILE_LOCATION>",
    "MAPR_IMPERSONATION_ENABLED": "0",
    "HADOOP_USER_NAME": "trifacta",
    "HADOOP_CONF_DIR": "%(topOfTree)s/conf/hadoop-site/"
  },
  "enabled": true,
  "enableHiveSupport": true,
  "enableHistoryServer": false,
  "classpath": "%(topOfTree)s/services/spark-job-server/server/build/libs/spark-job-server-bundle.jar:%(topOfTree)s/conf/hadoop-site/:%(topOfTree)s/services/spark-job-server/build/bundle/*:%(topOfTree)s/%(hadoopBundleJar)s",
  "autoRestart": false,
},

The following properties can be modified based on your needs:

NOTE: Unless explicitly told to do so, do not modify any of the above properties that are not listed below.

PropertyDescription
sparkImpersonationOnSet this value to true, if secure impersonation is enabled on your cluster. See Configure for Secure Impersonation.
jvmOptions

This array of values can be used to pass parameters to the JVM that manages Spark Job Service.

hiveDependenciesLocation

If Spark is integrated with a Hive instance, set this value to the path to the location where Hive dependencies are installed on the Trifacta node. For more information, see Configure for Hive.

env.SPARK_JOB_SERVICE_PORTSet this value to the listening port number on the cluster for Spark. Default value is 4007. For more information, see System Ports.
env.HADOOP_USER_NAME

The username of the Hadoop principal used by the platform. By default, this value is trifacta.

env.HADOOP_CONF_DIR

The directory on the Trifacta node where the Hadoop cluster configuration files are stored. Do not modify unless necessary.

enabledSet this value to true to enable the Spark Job Service.
enableHiveSupport

See below.

After making any changes, save the file and restart the platform. See Start and Stop the Platform.

Configure service for Hive

Depending on the environment, please apply the following configuration changes to manage Spark interactions with Hive:

Environmentspark.enableHiveSupport
Hive is not presentfalse
Hive is present but not enabled.false
Hive is present and enabledtrue

If Hive is present on the cluster and either enabled or disabled: the hive-site.xml file must be copied to the correct directory:

cp /etc/hive/conf/hive-site.xml /opt/trifacta/conf/hadoop-site/hive-site.xml

At this point, the platform only expects that a hive-site.xml file has been installed on the Trifacta node . A valid connection is not required. For more information, see Configure for Hive .

Configure Spark

After the Spark Job Service has been enabled, please complete the following sections to configure it for the Trifacta platform.

Yarn cluster mode

All jobs submitted to the Spark Job Service are executed in YARN cluster mode. No other cluster mode is supported for the Spark Job Service.

Configure access for secure impersonation

The Spark Job Service can run under secure impersonation. For more information, see Configure for Secure Impersonation.

When running under secure impersonation, the Spark Job Service requires access to the following folders. Read, write, and execute access must be provided to the Trifacta user and the impersonated user.

Folder Name
Platform Configuration Property
Default ValueDescription
Trifacta Libraries folder
"hdfs.pathsConfig.libraries"/trifacta/libraries

Maintains JAR files and other libraries required by Spark. No sensitive information is written to this location.

Trifacta Temp files folder
"hdfs.pathsConfig.tempFiles"/trifacta/tempfilesHolds temporary progress information files for YARN applications. Each file contains a number indicating the progress percentage. No sensitive information is written to this location.
Trifacta Dictionaries folder
"hdfs.pathsConfig.dictionaries"/trifacta/dictionaries

Contains definitions of dictionaries created for the platform.

Identify Hadoop libraries on the cluster

The Spark Job Service does not require additional installation on the Trifacta node or on the Hadoop cluster. Instead, it references the spark-assembly JAR that is provided with the Trifacta distribution.

This JAR file does not include the Hadoop client libraries. You must point the Trifacta platform to the appropriate libraries.

Steps:

  1. In platform configuration, locate the spark-job-service configuration block.
  2. Set the following property:

    "spark-job-service.env.HADOOP_CONF_DIR": "<path_to_Hadoop_conf_dir_on_Hadoop_cluster>",
    PropertyDescription
    spark-job-service.env.HADOOP_CONF_DIRPath to the Hadoop configuration directory on the Hadoop cluster.
  3. In the same block, the SPARK_DIST_CLASSPATH property must be set depending on your Hadoop distribution.
    1. For Cloudera 5.x: This property can be left blank.
    2. For Hortonworks 2.x: This property configuration is covered later in this section.

  4. Save your changes.

Locate Hive dependencies location

If the Trifacta platform is also connected to a Hive instance, please verify the location of the Hive dependencies on the Trifacta node. The following example is from Cloudera 5.10:

NOTE: This parameter value is distribution-specific. Please update based on your Hadoop distribution.

"spark-job-service.hiveDependenciesLocation":"%(topOfTree)s/hadoop-deps/cdh-5.10/build/libs",

For more information, see Configure for Spark.

Enable High Availability

NOTE: If high availability is enabled on the Hadoop cluster, it must be enabled on the Trifacta platform, even if you are not planning to rely on it. See Enable Integration with Cluster High Availability.

Configure for Trifacta platform

Set base storage layer

The platform requires that one backend datastore be configured as the base storage layer. This base storage layer is used for storing uploaded data and writing results and profiles. 

NOTE: By default, the base storage layer for Trifacta Wrangler Enterprise is set to HDFS. You can change it now, if needed. After this base storage layer is defined, it cannot be changed again.

See Set Base Storage Layer.
 

Verify Operations

NOTE: You can try to verify operations using the Trifacta Photon running environment at this time. While you can also try to run a job on the Hadoop cluster, additional configuration may be required to complete the integration. These steps are listed under Next Steps below.

 

Prepare Your Sample Dataset

To complete this test, you should locate or create a simple dataset. Your dataset should be created in the format that you wish to test.

Characteristics:

  • Two or more columns. 
  • If there are specific data types that you would like to test, please be sure to include them in the dataset.
  • A minimum of 25 rows is required for best results of type inference.
  • Ideally, your dataset is a single file or sheet. 


Store Your Dataset

If you are testing an integration, you should store your dataset in the datastore with which the product is integrated.

Tip: Uploading datasets is always available as a means of importing datasets.

 

  • You may need to create a connection between the platform and the datastore.
  • Read and write permissions must be enabled for the connecting user to the datastore.
  • For more information, see Connections Page.

Verification Steps

Steps:

  1. Login to the application. See Login.
  2. In the application menu bar, click Library.
  3. Click Import Data. See Import Data Page.
    1. Select the connection where the dataset is stored. For datasets stored on your local desktop, click Upload.
    2. Select the dataset.
    3. In the right panel, click the Add Dataset to a Flow checkbox. Enter a name for the new flow.
    4. Click Import and Add to Flow.

    5. Troubleshooting:   At this point, you have read access to your datastore from the platform. If not, please check the logs, permissions, and your Trifacta® configuration.

  4. In the left menu bar, click the Flows icon. Flows page, open the flow you just created. See Flows Page.
  5. In the Flows page, click the dataset you just imported. Click Add new Recipe.
  6. Select the recipe. Click Edit Recipe.
  7. The initial sample of the dataset is opened in the Transformer page, where you can edit your recipe to transform the dataset.
    1. In the Transformer page, some steps are automatically added to the recipe for you. So, you can run the job immediately.
    2. You can add additional steps if desired. See Transformer Page.
  8. Click Run Job

    1. If options are presented, select the defaults.
    2. To generate results in other formats or output locations, click Add Publishing Destination. Configure the output formats and locations. 
    3. To test dataset profiling, click the Profile Results checkbox. Note that profiling runs as a separate job and may take considerably longer. 
    4. See Run Job Page.

    5. Troubleshooting: Later, you can re-run this job on a different running environment. Some formats are not available across all running environments.

  9. When the job completes, you should see a success message under the Jobs tab in the Flow View page. 
    1. Troubleshooting: Either the Transform job or the Profiling job may break. To localize the problem, try re-running a job by deselecting the broken job type or running the job on a different running environment (if available). You can also download the log files to try to identify the problem. See Job Details Page.
  10. Click View Results from the context menu for the job listing. In the Job Details page, you can see a visual profile of the generated results. See Job Details Page.
  11. In the Output Destinations tab, click a link to download the results to your local desktop. 
  12. Load these results into a local application to verify that the content looks ok.

Checkpoint: You have verified importing from the selected datastore and transforming a dataset. If your job was successfully executed, you have verified that the product is connected to the job running environment and can write results to the defined output location. Optionally, you may have tested profiling of job results. If all of the above tasks completed, the product is operational end-to-end.

Documentation

Tip: You should access online documentation through the product. Online content may receive updates that are not present in PDF content.

You can access complete product documentation online and in PDF format. From within the Trifacta application, select Help menu > Product Docs.

Next Steps

After you have accessed the documentation, the following topics are relevant to on-premises deployments. Please review them in order.

NOTE: These materials are located in the Configuration Guide.

TopicDescription
Required Platform Configuration

This section covers the following topics, some of which should already be completed:

  • Set Base Storage Layer - The base storage layer must be set once and never changed.
  • Create Encryption Key File - If you plan to integrate the platform with any relational sources, you must create an encryption key file and store it on the Trifacta node
  • Running Environment Options - Depending on your scenario, you may need to perform additional configuration for your available running environment(s) for executing jobs.
  • Profiling Options - In some environments, tweaks to the settings for visual profiling may be required. You can disable visual profiling if needed.
  • Configure for Spark - If you are enabling the Spark running environment, please review and verify the configuration for integrating the platform with the Hadoop cluster instance of Spark.
Configure for Hadoop
Enable Integration with Compressed ClustersIf the Hadoop cluster uses compression, additional configuration is required.
Enable Integration with Cluster High Availability

If you are integrating with high availability on the Hadoop cluster, please complete these steps.

  • If you are integrating with high availability on the Hadoop cluster, HttpFS must be enabled in the platform. HttpFS is required in other, less-common cases. See Enable HttpFS.
Configure for Hive

Integration with the Hadoop cluster's instance of Hive.

 

Configure for KMSIntegration with the Hadoop cluster's key management system (KMS) for encrypted transport. Instructions are provided for distribution-specific versions of Hadoop.
Configure Security

A list of topics on applying additional security measures to the Trifacta platform and how integrates with Hadoop.

Configure SSO for AD-LDAPPlease complete these steps if you are integrating with your enterprise's AD/LDAP Single Sign-On (SSO) system.

This page has no comments.