Page tree

Outdated release! Latest docs are Release 8.7: Configure for HDInsight



When deployed to Microsoft Azure, the Trifacta® platform must be integrated with Microsoft HDInsight, a Hadoop-based platform for data storage and analytics. This section describes the configuration steps required to integrate with a pre-existing HDI cluster.  This section applies only if you have installed the Trifacta platform onto a node of a pre-existing HDI cluster.

If you created a new HDI cluster as part of your deployment of the platform from the Azure Marketplace, please skip this section. You may use it as reference in the future.

Supported Versions

Supported Versions: This release supports integration with HDI 3.5 and HDI 3.6.


For this release, the following limitations apply:

  • The Trifacta platform must be installed on Azure.

  • HDI does not support the client-server web sockets configuration used by the platform. This limitation results in diminished suggestions prompted by platform activities.


This section makes the following assumptions:

  1. You have installed and configured the Trifacta platform onto an edge node of a pre-existing HDI cluster.
  2. You have installed WebWASB on the platform edge node.

Before You Begin

Create Trifacta user account on HDI cluster

The Trifacta platform interacts with the cluster through a single system user account.  A user for the platform must be added to the cluster.


If possible, please create the user ID ([hdi.user]) as: trifacta.

This user must be created on each data node in the cluster.

This user should belong to the group ([]): trifactausers.

User requirements:

  • (if integrating with WASB) Access to WASB
  • Permission to run YARN jobs on the cluster. 


  1. You can apply this change through the Admin Settings Page (recommended) or
    . For more information, see Platform Configuration Methods.
  2. Set the cluster paths in the following locations:

    "hdfs.pathsConfig.fileUpload": "/trifacta/uploads",
    "hdfs.pathsConfig.dictionaries": "/trifacta/dictionaries",
    "hdfs.pathsConfig.batchResults": "/trifacta/queryResults",

    Do not use the trifacta/uploads directory. This directory is used for storing uploads and metadata, which may be used by multiple users. Manipulating files outside of the Trifacta application can destroy other users' data. Please use the tools provided through the interface for managing uploads to WASB.

    Individual users can configure the output directory where exported results are stored. See Storage Config Page.

  3. Save your changes.

Acquire cluster configuration files  

You must acquire the configuration files from the HDI cluster and apply them to the Trifacta node.

Tip: Later, you configure the platform settings for accessing various components in the cluster. This host, port, and other information is available through these cluster configuration files.


  1. On any edge node of the cluster, acquire the .XML files from the following directory:


    NOTE: If you are integrating with an instance of Hive on the cluster, you must also acquire the Hive configuration file: /etc/hive/conf/hive-site.xml.

  2. These files must be copied to the following directory on the Trifacta node:

  3. Replace any existing files with these files.

Acquire build build number

You must acquire the full version and build number of the underlying Hortonworks distribution. On any of the cluster nodes, navigate to /usr/hdp. The version and build number is referenced as a directory in this location, named in the following form:


For the rest of the configuration, the sample values for HDI 3.6 are referenced. Use the appropriate values for your distribution.

Supported HDInsight DistributionShort Hortonworks VersionExample Full Hortonworks Version

Configure the HDI Cluster

The following configuration sections must be reviewed and completed. 

Specify Storage Layer

In the Azure console, you must specify and retain the type of storage to apply. In the Storage tab of the cluster specification, the following storage layers are supported.

NOTE: After the base storage layer has been defined in the Trifacta platform, it cannot be changed. Reinstallation is required.

NOTE: If possible, you should reserve a dedicated cluster for the Trifacta platform processing. If there is a mismatch between the storage layer of your existing HDI cluster and the required storage for your Trifacta deployment, you can create a new HDI cluster as part of the installation process. For more information, see Install for Azure.

Tip: During installation of the Trifacta platform, you must define the base storage layer. Retain your selection of the Azure Storage Layer and its mapped based storage layer for the Trifacta platform installation.

Azure Storage LayerDescription

Trifacta Base Storage Layer

ADLSAzure storage leverages WASB, an astraction layer on top of HDFS.



Data Lake Store maps to ADLS in the Trifacta platform.

ADLS Gen2ADLS Gen2 storage is not supported for HDInsight clusters.

Specify Protocol

In the Ambari console, you must specify the communication protocol to use in the cluster. 

NOTE: The cluster protocol must match the protocol in use by the Trifacta platform.


  1. In the Ambari console, please migrate to the following location: HDFS > Configs > Advanced > Advanced Core Site > fs.defaultFS.
  2. Set the value according to the following table:

    Azure Storage LayerProtocol (fs.defaultFS) value

    Trifacta platform config value

    Azure Storagewasbs://<containername>@<accountname> "webapp.storageProtocol" "wasbs",See Set Base Storage Layer.
    Data Lake Storeadl://home "webapp.storageProtocol" "hdfs",See Set Base Storage Layer.
  3. Save your changes.

Define Script Action for domain-joined clusters

If you are integrating with a domain-joined cluster, you must specify a script action to set some permissions on cluster directories. 

For more information, see


  1. In the Advanced Settings tab of the cluster specification, click Script actions.
  2. In the textbox, insert the following URL:
  3. Save your changes.

Configure the Platform

These changes must be applied after the  Trifacta platformhas been installed.

Perform base configuration for HDI

The Trifacta platform can be configured to integrate with supported versions of HDInsight clusters to run jobs in Spark. 

NOTE: Before you attempt to integrate, you should review the limitations around this integration. For more information, see Configure for HDInsight.

Specify running environment options:

  1. You can apply this change through the Admin Settings Page (recommended) or

    . For more information, see Platform Configuration Methods.

  2. Configure the following parameters to enable job execution on the specified HDI cluster:

    "webapp.runInDatabricks": false,
    "webapp.runWithSparkSubmit": true,

    Defines if the platform runs jobs in Azure Databricks. Set this value to false.

    webapp.runWithSparkSubmitFor HDI deployments, this value should be set to true.

Specify Trifacta user:

Set the Hadoop username for the Trifacta platform to use for executing jobs [hadoop.user (default=trifacta)]:  

"hdfs.username": "[hadoop.user]",

Specify location of client distribution bundle JAR:

The Trifacta platform ships with client bundles supporting a number of major Hadoop distributions.  You must configure the jarfile for the distribution to use.  These distributions are stored in the following directory:


Configure the bundle distribution property (hadoopBundleJar):

  "hadoopBundleJar": "hadoop-deps/hdp-2.6/build/libs/hdp-2.6-bundle.jar"

Configure component settings:

For each of the following components, please explicitly set the following settings.

  1. You can apply this change through the Admin Settings Page (recommended) or

    . For more information, see Platform Configuration Methods.

  2. Configure Batch Job Runner:

      "batch-job-runner": {
       "autoRestart": true,
        "classpath": "%(topOfTree)s/services/batch-job-runner/build/install/batch-job-runner/batch-job-runner.jar:%(topOfTree)s/services/batch-job-runner/build/install/batch-job-runner/lib/*:%(topOfTree)s/%(hadoopBundleJar)s:/etc/hadoop/conf:%(topOfTree)s/conf/hadoop-site:/usr/lib/hdinsight-datalake/*:/usr/hdp/current/hadoop-client/client/*:/usr/hdp/current/hadoop-client/*"
  3. Configure the following environment variables:

    "env.PATH": "${HOME}/bin:$PATH:/usr/local/bin:/usr/lib/zookeeper/bin",
    "env.TRIFACTA_CONF": "/opt/trifacta/conf"
    "env.JAVA_HOME": "/usr/lib/jvm/java-1.8.0-openjdk-amd64",
  4. Configure the following properties for various Trifacta components:

      "ml-service": {
       "autoRestart": true
      "monitor": {
       "autoRestart": true,
       "port": <your_cluster_monitor_port>
      "proxy": {
       "autoRestart": true
      "udf-service": {
       "autoRestart": true
      "webapp": {
        "autoRestart": true
  5. Disable S3 access:

    "aws.s3.enabled": false,
  6. Configure the following Spark Job Service properties:

    "spark-job-service.classpath": "%(topOfTree)s/services/spark-job-server/server/build/install/server/lib/*:%(topOfTree)s/conf/hadoop-site/:%(topOfTree)s/services/spark-job-server/build/bundle/*:/usr/hdp/current/hadoop-client/hadoop-azure.jar:/usr/hdp/current/hadoop-client/lib/azure-storage-2.2.0.jar",
    "spark-job-service.env.SPARK_DIST_CLASSPATH": "/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-mapreduce-client/*",
  7. Save your changes.

Configure High Availability

If you are integrating the platform the HDI cluster with high availability enabled, please complete the following steps so that the platform is aware of the failover nodes


  1. You can apply this change through the Admin Settings Page (recommended) or

    . For more information, see Platform Configuration Methods.

  2. Enable high availability feature on the namenode and resourceManager nodes:

    "feature.highAvailability.namenode": true,
    "feature.highAvailability.resourceManager": true,
  3. For each YARN resource manager, you must configure its high availability settings. The following are two example node configurations, including the default port numbers for HDI:

    Tip: Host and port settings should be available in the cluster configuration files you copied to the Trifacta node. Or you can acquire the settings through the cluster's admin console.

      "yarn": {
        "highAvailability": {
          "resourceManagers": {
            "rm1": {
              "port": <your_cluster_rm1_port>,
              "schedulerPort": <your_cluster_rm1_scheduler_port>,
              "adminPort": <your_cluster_rm1_admin_port>,
              "webappPort": <your_cluster_rm1_webapp_port>
            "rm2": {
              "port": <your_cluster_rm2_port>,
              "schedulerPort": <your_cluster_rm2_scheduler_port>,
              "adminPort": <your_cluster_rm2_admin_port>,
              "webappPort": <your_cluster_rm2_webapp_port>
  4. Configure the high availability namenodes. The following example configures two namenodes (nn1 and nn2), including the default port numbers for HDI: 

    Tip: Host and port settings should be available in the cluster configuration files you copied to the Trifacta node. Or you can acquire the settings through the cluster's admin console.

      "hdfs": {
        "highAvailability": {
          "namenodes": {
            "nn1": {
              "port": <your_cluster_namenode1_port>
            "nn2": {
              "port": <your_cluster_namenode2_port>
  5. Save your changes.

NOTE: If you are deploying high availability failover, you must use HttpFS, instead of WebHDFS, for communicating with HDI. Additional configuration is required. HA in a Kerberized environment for HDI is not supported. See Enable Integration with Cluster High Availability.

Create Hive connection


  1. The platform only supports HTTP connections to Hive on HDI. TCP connections are not supported.
  2. The Hive port must be set to 10001 for HTTP.

For more information, see Create Hive Connections.

Hive integration requires additional configuration.

NOTE: Natively, HDI supports high availability for Hive via a Zookeeper quorum.

For more information, see Configure for Hive .

Configure for Spark Profiling

If you are using Spark for profiling, you must add environment properties to your cluster configuration. See Configure for Spark.

Configure for UDFs

If you are using user-defined functions (UDFs) on your HDInsight cluster, additional configuration is required. See Java UDFs.

Configure Storage

Before you begin running jobs, you must specify your base storage layer, which can be WASB or ADLS. For more information, see Set Base Storage Layer.

Additional configuration is required:

Starting the Platform

NOTE: In an Azure HDI environment, you must perform platform start and stop operations from /opt/trifacta. Running these commands from other directories, such as /root, can cause service issues.

This page has no comments.