Page tree

 

NOTE: This content is shared across multiple pages and should be reviewed where it is shared. For an example where this content is used, see Deploy.

After you have performed the base installation of the Trifacta® platform, please complete the following steps if you are integrating with a Hadoop cluster.

Apply cluster configuration files - non-edge node

If the Trifacta platform is being installed on a non-edge node, you must copy over the Hadoop Client Configuration files from the cluster. 

NOTE: When these files change, you must update the local copies. For this reason, it is best to install on an edge node.

  1. Download the Hadoop Client Configuration files from the Hadoop cluster. The required files are the following:
    1. core-site.xml
    2. hdfs-site.xml
    3. mapred-site.xml
    4. yarn-site.xml
    5. hive-site.xml (if you are using Hive)
  2. These configuration files must be moved to the Trifacta deployment. By default, these files are in /etc/hadoop/conf:

    sudo cp <location>/*.xml /opt/trifacta/conf/hadoop-site/
    sudo chown trifacta:trifacta /opt/trifacta/conf/hadoop-site/*.xml 
    

For more information, see Configure for Hadoop.

Apply cluster configuration files - edge node

If the Trifacta platform is being installed on an edge node of the cluster, you can create a symlink from a local directory to the source cluster files so that they are automatically updated as needed.

  1. Navigate to the following directory on the Trifacta node:

    cd /opt/trifacta/conf/hadoop-site
  2. Create a symlink for each of the Hadoop Client Configuration files referenced in the previous steps. Example:

    ln -s /etc/hadoop/conf/core-site.xml core-site.xml
  3. Repeat the above steps for each of the Hadoop Client Configuration files.

For more information, see Configure for Hadoop.

Modify Trifacta configuration changes

  1. To apply this configuration change, login as an administrator to the Trifacta node. Then, edit trifacta-conf.json. Some of these settings may not be available through the Admin Settings Page. For more information, see Platform Configuration Methods.

  2. HDFS: Change the host and port information for HDFS as needed. Please apply the port numbers for your distribution:

    "hdfs.namenode.host": "<namenode>",
    "hdfs.namenode.port": <hdfs_port_num>
    "hdfs.yarn.resourcemanager": {
    "hdfs.yarn.webappPort": 8088,
    "hdfs.yarn.adminPort": 8033,
    "hdfs.yarn.host": "<resourcemanager_host>",
    "hdfs.yarn.port": <resourcemanager_port>,
    "hdfs.yarn.schedulerPort": 8030

    For more information, see Configure for Hadoop.
     

  3. Save your changes and restart the platform.

Configure Spark Job Service

The Spark Job Service must be enabled for both execution and profiling jobs to work in Spark.

NOTE: Beginning in Release 4.0, the Spark Job Service and running environment are enabled by default. If you are upgrading from an earlier release, you may be required to enable the service through the following configuration changes.

Below is a sample configuration and description of each property. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods..

"spark-job-service" : {
  "systemProperties" : {
    "java.net.preferIPv4Stack": "true",
    "SPARK_YARN_MODE": "true"
  },
  "sparkImpersonationOn": false,
  "optimizeLocalization": true,
  "mainClass": "com.trifacta.jobserver.SparkJobServer",
  "jvmOptions": [
"-Xmx128m"
],
  "hiveDependenciesLocation": "%(topOfTree)s/hadoop-deps/cdh-5.4/build/libs",
  "env": {
    "SPARK_JOB_SERVICE_PORT": "4007",
    "SPARK_DIST_CLASSPATH": "",
    "MAPR_TICKETFILE_LOCATION": "<MAPR_TICKETFILE_LOCATION>",
    "MAPR_IMPERSONATION_ENABLED": "0",
    "HADOOP_USER_NAME": "trifacta",
    "HADOOP_CONF_DIR": "%(topOfTree)s/conf/hadoop-site/"
  },
  "enabled": true,
  "enableHiveSupport": true,
  "enableHistoryServer": false,
  "classpath": "%(topOfTree)s/services/spark-job-server/server/build/libs/spark-job-server-bundle.jar:%(topOfTree)s/conf/hadoop-site/:%(topOfTree)s/services/spark-job-server/build/bundle/*:%(topOfTree)s/%(hadoopBundleJar)s",
  "autoRestart": false,
},

The following properties can be modified based on your needs:

NOTE: Unless explicitly told to do so, do not modify any of the above properties that are not listed below.

PropertyDescription
sparkImpersonationOnSet this value to true, if secure impersonation is enabled on your cluster. See Configure for Secure Impersonation.
jvmOptions

This array of values can be used to pass parameters to the JVM that manages Spark Job Service.

hiveDependenciesLocation

If Spark is integrated with a Hive instance, set this value to the path to the location where Hive dependencies are installed on the Trifacta node. For more information, see Configure for Hive.

env.SPARK_JOB_SERVICE_PORTSet this value to the listening port number on the cluster for Spark. Default value is 4007. For more information, see System Ports.
env.HADOOP_USER_NAME

The username of the Hadoop principal used by the platform. By default, this value is trifacta.

env.HADOOP_CONF_DIR

The directory on the Trifacta node where the Hadoop cluster configuration files are stored. Do not modify unless necessary.

enabledSet this value to true to enable the Spark Job Service.
enableHiveSupport

See below.

After making any changes, save the file and restart the platform. See Start and Stop the Platform.

Configure service for Hive

Depending on the environment, please apply the following configuration changes to manage Spark interactions with Hive:

Environmentspark.enableHiveSupport
Hive is not presentfalse
Hive is present but not enabled.false
Hive is present and enabledtrue

If Hive is present on the cluster and either enabled or disabled: the hive-site.xml file must be copied to the correct directory:

cp /etc/hive/conf/hive-site.xml /opt/trifacta/conf/hadoop-site/hive-site.xml

At this point, the platform only expects that a hive-site.xml file has been installed on the Trifacta node . A valid connection is not required. For more information, see Configure for Hive .

Configure Spark

After the Spark Job Service has been enabled, please complete the following sections to configure it for the Trifacta platform.

Yarn cluster mode

All jobs submitted to the Spark Job Service are executed in YARN cluster mode. No other cluster mode is supported for the Spark Job Service.

Configure access for secure impersonation

The Spark Job Service can run under secure impersonation. For more information, see Configure for Secure Impersonation.

When running under secure impersonation, the Spark Job Service requires access to the following folders. Read, write, and execute access must be provided to the Trifacta user and the impersonated user.

Folder Name
Platform Configuration Property
Default ValueDescription
Trifacta Libraries folder
"hdfs.pathsConfig.libraries"/trifacta/libraries

Maintains JAR files and other libraries required by Spark. No sensitive information is written to this location.

Trifacta Temp files folder
"hdfs.pathsConfig.tempFiles"/trifacta/tempfilesHolds temporary progress information files for YARN applications. Each file contains a number indicating the progress percentage. No sensitive information is written to this location.
Trifacta Dictionaries folder
"hdfs.pathsConfig.dictionaries"/trifacta/dictionaries

Contains definitions of dictionaries created for the platform.

Identify Hadoop libraries on the cluster

The Spark Job Service does not require additional installation on the Trifacta node or on the Hadoop cluster. Instead, it references the spark-assembly JAR that is provided with the Trifacta distribution.

This JAR file does not include the Hadoop client libraries. You must point the Trifacta platform to the appropriate libraries.

Steps:

  1. In platform configuration, locate the spark-job-service configuration block.
  2. Set the following property:

    "spark-job-service.env.HADOOP_CONF_DIR": "<path_to_Hadoop_conf_dir_on_Hadoop_cluster>",
    PropertyDescription
    spark-job-service.env.HADOOP_CONF_DIRPath to the Hadoop configuration directory on the Hadoop cluster.
  3. In the same block, the SPARK_DIST_CLASSPATH property must be set depending on your Hadoop distribution.
    1. For Cloudera 5.x: This property can be left blank.
    2. For Hortonworks 2.x: This property configuration is covered later in this section.

  4. Save your changes.

Locate Hive dependencies location

If the Trifacta platform is also connected to a Hive instance, please verify the location of the Hive dependencies on the Trifacta node. The following example is from Cloudera 5.10:

NOTE: This parameter value is distribution-specific. Please update based on your Hadoop distribution.

"spark-job-service.hiveDependenciesLocation":"%(topOfTree)s/hadoop-deps/cdh-5.10/build/libs",

For more information, see Configure for Spark.

Enable High Availability

NOTE: If high availability is enabled on the Hadoop cluster, it must be enabled on the Trifacta platform, even if you are not planning to rely on it. See Enable Integration with Cluster High Availability.


This page has no comments.