Page tree

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Current »

Outdated release! Latest docs are Release 8.2: Azure Install Base Configure Platform for HDI

   

The Trifacta platform can be configured to integrate with supported versions of HDInsight clusters to run jobs in Spark. 

NOTE: Before you attempt to integrate, you should review the limitations around this integration. For more information, see Configure for HDInsight.

Specify running environment options:

  1. You can apply this change through the Admin Settings Page (recommended) or

    trifacta-conf.json
    . For more information, see Platform Configuration Methods.

  2. Configure the following parameters to enable job execution on the specified HDI cluster:

    "webapp.runInDatabricks": false,
    "webapp.runWithSparkSubmit": true,
    ParameterDescription
    webapp.runInDatabricks

    Defines if the platform runs jobs in Azure Databricks. Set this value to false.

    webapp.runWithSparkSubmitFor HDI deployments, this value should be set to true.

Specify Trifacta user:

Set the Hadoop username for the Trifacta platform to use for executing jobs [hadoop.user (default=trifacta)]:  

"hdfs.username": "[hadoop.user]",

Specify location of client distribution bundle JAR:

The Trifacta platform ships with client bundles supporting a number of major Hadoop distributions.  You must configure the jarfile for the distribution to use.  These distributions are stored in the following directory:

/trifacta/hadoop-deps

Configure the bundle distribution property (hadoopBundleJar):

  "hadoopBundleJar": "hadoop-deps/hdp-2.6/build/libs/hdp-2.6-bundle.jar"

Configure component settings:

For each of the following components, please explicitly set the following settings.

  1. You can apply this change through the Admin Settings Page (recommended) or

    trifacta-conf.json
    . For more information, see Platform Configuration Methods.

  2. Configure Batch Job Runner:

      "batch-job-runner": {
       "autoRestart": true,
        ...
        "classpath": "%(topOfTree)s/hadoop-data/build/install/hadoop-data/hadoop-data.jar:%(topOfTree)s/hadoop-data/build/install/hadoop-data/lib/*:%(topOfTree)s/conf/hadoop-site:/usr/hdp/current/hadoop-client/hadoop-azure.jar:/usr/hdp/current/hadoop-client/lib/azure-storage-2.2.0.jar"
      },
  3. Configure the following environment variables:

    "env.PATH": "${HOME}/bin:$PATH:/usr/local/bin:/usr/lib/zookeeper/bin",
    "env.TRIFACTA_CONF": "/opt/trifacta/conf"
    "env.JAVA_HOME": "/usr/lib/jvm/java-1.8.0-openjdk-amd64",
  4. Configure the following properties for various Trifacta components:

      "ml-service": {
       "autoRestart": true
      },
      "monitor": {
       "autoRestart": true,
        ...
       "port": <your_cluster_monitor_port>
      },
      "proxy": {
       "autoRestart": true
      },
      "udf-service": {
       "autoRestart": true
      },
      "webapp": {
        "autoRestart": true
      },
  5. Disable S3 access:

    "aws.s3.enabled": false,
  6. Configure the following Spark Job Service properties:

    "spark-job-service.classpath": "%(topOfTree)s/services/spark-job-server/server/build/install/server/lib/*:%(topOfTree)s/conf/hadoop-site/:%(topOfTree)s/services/spark-job-server/build/bundle/*:/usr/hdp/current/hadoop-client/hadoop-azure.jar:/usr/hdp/current/hadoop-client/lib/azure-storage-2.2.0.jar",
    "spark-job-service.env.SPARK_DIST_CLASSPATH": "/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-mapreduce-client/*",
  7. Save your changes.

  • No labels

This page has no comments.