Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Excerpt

The 

D s platform
 can be configured to integrate with supported versions of HDInsight clusters to run jobs in Spark. 

Info

NOTE: Before you attempt to integrate, you should review the limitations around this integration. For more information, see Configure for HDInsight.

Specify running environment options:

  1. D s config

  2. Configure the following parameters to enable job execution on the specified HDI cluster:

    Code Block
    "webapp.runInDatabricks": false,
    "webapp.runWithSparkSubmit": true,


    ParameterDescription
    webapp.runInDatabricks

    Defines if the platform runs jobs in Azure Databricks. Set this value to false.

    webapp.runWithSparkSubmitFor HDI deployments, this value should be set to true.


Specify 

D s item
itemuser
:

Set the Hadoop username for the 

D s platform
 to use for executing jobs 
D s defaultuser
Typehadoop
Fulltrue
:  

Code Block
"hdfs.username": "[hadoop.user]",

Specify location of client distribution bundle JAR:

The 

D s platform
 ships with client bundles supporting a number of major Hadoop distributions.  You must configure the jarfile for the distribution to use.  These distributions are stored in the following directory:

/trifacta/hadoop-deps

Configure the bundle distribution property (hadoopBundleJar):

Code Block
  "hadoopBundleJar": "hadoop-deps/hdp-2.6/build/libs/hdp-2.6-bundle.jar"

Configure component settings:

For each of the following components, please explicitly set the following settings.

  1. D s config

  2. Configure Batch Job Runner:

    Code Block
      "batch-job-runner": {
       "autoRestart": true,
        ...
        "classpath": "%(topOfTree)s/hadoop-dataservices/batch-job-runner/build/install/hadoopbatch-job-datarunner/hadoopbatch-job-datarunner.jar:%(topOfTree)s/hadoop-dataservices/batch-job-runner/build/install/hadoopbatch-job-datarunner/lib/*:%(topOfTree)s/%(hadoopBundleJar)s:/etc/hadoop/conf:%(topOfTree)s/conf/hadoop-site:/usr/lib/hdinsight-datalake/*:/usr/hdp/current/hadoop-client/hadoop-azure.jarclient/*:/usr/hdp/current/hadoop-client/lib/azure-storage-2.2.0.jar*"
      },


  3. Configure the following environment variables:

    Code Block
    "env.PATH": "${HOME}/bin:$PATH:/usr/local/bin:/usr/lib/zookeeper/bin",
    "env.TRIFACTA_CONF": "/opt/trifacta/conf"
    "env.JAVA_HOME": "/usr/lib/jvm/java-1.8.0-openjdk-amd64",


  4. Configure the following properties for various 

    D s item
    components
    components
    :

    Code Block
      "ml-service": {
       "autoRestart": true
      },
      "monitor": {
       "autoRestart": true,
        ...
       "port": <your_cluster_monitor_port>
      },
      "proxy": {
       "autoRestart": true
      },
      "udf-service": {
       "autoRestart": true
      },
      "webapp": {
        "autoRestart": true
      },


  5. Disable S3 access:

    Code Block
    "aws.s3.enabled": false,


  6. Configure the following Spark Job Service properties:

    Code Block
    "spark-job-service.classpath": "%(topOfTree)s/services/spark-job-server/server/build/install/server/lib/*:%(topOfTree)s/conf/hadoop-site/:%(topOfTree)s/services/spark-job-server/build/bundle/*:/usr/hdp/current/hadoop-client/hadoop-azure.jar:/usr/hdp/current/hadoop-client/lib/azure-storage-2.2.0.jar",
    "spark-job-service.env.SPARK_DIST_CLASSPATH": "/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-mapreduce-client/*",


  7. Save your changes.

...