The can be configured to integrate with supported versions of HDInsight clusters to run jobs in Spark. Info |
---|
NOTE: Before you attempt to integrate, you should review the limitations around this integration. For more information, see Configure for HDInsight. |
Specify running environment options: Configure the following parameters to enable job execution on the specified HDI cluster:
Code Block |
---|
"webapp.runInDatabricks": false,
"webapp.runWithSparkSubmit": true, |
Parameter | Description |
---|
webapp.runInDatabricks | Defines if the platform runs jobs in Azure Databricks. Set this value to false . | webapp.runWithSparkSubmit | For HDI deployments, this value should be set to true . |
Specify : Set the Hadoop username for the to use for executing jobs : Code Block |
---|
"hdfs.username": "[hadoop.user]", |
Specify location of client distribution bundle JAR: The ships with client bundles supporting a number of major Hadoop distributions. You must configure the jarfile for the distribution to use. These distributions are stored in the following directory:/trifacta/hadoop-deps
Configure the bundle distribution property (hadoopBundleJar ): Code Block |
---|
"hadoopBundleJar": "hadoop-deps/hdp-2.6/build/libs/hdp-2.6-bundle.jar" |
Configure component settings: For each of the following components, please explicitly set the following settings. Configure Batch Job Runner: Code Block |
---|
"batch-job-runner": {
"autoRestart": true,
...
"classpath": "%(topOfTree)s/hadoop-dataservices/batch-job-runner/build/install/hadoopbatch-job-datarunner/hadoopbatch-job-datarunner.jar:%(topOfTree)s/hadoop-dataservices/batch-job-runner/build/install/hadoopbatch-job-datarunner/lib/*:%(topOfTree)s/%(hadoopBundleJar)s:/etc/hadoop/conf:%(topOfTree)s/conf/hadoop-site:/usr/lib/hdinsight-datalake/*:/usr/hdp/current/hadoop-client/hadoop-azure.jarclient/*:/usr/hdp/current/hadoop-client/lib/azure-storage-2.2.0.jar*"
}, |
Configure the following environment variables: Code Block |
---|
"env.PATH": "${HOME}/bin:$PATH:/usr/local/bin:/usr/lib/zookeeper/bin",
"env.TRIFACTA_CONF": "/opt/trifacta/conf"
"env.JAVA_HOME": "/usr/lib/jvm/java-1.8.0-openjdk-amd64", |
Configure the following properties for various : Code Block |
---|
"ml-service": {
"autoRestart": true
},
"monitor": {
"autoRestart": true,
...
"port": <your_cluster_monitor_port>
},
"proxy": {
"autoRestart": true
},
"udf-service": {
"autoRestart": true
},
"webapp": {
"autoRestart": true
}, |
Disable S3 access: Code Block |
---|
"aws.s3.enabled": false, |
Configure the following Spark Job Service properties: Code Block |
---|
"spark-job-service.classpath": "%(topOfTree)s/services/spark-job-server/server/build/install/server/lib/*:%(topOfTree)s/conf/hadoop-site/:%(topOfTree)s/services/spark-job-server/build/bundle/*:/usr/hdp/current/hadoop-client/hadoop-azure.jar:/usr/hdp/current/hadoop-client/lib/azure-storage-2.2.0.jar",
"spark-job-service.env.SPARK_DIST_CLASSPATH": "/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-mapreduce-client/*", |
- Save your changes.
|