When deployed to Microsoft Azure, the must be integrated with Microsoft HDInsight, a Hadoop-based platform for data storage and analytics. This section describes the configuration steps required to integrate with a pre-existing HDI cluster.
This section applies only if you have installed the |
For this release, the following limitations apply:
HDI does not support the client-server web sockets configuration used by the platform. This limitation has the following impacts:
Diminished suggestions prompted by platform activities
This section makes the following assumptions:
The interacts with the cluster through a single system user account. A user for the platform must be added to the cluster.
UserID:
If possible, please create the user ID () as:
.
This user must be created on each data node in the cluster.
This user should belong to the group ():
.
User requirements:
Steps:
Set the cluster paths in the following locations:
"hdfs.pathsConfig.fileUpload": "/trifacta/uploads", "hdfs.pathsConfig.dictionaries": "/trifacta/dictionaries", "hdfs.pathsConfig.batchResults": "/trifacta/queryResults", |
Do not use the |
Individual users can configure the output directory where exported results are stored. See User Profile Page.
You must acquire the configuration files from the HDI cluster and apply them to the .
Tip: Later, you configure the platform settings for accessing various components in the cluster. This host, port, and other information is available through these cluster configuration files. |
Steps:
On any edge node of the cluster, acquire the .XML files from the following directory:
/etc/hadoop/conf |
NOTE: If you are integrating with an instance of Hive on the cluster, you must also acquire the Hive configuration file: |
These files must be copied to the following directory on the :
/trifacta/conf/hadoop-site |
You must acquire the full version and build number of the underlying Hortonworks distribution. On any of the cluster nodes, navigate to /usr/hdp
. The version and build number is referenced as a directory in this location, named in the following form:
A.B.C.D-X |
For the rest of the configuration, the sample values for HDI 3.6 are referenced. Use the appropriate values for your distribution.
Supported HDInsight Distribution | Short Hortonworks Version | Example Full Hortonworks Version |
---|---|---|
3.5 | 2.5 | 2.5.6.2-9 |
3.6 | 2.6 | 2.6.2.2-5 |
The following configuration sections must be reviewed and completed.
In the Azure console, you must specify and retain the type of storage to apply. In the Storage tab of the cluster specification, the following storage layers are supported.
NOTE: After the base storage layer has been defined in the |
NOTE: If possible, you should reserve a dedicated cluster for the |
Tip: During installation of the |
Azure Storage Layer | Description | |
---|---|---|
Azure Storage | Azure storage leverages WASB, an astraction layer on top of HDFS. |
|
Data Lake Store | Data Lake Store maps to ADLS in the | hdfs |
In the Ambari console, you must specify the communication protocol to use in the cluster.
NOTE: The cluster protocol must match the protocol in use by the |
Steps:
Set the value according to the following table:
Azure Storage Layer | Protocol (fs.defaultFS) value |
| Link |
---|---|---|---|
Azure Storage | wasbs://<containername>@<accountname>.blob.core.windows.net | "webapp.storageProtocol" : "wasbs" , | See Set Base Storage Layer. |
Data Lake Store | adl://home | "webapp.storageProtocol" : "hdfs" , | See Set Base Storage Layer. |
If you are integrating with a domain-joined cluster, you must specify a script action to set some permissions on cluster directories.
For more information, see https://docs.microsoft.com/en-us/azure/hdinsight/domain-joined/apache-domain-joined-configure-using-azure-adds.
Steps:
In the textbox, insert the following URL:
https://raw.githubusercontent.com/trifacta/azure-deploy/master/bin/set-key-permissions.sh |
If you haven't done so already, you can install the into an edge node of the HDI cluster. For more information, see Installation Steps.
These changes must be applied after the has been installed.
Set the Hadoop username for the to use for executing jobs
:
"hdfs.username": "[hadoop.user]", |
The ships with client bundles supporting a number of major Hadoop distributions. You must configure the jarfile for the distribution to use. These distributions are stored in the following directory:
/trifacta/hadoop-deps
Configure the bundle distribution property (hadoopBundleJar
):
"hadoopBundleJar": "hadoop-deps/hdp-2.6/build/libs/hdp-2.6-bundle.jar" |
For each of the following components, please explicitly set the following settings.
Configure Batch Job Runner:
"batch-job-runner": { "autoRestart": true, ... "classpath": "%(topOfTree)s/hadoop-data/build/install/hadoop-data/hadoop-data.jar:%(topOfTree)s/hadoop-data/build/install/hadoop-data/lib/*:%(topOfTree)s/conf/hadoop-site:/usr/hdp/current/hadoop-client/hadoop-azure.jar:/usr/hdp/current/hadoop-client/lib/azure-storage-2.2.0.jar" }, |
Configure the following environment variables:
"env.PATH": "${HOME}/bin:$PATH:/usr/local/bin:/usr/lib/zookeeper/bin", "env.TRIFACTA_CONF": "/opt/trifacta/conf" "env.JAVA_HOME": "/usr/lib/jvm/java-1.8.0-openjdk-amd64", |
Configure the following properties for various :
"ml-service": { "autoRestart": true }, "monitor": { "autoRestart": true, ... "port": <your_cluster_monitor_port> }, "proxy": { "autoRestart": true }, "udf-service": { "autoRestart": true }, "webapp": { "autoRestart": true }, |
Disable S3 access:
"aws.s3.enabled": false, |
Configure the following Spark Job Service properties:
"spark-job-service.classpath": "%(topOfTree)s/services/spark-job-server/server/build/libs/spark-job-server-bundle.jar:%(topOfTree)s/conf/hadoop-site/:%(topOfTree)s/services/spark-job-server/build/bundle/*:/usr/hdp/current/hadoop-client/hadoop-azure.jar:/usr/hdp/current/hadoop-client/lib/azure-storage-2.2.0.jar", "spark-job-service.env.SPARK_DIST_CLASSPATH": "/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-mapreduce-client/*", |
If you are integrating the platform the HDI cluster with high availability enabled, please complete the following steps so that the platform is aware of the failover nodes
Steps:
Enable high availability feature on the namenode and resourceManager nodes:
"feature.highAvailability.namenode": true, "feature.highAvailability.resourceManager": true, |
For each YARN resource manager, you must configure its high availability settings. The following are two example node configurations, including the default port numbers for HDI:
Tip: Host and port settings should be available in the cluster configuration files you copied to the |
"yarn": { "highAvailability": { "resourceManagers": { "rm1": { "port": <your_cluster_rm1_port>, "schedulerPort": <your_cluster_rm1_scheduler_port>, "adminPort": <your_cluster_rm1_admin_port>, "webappPort": <your_cluster_rm1_webapp_port> }, "rm2": { "port": <your_cluster_rm2_port>, "schedulerPort": <your_cluster_rm2_scheduler_port>, "adminPort": <your_cluster_rm2_admin_port>, "webappPort": <your_cluster_rm2_webapp_port> } } } }, |
Configure the high availability namenodes. The following example configures two namenodes (nn1
and nn2
), including the default port numbers for HDI:
Tip: Host and port settings should be available in the cluster configuration files you copied to the |
"hdfs": { ... "highAvailability": { "namenodes": { "nn1": { "port": <your_cluster_namenode1_port> }, "nn2": { "port": <your_cluster_namenode2_port> } } ... |
NOTE: If you are deploying high availability failover, you must use HttpFS, instead of WebHDFS, for communicating with HDI. Additional configuration is required. HA in a Kerberized environment for HDI is not supported. See Enable Integration with Cluster High Availability. |
Limitations:
10001
for HTTP.For more information, see Create Hive Connections.
Hive integration requires additional configuration.
NOTE: Natively, HDI supports high availability for Hive via a Zookeeper quorum. |
For more information, see Configure for Hive.
If you are using Spark for profiling, you must add environment properties to your cluster configuration. See Configure for Spark.
If you are using user-defined functions (UDFs) on your HDInsight cluster, additional configuration is required. See Java UDFs.
Before you begin running jobs, you must specify your base storage layer, which can be WASB or ADLS. For more information, see Set Base Storage Layer.
Additional configuration is required: