This section describes how to enable the
|Hive Version||Master namenode||Notes|
|Hive 1.x||HiveServer2||All supported Hadoop deployments|
|Hive 2.x||HiveServer2 (Interactive)||Supported on HDP 2.6 only.|
NOTE: For JDBC interactions, the
hive-site.xmlconfiguration file into your
|D s item|
- Only one global connection to Hive is supported.
- Changes to the underlying Hive schema are not picked up by the
and will break the source and datasets that use it.
D s platform
- During import, the JDBC data types from Hive are converted to
. When data is written back to Hive, the original Hive data types may not be preserved. For more information, see Type Conversions.
D s item item data types
- Publish to unmanaged tables in Hive is supported, except for the following actions:
- Create table
- Drop & load table
- Publish to partitioned tables in Hive is supported.
- The schema of the results and the partitioned table must be the same.
- If they do not match, you may see an SchemaMismatched Exception error in the UI. You can try a drop and republish action on the data. However, the newly generated table does not have partitions.
- For errors publishing to partitioned columns, additional information may be available in the logs.
- Writing or publishing to ORC tables through Hive is not supported.
NOTE: Running user-defined functions for an external service, such as Hive, is not supported from within a recipe step. As a workaround, you may be able to execute recipes containing such external UDFs on the Photon running environment. Performance issues should be expected on larger datasets.
Configure for Hive
The user with which Hive connects to read from the backend datastore should be a member of the user group
|D s defaultuser|
|D s platform|
Verify that the Unix or LDAP group
has read access to the Hive warehouse directory.
D s defaultuser Type os.group Full true
Hive user for Spark:
NOTE: If you are executing jobs in the Spark running environment, additional permissions may be required. If the Hive source is a reference or references to files stored elsewhere in backend storage, the Hive user or its group must have read and execute permissions on the source directories or files.
Enable Data Service
In platform configuration, the
|D s item|
|D s config|
Please verify the following:
Locate the Hive JDBC Jar
In platform configuration, you must verify that the following parameter is pointing to the proper location for the Hive JDBC JAR file. The example below identfies the location for Cloudera 5.10:
NOTE: This parameter varies for each supported distribution and versionand version.
NOTE: If you are integrating with HDP 18.104.22.1687-2 or later, there is a known incompatibility between the cluster and the version of the Hadoop bundle JAR that is shipped with the product. The solution is to use bundle JARs from an earlier compatible version. For more information, see Configure for Hortonworks.
Enable Hive Support for Spark Job Service
If you are using the Spark running environment for execution and profiling jobs, you must enable Hive support within the Spark Job Service configuration block.
NOTE: The Spark running environment is the default running environment. When this change is made, the platform requires that a valid
D s config
Locate the following setting and verify that it is set to
"spark-job-service.enableHiveSupport" : true,
Modify the following parameter to point to the location where Hive dependencies are installed. This example points to the location for Cloudera 5.10:
NOTE: This parameter value is distribution-specific. Please update based on your specific distribution.
- Save your changes.
Enable Hive Database Access for Spark Job Service
The Spark Job Services requires read access to the Hive databases. Please verify that the Spark user can access the required Hive databases and tables.
For more information, please contact your Hive administrator.
Configure managed table format
|D s platform|
By default, the platform published to managed tables in Parquet format. As needed, you can apply the following values into platform configuration to change the format to which the platform writes when publishing a managed table:
To change the format, please modify the following parameter.
D s config
Locate the following parameter and modify it using one of the above values:
- Save your changes and restart the platform.
Additional configuration for HDP 3.x
If you are integrating with an HDP 3.x cluster, please add the following the Spark Job Service classpath:
D s config
Add the following value to Spark Job Service classpath.:
Example (No LLAP or Hive Warehouse):
D s property overflow
"classpath": "%(topOfTree)/etc/hive/conf:%(topOfTree)s/ \
Save your changes. Before restarting the platform, please review the following section.
Additional configuration for Hive 3.0 on HDP 3.x
NOTE: Hive 3.0 is supported only on Hortonworks HDP 3.x using the Hive Warehouse Connector to read from Hive.
Tables in Hive 3.0 are ACID-compliant, transactional tables. Since Spark cannot natively read transactional tables, the
|D s platform|
- The Hive Warehouse Connector connects to LLAP, which can run the Hive queries.
- Low Latency Analytics Processing (LLAP) is a Hortonworks framework that uses long-lived daemons in YARN for Hive query execution and in-memory caching of Hive data.
- For more information, see https://hortonworks.com/blog/top-5-performance-boosters-with-apache-hive-llap/.
NOTE: If Ranger is deployed on the cluster, Spark respects any column- or row-level security that Ranger enforces on the Hive tables. Queries for unauthorized data in a table fail in the
Please complete the following steps to integrate the
|D s platform|
NOTE: Before you begin, please verify that you have performed the extra configuration for using Spark on HDP 3.x. For more information, see Configure for Spark.
D s config
Enable use of the Hive Warehouse Connector:
Add the Hive Warehouse Connector to the Spark Job Service classpath. Example:
NOTE: If you have already configured for HDP 3.x, then the
(sparkBundleJar)update below may have already been added.
The following properties and values must be inserted in the
NOTE: These properties must be added to the
configuration. They cannot be read from Ambari.
D s platform Code Block
"spark.datasource.hive.warehouse.load.staging.dir": "/tmp", "spark.datasource.hive.warehouse.metastoreUri": "thrift://hdp30.example:9083", "spark.driver.extraLibraryPath": "/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64", "spark.executor.extraJavaOptions": "-XX:+UseNUMA", "spark.executor.extraLibraryPath": "/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64", "spark.hadoop.hive.llap.daemon.service.hosts": "@llap0", "spark.hadoop.hive.zookeeper.quorum": "hdp30.example:2181", "spark.sql.hive.hiveserver2.jdbc.url": "jdbc:hive2://hdp30.example:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive", "spark.sql.hive.hiveserver2.jdbc.url.principal": "hive/_HOST@HORTONWORKS", "spark.yarn.security.credentials.hiveserver2.enabled": "true", "spark.yarn.jars": "local:/usr/hdp/current/spark2-client/jars/*" "spark.driver.extraClassPath": "/usr/hdp/current/spark2-client/jars/guava-14.0.1.jar" "spark.executor.extraClassPath": "/usr/hdp/current/spark2-client/jars/guava-14.0.1.jar"
The properties listed below require information from your HDP cluster. For the other properties, please use the listed values, unless otherwise required.
URI for the Hive metastore. Copy the value from hive.metastore.uris. Example value:
A list of Zookeeper hosts used by LLAP. Copy the value from Advanced hive-site in Ambari:
The URL for HiveServer2 Interactive. In Ambari, copy the value from the following: Services > Hive > Summary > HIVESERVER2 INTERACTIVE JDBC URL.
"spark.sql.hive.hiveserver2.jdbc.url.principal" This property must be equal to
hive.server2.authentication.kerberos.principal. In Ambari, copy the value for this property from the following: . The property value is in
For more information on these properties, see https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/integrating-hive/content/hive_configure_a_spark_hive_connection.html.
Save your changes and restart the platform.
Create Hive Connection
NOTE: High availability for Hive is supported through configuration of the Hive connection.
For more information, see Create Hive Connections.
Depending on your Hadoop environment, you may need to perform additional configuration to enable connectivity with your Hadoop cluster.
Additional Configuration for Secure Environments
For secure impersonation
NOTE: You should have already configured the
You must add the Hive principal value to your Hive connection. Add the following principal value to the Connect String Options textbox.
For Kerberos with secure impersonation