Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Excerpt

This section describes how to enable the 

D s platform
rtrue
 to read sources in Hive and write results back to Hive.

  • A Hive source is a single table in a selected Hive database.
  • Apache Hive is a data warehouse system for managing queries against large datasets distributed across a Hadoop cluster. Queries are managed using HiveQL, a SQL-like querying language. See https://hive.apache.org/
  • The platform can publish results to Hive as part of any normal job or on an ad-hoc basis for supported output formats.
  • Hive is also used by the 
    D s platform
     to publish metadata results. This capability shares the same configuration described below.

Supported Versions:

Hive VersionMaster namenodeNotes
Hive 1.xHiveServer2All supported Hadoop deployments
Hive 2.xHiveServer2 (Interactive)Supported on HDP 2.6 only.

 

Pre-requisites

  1. HDFS has been specified as your base storage layer. See Set Base Storage Layer .

  2. HiveServer2 and your Hive databases must already be installed in your Hadoop cluster.

    Info

    NOTE: For JDBC interactions, the 

    D s platform
     supports HiveServer2 only.


  3. You have verified that Hive is working correctly.
  4. You have acquired and deployed the hive-site.xml configuration file into your 
    D s item
    itemdeployment
    . See Configure for Hadoop.

Known Limitations

  1. Changes to the underlying Hive schema are not picked up by the 
    D s platform
     and will break the source and datasets that use it.
  2. High availability is not supported for Hive. 
  3. During import, the JDBC data types from Hive are converted to 
    D s item
    itemdata types
    . When data is written back to Hive, the original Hive data types may not be preserved. For more information, see Type Conversions
  4. Publish to partitioned tables in Hive is supported.
    1. The schema of the results and the partitioned table must be the same.
    2. If they do not match, you may see an SchemaMismatched Exception error in the UI. You can try a drop and republish action on the data. However, the newly generated table does not have partitions.
    3. For errors publishing to partitioned columns, additional information may be available in the logs. 
Info

NOTE: Running user-defined functions for an external service, such as Hive, is not supported from within a recipe step. As a workaround, you may be able to execute recipes containing such external UDFs on the Photon running environment. Performance issues should be expected on larger datasets.

Configure for Hive

Hive user

The user with which Hive connects to read from HDFS should be a member of the user group

D s defaultuser
Typehive.group
Fulltrue
 or whatever group is used to access HDFS from the 
D s platform
.

Verify that the Unix or LDAP group

D s defaultuser
Typeos.group
Fulltrue
has read access to the Hive warehouse directory.

Hive user for Spark:

Info

NOTE: If you are executing jobs in the Spark running environment, additional permissions may be required. If the Hive source is a reference or references to files stored elsewhere in HDFS, the Hive user or its group must have read and execute permissions on the source HDFS directories or files.

 

Enable Data Service

In platform configuration, the  

D s item
data service
data service
  must be enabled.
D s config

Please verify the following:

Code Block
"data-service.enabled": true,

Locate the Hive JDBC Jar

In platform configuration, you must verify that the following parameter is pointing to the proper location for the Hive JDBC JAR file. The example below identfies the location for Cloudera 5.10:

Info

NOTE: This parameter varies for each supported distribution and version.


Code Block
"data-service.hiveJdbcJar": "hadoop-deps/cdh-5.10/build/libs/cdh-5.10-hive-jdbc.jar",

Enable Hive Support for Spark Job Service

If you are using the Spark running environment for execution and profiling jobs, you must enable Hive support within the Spark Job Service configuration block.

Info

NOTE: The Spark running environment is the default running environment. When this change is made, the platform requires that a valid hive-site.xml cluster configuration file be installed on the

D s item
itemnode
.

Steps:

  1. D s config
  2. Locate the following setting and verify that it is set to true:

    Code Block
    "spark-job-service.enableHiveSupport" : true,


  3. Modify the following parameter to point to the location where Hive dependencies are installed. This example points to the location for Cloudera 5.10:

    Info

    NOTE: This parameter value is distribution-specific. Please update based on your specific distribution.


    Code Block
    "spark-job-service.hiveDependenciesLocation":"%(topOfTree)s/hadoop-deps/cdh-5.10/build/libs",


  4. Save your changes.

Enable Hive Database Access for Spark Job Service

The Spark Job Services requires read access to the Hive databases. Please verify that the Spark user can access the required Hive databases and tables.

For more information, please contact your Hive administrator.

Configure managed table format

The

D s platform
publishes to Hive using managed tables. When writing to Hive, the platform pushes to an externally staged table. Then, from this staging table, the platform selects and inserts into a managed table.

By default, the platform published to managed tables in Parquet format. As needed, you can apply the following values into platform configuration to change the format to which the platform writes when publishing a managed table:

  • PARQUET (default)
  • AVRO

To change the format, please modify the following parameter.

Steps:

  1. D s config
  2. Locate the following parameter and modify it using one of the above values:

    Code Block
    "data-service.hiveManagedTableFormat": "PARQUET",


  3. Save your changes and restart the platform.

Create Hive Connection

For more information, see Create Hive Connections.

Optional Configuration

Depending on your Hadoop environment, you may need to perform additional configuration to enable connectivity with your Hadoop cluster.

Additional Configuration for Secure Environments

...