Page tree

Release 9.2


Contents:

   

Contents:


This section describes the basics of creating a connection to Hive  from Designer Cloud powered by Trifacta® Enterprise Edition and modifying it based on your cluster environment. 

For more information on supported versions of Hive , see Configure for Hive.

Enable

After you have configured your HDFS base storage layer, you can enable and configure your connection to the associated instance of Hive.

NOTE: Additional configuration is required.

For more information, see Configure for Hive.


Limitations

NOTE: The platform supports a single, global connection to Hive . All users must use this connection.

Create Encryption Key File

Before you create your Hive  connection, you must create and reference an encryption key file. If you have not created one already, see Create Encryption Key File.

Create Hive Connection

You can create the connection through the application or the APIs.

NOTE: The following configuration methods apply to creation of an insecure connection to a Hive 1.x instance. If you are applying security, using HTTP, or connecting to Hive 2.x, additional configuration is required. Before creating these connections, please review the Additional Configuration Options section below.

Create through application

Trifacta administrator can create the Hive  connection through the application.

Steps:

  1. Login to the application.
  2. In the menu, select User menu > Preferences > Connections.
  3. In the Create Connection page, click the Hive  connection card.
  4. Specify the properties for your Hive  database. 

    1. Before you create the connection, you should review the rest of this section to verify that you are using the proper connection string options for the Hive  connection to work in your environment. 

      NOTE: If secure impersonation is enabled on your cluster, you must include the Hive principal as part of your connection string. For more information, see Configure for Hive.


    2. For more information, see Create Connection Window.

  5. Click Save

Create via API

For more information, see  https://api.trifacta.com/ee/9.2/index.html#operation/createConnectionAPI: API Reference

  • Type: hive
  • Vendor: hive

Additional Configuration Options

Configure for HTTP

By default, the  Designer Cloud powered by Trifacta platform  utilizes TCP to connect to Hive . As needed, the platform can be configured to interact with Hive Server2 over HTTP. 

NOTE: HTTP connections are enabled by inserting additional parameters as part of the connection string. In some environments, the JDBC drive may automatically insert these parameters, based on configuration in hive-site.xml. If these parameters are not inserted, the Designer Cloud powered by Trifacta platform does not need additional configuration. For more information, see https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-ConnectionURLWhenHiveServer2IsRunninginHTTPMode.

NOTE: If you are planning to use SSL, additional configuration is required. See below.


Steps:

To enable the  Designer Cloud powered by Trifacta platform  to use HTTP when connecting to Hive, please do the following.

  1. Create a params file for your Hive connection. This file must contain at least the following entry:

        "connectStrOpts": ";transportMode=http;httpPath=cliservice"
  2. Save the file. 
  3. In your command, set the connection port value to 10001.
  4. Execute the command.

  5. Test the Hive connection.

  6. If it's not working, delete the connection and try again.

Configure for SASL

NOTE: Cloudera supports an additional method for communicating over SSL with Hive . For more information on how to identify the method used by your Cloudera cluster, see Configure for Cloudera.

The steps below describe how to enable the SASL-QOP method of SASL (Simple Authentication and Security Layer) communications. To enable, you must add an additional connection string parameter.

Steps:

  1. Create or edit the params file for your connection to Hive. This file must include the following setting for connectStrOpts:

    {
      "connectStrOpts": ";sasl.qop=auth-conf",
      "defaultDatabase": "default",
      "jdbc": "hive2"
    }
  2. Save the file.
  3. In your command, set the connection port value to 10001.
  4. Execute the command.
  5. Test the Hive  connection.

  6. If it's not working, delete the connection and try again.

Configure for Kerberos

For a Kerberos-enabled cluster, you must include the Kerberos principal value as part of the Hive  connection string options.

Please complete the following steps to update the Kerberos principal value for Hive  into your Hive  connection.

Steps:

  1. Retrieve the Kerberos principal value for your deployment.

  2. Apply the Kerberos principal to your Hive  connection. Add the Hive  principal in the above format to the Connect String Options in your Hive  connection. See Create Connection Window.

  3. Save the file.
  4. Test the connection.

Run Hive  jobs on a specific YARN queue

If needed, you can route any Spark jobs sourced from Hive  to a specific YARN job queue. 

Steps:

  1. In your Hive  connection, add the following option to your Connection String Options (connectStrOpts) in your Hive  connection:

      "connectStrOpts": ";sess_var_list?mapred.job.queue.name=<your_queue>",
    

    where:
    <your_queue> is the name of the YARN job queue.

  2. Save the change.
  3. Test the connection. 
  4. Run a job, and verify that it is routed to the appropriate YARN job queue.

Using Hive Connections

This section describes how you interact through the  Designer Cloud powered by Trifacta® platform  with your Hive data warehouse.

  • Hive is an open-source, scalable data warehouse built on top of the Hadoop infrastructure to enable SQL-like access to a datastore where processing is converted to Hadoop map/reduce tasks. Hive users can interact directly with the databases and tables using HiveQL, a querying language similar to SQL. For more information, see https://en.wikipedia.org/wiki/Apache_Hive.
  • Hive sources can be stored as Hive database tables and views, or you can select or create tables to which to publish job results. 

Uses of Hive

The  Designer Cloud powered by Trifacta platform  can use Hive for the following tasks:

  1. Create datasets by reading from Hive tables.
  2. Write data to Hive.

Before you begin using Hive

  • Read Access: Your Hadoop administrator must configure read permissions to Hive databases.

    • Your Hadoop administrator should provide a database table or tables for data upload to your Hive datastore.

  • Write Access: You can write jobs directly to Hive or ad-hoc publish jobs results to Hive at a later time. See Writing to Hive below.

Secure Access

Depending on the security features you've enabled, the technical methods by which Trifacta users access Hive may vary. For more information, see Configure Hadoop Authentication.

Reading partitioned data

The  Designer Cloud powered by Trifacta platform  can read in partitioned tables. However, it cannot read individual partitions of partitioned tables. 

Tip: If you are reading data from a partitioned table, one of your early recipe steps in the Transformer page should filter out the unneeded table data so that you are reading only the records of the individual partition.

Storing data in Hive

Your Hadoop administrator should provide datasets or locations and access for storing datasets within Hive. 

  • Users should know where shared data is located and where personal data can be saved without interfering with or confusing other users. 

NOTE: The Designer Cloud powered by Trifacta platform does not modify source data in Hive. Datasets sourced from Hive are read without modification from their source locations.

Reading from Hive

You can create a Trifacta dataset from a table or view stored in Hive.

For more information on how data types are imported from Hive, see Hive Data Type Conversions.

Notes on reading from Hive views using custom SQL

If you have enabled custom SQL and are reading data from a Hive view, nested functions are written to a temporary filename, unless they are explicitly aliased. 

Tip: If your custom SQL uses nested functions, you should create an explicit alias from the results. Otherwise, the job is likely to fail.

Problematic example:

SELECT
 UPPER(`t1`.`colum1`),
 TRIM(`t1`.`column2`),...

When these are ready from a Hive view, the temporary column names are: _c0_c1, etc. During job execution, Spark ignores the column1 and column2 reference.

Improved Example:

SELECT
 UPPER(`t1`.`column1`) as col1,
  TRIM(`t1`.`column2`) as col2,...

In this improved example, the two Hive view columns are aliased to the explicit column names, which are correctly interpreted and used by the Spark running environment during job execution.

Writing to Hive

You can write data back to Hive using one of the following methods:

NOTE: You cannot publish to a Hive database that is empty. The database must contain at least one table.

NOTE: If you are writing to unmanaged tables in Hive, create and drop & load actions are not supported.


  • Job results can be written directly to Hive as part of the normal job execution. Create a new publishing action to write to Hive. See Run Job Page.
  • As needed, you can publish results to Hive for previously executed jobs. See Publishing Dialog.
  • For more information on how data is converted to Hive, see Hive Data Type Conversions.

SQL Syntax

The following syntax requirements apply to this connection.

Object delimiter: backtick

Example syntax:

SELECT `column1`,`column2` FROM `databaseName`.`tableName`;

For more information on SQL in general, see Supported SQL Syntax.

Reference

Create New Connection:

NOTE: A single public Hive connection is supported.

This page has no comments.