Skip to main content

Configure for Hive

This section describes how to enable the Designer Cloud Powered by Trifacta platform to read sources in Hive and write results back to Hive.

  • A Hive source is a single table in a selected Hive database.

  • Apache Hive is a data warehouse system for managing queries against large datasets distributed across a Hadoop cluster. Queries are managed using HiveQL, a SQL-like querying language. See https://hive.apache.org/.

  • The platform can publish results to Hive as part of any normal job or on an ad-hoc basis for supported output formats.

  • Hive is also used by the Designer Cloud Powered by Trifacta platform to publish metadata results. This capability shares the same configuration described below.

  • ORC tables managed through Hive can be used as datasources for the platform.

Prerequisites

  1. HiveServer2 and your Hive databases must already be installed in your Hadoop cluster.

    Note

    For JDBC interactions, the Designer Cloud Powered by Trifacta platform supports HiveServer2 only.

  2. You have verified that Hive is working correctly.

  3. You have acquired and deployed the hive-site.xml configuration file into your Alteryx deployment. See Configure for Hadoop.

Limitations

  1. Only one global connection to Hive is supported.

  2. Changes to the underlying Hive schema are not picked up by the Designer Cloud Powered by Trifacta platform and will break the source and datasets that use it.

  3. During import, the JDBC data types from Hive are converted to Alteryx data types. When data is written back to Hive, the original Hive data types may not be preserved. For more information, see Type Conversions.

  4. Publish to unmanaged tables in Hive is supported, except for the following actions:

    1. Create table

    2. Drop & load table

  5. Publish to partitioned tables in Hive is supported.

    1. The schema of the results and the partitioned table must be the same.

    2. If they do not match, you may see an SchemaMismatched Exception error in the UI. You can try a drop and republish action on the data. However, the newly generated table does not have partitions.

    3. For errors publishing to partitioned columns, additional information may be available in the logs.

  6. Writing or publishing to ORC tables through Hive is not supported.

Note

Running user-defined functions for an external service, such as Hive, is not supported from within a recipe step. As a workaround, you may be able to execute recipes containing such external UDFs on the Photon running environment. Performance issues should be expected on larger datasets.

Configure for Hive

Hive user

The user with which Hive connects to read from the backend datastore should be a member of the user group [hive.group (default=trifactausers)] or whatever group is used to access storage from the Designer Cloud Powered by Trifacta platform.

Verify that the Unix or LDAP group [os.group (default=trifacta)]has read access to the Hive warehouse directory.

Hive user for Spark:

Note

If you are executing jobs in the Spark running environment, additional permissions may be required. If the Hive source is a reference or references to files stored elsewhere in backend storage, the Hive user or its group must have read and execute permissions on the source directories or files.

Enable Data Service

In platform configuration, the Alteryx data service must be enabled. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

Please verify the following:

"data-service.enabled": true,

Locate the Hive JDBC Jar

In platform configuration, you must verify that the following parameter is pointing to the proper location for the Hive JDBC JAR file. The example below identfies the location for Cloudera 6.2:

Note

This parameter varies for each supported distribution and version.

"data-service.hiveJdbcJar": "hadoop-deps/cdh-6.2/build/libs/cdh-6.2-hive-jdbc.jar",

Enable Hive Support for Spark Job Service

If you are using the Spark running environment for execution and profiling jobs, you must enable Hive support within the Spark Job Service configuration block.

Note

The Spark running environment is the default running environment. When this change is made, the platform requires that a valid hive-site.xml cluster configuration file be installed on the Trifacta node.

Steps:

  1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

  2. Locate the following setting and verify that it is set to true:

    "spark-job-service.enableHiveSupport" : true,
  3. Modify the following parameter to point to the location where Hive dependencies are installed. This example points to the location for Cloudera 6.2:

    Note

    This parameter value is distribution-specific. Please update based on your specific distribution.

    "spark-job-service.hiveDependenciesLocation":"%(topOfTree)s/hadoop-deps/cdh-6.2/build/libs",
  4. Save your changes.

Enable Hive Database Access for Spark Job Service

The Spark Job Service requires read access to the Hive databases. Please verify that the Spark user can access the required Hive databases and tables.

For more information, please contact your Hive administrator.

Configure managed table format

The Designer Cloud Powered by Trifacta platform publishes to Hive using managed tables. When writing to Hive, the platform pushes to an externally staged table. Then, from this staging table, the platform selects and inserts into a managed table.

By default, the platform published to managed tables in Parquet format. As needed, you can apply the following values into platform configuration to change the format to which the platform writes when publishing a managed table:

  • PARQUET (default)

  • AVRO

To change the format, please modify the following parameter.

Steps:

  1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

  2. Locate the following parameter and modify it using one of the above values:

    "data-service.hiveManagedTableFormat": "PARQUET",
  3. Save your changes and restart the platform.

Create Hive Connection

Note

High availability for Hive is supported through configuration of the Hive connection.

For more information, see Hive Connections.

Optional Configuration

Depending on your Hadoop environment, you may need to perform additional configuration to enable connectivity with your Hadoop cluster.

Additional Configuration for Secure Environments

For secure impersonation

Note

You should have already configured the Designer Cloud Powered by Trifacta platform to use secure impersonation. For more information on basic configuration, see Configure for Secure Impersonation.

You must add the Hive principal value to your Hive connection. Add the following principal value to the Connect String Options textbox.

  "connectStrOpts": ";principal=<principal_value>",

For Kerberos with secure impersonation

Note

You should have already enabled basic Kerberos integration. For more information, see Configure for Kerberos Integration.

Note

If you are enabling Hive in a Kerberized environment, you must also enable secure impersonation. When connecting to Hive, Kerberos without secure impersonation is not supported. You should have already configured theDesigner Cloud Powered by Trifacta platformto use secure impersonation.For more information on basic configuration, see Configure for Secure Impersonation.

Additional Configuration for Sentry

The Designer Cloud Powered by Trifacta platform can be configured to use Sentry to authorize access to Hive. See Configure for Hive with Sentry.

Enable appending to Hive tables without full permissions

Optionally, you can enable users to publish to Hive tables for which they do not have CREATE or DROP permissions.

If they have read (SELECT) and append (INSERT) permissions on a Hive schema, they can be permitted to append to the production schema using a separate schema that matches the production one. The Designer Cloud Powered by Trifacta platform does the following:

  1. CREATE a staging table in the schema specified in the User Profile.

    Note

    These schemas must be created. Users must be given CREATE and DROP permissions on them.

  2. INSERT the output data into the staging table.

  3. Via INSERT, copy over data from the staging table to the production schema, effectively performing an append on the production table.

Steps:

Note

This feature must be enabled.

  1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

  2. Search for the following setting:

    "feature.showHiveStagingDB": true,
  3. Save your changes and restart the platform.

Use:

  1. Staging schemas must be created in Hive.

  2. Each user must insert the name of the staging schema in their user profile once. For more information, see User Profile Page.

  3. When users generate results to Hive, they choose to publish to the production schema as an append operation.

    1. For more information, see Run Job Page.

    2. For more information, see Publishing Dialog.

  4. Because this feature is enabled, the platform uses the specified staging schema and publishing mechanism to perform the append to the production schema.

Validate Configuration

Note

The platform cannot publish to a default database in Hive that is empty. Please create at least one table in your default database.

Build example Hive dataset

Steps:

  1. Download and unzip the following dataset: Dataset-HiveExampleData.

  2. Store the dataset in the following example directory:

    /tmp/hiveTest_5mb
  3. Use the following command to create your table:

    create table test (name string, id bigint, id2 bigint, randomName string, description string, dob string, title string, corp string, fixedOne bigint, fixedTwo int) row format delimited fields terminated by ',' STORED AS TEXTFILE;
  4. Add the example dataset to the above test table:

    load data local inpath '/tmp/hiveTest_5mb' into table test;

Check basic connectivity

Steps:

  1. After restart, login to the Trifacta Application. See Login.

  2. If the platform is able to connect to Hive, you should see a Hive tab in the Import Data page. Create a new dataset and verify that the data from the Hive data has been ingested in the Transformer page.

  3. If not, please check /opt/trifacta/logs/data-service.log for errors.

  4. For more information, see Verify Operations.