Contents: This section describes how to enable the Trifacta® platform to read sources in Hive and write results back to Hive. HDFS has been specified as your base storage layer. See Set Base Storage Layer . Hive Server 2 and your Hive databases must already be installed in your Hadoop cluster. NOTE: For JDBC interactions, the Trifacta platform supports HiveServer2 only. NOTE: Running user-defined functions for an external service, such as Hive, is not supported from within a recipe step. As a workaround, you may be able to execute recipes containing such external UDFs on the Photon running environment. Performance issues should be expected on larger datasets. The user with which Hive connects to read from HDFS should be a member of the user group Verify that the Unix or LDAP group Hive user for Spark: NOTE: If you are executing jobs in the Spark running environment, additional permissions may be required. If the Hive source is a reference or references to files stored elsewhere in HDFS, the Hive user or its group must have read and execute permissions on the source HDFS directories or files. In platform configuration, the Trifacta data service must be enabled. You can apply this change through the Admin Settings Page (recommended) or
Please verify the following: In platform configuration, you must verify that the following parameter is pointing to the proper location for the Hive JDBC JAR file. The example below identfies the location for Cloudera 5.10: NOTE: This parameter varies for each supported distribution and version. If you are using the Spark running environment for execution and profiling jobs, you must enable Hive support within the Spark Job Service configuration block. NOTE: The Spark running environment is the default running environment. When this change is made, the platform requires that a valid Steps: Locate the following setting and verify that it is set to Modify the following parameter to point to the location where Hive dependencies are installed. This example points to the location for Cloudera 5.10: NOTE: This parameter value is distribution-specific. Please update based on your specific distribution. The Spark Job Services requires read access to the Hive databases. Please verify that the Spark user can access the required Hive databases and tables. For more information, please contact your Hive administrator. The Trifacta platform publishes to Hive using managed tables. When writing to Hive, the platform pushes to an externally staged table. Then, from this staging table, the platform selects and inserts into a managed table. By default, the platform published to managed tables in Parquet format. As needed, you can apply the following values into platform configuration to change the format to which the platform writes when publishing a managed table: To change the format, please modify the following parameter. Steps: Locate the following parameter and modify it using one of the above values: For more information, see Create Hive Connections. Depending on your Hadoop environment, you may need to perform additional configuration to enable connectivity with your Hadoop cluster.
Pre-requisites
hive-site.xml
configuration file into your Trifacta deployment. See Configure for Hadoop.Known Limitations
Configure for Hive
Hive user
[hive.group
(default=trifactausers
)]
or whatever group is used to access HDFS from the Trifacta platform.[os.group
(default=trifacta
)]
has read access to the Hive warehouse directory.Enable Data Service
trifacta-conf.json
. For more information, see Platform Configuration Methods."data-service.enabled": true,
Locate the Hive JDBC Jar
"data-service.hiveJdbcJar": "hadoop-deps/cdh-5.10/build/libs/cdh-5.10-hive-jdbc.jar",
Enable Hive Support for Spark Job Service
hive-site.xml
cluster configuration file be installed on the Trifacta node.trifacta-conf.json
. For more information, see Platform Configuration Methods.true
:"spark-job-service.enableHiveSupport" : true,
"spark-job-service.hiveDependenciesLocation":"%(topOfTree)s/hadoop-deps/cdh-5.10/build/libs",
Enable Hive Database Access for Spark Job Service
Configure managed table format
PARQUET
(default)AVRO
trifacta-conf.json
. For more information, see Platform Configuration Methods."data-service.hiveManagedTableFormat": "PARQUET",
Create Hive Connection
Optional Configuration
Additional Configuration for Secure Environments
For secure impersonation
NOTE: You should have already configured the Trifacta platform to use secure impersonation. For more information on basic configuration, see Configure for Hive.
You must add the Hive principal value to your Hive connection:
- If you created your connection via the application: Add the following principal value to the Connect String Options textbox.
- If you created your connection via the CLI: Add it in
connectStrOpts
entry for your params file:
"connectStrOpts": ";principal=<principal_value>",
For Kerberos with secure impersonation
NOTE: If you are enabling Hive in a Kerberized environment, you must also enable secure impersonation. When connecting to Hive, Kerberos without secure impersonation is not supported.
NOTE: You should have already enabled basic Kerberos integration. For more information, see Set up for a Kerberos-enabled Hadoop cluster.
NOTE: You should have already configured the Trifacta platform to use secure impersonation and added the Hive principal to the connectStrOpts
value in your params file. For more information on basic configuration, see Configure for secure impersonation.
Additional Configuration for Sentry
The Trifacta platform can be configured to use Sentry to authorize access to Hive. See Configure for Hive with Sentry.
Validate Configuration
NOTE: The platform cannot publish to a default database in Hive that is empty. Please create at least one table in your default database.
Build example Hive dataset
Steps:
- Download and unzip the following dataset: Dataset-HiveExampleData.
Store the dataset in the following example directory:
/tmp/hiveTest_5mb
Use the following command to create your table:
create table test (name string, id bigint, id2 bigint, randomName string, description string, dob string, title string, corp string, fixedOne bigint, fixedTwo int) row format delimited fields terminated by ',' STORED AS TEXTFILE;
Add the example dataset to the above test table:
load data local inpath '/tmp/hiveTest_5mb' into table test;
Check basic connectivity
Steps:
- After restart, login to the Trifacta application. See Login.
- If the platform is able to connect to Hive, you should see a Hive tab in the Import Data page. Create a new dataset and verify that the data from the Hive data has been ingested in the Transformer page.
- If not, please check
/opt/trifacta/logs/data-service.log
for errors. - For more information, see Verify Operations.
This page has no comments.