Skip to main content

AWS Glue Access

If you have integrated with an EMR cluster version 5.8.0 or later, you can configure your Hive instance to use Amazon Glue Data Catalog for storage and access to Hive metadata.

Tip

For metastores that are used across a set of services, accounts, and applications, Amazon Glueis the recommended method of access.

For more information on Amazon Glue, see https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html.

This section describes how to enable integration with your Amazon Glue deployment.

Supported Deployment

Amazon Glue tables can be read under the following conditions:

  • The Designer Cloud Powered by Trifacta platform uses S3 as the base storage layer.

  • The Designer Cloud Powered by Trifacta platform is integrated with an EMR cluster:

    • EMR version 5.8.0 or later

    • EMR cluster has been configured with HiveServer2

  • The Hive deployment must be integrated with Amazon Glue.

    Note

    connections are supported when S3 is the backend datastore.

  • For HiveServer2 connectivity, the Trifacta node has direct access to the Master node of the EMR cluster.

EMR Settings

When you create the EMR cluster, please verify the following in the Amazon Glue Data Catalog settings:

  • Use for Hive table metadata

  • Use for Spark table metadata

Required Glue table properties

Each Glue table must be created with the following properties specified:

  • InputFormat

  • OutputFormat

  • Serde

These properties must be specified for the Hive JDBC driver to read the Amazon Glue tables.

For additional limitations on access Hive tables through Amazon Glue, see https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html#emr-hive-glue-considerations-hive.

Deploy Credentials JAR to S3

To enable integration between the Designer Cloud Powered by Trifacta platform and Amazon Glue, a JAR file for managing the Alteryx credentials for AWS access must be deployed to S3 in a location that is accessible to the EMR cluster.

When the EMR cluster is launched with the following custom bootstrap action, the cluster does one of the following:

  • Interacts with Amazon Glue using the credentials specified in trifacta-conf.json

  • If aws.mode = user, then the credentials registered by the user are used to connect to Amazon Glue.

Steps:

  1. From the installation of the Designer Cloud Powered by Trifacta platform, retrieve the following file:

    [TRIFACTA_INSTALL_DIR]/aws/glue-credential-provider/build/libs/trifacta-aws-glue-credential-provider.jar
  2. Upload this JAR file to an S3 bucket location where the EMR cluster can access it:

    1. Via AWS Console S3 UI: See http://docs.aws.amazon.com/cli/latest/reference/s3/index.html.

    2. Via AWS command line:

      aws s3 cp trifacta-aws-glue-credential-provider.jar s3://<YOUR-BUCKET>/
  3. Create a bootstrap action script named configure_glue_lib.sh. The contents must be the following:

    sudo aws s3 cp s3://<YOUR-BUCKET>/trifacta-aws-glue-credential-provider.jar  /usr/share/aws/emr/emrfs/auxlib/
    sudo aws s3 cp s3://<YOUR-BUCKET>/trifacta-aws-glue-credential-provider.jar  /usr/lib/hive/auxlib/
  4. This script must be uploaded into S3 in a location that can be accessed from the EMR cluster. Retain the full path to this location.

  5. Add a bootstrap action to EMR cluster configuration.

    1. Via AWS Console S3 UI: Create the bootstrap action to point to the script that you uploaded on S3.

    2. Via AWS command line:

      1. Upload the configure_glue_lib.sh file to the accessible S3 bucket.

      2. In the command line cluster creation script, add a custom bootstrap action. Example:

        --bootstrap-actions '[
        {"Path":"s3://<YOUR-BUCKET>/configure_glue_lib.sh","Name":"Custom action"}
        ]'

Authentication

Authentication methods are required permissions are based on the AWS authentication mode:

"aws.mode": "system",

aws.mode value

Permissions

Doc

system

IAM role assigned to the cluster must provide access to Amazon Glue.

See Configure for AWS.

user

The user role must provide access to Amazon Glue.

See below for an example IAM role access control.

See Configure AWS Per-User Auth for Temporary Credentials.

Example fine-grain access control for IAM policy:

If you are using IAM roles to provide access to Amazon Glue, you can review the following fine-grained access control, which includes the permissions required to access Amazon Glue tables. Please add this to the Permissions section of your Amazon Glue Catalog Settings page.

Note

Please verify that access is granted in the IAM policy to the default database for Amazon Glue, as noted below.

{
    "Sid" : "accessToAllTables",
    "Effect" : "Allow",
    "Principal" : {
      "AWS" : [  "arn:aws:iam::<accountId>:role/glue-read-all" ]
    },
    "Action" : [ "glue:GetDatabases", "glue:GetDatabase", "glue:GetTables", "glue:GetTable", "glue:GetUserDefinedFunctions", "glue:GetPartitions" ],
    "Resource" : [ "arn:aws:glue:us-west-2:<accountId>:catalog", "arn:aws:glue:us-west-2:<accountId>:database/default", "arn:aws:glue:us-west-2:<accountId>:database/global_temp", "arn:aws:glue:us-west-2:<accountId>:database/mydb", "arn:aws:glue:us-west-2:<accountId>:table/mydb/*" ]
}

S3 access

Amazon Gluecrawls available data that is stored on S3. When you import a dataset through Amazon Glue:

  • Any samples of your data that are generated by the Designer Cloud Powered by Trifacta platform are stored in S3. Sample data is read by the platform directly from S3.

  • Source data is read through Amazon Glue.

Warning

You should review and, if needed, apply additional read restrictions on your IAM policies so that users are limited to reading data from their own S3 directories. If all users have access to the same areas of the same S3 bucket, then it may be possible for users to access datasets through the platform when it is forbidden through Amazon Glue.

Limitations

  • Access is read-only. Publishing to Glue hosted on EMR is not supported.

  • When using per-user IAM role-based authentication, EMR Spark jobs on Amazon Glue datasources may fail if the job is still running beyond the defined session limit after job submission time for the IAM role.

    • In the AWS Console, this limit is defined in hours as the Maximum CLI/API session duration assigned to the IAM role.

    • In the Amazon Glue catalog client for the Hive Metadata store, the temporary credentials generated for the IAM role expire after this limit in hours and cannot be renewed.

Enable

Please verify the following have been enabled and configured.

  1. Your deployment has been configured to meet the Supported Deployment guidelines above.

  2. You must integrate the platform with Hive.

    Note

    For the Hive hostname and port number, use the Master public DNS values. For more information, see https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html.

    For more information, see Configure for Hive.

  3. If you are using it, the custom SQL query feature must be enabled. For more information, see Enable Custom SQL Query.

Configure

When accessing Amazon Glue using temporary per-user credentials, the credentials are given a duration of 1 hour. As needed, you can modify this duration.

Note

This value cannot exceed the Maximum Session Duration value for IAM roles, as configured in the IAM Console.

  1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

  2. Locate the following parameter. By default, this value is set to 1:

    "data-service.sqlOptions.glueTempCredentialTimeoutInHours": 1
  3. Save your changes and restart the platform.

Create Connection

See AWS Glue Connections.