Page tree

 

Contents:


This section provides information on how to enable Redshift connectivity and create one or more connections to Redshift sources. 

  • Amazon Redshift is a hosted data warehouse available through Amazon Web Services. It is frequently used for hosting of datasets used by downstream analytic tools such as Tableau and Qlik. For more information, see https://aws.amazon.com/redshift/

Pre-requisites

NOTE: In the Admin Settings page are some deprecated parameters pertaining to Redshift. Please ignore these parameters and their settings. They do not apply to this release.


Before you begin, please verify that your Trifacta® environment meets the following requirements:

  1. S3 base storage layer: Redshift access requires use of S3 as the base storage layer, which must be enabled. See Set Base Storage Layer

    NOTE: After you define your storage deployment options, you cannot change them for your deployment. For more information, see Storage Deployment Options

  2. Redshift version: For more information on supported versions, see Connection Types.
  3. Same region: The Redshift cluster must be in the same region as the default S3 bucket.
  4. Integration: Your  Trifacta instance is connected to a running environment supported by your product edition.
  5. Hadoop deployment: Hadoop is deployed either on-premises or in EC2.
  6. Per-user mode: If per-user mode is enabled for S3, the user must have permission to access the manifest file (if used) and the LIST permission for the Amazon S3 bucket  and the GET permission for the Amazon S3 objects that are being loaded for the bucket to which job results are written for each user.

Limitations

  1. When publishing to Redshift, output must be in Avro or JSON format.
  2. You can publish any specific job once to Redshift through the export window. See Export Results Window.
    1. To republish, you must publish it again through the command line interface. See CLI for Jobs.
  3. Management of nulls:
    1. Nulls are displayed as expected in the Trifacta application.
    2. When Redshift jobs are run, the UNLOAD SQL command in Redshift converts all nulls to empty strings. Null values appear as empty strings in generated results, which can be confusing. This is a known issue with Redshift.

Create Connection

You can create Redshift connections through either of the following methods:

Tip: SSL connections are recommended. Details are below.

 

  1. Create a connection through the application
  2. Through the command line interface

Create through application

Any user can create a Redshift connection through the application.

Steps:

  1. Login to the application.
  2. To create an import-only connection:
    1. In the menu, click Datasets.
    2. Click Import Data.
    3. In the left nav bar, click the New icon.
  3. To create an import-and-publish connection:
    1. In the menu, click Settings menu > Connections.
  4. In the Create Connection page, click the Redshift connection card.
  5. Specify the properties for your Redshift database connection. The following parameters are specific to Redshift connections:

    PropertyDescription
    IAM Role ARN for Redshift-S3 Connectivity

    (Optional) You can specify an IAM role ARN that enables role-based connectivity between Redshift and the S3 bucket that is used as intermediate storage during Redshift bulk COPY/UNLOAD operations. Example:

    arn:aws:iam::1234567890:role/MyRedshiftRole

    NOTE: If a Redshift connection has an invalid iamRoleArn, you can browse, import datasets, and open the data in the Transformer page. However, any jobs executed using this connection fail. If the iamRoleArn is invalid, the only samples that you can generate are Quick Random samples; other sampling jobs fail.

     

    For more information, see Configure for EC2 Role-Based Authentication.


    For more information, see Create Connection Window.

  6. Click Save

Enable SSL connections

To enable SSL connections to Redshift, you must enable them first on your Redshift cluster. For more information, see https://docs.aws.amazon.com/redshift/latest/mgmt/connecting-ssl-support.html.

In your  Trifacta Wrangler Enterprise connection to Redshift, please add the following string to your Connect String Options:

;ssl=true

Save your changes.

Create through CLI

The CLI tools are stored in the following directory:

/opt/trifacta/bin/

Example command (all one command):

./trifacta_cli.py create_connection --user_name <platform_admin_username> --password <platform_admin_password> 
--conn_type amazon_redshift --conn_name aRedshiftConnection 
--conn_description "This is my Redshift connection." 
--conn_host example.com --conn_port 10000 
--conn_credential_type aws 
--conn_credential_location ~/.trifacta/config_conn.json --conn_params_location ~/.trifacta/p.json 
--conn_is_global --cli_output_path ./conn_create.out


ParameterDescription
create_connectionCLI action type. Please leave this value.
--user_name

Username of the Trifacta administrator account to use. Please specify the full username.

For this connection type, it must be an admin user account.

--passwordPassword of the account.
--conn_typeThe type of connection. Set this value to amazon_redshift.
--conn_name

The internal name of this connection.

NOTE: This value must be unique within all of your connections.

--conn_descriptionA user-friendly description for this connection appears in the application.
--conn_hostHost of the Redshift instance.
--conn_port

Port number of the Redshift instance.

--conn_credential_typeThe type of credentials to use. Set this value to aws for Redshift.
--conn_credential_locationPath to the file containing the credentials used to access Redshift. See below.
--conn_params_locationPath to the file containing the parameters to pass to Redshift during interactions. See below.
--conn_is_global

This flag makes the connection available to all Trifacta users. This flag requires no value.

NOTE: After a connection has been made global, it cannot be made private. You must delete the connection through the CLI and then recreate it. When a Redshift connection is deleted, all publication information associated with it is dropped.

--conn_skip_test

By default, any connection is tested as part of a create or edit action. Add this flag to skip testing the connection. This flag requires no value.

Tip: After creation, you can test the connection and modify it through the application. See Flows Page.

--cli_output_path

The path to the file where results of this command are written.

NOTE: The user issuing the command must also have execute permissions on all parent folders in the specified cli_output_path.

Example credentials file:

A credentials file for AWS is structured in the following manner. This file must be accessible within the Trifacta deployment directory. Please add a reference to this file in your command.

{
  "username": "<your_user>",
  "password": "<your_password>"
  "iamRoleArn": "<your_IAM_role_ARN>"
}

NOTE: iamRoleArn is optional. For more information, see Configure for EC2 Role-Based Authentication.

Example params file:

A parameter file containing the following information must be accessible within the Trifacta deployment directory. Please add a reference to this file in your command.

Example:

{
 "defaultDatabase":"<your_database>",
 "extraLoadParams": "BLANKSASNULL EMPTYASNULL TRIMBLANKS TRUNCATECOLUMNS"
}

The first parameter defines the default database.

The second parameter is used when you publish results to Redshift. For more information on these values, see http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html.

For more information on the Trifacta command line interface, see CLI for Connections.

Create via API

For more information, see API Connections Create v3

Testing

Import a dataset from Redshift. Add it to a flow, and specify a publishing action. Run a job.

NOTE: To Redshift, you can publish job results in JSON or Avro format.

After you have run your job, you can publish the results to Redshift through the Export Results Window. See Export Results Window.

You can also publish results through the command line interface. See CLI for Jobs.

This page has no comments.