Page tree

Release 8.7.1


Contents:

   

Contents:


By default, Microsoft Azure deployments integrate with Azure Data Lake Store (ADLS). Optionally, you can configure your deployment to integrate with WASB.

Supported Environments:

Operation

Designer Cloud Powered by Trifacta Enterprise Edition

AmazonMicrosoft Azure
ReadNot supportedNot supportedSupported
WriteNot supportedNot supportedSupported (only if ADLS Gen1 is base storage layer)

Limitations

  • A single public connection to ADLS Gen1 is supported.
  • In this release, the  Designer Cloud powered by Trifacta platform  supports integration with the default store only. Extra stores are not supported.

Read-only access

If the base storage layer has been set to WASB, you can follow these instructions to set up read-only access to ADLS Gen1.

NOTE: To enable read-only access to ADLS Gen1, do not set the base storage layer to adl. The base storage layer for ADLS read-write access must remain wasbs.

Prerequisites

General

  • The  Designer Cloud powered by Trifacta platform  has already been installed and integrated with an Azure Databricks cluster. See Configure for Azure Databricks.
  • ADL must be set as the base storage layer for the  Designer Cloud powered by Trifacta platform  instance. See Set Base Storage Layer

Create a registered application

Before you integrate with Azure ADLS Gen1, you must create the  Designer Cloud powered by Trifacta platform  as a registered application. See Configure for Azure.

Azure properties

The following properties should already be specified in the Admin Settings page. Please verify that the following have been set:

  • azure.applicationId
  • azure.secret
  • azure.directoryId

The above properties are needed for this configuration. For more information, see Configure for Azure.

Key Vault Setup

An Azure Key Vault has already been set up and configured for use by the  Designer Cloud powered by Trifacta platform . For more information, see Configure for Azure.

Configure ADLS Authentication

Authentication to ADLS storage is supported for the following modes, which are described in the following section.

ModeDescription
System

All users authenticate to ADLS using a single system key/secret combination. This combination is specified in the following parameters, which you should have already defined:

  • azure.applicationId
  • azure.secret
  • azure.directoryId

These properties define the registered application in Azure Active Directory. System authentication mode uses the registered application identifier as the service principal for authentication to ADLS Gen1. All users have the same permissions in ADLS Gen1.

For more information on these settings, see Configure for Azure.

User

Per-user mode allows individual users to authenticate to ADLS Gen1 through their Azure Active Directory login.

NOTE: Additional configuration for AD SSO is required. Details are below.

Steps:

Please complete the following steps to specify the ADLS Gen1 access mode.

  1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.
  2. Set the following parameter to the preferred mode (system or user):

    "azure.adl.mode": "<your_preferred_mode>",
  3. Save your changes.

System mode access

When access to ADLS is requested, the platform uses the combination of Azure directory ID, Azure application ID, and Azure secret to complete access.

After defining the properties in the  Designer Cloud powered by Trifacta platform , system mode access requires no additional configuration.

User mode access

In user mode, a user ID hash is generated from the Key Vault key/secret and the user's AD login. This hash is used to generate the access token, which is stored in the Key Vault.

Set up for Azure AD SSO

NOTE: User mode access to ADLS requires Single Sign On (SSO) to be enabled for integration with Azure Active Directory. For more information, see Configure SSO for Azure AD.

Configure the  Designer Cloud powered by Trifacta platform

Configure storage protocol

You must configure the platform to use the ADL storage protocol when accessing.

NOTE: Per earlier configuration, base storage layer must be set to adl for read/write access to ADLS. See Set Base Storage Layer.

Steps:

  1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.
  2. Locate the following parameter and change its value to  adl:

    "webapp.storageProtocol": "adl",
  3. Set the following parameter to false:

    "hdfs.enabled": false,
  4. Save your changes and restart the platform.

Define storage locations

You must define the base storage location and supported protocol for storing data on ADLS.

NOTE: You can specify only one storage location for ADLS.


Steps:

  1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.
  2. Locate the following configuration block. Specify the listed changes:

    "fileStorage": {
        "defaultBaseUris": [
          "<baseURIOfYourLocation>"
        ],
        "whitelist": ["adl"]
      }
    ParameterDescription
    defaultBaseUris

    A comma-separated list of protocols that are permitted to read and write with ADLS storage.

    NOTE: The adl:// protocol identifier must be included.

    Example value:

    adl://<YOUR_STORE_NAME>.azuredatalakestore.net



    whitelist

    For each supported protocol, this array must contain a top-level path to the location where  Designer Cloud powered by Trifacta platform  files can be stored. These files include uploads, samples, and temporary storage used during job execution.

    NOTE: This array of values must include adl.

  3. Save your changes and restart the platform.

Testing

Restart services. See Start and Stop the Platform.

After the configuration has been specified, an ADLS connection appears in the Import Data page. Select it to begin navigating for data sources.

Specify ADLS Gen1 Path:

In the ADLS Gen1 browser, you can specify an explicit path to resources. Click the Pencil icon, paste the path value, and click Go.

/trifacta/input/username@example.com

You should paste the following in the Path textbox:

hdfs://trifacta/input/username@example.com

NOTE: When inserting values directly into the Path textbox, you must use the hdfs:// protocol identifier. Do not use the adl:// protocol identifier.

Tip: You can retrieve your home directory from your profile. See Storage Config Page.

Try running a simple job from the  Designer Cloud application . For more information, see Verify Operations.

Troubleshooting

For additional troubleshooting information, see ADLS Gen2 Access.

Using ADLS Gen 1 Connection

Uses of ADLS

The  Designer Cloud powered by Trifacta platform  can use ADLS for the following reading and writing tasks:

  1. Creating Datasets from ADLS Files: You can read in from a data source stored in ADLS. A source may be a single ADLS file or a folder of identically structured files. See Reading from Sources in ADLS below.
  2. Reading Datasets: When creating a dataset, you can pull your data from another dataset defined in ADLS. See Creating Datasets below.
  3. Writing Job Results: After a job has been executed, you can write the results back to ADLS. See Writing Job Results below.

In the  Designer Cloud application , ADLS is accessed through the ADLS browser.

NOTE: When the Designer Cloud powered by Trifacta platform executes a job on a dataset, the source data is untouched. Results are written to a new location, so that no data is disturbed by the process.

Before you begin using ADLS

  • Read/Write Access: Your cluster administrator must configure read/write permissions to locations in ADLS. Please see the ADLS documentation.

    Avoid using /trifacta/uploads for reading and writing data. This directory is used by the Designer Cloud application .

  • Your cluster administrator should provide a place or mechanism for raw data to be uploaded to your datastore. 

  • Your cluster administrator should provide a writeable home output directory for you, which you can review. See Storage Config Page.

Secure access

Depending on the security features you've enabled, the technical methods by which Trifacta users access ADLS may vary. For more information, see ADLS Gen1 Access.

Storing data in ADLS

Your cluster administrator should provide raw data or locations and access for storing raw data within ADLS. All Trifacta users should have a clear understanding of the folder structure within ADLS where each individual can read from and write their job results. 

  • Users should know where shared data is located and where personal data can be saved without interfering with or confusing other users.

NOTE: The Designer Cloud powered by Trifacta platform does not modify source data in ADLS. Sources stored in ADLS are read without modification from their source locations, and sources that are uploaded to the platform are stored in /trifacta/uploads.

Reading from sources in ADLS

You can create a dataset from one or more files stored in ADLS.

Wildcards:

You can parameterize your input paths to import source files as part of the same imported dataset. For more information, see Overview of Parameterization

Folder selection:

When you select a folder in ADLS to create your dataset, you select all files in the folder to be included. Notes:

  • This option selects all files in all sub-folders. If your sub-folders contain separate datasets, you should be more specific in your folder selection.
  • All files used in a single dataset must be of the same format and have the same structure. For example, you cannot mix and match CSV and JSON files if you are reading from a single directory. 
  • When a folder is selected from ADLS, the following file types are ignored:
    • *_SUCCESS and *_FAILED files, which may be present if the folder has been populated by the running environment.
    • If you have stored files in ADLS that begin with an underscore (_), these files cannot be read during batch transformation and are ignored. Please rename these files through ADLS so that they do not begin with an underscore. 

Creating Datasets

When creating a dataset, you can choose to read data in from a source stored from ADLS or from a local file.

  • ADLS sources are not moved or changed.
  • Local file sources are uploaded to /trifacta/uploads where they remain and are not changed.

Data may be individual files or all of the files in a folder. For more information, see Reading from Sources in ADLS above.

In the Import Data page, click the ADLS tab. See Import Data Page.

Writing job results

When your job results are generated, they can be stored back in ADLS for you at the location defined for your user account.

  • The ADLS location is available through the Publishing dialog in the Output Destinations tab of the Job Details page. See Publishing Dialog.
  • Each set of job results must be stored in a separate folder within your ADLS output home directory. 
  • For more information on your output home directory, see Storage Config Page.

If your deployment is using ADLS, do not use the trifacta/uploads directory. This directory is used for storing uploads and metadata, which may be used by multiple users. Manipulating files outside of the Designer Cloud application can destroy other users' data. Please use the tools provided through the interface for managing uploads from ADLS.

Users can specify a default output home directory and, during job execution, an output directory for the current job.

Access to results:

Depending on how the platform is integrated with ADLS, other users may or may not be able to access your job results.

  • If user mode is enabled, results are written to ADLS through the ADLS account configured for your use. Depending on the permissions of your ADLS account, you may be the only person who can access these results.

  • If user mode is not enabled, then each Trifacta user writes results to ADLS using a shared account. Depending on the permissions of that account, your results may be visible to all platform users.

Creating a new dataset from results

As part of writing job results, you can choose to create a new dataset, so that you can chain together data wrangling tasks.

NOTE: When you create a new dataset as part of your job results, the file or files are written to the designated output location for your user account. Depending on how your cluster permissions are configured, this location may not be accessible to other users.

Reference

Supported Versions: n/a

Supported Environments:

Operation

Designer Cloud Powered by Trifacta Enterprise Edition

AmazonMicrosoft Azure
ReadNot supportedNot supportedSupported
WriteNot supportedNot supportedSupported (only if ADLS Gen1 is base storage layer)


Create New Connection: 
n/a

NOTE: A single public connection to ADLS Gen1 is supported.

This page has no comments.