Page tree

Release 8.7.1


Contents:

   

Contents:


By default, Microsoft Azure deployments integrate with Azure Data Lake Store (ADLS). Optionally, you can configure your deployment to integrate with WASB.

  • Windows Azure Storage Blob (WASB) is an abstraction layer on top of HDFS, which enables persistence of storage, access without a Hadoop cluster presence, and access from multiple Hadoop clusters. 

Supported Environments:

Operation

Designer Cloud Powered by Trifacta Enterprise Edition

AmazonMicrosoft Azure
ReadNot supportedNot supportedSupported
WriteNot supportedNot supportedSupported (only if WASB is base storage layer)

Limitations

  • A single public connection to WASB is supported.
  • If a directory is created on the cluster through WASB, the directory includes a Size=0 blob. The  Designer Cloud powered by Trifacta platform  does not list them and does not support interaction with Size=0 blobs.

Read-only access

If the base storage layer has been set to ADLS Gen1 or ADLS Gen2, you can follow these instructions to set up read-only access to WASB. 

NOTE: If you are adding WASB as a secondary integration, your WASB blob container or containers must contain at least one folder. This is a known issue.

NOTE: To enable read-only access to WASB, do not set the base storage layer to wasbs. The base storage layer must remain set for ADLS Gen 1 or ADLS Gen 2.

Prerequisites

General

  • The  Designer Cloud powered by Trifacta platform  has already been installed and integrated with an Azure Databricks cluster.
  • WASB must be set as the base storage layer for the  Designer Cloud powered by Trifacta platform  instance. See Set Base Storage Layer.
  • For each combination of blob host and container, a separate Azure Key Vault Store entry must be created. For more information, please contact your Azure admin. 

Create a registered application

Before you integrate with Azure WASB, you must create the  Designer Cloud powered by Trifacta platform  as a registered application. See Configure for Azure.

Other Azure properties

The following properties should already be specified in the Admin Settings page. Please verify that the following have been set:

  • azure.applicationId
  • azure.secret
  • azure.directoryId

The above properties are needed for this configuration. For more information, see Configure for Azure.

Key Vault Setup

For new installs, an Azure Key Vault has already been set up and configured for use by the  Designer Cloud powered by Trifacta platform .

NOTE: An Azure Key Vault is required. Upgrading customers who do not have a Key Vault in their environment must create one.

For more information, see Configure for Azure.

Configure WASB Authentication

Authentication to WASB storage is managed by specifying the appropriate host, container, and token ID in the  Designer Cloud powered by Trifacta platform  configuration. When access to WASB is requested, the platform passes the information through the Secure Token Service to query the specified Azure Key Vault Store using the provided values. The keystore returns the value for the secret. The combination of the key (token ID) and secret is used to access WASB.

NOTE: Per-user authentication is not supported for WASB.

For more information on creating the Key Vault Store and accessing it through the Secure Token Service, see Configure for Azure.

Configure the  Designer Cloud powered by Trifacta platform

Enable

 For more information, see WASB Access.

Define location of SAS token

The SAS token required for accessing Azure can be accessed from either of the following locations:

  1. Key Vault
  2. Trifacta configuration

SAS token from Key Vault

To store the SAS token in the key vault, specify the following parameters in platform configuration. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

ParameterDescription
"azure.wasb.fetchSasTokensFromKeyVault": true,

Instructs the Designer Cloud powered by Trifacta platform to query the Key Vault for SAS tokens

NOTE: The Key Vault must already be set up. See "Key Vault Setup" above.


SAS token from Trifacta configuration

To specify the SAS token in the  Designer Cloud powered by Trifacta platform  configuration, set the following flag to false:  

ParameterDescription
"azure.wasb.fetchSasTokensFromKeyVault": false,

Instructs the Designer Cloud powered by Trifacta platform to acquire per-container SAS tokens from the platform configuration.

Define WASB stores

The WASB stores that users can access are specified as an array of configuration values. Users of the platform can use all of them for reading sources and writing results.

Steps:

  1. To apply this configuration change, login as an administrator to the Trifacta node. Then, edit trifacta-conf.json. For more information, see Platform Configuration Methods.
  2. Locate the azure.wasb.stores configuration block.

  3. Apply the appropriate configuration as specified below.

    Tip: The default container must be specified as the first set of elements in the array. All containers listed after the first one are treated as extra stores.

    "azure.wasb.stores": 
        [
         {
          "sasToken": "<DEFAULT_VALUE1_HERE>",
          "keyVaultSasTokenSecretName": "<DEFAULT_VALUE1_HERE>",
          "container": "<DEFAULT_VALUE1_HERE>",
          "blobHost": "<DEFAULT_VALUE1_HERE>"
         },
         {
          "sasToken": "<VALUE2_HERE>",
          "keyVaultSasTokenSecretName": "<VALUE2_HERE>",
          "container": "<VALUE2_HERE>",
          "blobHost": "<VALUE2_HERE>"
         }
        ]
    },
    ParameterDescriptionSAS Token from Azure Key VaultSAS Token from Platform Configuration
    sasToken

    Set this value to the SAS token to use, if applicable.

    Example value:

    ?sv=2019-02-02&ss=bfqt&srt=sco&sp=rwdlacup&se=2022-02-13T00:00:00Z&st=2020-02-13T00:00:00Z&spr=https&sig=<redacted>

    Set this value to an empty string.

    NOTE: Do not delete the entire line. Leave the value as empty.

    See below for the command to execute to generate a SAS token.
    keyVaultSasTokenSecretName

    Set this value to the secret name of the SAS token in the Azure key vault to use for the specified blob host and container.

    If needed, you can generate and apply a per-container SAS token for use in this field for this specific store. Details are below.


    Set this value to an empty string.

    NOTE: Do not delete the entire line. Leave the value as empty.

    container

    Apply the name of the WASB container.

    NOTE: If you are specifying different blob host and container combinations for your extra stores, you must create a new Key Vault store. See above for details.



    blobHost

    Specify the blob host of the container.

    Example value:

    storage-account.blob.core.windows.net

    NOTE: If you are specifying different blob host and container combinations for your extra stores, you must create a new Key Vault store. See above for details.



  4. Save your changes and restart the platform.

Generate per-container SAS token

Execute the appropriate command at the command line to generate a SAS token for a specific container. The following Windows PowerShell command generates a SAS token that is valid for a full year:

Set-AzureRmStorageAccount -Name 'name'
$sasToken = New-AzureStorageContainerSASToken -Permission r -ExpiryTime (Get-Date).AddYears(1) -Name '<container_name>'

Tip: You can also generate a Shared Access Signature token for your Storage Account and Container from the Azure Portal.


Configure storage protocol

You must configure the platform to use the WASBS (secure) storage protocol when accessing.

Steps:

  1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.
  2. Locate the following parameter and change its value  wasbs for secure access:

    "webapp.storageProtocol": "wasbs",
  3. Set the following:

    "hdfs.enabled": false,
  4. Save your changes and restart the platform.

Define storage locations

You must define the base blob locations and supported protocol for storing data on WASB.

Steps:

  1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.
  2. Locate the following configuration block. Specify the listed changes:

    "fileStorage": {
        "defaultBaseUris": [
          "<baseURIOfYourBlob>"
        ],
        "whitelist": ["wasbs"]
      }
    ParameterDescription
    defaultBaseUris

    A comma-separated list of protocols that are permitted to read and write with WASB storage.

    NOTE: The wasbs:// protocol identifier must be included. WASB protocol is not supported.

    Example value:

    wasbs://container@storage-account.blob.core.windows.net/



    whitelist

    For each supported protocol, this array must contain a top-level path to the location where  Designer Cloud powered by Trifacta platform  files can be stored. These files include uploads, samples, and temporary storage used during job execution.

    NOTE: This array of values must include wasbs.

  3. Save your changes and restart the platform.

Testing

After the configuration has been specified, a WASB connection appears in the Import Data page. Select it to begin navigating through the WASB Browser for data sources.

Try running a simple job from the  Designer Cloud application . For more information, see Verify Operations.

Troubleshooting

For additional troubleshooting information, see ADLS Gen2 Access.

Using WASB Connection

Uses of WASB

The  Designer Cloud powered by Trifacta platform  can use WASB for the following tasks for reading and writing data:

  1. Importing datasets from WASB Files: You can read in from source data stored in WASB. An imported dataset may be a single WASB file or a folder of identically structured files. See Reading from Sources in WASB below.
  2. Reading Datasets: When creating a dataset, you can pull your data from a source in WASB. See Creating Datasets below.
  3. Writing Results: You can publish your results directly to WASB. See Writing Job Results below.
  4. Publishing Job Results: After a job has been executed, you can write the results back to WASB. See Writing Job Results below.

In the  Designer Cloud application , WASB is accessed through the WASB browser. See File System Browser.

NOTE: When the Designer Cloud powered by Trifacta platform executes a job on a dataset, the source data is untouched. Results are written to a new location, so that no data is disturbed by the process.

Before you begin using WASB

  • Read/Write Access: Your administrator must configure read/write permissions to locations in WASB. Please see the WASB documentation provided with your cluster software distribution.

    Avoid using /trifacta/uploads for reading and writing data. This directory is used by the Designer Cloud application .

    NOTE: If a directory is created on the cluster through WASB, the directory includes a Size=0 blob. The  Designer Cloud powered by Trifacta platform  does not list them and does not support interaction with Size=0 blobs.

  • Your cluster administrator should provide a place or mechanism for raw data to be uploaded to your datastore.

  • Your cluster administrator should provide a writeable home output directory for you, which you can review. See Storage Config Page.

Secure access

Client-side encryption is not supported. Through WASBS, HTTPS is supported.

Storing data in WASB

Your cluster administrator should provide raw data or locations and access for storing raw data within WASB. All Trifacta users should have a clear understanding of the folder structure within WASB where each individual can read from and write their job results. 

  • Users should know where shared data is located and where personal data can be saved without interfering with or confusing other users.
  • The  Designer Cloud application  stores the results of each job in a separate folder in WASB.

NOTE: The Designer Cloud powered by Trifacta platform does not modify source data in WASB. Data stored in WASB is read without modification from source locations, and source data that is uploaded to the platform are stored in /trifacta/uploads.

Reading from sources in WASB

You can import a dataset from one or more files stored in WASB.

Wildcards:

You can parameterize your input paths to import source files as part of the same imported dataset. For more information, see Overview of Parameterization

Folder selection:

When you select a folder in WASB for your imported dataset, you select all files in the folder to be included. Notes:

  • This option selects all files in all sub-folders. If your sub-folders contain separate datasets, you should be more specific in your folder selection.
  • All files used in a single imported dataset must be of the same format and have the same structure. For example, you cannot mix and match CSV and JSON files if you are reading from a single directory. Files of the same format must have identical column structures.
  • When a folder is selected from WASB, the following file types are ignored:
    • *_SUCCESS and *_FAILED files, which may be present if the folder has been populated from the cluster.
    • If you have stored files in WASB that begin with an underscore (_), these files cannot be read during batch transformation and are ignored. Please rename these files through WASB so that they do not begin with an underscore.

Creating datasets

When creating a dataset, you can choose to read data in source data stored from WASB or from a local file.

  • WASB sources are not moved or changed.
  • Local file sources are uploaded to /trifacta/uploads where they remain and are not changed.

Data may be individual files or all of the files in a folder.

Writing job results

When your job results are generated, they can be stored back in WASB at the location defined for your user account.

  • As part of the job specification, you can create a publishing target in WASB. See Run Job Page.
  • For ad-hoc publication to WASB, the target location is available through the Job Details page. See Job Details Page.
  • Each set of job results must be stored in a separate folder within your WASB output home directory. 
  • For more information on your output home directory, see Storage Config Page.

If your deployment is using WASB, do not use the trifacta/uploads directory. This directory is used for storing uploads and metadata, which may be used by multiple users. Manipulating files outside of the Designer Cloud application can destroy other users' data. Please use the tools provided through the interface for managing uploads from WASB.

Creating a new dataset from results

As part of writing job results, you can choose to create a new dataset, so that you can chain together data wrangling tasks.

NOTE: When you create a new dataset as part of your job results, the file or files are written to the designated output location for your user account. Depending on how your WASB permissions are configured, this location might not be accessible to other users.

Reference

Supported Versions: n/a

Supported Environments:

Operation

Designer Cloud Powered by Trifacta Enterprise Edition

AmazonMicrosoft Azure
ReadNot supportedNot supportedSupported
WriteNot supportedNot supportedSupported (only if WASB is base storage layer)


Create New Connection: 
n/a

NOTE: A single public connection to WASB is supported.



This page has no comments.