Page tree

Outdated release! Latest docs are Release 8.2: Install Config for Azure

   

Contents:


After installation of the Trifacta platform software and databases in your Microsoft Azure infrastructure, please complete these steps to perform the basic integration between the Trifacta node and Azure resources like the backend storage layer and running environment cluster. 

NOTE: This section includes only basic configuration for required platform functions and integrations with Azure. Please use the links in this section to access additional details on these key features.

Tip: When you save changes from within the Trifacta platform, your configuration is automatically validated, and the platform is automatically restarted.


Configure in Azure

These steps require admin access to your Azure deployment.

Create registered application

To create an Azure Active Directory (AAD) application, please complete the following steps in the Azure console.

Steps:

  1. Create registered application:

    1. In the Azure console, navigate to Azure Active Directory > App Registrations.

    2. Create a New App. Name it trifacta.

      NOTE: Retain the Application ID and Directory ID for configuration in the Trifacta platform.

  2. Create a client secret:
    1. Navigate to Certificates & secrets.
    2. Create a new Client secret.

      NOTE: Retain the value of the Client secret for configuration in the Trifacta platform.

  3. Add API permissions:
    1. Navigate to API Permissions.
    2. Add Azure Key Vault with the user_impersonation permission.

For additional details, see Configure for Azure.

Please complete the following steps in the Azure portal to create a Key Vault and to associate it with the Trifacta registered application.

NOTE: A Key Vault is required for use with the Trifacta platform.

Create Key Vault in Azure

Steps:

  1. Log into the Azure portal.
  2. Goto: https://portal.azure.com/#create/Microsoft.KeyVault
  3. Complete the form for creating a new Key Vault resource:
    1. Name: Provide a reasonable name for the resource. Example:

      <clusterName>-<applicationName>-<group/organizationName>

      Or, you can use trifacta.

    2. Location: Pick the location used by the HDI cluster.
    3. For other fields, add appropriate information based on your enterprise's preferences.
  4. To create the resource, click Create.

    NOTE: Retain the DNS Name value for later use.

Enable Key Vault access for the Trifacta platform

Steps:

In the Azure portal, you must assign access policies for application principal of the Trifacta registered application to access the Key Vault.

Steps:

  1. In the Azure portal, select the Key Vault you created. Then, select Access Policies.
  2. In the Access Policies window, select the Trifacta registered application.
  3. Click Add Access Policy.
  4. Select the following secret permissions (at a minimum):
    1. Get
    2. Set
    3. Delete
  5. Select the Trifacta application principal.
  6. Assign the policy you just created to that principal.

For additional details, see Configure Azure Key Vault.

Create or modify Azure backend datastore

In the Azure console, you must create or modify the backend datastore for use with the Trifacta platform. Supported datastores:

NOTE: You should review the limitations for your selected datastore before configuring the platform to use it. After the base storage layer has been defined in the platform, it cannot be modified.


DatastoreNotes
ADLS Gen2

Supported for use with Azure Databricks cluster only.

See Enable ADLS Gen2 Access.

ADLSSee Enable ADLS Access.
WASB

Only WASBS protocol is supported only.

See Enable WASB Access.


Create or modify running environment cluster

In the Azure console, you must create or modify the running environment where jobs are executed by the Trifacta platform. Supported running environments:

NOTE: You should review the limitations for your selected running environment before configuring the platform to use it.


Running EnvironmentNotes
Azure Databricks

See Configure for Azure Databricks.

HDISee Configure for HDInsight.

Configure the Platform

Please complete the following steps to configure the Trifacta platform and to integrate it with Azure resources.

Base platform configuration

Please complete the following configuration steps in the Trifacta® platform.

NOTE: If you are integrating with Azure Databricks and are Managed Identities for authentication, please skip this section. That configuration is covered in a later step.

NOTE: Except as noted, these configuration steps are required for all Azure installs. These values must be extracted from the Azure portal.

Steps:

  1. You can apply this change through the Admin Settings Page (recommended) or
    trifacta-conf.json
    . For more information, see Platform Configuration Methods.
  2. Azure registered application values:

    "azure.applicationId": "<azure_application_id>",
    "azure.directoryId": "<azure_directory_id>",
    "azure.secret": "<azure_secret>",
    ParameterDescription
    azure.applicationId

    Application ID for the Trifacta registered application that you created in the Azure console

    azure.directoryId

    The directory ID for the Trifacta registered application

    azure.secret

    The Secret value for the Trifacta registered application

  3. Configure Key Vault:

    "azure.keyVaultUrl": "<url_of_key_vault>",
    ParameterDescription
    azure.keyVaultUrlURL of the Azure Key Vault that you created in the Azure console
  4. Save your changes and restart the platform.

For additional details:

Set base storage layer

The Trifacta platform supports integration with the following backend datastores on Azure.

  • ADLS Gen2
  • ADLS
  • WASB

ADLS Gen2

Please complete the following configuration steps in the Trifacta® platform.

NOTE: Integration with ADLS Gen2 is supported only on Azure Databricks.


Steps:

  1. You can apply this change through the Admin Settings Page (recommended) or
    trifacta-conf.json
    . For more information, see Platform Configuration Methods.
  2. Enable ADLS Gen2 as the base storage layer:

    "webapp.storageProtocol": "abfss",
    "hdfs.enabled": false,
    "hdfs.protocolOverride": "",
    ParameterDescription
    webapp.storageProtocol

    Sets the base storage layer for the platform. Set this value to abfss.

    NOTE: After this parameter has been saved, you cannot modify it. You must re-install the platform to change it.

    hdfs.enabledFor ADLS Gen2 access, set this value to false.
    hdfs.protocolOverrideFor ADLS Gen2 access, this special parameter should be empty. It is ignored when the storage protocol is set to abfss.
  3. Configure ADLS Gen2 access mode. The following parameter must be set to system.

    "azure.adlsgen2.mode": "system",
  4. The platform must be configured to use the CDH 6.1 bundle JARs:

    NOTE: If you have integrated with Databricks Tables, do not overwrite the value for data-service.hiveJdbcJar with the following value, even if it's set to a different distribution JAR file.

    "hadoopBundleJar": "hadoop-deps/cdh-6.2/build/libs/cdh-6.2-bundle.jar",
    "spark-job-service.hiveDependenciesLocation": %(topOfTree)s/hadoop-deps/cdh-6.2/build/libs",
    "data-service.hiveJdbcJar": "hadoop-deps/cdh-6.2/build/libs/cdh-6.2-hive-jdbc.jar",
  5. Set the protocol whitelist and base URIs for ADLS Gen2:

    "fileStorage.whitelist": ["abfss"],
    "fileStorage.defaultBaseUris": ["abfss://filesystem@storageaccount.dfs.core.windows.net/"],
    ParameterDescription
    fileStorage.whitelist

    A comma-separated list of protocols that are permitted to read and write with ADLS Gen2 storage.

    NOTE: The protocol identifier "abfss" must be included in this list.

    fileStorage.defaultBaseUris

    For each supported protocol, this param must contain a top-level path to the location where Trifacta platform files can be stored. These files include uploads, samples, and temporary storage used during job execution.

    NOTE: A separate base URI is required for each supported protocol. You may only have one base URI for each protocol.

  6. Save your changes.
  7. The Java VFS service must be enabled for ADLS Gen2 access. For more information, see Configure Java VFS Service in the Configuration Guide.

For additional details, see Enable ADLS Gen2 Access.

ADLS

ADLS access leverages HDFS protocol and storage, so additional configuration is required.

Steps:

  1. You can apply this change through the Admin Settings Page (recommended) or
    trifacta-conf.json
    . For more information, see Platform Configuration Methods.
  2. Enable ADLS as the base storage layer:

    "webapp.storageProtocol": "hdfs",
    "hdfs.enabled": true,
    "hdfs.protocolOverride": "adl",
    ParameterDescription
    webapp.storageProtocol

    Sets the base storage layer for the platform. Set this value to hdfs.

    NOTE: After this parameter has been saved, you cannot modify it. You must re-install the platform to change it.

    hdfs.enabledFor ADLS blob storage, set this value to true.
    hdfs.protocolOverrideFor ADLS blob storage, this special parameter must be set to adl.
  3. These parameters specify the Azure Data Lake for the platform:

    "azure.adl.enabled": =  "true"
    "azure.adl.store": =  "adl://xxx.azuredatalakestore.net"
    ParameterDescription
    azure.adl.enabledTo enable access to the Azure Data Lake, set this value to true.
    azure.adl.store

    Specify the value of the Azure Data Lake Store here.

    NOTE: Protocol should be set to adl://.

  4. Configure the appropriate Hadoop bundle JAR to use:

    "hadoopBundleJar": "hadoop-deps/hdp-2.6/build/libs/hdp-2.6-bundle.jar",
  5. Configure access to HDFS resources:

    "hdfs.namenode.host": "xxx.azuredatalakestore.net",
    "hdfs.namenode.port": "443",
    "hdfs.webhdfs.host": "xxx.azuredatalakestore.net",
    "hdfs.webhdfs.ssl.enabled": "true",
    "hdfs.webhdfs.port": "443",
    "hdfs.highavailability.serviceName": "xxx.azuredatalakestore.net",
    ParameterDescription
    hdfs.namenode.hostHostname of the HDFS namenode
    hdfs.namenode.portPort number for the HDFS namenode
    hdfs.webhdfs.hostHostname of WebHDFS service
    hdfs.webhdfs.ssl.enabledIf SSL has been enabled on the WebHDFS host, please set this value to true. You are likely to need to set the port value to a non-default value.
    hdfs.webhdfs.portPort number of WebHDFS service
    hdfs.highavailability.serviceNameSet this value the high availability service name for HDFS, if you have enabled integration with the cluster high availability. For more information, see Enable Integration with Cluster High Availability in the Configuration Guide.
  6. Save your changes.

For additional details, see Enable ADLS Access.

WASB

Steps:

  1. You can apply this change through the Admin Settings Page (recommended) or
    trifacta-conf.json
    . For more information, see Platform Configuration Methods.
  2. Enable WASB as the base storage layer:

    "webapp.storageProtocol": "wasbs",
    "hdfs.enabled": false,
    ParameterDescription
    webapp.storageProtocol

    Sets the base storage layer for the platform. Set this value to wasbs.

    NOTE: After this parameter has been saved, you cannot modify it. You must re-install the platform to change it.

    wasb protocol is not supported.

    hdfs.enabledFor WASB blob storage, set this value to false.
  3. Save your changes.
  4. In the following sections, you configure where the platform acquires the SAS token to use for WASB access from one of the following:
    1. From platform configuration
    2. From the Azure key vault
Configure SAS token for WASB

When integrating with WASB, the platform must be configured to use a SAS token to gain access to WASB resources. This token can be made available in either of the following ways, each of which requires separate configuration.

Via Trifacta platform configuration:

  1. You can apply this change through the Admin Settings Page (recommended) or
    trifacta-conf.json
    . For more information, see Platform Configuration Methods.
  2. Locate and specify the following parameters:

    "azure.wasb.defaultStore.blobHost": "<xxxxxxx.blob.core.windows.net>",
    "azure.wasb.defaultStore.container": "<name_of_container>",
    "azure.wasb.defaultStore.keyVaultSasTokenSecretName": "",
    "azure.wasb.defaultStore.sasToken": "sas_token_of_your_blob_storage"
    "azure.wasb.enabled": true,
    "azure.wasb.extraStores": [],
    "azure.wasb.fetchSasTokensFromKeyVault": false,
    ParameterDescription
    azure.wasb.defaultStore.blobHost

    Host of the WASB storage container. Acquire this value from your Azure console.

    NOTE: Do not add a protocol identifier such as https:// to the front of this value.

    azure.wasb.defaultStore.containerName of the default storage container
    azure.wasb.defaultStore.keyVaultSasTokenSecretNameFor acquiring the SAS token from platform configuration, leave this value empty.
    azure.wasb.defaultStore.sasTokenEnter the SAS token to access the WASB blob storage container.
    azure.wasb.enabledSet this value to true.
    azure.wasb.extraStoresYou can configure extra WASB storage blobs to access. For additional details, see Enable WASB Access.
    azure.wasb.fetchSasTokensFromKeyVaultFor acquiring the SAS token from platform configuration, set this value to false.
  3. Save your changes and restart the platform.

Via Azure Key Vault:

To require the Trifacta platform to acquire the SAS token from the Azure key vault, please complete the following configuration steps.

  1. You can apply this change through the Admin Settings Page (recommended) or
    trifacta-conf.json
    . For more information, see Platform Configuration Methods.
  2. Locate and specify the following parameters:

    "azure.wasb.defaultStore.blobHost": "<xxxxxxx.blob.core.windows.net>"
    "azure.wasb.defaultStore.container": "<name_of_container>",
    "azure.wasb.defaultStore.keyVaultSasTokenSecretName": "<your_key_vault_secret_name>",
    "azure.wasb.defaultStore.sasToken": ""
    "azure.wasb.enabled": true,
    "azure.wasb.extraStores": [],
    "azure.wasb.fetchSasTokensFromKeyVault": true,
    ParameterDescription
    azure.wasb.defaultStore.blobHost

    Host of the WASB storage container. Acquire this value from your Azure console.

    NOTE: Do not add a protocol identifier such as https:// to the front of this value.

    azure.wasb.defaultStore.containerName of the default storage container
    azure.wasb.defaultStore.keyVaultSasTokenSecretNameFor acquiring the SAS token from the key vault, paste in the value of the Azure key vault secret name, which you defined when creating the registered application in the Azure console.
    azure.wasb.defaultStore.sasTokenLeave this value empty.
    azure.wasb.enabledSet this value to true.
    azure.wasb.extraStoresYou can configure extra WASB storage blobs to access. For additional details, see Enable WASB Access.
    azure.wasb.fetchSasTokensFromKeyVaultFor acquiring the SAS token from the key vault, set this value to true.
  3. Save your changes and restart the platform.

For additional details, see Enable WASB Access.

Checkpoint: At this point, you should be able to load data from your backend datastore, if data is available. You can try to run a small job on Photon, which is native to the Trifacta node. You cannot yet run jobs on an integrated cluster.


Integrate with running environment

The Trifacta platform can run jobs on the following running environments.

NOTE: You may integrate with only one of these environments.

Base configuration for Azure running environments

The following parameters should be configured for all Azure running environments.

Steps:

  1. You can apply this change through the Admin Settings Page (recommended) or
    trifacta-conf.json
    . For more information, see Platform Configuration Methods.
  2.  Parameters:

    "webapp.runInTrifactaServer": true,
    "webapp.runinEMR": false,
    "webapp.runInDataflow": false,
    "photon.enabled": true,
    ParameterDescription
    webapp.runInTrifactaServer

    When set to true, the platform recommends and can run smaller jobs on the Trifacta node, which uses the embedded Photon running environment.

    Tip: Unless otherwise instructed, the Photon running environment should be enabled.

    webapp.runinEMRFor Azure, set this value to false.
    webapp.runInDataflowFor Azure, set this value to false.
    photon.enabledSet this value to true.
  3. Save your changes.

Azure Databricks

The Trifacta platform can be configured to integrate with supported versions of Azure Databricks clusters to run jobs in Spark. 

NOTE: Before you attempt to integrate, you should review the limitations around this integration. For more information, see Configure for Azure Databricks.

Steps:

  1. You can apply this change through the Admin Settings Page (recommended) or

    trifacta-conf.json
    . For more information, see Platform Configuration Methods.

  2. Configure the following parameters to enable job execution on the specified Azure Databricks cluster:

    "webapp.runInDatabricks": true,
    "webapp.runWithSparkSubmit": false,
    ParameterDescription
    webapp.runInDatabricks

    Defines if the platform runs jobs in Azure Databricks. Set this value to true.

    webapp.runWithSparkSubmitFor all Azure Databricks deployments, this value should be set to false.
  3. Configure the following Azure Databricks-specific parameters:

    "databricks.serviceUrl": "<url_to_databricks_service>",
    ParameterDescription
    databricks.serviceUrlURL to the Azure Databricks Service where Spark jobs will be run (Example: https://westus2.azuredatabricks.net)

    NOTE: If you are using instance pooling on the cluster, additional configuration is required. See Configure for Azure Databricks.

  4. Save your changes and restart the platform.

For additional details, see Configure for Azure Databricks.

HDInsight

The Trifacta platform can be configured to integrate with supported versions of HDInsight clusters to run jobs in Spark. 

NOTE: Before you attempt to integrate, you should review the limitations around this integration. For more information, see Configure for HDInsight.

Specify running environment options:

  1. You can apply this change through the Admin Settings Page (recommended) or

    trifacta-conf.json
    . For more information, see Platform Configuration Methods.

  2. Configure the following parameters to enable job execution on the specified HDI cluster:

    "webapp.runInDatabricks": false,
    "webapp.runWithSparkSubmit": true,
    ParameterDescription
    webapp.runInDatabricks

    Defines if the platform runs jobs in Azure Databricks. Set this value to false.

    webapp.runWithSparkSubmitFor HDI deployments, this value should be set to true.

Specify Trifacta user:

Set the Hadoop username for the Trifacta platform to use for executing jobs [hadoop.user (default=trifacta)]:  

"hdfs.username": "[hadoop.user]",

Specify location of client distribution bundle JAR:

The Trifacta platform ships with client bundles supporting a number of major Hadoop distributions.  You must configure the jarfile for the distribution to use.  These distributions are stored in the following directory:

/trifacta/hadoop-deps

Configure the bundle distribution property (hadoopBundleJar):

  "hadoopBundleJar": "hadoop-deps/hdp-2.6/build/libs/hdp-2.6-bundle.jar"

Configure component settings:

For each of the following components, please explicitly set the following settings.

  1. You can apply this change through the Admin Settings Page (recommended) or

    trifacta-conf.json
    . For more information, see Platform Configuration Methods.

  2. Configure Batch Job Runner:

      "batch-job-runner": {
       "autoRestart": true,
        ...
        "classpath": "%(topOfTree)s/services/batch-job-runner/build/install/batch-job-runner/batch-job-runner.jar:%(topOfTree)s/services/batch-job-runner/build/install/batch-job-runner/lib/*:%(topOfTree)s/%(hadoopBundleJar)s:/etc/hadoop/conf:%(topOfTree)s/conf/hadoop-site:/usr/lib/hdinsight-datalake/*:/usr/hdp/current/hadoop-client/client/*:/usr/hdp/current/hadoop-client/*"
      },
  3. Configure the following environment variables:

    "env.PATH": "${HOME}/bin:$PATH:/usr/local/bin:/usr/lib/zookeeper/bin",
    "env.TRIFACTA_CONF": "/opt/trifacta/conf"
    "env.JAVA_HOME": "/usr/lib/jvm/java-1.8.0-openjdk-amd64",
  4. Configure the following properties for various Trifacta components:

      "ml-service": {
       "autoRestart": true
      },
      "monitor": {
       "autoRestart": true,
        ...
       "port": <your_cluster_monitor_port>
      },
      "proxy": {
       "autoRestart": true
      },
      "udf-service": {
       "autoRestart": true
      },
      "webapp": {
        "autoRestart": true
      },
  5. Disable S3 access:

    "aws.s3.enabled": false,
  6. Configure the following Spark Job Service properties:

    "spark-job-service.classpath": "%(topOfTree)s/services/spark-job-server/server/build/install/server/lib/*:%(topOfTree)s/conf/hadoop-site/:%(topOfTree)s/services/spark-job-server/build/bundle/*:/usr/hdp/current/hadoop-client/hadoop-azure.jar:/usr/hdp/current/hadoop-client/lib/azure-storage-2.2.0.jar",
    "spark-job-service.env.SPARK_DIST_CLASSPATH": "/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-mapreduce-client/*",
  7. Save your changes.

For additional details, see Configure for HDInsight.

Checkpoint: At this point, you should be able to load data from your backend datastore and run jobs on an integrated cluster.

Configure platform authentication

The Trifacta platform supports the following methods of authentication when hosted in Azure.

Integrate with Azure AD SSO

The platform can be configured to integrate with your enterprise's Azure Active Directory provider. For more information, see Configure SSO for Azure AD.

Non-SSO authentication

If you are not applying your enterprise SSO authentication to the Trifacta platform, platform users must be created and managed through the application. 

Self-managed:

Users can be permitted to self-register their accounts and manage their password reset requests:

NOTE: Self-created accounts are permitted to import data, generate samples, run jobs, and generate and download results. Admin roles must be assigned manually through the application.

Admin-managed:

If users are not permitted to create their accounts, an admin must do so: 

via API:

For more information on creating user accounts via API, see API People Create v4 in the Developer Guide.

Checkpoint: Users who are authenticated or have been provisioned user accounts should be able to login to the Trifacta application and begin using the product.

Documentation

You can access complete product documentation online and in PDF format. From within the product, select Help menu > Documentation.

This page has no comments.