- All desktop users of the platform must have the latest version of Google Chrome installed on their desktops.
- All desktop users must be able to connect to the created instance through the enterprise infrastructure.
Info |
---|
NOTE: The following guidelines apply only to on the Azure Marketplace. |
Use the following guidelines to select your instance size: Azure virtual machine type | vCPUs | RAM (GB) | Max recommended concurrent users | Avg. input data size of jobs on (GB) |
---|
Standard_A4 | 8 | 14 | 30 | 1 | Standard_A6 | 4 | 28 | 30 | 2 | Standard_A7 | 8 | 56 | 60 | 5 | Standard_A10 | 8 | 56 | 60 | 5 | Standard_A11 | 16 | 112 | 90 | 11 | Standard_D4_v2 | 8 | 28 | 30 | 2 | Standard_D5_v2 | 16 | 56 | 90 | 5 | Standard_D12_v2 | 4 | 28 | 30 | 2 | Standard_D13_v2 | 8 | 56 | 60 | 5 | Standard_D14_v2 | 16 | 112 | 90 | 11 | Standard_D15_v2 | 20 | 140 | 120 | 14 |
Before you install the platform, please verify that the following steps have been completed. License: When you install the software, the installed license is valid for 24 hours.You must acquire a license key file from . For more information, please contact Sales.- Supported Cluster Types:
- Hadoop
- HBase
- Spark
Data Lake Store only: If you are integrating with Azure Data Lake Store, please review and complete the following section.
- Azure Active Directory registered application permissions are listed below.
- If you are connecting to a SQL DW database, additional permissions are required and are described in the full Install Guide. See Create SQL DW Connections.
- If you are integrating with a WASB cluster, you must generate a SAS token with appropriate permissions for each WASB cluster. Instructions are available in the full Install Guide. See Enable WASB Access.
Register application and r/w accessYou must create an Azure Active Directory registered application with appropriate permissions for the following: - Read/write access to the Azure Key Vault resources
- Read/write access to the Data Lake Store resource
This can be either the same service principal used for the HDInsight cluster or a new one created specifically for the . This service principal is used by the for access to all Azure resources. For more information, see https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-create-service-principal-portal.- To create a new service principal, see https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-create-service-principal-portal#create-an-azure-active-directory-application.
- Obtain property values:
- After you have created a new one, please acquire the following property values prior to install.
- For an existing service principal, see https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-create-service-principal-portal#get-application-id-and-authentication-key to obtain the property values.
These values are applied during the install process. Azure Property | Location | Use |
---|
Application ID | Acquire this value from the Registered app blade of the Azure Portal. | Applied to configuration: azure.applicationid . | Service User Key | Create a key for the Registered app in the Azure Portal. | Applied to configuration: azure.secret . | Directory ID | Copy the Directory ID from the Properties blade of Azure Active Directory. | Applied to configuration: azure.directoryId . |
You can install from the Azure Marketplace using one of the following methods: Create cluster: Create a brand-new HDI cluster and add as an application. Tip |
---|
Tip: This method is easiest and fastest to deploy. |
Add application: Use an existing HDI cluster and add as an application. Tip |
---|
Tip: This method does not support choosing a non-default size for the . If you need more flexibility, please choose the following option. |
Create a custom ARM template: Use an existing HDI cluster and configure a custom application via Azure Resource Manager (ARM) template. Tip |
---|
Tip: Use the third method only if your environment requires additional configuration flexibility or automated deployment via ARM template. |
Depending on your selection, please follow the steps listed in one of the following sections. Info |
---|
NOTE: These steps include required settings or recommendations only for configuring a cluster and the application for use with . Any other settings should be specified for your enterprise requirements. |
Please use the following steps if you are creating a new HDI cluster and adding the to it.Steps: - From the Azure Marketplace listing, click Get it Now.
- In Microsoft Azure Portal, click the New blade.
- Select .
- Click Create. Then, click the Quick Create tab.
- Please configure any settings that are not listed below according to your enterprise requirements.
- For more information on the Quick Create settings, see https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-provision-linux-clusters.
- Basics tab:
- Cluster type:
- Hadoop: One of the following:
- Hadoop
- HBase
- Spark
- Version:
- HDI 3.6
Click Select.
- Click Next.
- Storage tab:
Select your primary storage: Info |
---|
NOTE: does not support Additional storage options. Additional configuration is required after install is complete for the supported storage options. |
- Azure Storage
- Azure Data Lake Store
- Select a SQL database for Hive: If you plan to use Hive, you can choose to specify a database. If not, the platform creates one for you.
- Click Next.
- In the Applications tab, select .
- tab:
- Please review and accept the terms for using the product.
- Click Create.
- Click Ok.
- Click Next.
Cluster size tab: - Select , and select from the available cluster sizes.
Click Select.
Script actions tab: - No changes are needed here.
Summary tab: - Review the specification of the cluster you are creating.
- Make modifications as needed.
- To create the cluster, click Create.
- After the cluster has been started and deployed:
- Login to the application. See "Login" below.
- Change the admin password. See "Change admin password" below.
- Perform required additional configuration. See "Post-install configuration" below.
Please use the following steps if you are adding the to a pre-existing HDI cluster. Info |
---|
NOTE: The is set to Standard_V13_D2 by default. The size cannot be modified. |
Steps: - In the Microsoft Azure Portal, select the HDInsight cluster to which you are adding the application.
- In the Portal, select the Applications blade. Click + Add.
- From the list of available applications, select .
- Please accept the legal terms. Click Create.
- Click Ok.
- Click Next.
- The application is created for you.
- After the cluster has been started and deployed:
- Login to the application. See "Login" below.
- Change the admin password. See "Change admin password" below.
- Perform required additional configuration. See "Post-install configuration" below.
Please use the following steps if you are creating a custom application template for later deployment. This method provides more flexible configuration options and can be used for deployments in the future. Info |
---|
NOTE: You must have a pre-existing HDI cluster for which to create the application template. |
Steps: - Start here:
- https://github.com/trifacta/azure-deploy/
- Select the appropriate branch.
- Click Deploy from Azure.
- From the Microsoft Azure Portal, select the custom deployment link.
- Resource Group: Create or select one.
- Cluster Name: Select an existing cluster name.
- Edge Node Size: Select the instance type. For more information, see the Sizing Guide above.
- : For the version, select the latest listed version of .
- Application Name: If desired, modify the application name as needed. This name must be unique per cluster.
- Subdomain Application URI Suffix: If desired, modify the three-character alphanumeric string used in the DNS name of the application. This suffix must be unique per cluster.
- Please specify values for the following. See the Pre-requisites section for details.
- Application ID
- Directory ID
- Secret
- Gallery Package Identifier: please leave the default value.
- Please accept the Microsoft terms of use.
- To create the template, click Purchase.
- The custom template can be used to create the application. For more information, please see the Azure documentation.
- After the application has been started and deployed:
- Login to the application. See "Login" below.
- Change the admin password. See "Change admin password" below.
- Perform required additional configuration. See "Post-install configuration" below.
Steps: - In the Azure Portal, select the HDI cluster.
- Select the Applications blade.
- Select the application.
- Click the Portal link.
You may be required to apply the cluster username and password. Info |
---|
NOTE: You can create a local user of the cluster to avoid enabling application users to use the administrative user's cluster credentials. To create such a user: 1. Navigate to the cluster's Ambari console. 2. In the user menu, select the Manage Ambari page. 3. Select Users. 4. Select Create Local User. 5. Enter a unique (lowercase) user name. 6. Enter a password and confirm that password. 7. For User Access, select Cluster User. 8. Verify that the user is not an Ambari Admin and has Active status. 9. Click Save. |
- You are connected to the .
- In the login screen, enter the default username and password:
- Username:
admin@trifacta.local - Password:
admin - Click Login.
Info |
---|
NOTE: If this is your first login to the application, please be sure to reset the admin password. Steps are provided below. |
Steps: - If you haven't done so already, login to the as an administrator.
- In the menu bar, select Settings menu > Administrator.
- In the User Profile, enter and re-enter a new password.
- Click Save.
- Logout and login again using the new password.
If you are integrating with HDI and did not install via a custom ARM template, the following settings must be specified with the . Info |
---|
NOTE: These settings are specified as part of the cluster definition. If you have not done so already, you should acquire the corresponding values for the in the Azure Portal. |
Steps: - Login to the as an administrator.
- In the menu bar, select Settings menu > Admin Settings.
In the Admin Settings page, specify the values for the following parameters: Code Block |
---|
"azure.secret"
"azure.applicationId"
"azure.directoryId" |
Save your changes and restart the platform. - When the platform is restarted, continue the following configuration.
When the application is first created, the license is valid for 24 hours. Before the license expires, you must apply the license key file to the . Please complete the following general steps.Steps: - Locate the license key file that was provided to you by . Please store this file in a safe location that is not on the .
- In the Azure Portal, select the HDI cluster.
- Select the Applications blade.
- Select the application.
- From the application properties, acquire the SSH endpoint.
- Connect via SSH to the . For more information, see https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-linux-use-ssh-unix.
Drop the license key file in the following directory on the node: Code Block |
---|
/opt/trifacta/license |
- Restart the platform.
By default, the includes a static refresh token encryption key for the secure token service. The same default key is used for all instances of the platform. Info |
---|
NOTE: A valid base64 value must be configured for the platform, or the platform fails to start. |
If preferred, you can generate your own key value, which is unique to your instance of the platform. Steps: - Login at the command line to the .
To generate a new refresh token encryption key, please execute the following command:
cd /opt/trifacta/services/secure-token-service/ && java -cp server/build/install/secure-token-service/secure-token-service.jar:server/build/install/secure-token-service/lib/* com.trifacta.services.secure_token_service.tools.RefreshTokenEncryptionKeyGeneratorTool The refresh token encryption key is printed to the screen. Please copy this value to the clipboard. Paste the value in the following property: Code Block |
---|
"com.trifacta.services.secure_token_service.refresh_token_encryption_key": "<generated_key>", |
- Save your changes.
The cluster instance of Spark is used to execute larger jobs across the nodes of the cluster. Spark processes run multiple executors per job. Each executor must run within a YARN container. Therefore, resource requests must fit within YARN’s container limits. Like YARN containers, multiple executors can run on a single node. More executors provide additional computational power and decreased runtime. Spark’s dynamic allocation adjusts the number of executors to launch based on the following: job size job complexity available resources
The per-executor resource request sizes can be specified by setting the following properties in the spark.props section. Info |
---|
NOTE: In , all values in the spark.props section must be quoted values. |
| |
---|
spark.executor.memory
| Amount of memory to use per executor process (in a specified unit) | spark.executor.cores | Number of cores to use on each executor - limit to 5 cores per executor for best performance |
A single special process (the application driver) also runs in a container. Its resources are specified in the spark.props section: | |
---|
spark.driver.memory
| Amount of memory to use for the driver process (in a specified unit) | spark.driver.cores | Number of cores to use for the driver process |
Optimizing "Small" Joins (Advanced) Broadcast, or map-side, joins materialize one side of the join and send it to all executors to be stored in memory. This technique can significantly accelerate joins by skipping the sort and shuffle phases during a "reduce" operation. However, there is also a cost in communicating the table to all executors. Therefore, only "small" tables should be considered for broadcast join. The definition of "small" is set by the spark.sql.autoBroadcastJoinThreshold parameter which can be added to the spark.props section of . By default, Spark sets this to 10485760 (10MB). Info |
---|
NOTE: You should set this parameter between 20 and 100MB. It should not exceed 200MB. |
Example Spark configuration parameters for given YARN configurations: Property settings in bold are provided by the cluster. | | | | |
---|
YARN NodeManager node (v)CPUs | 4 | 8 | 16 | 40 | YARN NodeManager node memory (GB) | 16 | 32 | 64 | 160 | yarn.nodemanager.resource.memory-mb | 12288 | 24576 | 57344 | 147456 | yarn.nodemanager.resource.cpu-vcores | 3 | 6 | 13 | 32 | yarn.scheduler.maximum-allocation-mb | 12288 | 24576 | 57344 | 147456 | yarn.scheduler.maximum-allocation-vcores | 3 | 6 | 13 | 32 | spark.executor.memory | 6GB | 6GB | 16GB | 20GB | spark.executor.cores | 2 | 2 | 4 | 5 | spark.driver.memory | 4GB | 4GB | 4GB | 4GB | spark.driver.cores | 1 | 1 | 1 | 1 | spark.sql.autoBroadcastJoinThreshold | 20971520 | 20971520 | 52428800 | 104857600 |
If you did not create an HDI cluster as part of this install process, you must perform additional configuration to integrate the with your cluster. Please see the links under Documentation below.Exploring and Wrangling Data in Azure Info |
---|
NOTE: If you are integrating the with an existing cluster, these steps do not work. Additional configuration is required. See the Documentation section below. |
Basic steps: - When data is imported to the , a reference to it is stored by the platform as an imported dataset. The source data is not modified.
- In the application, you modify the recipe associated with a dataset to transform the imported data.
- When the recipe is ready, you define and run a job, which executes the recipe steps across the entire dataset.
- The source of the dataset is untouched, and the results are written to the specified location in the specified format.
Steps: Info |
---|
NOTE: Any user with a valid user account can import data from a local file. |
- Login.
- In the menubar, click Datasets. Click Import Data.
To add a dataset: - Select the connection where your source is located:
- WASB (Blob Storage)
- ADL (Azure Data Lake Store)
- Hive
- Navigate to the file or files for your source.
- To add the dataset, click the Plus icon next to its name.
To begin working with a dataset, you must first add it into a flow, which is a container for datasets. Click the Add Dataset to a Flow checkbox and enter the name for a new flow. Tip |
---|
Tip: If you have selected a single file, you can begin wrangling it immediately. Click Import and Wrangle. The flow is created for you, and your dataset is added to it. |
- Click Import & Add to Flow.
- After the flow has been created, the flow is displayed. Select the dataset, which is on the left side of the screen.
- Click Add New Recipe. Click Edit Recipe.
- The dataset is opened in the Transformer page, where you can begin building your recipe steps.
|