...
- HDInsight 3.6
- HDInsight Hadoop, Spark and HBase cluster types
Documentation Scope
This document guides you through the process of installing the product and beginning to use it.
- If you are creating a new cluster as part of this process: You can begin running jobs. Instructions are provided in this document for testing job execution.
- If you are integrating the product with a pre-existing cluster: Additional configuration is required after you complete the installation process. Relevant topics are listed in the Documentation section at the end of this document.
...
Excerpt | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Desktop Requirements
Sizing Guide
Use the following guidelines to select your instance size:
Pre-requisitesBefore you install the platform, please verify that the following steps have been completed.
Register application and r/w access to Data Lake StoreYou must have an Azure Active Directory registered application with appropriate permissions, such as read/write access to the Data Lake Store resource. This can be either the same service principal used for the HDInsight cluster or a new one created specifically for the
These values are applied during the install process.
Install ProcessMethodsYou can install from the Azure Marketplace using one of the following methods:
Depending on your selection, please follow the steps listed in one of the following sections.
Install Method - New clusterPlease use the following steps if you are creating a new HDI cluster and adding the
Steps:
Install Method - Add application to a clusterPlease use the following steps if you are adding the
Steps:
Install Method - Build custom ARM templatePlease use the following steps if you are creating a custom application template for later deployment. This method provides more flexible configuration options and can be used for deployments in the future.
Steps:
Login to
Steps:
Change admin passwordSteps:
Post-install configurationBase parameter settingsIf you are integrating with HDI and did not install via a custom ARM template, the following settings must be specified with the
Steps:
Apply license fileWhen the application is first created, the license is valid for 24 hours. Before the license expires, you must apply the license key file to the
Steps:
Review refresh token encryption keyBy default, the
If preferred, you can generate your own key value, which is unique to your instance of the platform. Steps:
Configure Spark Execution (Advanced)The cluster instance of Spark is used to execute larger jobs across the nodes of the cluster. Spark processes run multiple executors per job. Each executor must run within a YARN container. Therefore, resource requests must fit within YARN’s container limits. Like YARN containers, multiple executors can run on a single node. More executors provide additional computational power and decreased runtime. Spark’s dynamic allocation adjusts the number of executors to launch based on the following:
The per-executor resource request sizes can be specified by setting the following properties in the
A single special process (the application driver) also runs in a container. Its resources are specified in the
Optimizing "Small" Joins (Advanced) Broadcast, or map-side, joins materialize one side of the join and send it to all executors to be stored in memory. This technique can significantly accelerate joins by skipping the sort and shuffle phases during a "reduce" operation. However, there is also a cost in communicating the table to all executors. Therefore, only "small" tables should be considered for broadcast join. The definition of "small" is set by the
10485760 (10MB).
Example Spark configuration parameters for given YARN configurations: Property settings in bold are provided by the cluster.
|
...
Basic steps:
- When data is imported to the the
, a reference to it is stored by the platform as an imported dataset. The source data is not modified.D s platform - In the application, you modify the recipe associated with a dataset to transform the imported data.
- When the recipe is ready, you define and run a job, which executes the recipe steps across the entire dataset.
- The source of the dataset is untouched, and the results are written to the specified location in the specified format.
...
- Login.
- In the menubar, click Datasets. Click Import Data.
To add a dataset:
- Select the connection where your source is located:
- WASB (Blob Storage)
- ADL (Azure Data Lake Store)
- Hive
- Navigate to the file or files for your source.
- To add the dataset, click the Plus icon next to its name.
- Select the connection where your source is located:
To begin working with a dataset, you must first add it into a flow, which is a container for datasets. Click the Add Dataset to a Flow checkbox and enter the name for a new flow.
Tip Tip: If you have selected a single file, you can begin wrangling it immediately. Click Import and Wrangle. The flow is created for you, and your dataset is added to it.
- Click Import & Add to Flow.
- After the flow has been created, the flow is displayed. Select the dataset, which is on the left side of the screen.
- Click Click Add New Recipe. Click Edit Recipe.
- The dataset is opened in the Transformer page, where you can begin building your recipe steps.
...
Upgrade
You can access complete product documentation online and in PDF format. From within the product, select Help menu > Product Docs.
After you have accessed the Please complete the instructions in this section if you are upgrading from a previous version of
D s product | ||
---|---|---|
|
Info | ||||
---|---|---|---|---|
NOTE: These instructions apply only to |
Overview of the upgrade process
Upgrading your instance of
D s product |
---|
- Back up the databases and configuration for your existing platform instance.
- Download the latest version of
D s product - Uninstall the existing version of
. Install the version you downloaded in the previous step.D s product - Start up
. This step automatically performs the DB migrations from the older version to the latest version and upgrades the configurations.D s product - Within the application, perform required configuration updates in the upgraded instance.
- Apply the custom migrations to migrate paths in the DBs.
Instructions for these steps are provided below.
Acquire resources
Before you begin, please acquire the following publicly available resource:
Backup data from platform instance
Before you begin, you should backup your current instance.
- SSH to your current Marketplace instance.
Stop the
on your current Marketplace instance:D s platform Code Block sudo service trifacta stop
- Update the backup script with a more current version.
If you have not done so already, download 5.0.0 backup script from the following location:
Code Block https://raw.githubusercontent.com/trifacta/trifacta-utils/release/5.0/azure/trifacta-backup-config-and-db.sh
Example command to download the script:
Code Block curl --output trifacta-backup-config-and-db.sh https://raw.githubusercontent.com/trifacta/trifacta-utils/release/5.0/azure/trifacta-backup-config-and-db.sh
Overwrite the downloaded script to the following location:
Code Block /opt/trifacta/bin/setup-utils/trifacta-backup-config-and-db.sh
Verify that this script is executable:
Code Block sudo chmod 775 /opt/trifacta/bin/setup-utils/trifacta-backup-config-and-db.sh
Run the backup script:
Code Block sudo /opt/trifacta/bin/setup-utils/trifacta-backup-config-and-db.sh
When the script is complete, the output identifies the location of the backup. Example:
Code Block /opt/trifacta-backups/trifacta-backup-4.2.1+126.20171217124021.a8ed455-20180514213601.tgz
Store the backup in a safe location.
Download latest version of
D s product |
---|
Download the latest version of by using the command below
Code Block |
---|
wget https://trifactamarketplace.blob.core.windows.net/artifacts/trifacta-server-5.0.0-110~xenial_amd64.deb?sr=c&si=trifacta-deploy-public-read&sig=ksMPhDkLpJYPEXnRNp4vAdo6QQ9ulpP%2BM4Gsi/nea%2Bg%3D&sv=2016-05-31 |
Uninstall older version of
D s item | ||
---|---|---|
|
To uninstall older version of
, execute as root user the following command on theD s product r true
:D s item item node Code Block apt-get remove trifacta
Install the latest version that was downloaded. Execute the following command as root:
Code Block dpkg -i <location of the 5.0 deb file>
Start up latest version of
D s item | ||
---|---|---|
|
To migrate the DBs and upgrade configs from the older version of
D s product | ||
---|---|---|
|
D s platform |
---|
Code Block |
---|
service trifacta start |
Post-upgrade configuration
Create Key Vault and use true WASB protocol
Info |
---|
NOTE: This section applies only if you are upgrading from Release 4.2.x to Release 5.0 or later and were using WASB as your base storage layer. |
Prior to Release 5.0, the
D s platform |
---|
In Release 5.0 and later, the above workaround has been replaced by true WASB support. To migrate your installation, please complete the following sections.
Set base storage layer to WASB: After the Key Vault has been created, you must set the base storage layer to wasb
or wasbs
. For more information, see Configure base storage layer in Configure for Azure.
Create Key Vault: To enable access to the WASB base storage layer, you must create a Key Vault to manage authentication via secure token. For more information, see Configure for Key Vault in Configure for Azure.
Configure secure token service
Info |
---|
NOTE: This section applies only if you are upgrading from Release 4.2.x to Release 5.0 or later and were using WASB or plan to use SSO to access ADLS. |
To manage access, you must enable and configure the secure token service. For more information, see Configure Secure Token Service in Configure for Azure.
Migrate the paths from WebWASB to WASB
After you have restarted the
D s platform |
---|
Steps:
If you have not done so already, acquire the DB migration script:
Code Block https://github.com/trifacta/azure-deploy/blob/release/5.0/bin/migrations/path-migrations-4_2-5_0.sql
Store this script in an executable location on the
.D s item item node Acquire the values for the following parameters from the platform configuration.
:D s config Code Block "webapp.db.port" "azure.wasb.defaultStore.blobHost" "azure.wasb.defaultStore.container"
Update the blobHost and container parameters of the migration script with the values you acquired.
Code Block SET LOCAL myvars.blobHost = '<azure.wasb.defaultStore.blobHost value>'; SET LOCAL myvars.container = '<azure.wasb.defaultStore.container value>';
After updating the parameter, run the migration script:
Code Block psql -U trifacta -d trifacta -p <webapp.db.port> -a -f <location for the sql migration script>
Verify
The upgrade is complete. To verify:
Steps:
Restart the platform:
Code Block sudo service trifacta start
- Run a simple job with profiling.
- Verify that the job has successfully completed.
- In the Jobs page, locate the job results. See Jobs Page.
- Click View Details next to the job to review the profile.
Documentation
You can access complete product documentation online and in PDF format. From within the product, select Help menu > Product Docs.
After you have accessed the documentation, the following topics are relevant to Azure deployments. Please review them in order.
Topic | Description | ||
---|---|---|---|
Supported Deployment Scenarios for Azure | Matrix of supported Azure components. | ||
Top-level configuration topic on integrating the platform with Azure.
| |||
Configure for HDInsight | Review this section if you are integrating the
| ||
Enable ADLS Access | Configuration to enable access to ADLS. | ||
Enable WASB Access | Configuration to enable access to WASB. | ||
Configure SSO for Azure AD | How to integrate the
|
...