ADLS Gen2 Access
Deployments of the Designer Cloud Powered by Trifacta platform on Microsoft Azure can integrate with the next generation of Azure Data Lake Store (ADLS Gen2).
Microsof Azure Data Lake Store Gen2 (ADLS Gen2) combines the power of a high-performance file system with massive scale and economy. Azure Data Lake Storage Gen2 extends Azure Blob Storage capabilities and is optimized for analytics workloads.
For more information, see https://azure.microsoft.com/en-us/services/storage/data-lake-storage/.
Supported Environments:
Operation | Designer Cloud Powered by Trifacta Enterprise Edition | Amazon | Microsoft Azure |
---|---|---|---|
Read | Not supported | Not supported | Supported |
Write | Not supported | Not supported | Supported (only if ABFSS is base storage layer) |
Limitations
A single public connection to ADLS Gen2 is supported.
Read-only access
If the base storage layer has been set to WASB, you can follow these instructions to set up read-only access to ADLS Gen2.
Note
To enable read-only access to ADLS Gen2, do not set the base storage layer to abfss
.
Prerequisites
General
The Designer Cloud Powered by Trifacta platform has already been installed and integrated with an Azure Databricks cluster. See Configure for Azure Databricks.
For each combination of blob host and container, a separate Azure Key Vault Store entry must be created. For more information, please contact your Azure admin.
When running against ADLS Gen2, the product requires that you create a filesystem object in your ADLS Gen2 storage account.
ABFSS must be set as the base storage layer for the Designer Cloud Powered by Trifacta platform instance. See Set Base Storage Layer.
Create a registered application
Before you integrate with Azure ADLS Gen2, you must create the Designer Cloud Powered by Trifacta platform as a registered application. See Configure for Azure.
Azure properties
The following properties should already be specified in the Admin Settings page. Please verify that the following have been set:
azure.applicationId
azure.secret
azure.directoryId
The above properties are needed for this configuration.
Tip
ADLS Gen2 also works if you are using Azure Managed Identity.
Registered application role
Note
The Storage Blob Data Contributor role or its equivalent roles must be assigned in the ADLS Gen2 storage account.
For more information, see Configure for Azure.
Key Vault Setup
An Azure Key Vault has already been set up and configured for use by the Designer Cloud Powered by Trifacta platform. Properties must be specified in the platform, if they have not been configured already.
For more information on configuration for Azure key vault, see Configure for Azure.
Configure the Designer Cloud Powered by Trifacta platform
Define base storage layer
Per earlier configuration:
webapp.storageProtocol
must be set toabfss
.Note
Base storage layer must be configured when the platform is first installed and cannot be modified later.
Note
To enable read-only access to ADLS Gen2, do not set the base storage layer to
abfss
.hdfs.protocolOverride
is ignored.
Review Java VFS Service
Use of ADLS Gen2 requires the Java VFS service in the Designer Cloud Powered by Trifacta platform.
Note
This service is enabled by default.
For more information on configuring this service, see Configure Java VFS Service.
Configure file storage protocols and locations
The Designer Cloud Powered by Trifacta platform must be provided the list of protocols and locations for accessing ADLS Gen2 blob storage.
Steps:
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Locate the following parameters and set their values according to the table below:
"fileStorage.whitelist": ["abfss"], "fileStorage.defaultBaseUris": ["abfss://filesystem@storageaccount.dfs.core.windows.net/"],
Parameter
Description
filestorage.whitelist
A comma-separated list of protocols that are permitted to read and write with ADLS Gen2 storage.
Note
The protocol identifier
"abfss"
must be included in this list.filestorage.defaultBaseUris
For each supported protocol, this param must contain a top-level path to the location where Designer Cloud Powered by Trifacta platform files can be stored. These files include uploads, samples, and temporary storage used during job execution.
Note
A separate base URI is required for each supported protocol. You may only have one base URI for each protocol.
Save your changes and restart the platform.
Configure access mode
Mode | Description |
---|---|
System | All users authenticate to ADLS using a single system key/secret combination. This combination is specified in the following parameters, which you should have already defined:
These properties define the registered application in Azure Active Directory. System authentication mode uses the registered application identifier as the service principal for authentication to ADLS. All users have the same permissions in ADLS. For more information on these settings, see Configure for Azure. |
User | In user mode, per-user access is governed by Azure AD SSO. A set of tokens is acquired during SSO login for the user and is stored in the Azure Key Vault against the user's masked identifier. Additional configuration is required. See below. |
System mode access
When access to ADLS Gen2 is requested, the platform uses the combination of Azure directory ID, Azure application ID, and Azure secret to complete access.
Steps:
Please verify the following steps to specify the ADLS access mode.
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Verify that the following parameter to
system
:"azure.adlsgen2.mode": "system",
Save your changes.
User mode access
In user mode, a set of tokens is acquired during SSO login for the user and is stored in the Azure Key Vault against the user's masked identifier.
Prerequisites:
Designer Cloud Powered by Trifacta platform must be integrated with a Databricks 10.x cluster. For more information, see Configure for Azure Databricks.
User mode access to ADLS requires Single Sign On (SSO) to be enabled for integration with Azure Active Directory. For more information, seeConfigure SSO for Azure AD.
Steps:
Please verify the following steps to specify the ADLS access mode.
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Set the following parameter to
user
:"azure.adlsgen2.mode": "user",
Save your changes.
Testing
Restart services. See Start and Stop the Platform.
After the configuration has been specified, an ADLS Gen2 connection appears in the Import Data page. Select it to begin navigating for data sources.
Note
If you have multiple ADLS Gen2 file systems or storage accounts, you can access the secondary ones through the ADLS Gen2 browser. Edit the URL path in the browser and paste in the URI for other locations.
Try running a simple job from the Trifacta Application. For more information, see Verify Operations.
Troubleshooting
Unsupported curveId: 29 error when retrieving Databricks token
This issue is caused by the Designer Cloud Powered by Trifacta platform sending a known set of elliptic curve algorithms to Microsoft during SSL handshake, but an unsupported curve algorithm is being negotiated and used by the Microsoft server.
This issue applies only when SSL is enabled when accessing the base storage layer.
This issue applies to all Azure-based base storage layers supported by the Designer Cloud Powered by Trifacta platform.
A similar issue is described here: https://bugs.openjdk.java.net/browse/JDK-8171279
Solution:
Microsoft should fix the problem.
Within the Designer Cloud Powered by Trifacta platform, you can apply the following workaround:
Note
This solution disables the use of the listed algorithms for all Java services installed on the Trifacta node and is satisfactory for all Java services of the Designer Cloud Powered by Trifacta platform.
Login to the Trifacta node as an administrator.
Edit the following file:
$JAVA_HOME/jre/lib/security/java.security
Locate the following parameter:
jdk.tls.disabledAlgorithms
.To the above parameter, add the following algorithm references to disable them:
TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA
Save your changes and restart the platform.
Using ADLS Gen 2 Connection
Uses of ADLS
The Designer Cloud Powered by Trifacta platform can use ADLS for the following reading and writing tasks:
Creating Datasets from ADLS Files: You can read in from a data source stored in ADLS. A source may be a single ADLS file or a folder of identically structured files. See Reading from Sources in ADLS below.
Reading Datasets: When creating a dataset, you can pull your data from another dataset defined in ADLS. See Creating Datasets below.
Writing Job Results: After a job has been executed, you can write the results back to ADLS. See Writing Job Results below.
In the Trifacta Application, ADLS is accessed through the ADLS browser.
Note
When the Designer Cloud Powered by Trifacta platform executes a job on a dataset, the source data is untouched. Results are written to a new location, so that no data is disturbed by the process.
Before you begin using ADLS
Read/Write Access: Your cluster administrator must configure read/write permissions to locations in ADLS. Please see the ADLS documentation.
Warning
Avoid using
/trifacta/uploads
for reading and writing data. This directory is used by the Trifacta Application.Your cluster administrator should provide a place or mechanism for raw data to be uploaded to your datastore.
Your cluster administrator should provide a writeable home output directory for you, which you can review. See Storage Config Page.
Secure access
Depending on the security features you've enabled, the technical methods by which Alteryx users access ADLS may vary. For more information, see ADLS Gen1 Access.
Storing data in ADLS
Your cluster administrator should provide raw data or locations and access for storing raw data within ADLS. All Alteryx users should have a clear understanding of the folder structure within ADLS where each individual can read from and write their job results.
Users should know where shared data is located and where personal data can be saved without interfering with or confusing other users.
Note
The Designer Cloud Powered by Trifacta platform does not modify source data in ADLS. Sources stored in ADLS are read without modification from their source locations, and sources that are uploaded to the platform are stored in /trifacta/uploads
.
Reading from sources in ADLS
You can create a dataset from one or more files stored in ADLS..
Folder selection:
When you select a folder in ADLS to create your dataset, you select all files in the folder to be included. Notes:
This option selects all files in all sub-folders. If your sub-folders contain separate datasets, you should be more specific in your folder selection.
All files used in a single dataset must be of the same format and have the same structure. For example, you cannot mix and match CSV and JSON files if you are reading from a single directory.
When a folder is selected from ADLS, the following file types are ignored:
*_SUCCESS
and*_FAILED
files, which may be present if the folder has been populated by the running environment.If you have stored files in ADLS that begin with an underscore (
_
), these files cannot be read during batch transformation and are ignored. Please rename these files through ADLS so that they do not begin with an underscore.
Parameters:
You can parameterize parts of your paths to source files for your imported dataset. Parameters can be applied to:
user info value
host value
path values
For more information:
Creating datasets
When creating a dataset, you can choose to read data in from a source stored from ADLS or from a local file.
ADLS sources are not moved or changed.
Local file sources are uploaded to
/trifacta/uploads
where they remain and are not changed.
Data may be individual files or all of the files in a folder. For more information, see Reading from Sources in ADLS above.
In the Import Data page, click the ADLS tab. See Import Data Page.
Writing job results
When your job results are generated, they can be stored back in ADLS for you at the location defined for your user account.
The ADLS location is available through the Publishing dialog in the Output Destinations tab of the Job Details page. See Publishing Dialog.
Each set of job results must be stored in a separate folder within your ADLS output home directory.
For more information on your output home directory, see Storage Config Page.
Warning
If your deployment is using ADLS, do not use the trifacta/uploads
directory. This directory is used for storing uploads and metadata, which may be used by multiple users. Manipulating files outside of the Trifacta Application can destroy other users' data. Please use the tools provided through the interface for managing uploads from ADLS.
Users can specify a default output home directory and, during job execution, an output directory for the current job.
Access to results:
Depending on how the platform is integrated with ADLS, other users may or may not be able to access your job results.
If user mode is enabled, results are written to ADLS through the ADLS account configured for your use. Depending on the permissions of your ADLS account, you may be the only person who can access these results.
If user mode is not enabled, then each Alteryx user writes results to ADLS using a shared account. Depending on the permissions of that account, your results may be visible to all platform users.
Creating a new dataset from results
As part of writing job results, you can choose to create a new dataset, so that you can chain together data wrangling tasks.
Note
When you create a new dataset as part of your job results, the file or files are written to the designated output location for your user account. Depending on how your cluster permissions are configured, this location may not be accessible to other users.
Reference
Supported Versions: n/a
Supported Environments:
Operation | Designer Cloud Powered by Trifacta Enterprise Edition | Amazon | Microsoft Azure |
---|---|---|---|
Read | Not supported | Not supported | Supported |
Write | Not supported | Not supported | Supported (only if ABFSS is base storage layer) |
Create New Connection: n/a
Note
A single public connection to ADLS Gen2 is supported.