D toc |
---|
Microsoft Azure deployments can integrate with with the next generation of Azure Data Lake Store (ADLS Gen2).
- Microsof Azure Data Lake Store Gen2 (ADLS Gen2) combines the power of a high-performance file system with massive scale and economy. Azure Data Lake Storage Gen2 extends Azure Blob Storage capabilities and is optimized for analytics workloads.
- For more information, see https://azure.microsoft.com/en-us/services/storage/data-lake-storage/.
Limitations of ADLS Gen2 Integration
- This version requires a specific CDH 6.1 bundle JAR. Details are described later.
Read-only access
If the base storage layer has been set to WASB, you can follow these instructions to set up read-only access to ADLS Gen2.
Info |
---|
NOTE: To enable read-only access to ADLS Gen2, do not set the base storage layer to |
Pre-requisites
General
- The
has already been installed and integrated with an Azure Databricks cluster. See Configure for Azure Databricks.D s platform For each combination of blob host and container, a separate Azure Key Vault Store entry must be created. For more information, please contact your Azure admin.
- When running against ADLS Gen2, the product requires that you create a filesystem object in your ADLS Gen2 storage account.
- ABFSS must be set as the base storage layer for the
instance. See Set Base Storage Layer.D s platform
Create a registered application
Before you integrate with Azure ADLS Gen2, you must create the
D s platform |
---|
Azure properties
The following properties should already be specified in the Admin Settings page. Please verify that the following have been set:
azure.applicationId
azure.secret
azure.directoryId
The above properties are needed for this configuration.
Tip |
---|
Tip: ADLS Gen2 also works if you are using Azure Managed Identity. |
Registered application role
Info |
---|
NOTE: TheStorage Blob Data Contributor role or its equivalent roles must be assigned in the ADLS Gen2 storage account. |
For more information, see Configure for Azure.
Key Vault Setup
An Azure Key Vault has already been set up and configured for use by the
D s platform |
---|
For more information on configuration for Azure key vault, see Configure for Azure.
Configure the
D s platform |
---|
Specify CDH 6.1 bundle JARs
Info |
---|
NOTE: For this release, use of the CDH 6.1 bundle JARs is required for ADLS Gen2 integration. Please specify all of these properties, even if you are not integrating with Hive. |
Please complete the following steps to review and modify if necessary the bundle JAR properties and dependency locations.
Steps:
D s config Locate the following parameters and set them to the values listed below:
Info NOTE: If you have integrated with Databricks Tables, do not overwrite the value for
data-service.hiveJdbcJar
with the following value, even if it's set to a different distribution JAR file.Code Block "hadoopBundleJar": "hadoop-deps/cdh-6.2/build/libs/cdh-6.2-bundle.jar", "spark-job-service.hiveDependenciesLocation": %(topOfTree)s/hadoop-deps/cdh-6.2/build/libs" "data-service.hiveJdbcJar": "hadoop-deps/cdh-6.2/build/libs/cdh-6.2-hive-jdbc.jar",
- Save your changes and restart the platform.
Define base storage layer
Per earlier configuration:
webapp.storageProtocol
must be set toabfss
.Info NOTE: Base storage layer must be configured when the platform is first installed and cannot be modified later.
Info NOTE: To enable read-only access to ADLS Gen2, do not set the base storage layer to
abfss
.hdfs.protocolOverride
is ignored.
Review Java VFS Service
Use of ADLS Gen2 requires the Java VFS service in the
D s platform |
---|
Info |
---|
NOTE: This service is enabled by default. |
For more information on configuring this service, see Configure Java VFS Service.
Configure file storage protocols and locations
The
D s platform |
---|
Steps:
D s config Locate the following parameters and set their values according to the table below:
Code Block "fileStorage.whitelist": ["abfss"], "fileStorage.defaultBaseUris": ["abfss://filesystem@storageaccount.dfs.core.windows.net/"],
Parameter Description filestorage.whitelist A comma-separated list of protocols that are permitted to read and write with ADLS Gen2 storage.
Info NOTE: The protocol identifier
"abfss"
must be included in this list.filestorage.defaultBaseUris For each supported protocol, this param must contain a top-level path to the location where
files can be stored. These files include uploads, samples, and temporary storage used during job execution.D s platform Info NOTE: A separate base URI is required for each supported protocol. You may only have one base URI for each protocol.
- Save your changes and restart the platform.
Configure access mode
Authentication to ADLS Gen2 storage is supported for system
mode only.
Mode | Description | ||
---|---|---|---|
System | All users authenticate to ADLS using a single system key/secret combination. This combination is specified in the following parameters, which you should have already defined:
These properties define the registered application in Azure Active Directory. System authentication mode uses the registered application identifier as the service principal for authentication to ADLS. All users have the same permissions in ADLS. For more information on these settings, see Configure for Azure. | ||
User |
|
Steps:
Please verify the following steps to specify the ADLS access mode.
D s config Verify that the following parameter to
system
:Code Block "azure.adlsgen2.mode": "system",
- Save your changes.
Testing
Restart services. See Start and Stop the Platform.
After the configuration has been specified, an ADLS Gen2 connection appears in the Import Data page. Select it to begin navigating for data sources.
Info |
---|
NOTE: If you have multiple ADLS Gen2 file systems or storage accounts, you can access the secondary ones through the ADLS Gen2 browser. Edit the URL path in the browser and paste in the URI for other locations. |
Try running a simple job from the
D s webapp |
---|
- Except as noted above, the ADLS Gen2 browser is identical to the ADLS one. See \ADLS Gen2 Browser.
- Except as noted above, the basic usage of ADLS Gen2 is identical to the ADLS (Gen1) version. See Using ADLS.