...
- This feature is supported for
only.D s product product ee - The
must be deployed on an AWS EC2 Instance that is joined to the same domain as the EMR cluster.D s platform - The EMR cluster must be kerberized using the Cross-Realm Trust method. Additional information is below.
Excerpt | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Create EMR ClusterUse the following section to set up your EMR cluster for use with the
Cluster optionsIn the Amazon EMR console, click Create Cluster. Click Go to advanced options. Complete the sections listed below.
For more information on setting up your EMR cluster, see http://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html. Advanced OptionsIn the Advanced Options screen, please select the following:
Hardware configuration
General Options
Security Options
Create cluster and acquire cluster IDIf you performed all of the configuration, including the sections below, you can create the cluster.
Specify cluster rolesThe following cluster roles and their permissions are required. For more information on the specifics of these policies, see EMR cluster policies.
AuthenticationYou can use one of two methods for authenticating the EMR cluster:
Role-based IAM authenticationYou can leverage your IAM roles to provide role-based authentication to the S3 buckets.
Specify the custom credential provider JAR fileIf you are not using IAM roles for access, you can manage access using either of the following:
In either scenario, you must use the custom credential provider JAR provided in the installation. This JAR file must be available to all nodes of the EMR cluster. After you have installed the platform and configured the S3 buckets, please complete the following steps to deploy this JAR file.
Steps:
When the EMR cluster is launched with the above custom bootstrap action, the cluster does one of the following:
For more information about AWSCredentialsProvider for EMRFS please see:
Set up S3 BucketsBucket setupYou must set up S3 buckets for read and write access.
For more information, see Enable S3 Access. Set up EMR resources buckets
On the EMR cluster, all users of the platform must have access to the following locations:
These locations are configured on the
Access PoliciesEC2 instance profile
EMR rolesThe following policies should be assigned to the EMR roles listed below for read/write access:
General configuration for
Please complete the following sections to configure the
Change admin passwordAs soon as you have installed the software, you should login to the application and change the admin password. The initial admin password is the instanceId for the EC2 instance. For more information, see Change Password. Verify S3 as base storage layerEMR integrations requires use of S3 as the base storage layer.
Set up S3 integrationTo integrate with S3, additional configuration is required. See Enable S3 Access. Configure EMR authentication modeAuthentication to AWS and to EMR supports the following basic modes:
The authentication mode for your access to EMR can be configured independently from the base authentication mode for AWS, with the following exception:
Authentication mode configuration matrix: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
AWS mode (aws.mode) | system | user | EMR mode (aws.emr.authMode) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
system | AWS and EMR use a single key-secret combination. Parameters to set:
See Configure for AWS. | AWS access uses a single key-secret combination. EMR access is governed by per-user credentials. Per-user credentials can be provided from one of several different providers.
For more information on configuring per-user access, see Configure for AWS. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
user | Not supported | AWS and EMR use the same per-user credentials for access. Per-user credentials can be provided from one of several different providers.
For more information on configuring per-user access, see Configure AWS Per-User Authentication. |
D s config |
---|
Code Block |
---|
"aws.emr.authMode": "user", |
Configure the mode to use to authenticate to the EMR cluster:
system
- In system mode, the specified AWS key and secret combination are used to authenticate to the EMR cluster. These credentials are used for all users.
user
- In user mode, user configuration is retrieved from the database.
Info |
---|
NOTE: User mode for EMR authentication requires that |
EMR per-user authentication for the
D s platform |
---|
If you have enabled per-user authentication for EMR (aws.emr.authMode=user
), you must set the following properties based on the credential provider for your AWS per-user credentials.
D s config |
---|
Use default credential provider for all
D s item | ||
---|---|---|
|
EMR Authentication for the
D s platform |
---|
Depending on the authentication method you used, you must set the following properties.
D s config |
---|
Authentication method | Properties and values | ||||||||
---|---|---|---|---|---|---|---|---|---|
Use default credential provider for all
|
| ||||||||
Use default credential provider for all
|
| ||||||||
EC2 role-based IAM authentication for all
|
|
Configure
D s platform |
---|
Info |
---|
NOTE: This section assumes that you are integrating with an EMR cluster that has not been kerberized. If you are integrating with a Kerberized cluster, please skip to "Configure for EMR with Kerberos". |
Enable EMR integration
After you have configured S3 to be the base storage layer, you must enable EMR integration.
Steps:
D s config |
---|
Set the following value:
Code Block "webapp.runInEMR": true,
Set the following values:
Code Block "webapp.runWithSparkSubmit": false,
Verify the following property values:
Code Block "webapp.runInTrifactaServer": true, "webapp.runWithSparkSubmit": false, "webapp.runInDataflow": false,
Apply EMR cluster ID
The
D s platform |
---|
Steps:
D s config method admin Under External Service Settings, enter your AWS EMR Cluster ID. Click the Save button below the textbox.
For more information, see Admin Settings Page.
Extract IP address of master node in private sub-net
If you have deployed your EMR cluster on a private sub-net that is accessible outside of AWS, you must enable this property, which permits the extraction of the IP address of the master cluster node through DNS.
Info |
---|
NOTE: This feature must be enabled if your EMR is accessible outside of AWS on a private network. |
Steps:
D s config Set the following property to
true
:Code Block "emr.extractIPFromDNS": false,
- Save your changes and restart the platform.
Configure authentication mode
You can authenticate to the EMR cluster using either of the following authenticate modes:
- System: A single set of credentials are used to connect to EMR.
- User: Each user has a separate set of credentials.
Steps:
D s config Locate the following settings and apply the appropriate values. See the table below:
Code Block "aws.emr.authMode": "user",
Setting Description aws.emr.authMode Configure the mode to use to authenticate to the EMR cluster:
system
- In system mode, the specified AWS key and secret combination are used to authenticate to the EMR cluster. These credentials are used for all users.user
- In user mode, user configuration is retrieved from the database.Info NOTE: User mode for EMR authentication requires that
aws.mode
be set touser
.- Save your changes.
Configure Spark for EMR
For EMR, you can configure a set of Spark-related properties to manage the integration and its performance.
Configure Spark version
Depending on the version of EMR with which you are integrating, the
D s platform |
---|
Info |
---|
NOTE: You should have already acquired the value to apply. See "Supported Spark Versions" above. |
Steps:
D s config Locate the following:
Code Block "spark.version": "<SparkVersionForMyEMRVersion>",
- Save your changes.
Use vendor libraries
If you are using EMR 5.20 or later (Spark 2.4 or later), you must configure the vendor libraries provided by the cluster. Please set the following parameter.
Steps:
D s config Locate the following:
Code Block "spark.useVendorSparkLibraries": true,
- Save your changes.
Disable Spark job service
The Spark job service is not used for EMR job execution. Please complete the following to disable it:
Steps:
D s config Locate the following and set it to
false
:Code Block "spark-job-service.enabled": false,
Locate the following and set it to
false
:Code Block "spark-job-service.enableHiveSupport": false,
- Save your changes.
Specify YARN queue for Spark jobs
Through the Admin Settings page, you can specify the YARN queue to which to submit your Spark jobs. All Spark jobs from the
D s platform |
---|
Steps:
In platform configuration, locate the following:
Code Block "spark.props.spark.yarn.queue"
- Specify the name of the queue.
Save your changes.
Allocation properties
The following properties must be passed from the
D s platform |
---|
D s config | ||
---|---|---|
|
Info | |
---|---|
NOTE: Do not modify these properties through the Admin Settings page. These properties must be added as extra properties through the Spark configuration block. Ignore any references in
|
Code Block |
---|
"spark": { ... "props": { "spark.dynamicAllocation.enabled": "true", "spark.shuffle.service.enabled": "true", "spark.executor.instances": "0", "spark.executor.memory": "2048M", "spark.executor.cores": "2", "spark.driver.maxResultSize": "0" } ... } |
Property | Description | Value |
---|---|---|
spark.dynamicAllocation.enabled | Enable dynamic allocation on the Spark cluster, which allows Spark to dynamically adjust the number of executors. | true |
spark.shuffle.service.enabled | Enable Spark shuffle service, which manages the shuffle data for jobs, instead of the executors. | true |
spark.executor.instances | Default count of executor instances. | See Sizing GuideGuidelines. |
spark.executor.memory | Default memory allocation of executor instances. | See Sizing GuideGuidelines. |
spark.executor.cores | Default count of executor cores. | See Sizing GuideGuidelines. |
spark.driver.maxResultSize | Enable serialized results of unlimited size by setting this parameter to zero (0). | 0 |
Configure
D s platform |
---|
Info |
---|
NOTE: This section applies only if you are integrating with a kerberized EMR cluster. If you are not, please skip to "Additional Configuration for EMR". |
Disable standard EMR integration
When running jobs against a kerberized EMR cluster, you utilize the Spark-submit method of job submission. You must disable the standard EMR integration.
Steps:
D s config |
---|
Search for the following setting and set it
false
:Code Block "webapp.runInEMR": false,
Set the following value:
Code Block "webapp.runWithSparkSubmit": true,
Disable use of Hive, which is not supported with EMR:
Code Block "spark-job-service.enableHiveSupport": false,
Verify the following property values:
Code Block "webapp.runInTrifactaServer": true, "webapp.runInDataflow": false,
- Save your changes.
Enable YARN
To use Spark-submit, the Spark master must be set to use YARN.
Steps:
D s config |
---|
Search for the following setting and set it
yarn
:Code Block "spark.master": "yarn",
Save your changes.
Acquire site config files
For integrating with an EMR cluster with Kerberos, the EMR cluster site XML configuration files must be downloaded from the EMR master node to the
D s item | ||
---|---|---|
|
Info |
---|
NOTE: This step is not required for non-Kerberized EMR clusters. |
Info |
---|
NOTE: When these files change, you must update the local copies. |
- Download the Hadoop Client Configuration files from the EMR master node. The required files are the following:
core-site.xml
hdfs-site.xml
mapred-site.xml
yarn-site.xml
These configuration files must be moved to the
. By default, these files are inD s item item deployment /etc/hadoop/conf
:Code Block sudo cp <location>/*.xml /opt/trifacta/conf/hadoop-site/ sudo chown trifacta:trifacta /opt/trifacta/conf/hadoop-site/*.xml
- (Option) If we want to support impersonate, we also need copy *.keytab from the EMR master node under /etc folder to EC2 instance under same folder.
Unused properties for EMR with Kerberos
When integrating with a kerberized EMR cluster, the following
D s item | ||
---|---|---|
|
- External Service Settings: In the Admin Settings page, this section of configuration does not apply to EMR with Kerberos.
Unused EMR settings: In the Admin Settings page, the following EMR settings do not apply to EMR with Kerberos:
Code Block aws.emr.tempfilesCleanupAge aws.emr.proxyUser aws.emr.maxLogPollingRetries aws.emr.jobTagPrefix aws.emr.getLogsOnFailure aws.emr.getLogsForAllJobs aws.emr.extractIPFromDNS aws.emr.connectToRmEnabled
Additional Configuration for EMR
Default Hadoop job results format
For smaller datasets, the platform recommends using the
D s photon |
---|
For larger datasets, if the size information is unavailable, the platform recommends by default that you run the job on the Hadoop cluster. For these jobs, the default publishing action for the job is specified to run on the Hadoop cluster, generating the output format defined by this parameter. Publishing actions, including output format, can always be changed as part of the job specification.
As needed, you can change this default format.
D s config |
---|
Code Block |
---|
"webapp.defaultHadoopFileFormat": "csv", |
Accepted values: csv
, json
, avro
, pqt
For more information, see Run Job Page.
Configure Snappy publication
If you are publishing using Snappy compression for jobs run on an EMR cluster, you may need to perform the following additional configuration.
Steps:
SSH into EMR cluster (master) node:
Code Block ssh <EMR master node>
Create tarball of native Hadoop libraries:
Code Block tar -C /usr/lib/hadoop/lib -czvf emr-hadoop-native.tar.gz native
Copy the tarball to the
instance used by the into theD s item item EC2 /tmp
directory:Code Block scp -p emr-hadoop-native.tar.gz <EC2 instance>:/tmp
SSH to
instance:D s item item EC2 Code Block ssh <EC2 instance>
Create path values for libraries:
Code Block sudo -u trifacta mkdir -p /opt/trifacta/services/batch-job-runner/build/libs
Untar the tarball to the
:D s item item installation path Code Block sudo -u trifacta tar -C /opt/trifacta/services/batch-job-runner/build/libs -xzf /tmp/emr-hadoop-native.tar.gz
Verify
libhadoop.so*
andlibsnappy.so*
libraries exist and are owned by the
:D s item item user Code Block ls -l /opt/trifacta/services/batch-job-runner/build/libs/native/
Verify that the
/tmp
directory has the proper permissions for publication. For more information, see Supported File Formats.- A platform restart is not required.
Additional parameters
You can set the following parameters as needed:
Steps:
D s config |
---|
Property | Required | Description | ||||||
---|---|---|---|---|---|---|---|---|
aws.emr.resource.bucket | Y | S3 bucket name where
| ||||||
aws.emr.resource.path | Y | S3 path within the bucket where resources can be stored for job execution on the EMR cluster.
| ||||||
aws.emr.proxyUser | Y | This value defines the user for the
| ||||||
aws.emr.maxLogPollingRetries | N | Configure maximum number of retries when polling for log files from EMR after job success or failure. Minimum value is 5 . | ||||||
aws.emr.tempfilesCleanupAge | N | Defines the number of days that temporary files in the By default, this value is set to If needed, you can set this to a positive integer value. During each job run, the platform scans this directory for temp files older than the specified number of days and removes any that are found. This cleanup provides an additional level of system hygiene. Before enabling this secondary cleanup process, please execute the following command to clear the
|
Optional Configuration
Configure for Redshift
For more information on configuring the platform to integrate with Redshift, see Create Redshift Connections.
Switch EMR Cluster
If needed, you can switch to a different EMR cluster through the application. For example, if the original cluster suffers a prolonged outage, you can switch clusters by entering the cluster ID of a new cluster. For more information, see Admin Settings Page.
Configure Batch Job Runner
Batch Job Runner manages jobs executed on the EMR cluster. You can modify aspects of how jobs are executed and how logs are collected. For more information, see Configure Batch Job Runner.
Modify Job Tag Prefix
In environments where the EMR cluster is shared with other job-executing applications, you can review and specify the job tag prefix, which is prepended to job identifiers to avoid conflicts with other applications.
Steps:
D s config Locate the following and modify if needed:
Code Block "aws.emr.jobTagPrefix": "TRIFACTA_JOB_",
- Save your changes and restart the platform.
...