You can configure your instance of the to integrate with Amazon Elastic MapReduce (EMR), a highly scalable Hadoop-based execution environment.
NOTE: This section applies only to installations of |
This section outlines how to create a new EMR cluster and integrate the with it. The platform can be integrated with existing EMR clusters.
Supported Versions: EMR 5.6 - EMR 5.21
NOTE: EMR 5.20 - EMR 5.21 requires Spark 2.4. For more information, see Configure for Spark. |
NOTE: Job cancellation is not supported on an EMR cluster. |
Create EMR ClusterUse the following section to set up your EMR cluster for use with the
Cluster optionsIn the Amazon EMR console, click Create Cluster. Click Go to advanced options. Complete the sections listed below.
For more information on setting up your EMR cluster, see http://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html. Advanced OptionsIn the Advanced Options screen, please select the following:
Hardware configuration
General Options
Security Options
Create cluster and acquire cluster IDIf you performed all of the configuration, including the sections below, you can create the cluster.
Specify cluster rolesThe following cluster roles and their permissions are required. For more information on the specifics of these policies, see EMR cluster policies.
AuthenticationYou can use one of two methods for authenticating the EMR cluster:
Role-based IAM authenticationYou can leverage your IAM roles to provide role-based authentication to the S3 buckets.
Specify the custom credential provider JAR fileIf you are not using IAM roles for access, you can manage access using either of the following:
In either scenario, you must use the custom credential provider JAR provided in the installation. This JAR file must be available to all nodes of the EMR cluster. After you have installed the platform and configured the S3 buckets, please complete the following steps to deploy this JAR file.
Steps:
When the EMR cluster is launched with the above custom bootstrap action, the cluster does one of the following:
For more information about AWSCredentialsProvider for EMRFS please see:
EMRFS consistent view is recommendedAlthough it is not required, you should enable the consistent view feature for EMRFS on your cluster. During job execution, including profiling jobs, on EMR, the To ensure that the
For more information on EMRFS consistent view, see http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-consistent-view.html. DynamoDBAmazon's DynamoDB is automatically enabled to store metadata for EMRFS consistent view.
Enable output job manifestWhen EMRFS consistent view is enabled on the cluster, the platform must be configured to use it. During job execution, the platform can use consistent view to create a manifest file of all files generated during job execution. When the job results are published to an external target, this manifest file ensures proper publication. Steps:
Set up S3 BucketsBucket setupYou must set up S3 buckets for read and write access.
For more information, see Enable S3 Access. Set up EMR resources bucketsOn the EMR cluster, all users of the platform must have access to the following locations:
These locations are configured on the Access PoliciesEC2 instance profile
EMR rolesThe following policies should be assigned to the EMR roles listed below for read/write access:
EMRFS consistent view policiesIf EMRFS consistent view is enabled, the following policy must be added for users and the EMR cluster permissions:
Configure |
NOTE: The base storage layer must be set during initial installation and set up of the |
To integrate with S3, additional configuration is required. See Enable S3 Access.
After you have configured S3 to be the base storage layer, you must enable EMR integration.
Steps:
Search for the following setting:
"webapp.runInEMR": false, |
true
. Set the following value to false
:
"webapp.runInHadoop": false, |
Verify the following property values:
"webapp.runInTrifactaServer": true, "webapp.runInEMR": true, "webapp.runInHadoop": false, "webapp.runInDataflow": false, |
The must be aware of the EMR cluster to which to connection.
Steps:
Under External Service Settings, enter your AWS EMR Cluster ID. Click the Save button below the textbox.
For more information, see Admin Settings Page.
If you have deployed your EMR cluster on a private sub-net that is accessible outside of AWS, you must enable this property, which permits the extraction of the IP address of the master cluster node through DNS.
NOTE: This feature must be enabled if your EMR is accessible outside of AWS on a private network. |
Steps:
Set the following property to true
:
"emr.extractIPFromDNS": false, |
Depending on the authentication method you used, you must set the following properties.
Authentication method | Properties and values | ||
---|---|---|---|
Use default credential provider for all
|
| ||
Use default credential provider for all |
| ||
EC2 role-based IAM authentication for all |
|
For EMR, you can configure a set of Spark-related properties to manage the integration and its performance.
Depending on the version of EMR with which you are integrating, the must be modified to use the appropriate version of Spark to connect to EMR. For more information, see Configure for Spark.
Through the Admin Settings page, you can specify the YARN queue to which to submit your Spark jobs. All Spark jobs from the are submitted to this queue.
Steps:
In platform configuration, locate the following:
"spark.props.spark.yarn.queue" |
Save your changes.
The following properties must be passed from the to Spark for proper execution on the EMR cluster.
NOTE: Do not modify these properties through the Admin Settings page. These properties must be added as extra properties through the Spark configuration block. Ignore any references in |
"spark": { ... "props": { "spark.dynamicAllocation.enabled": "true", "spark.shuffle.service.enabled": "true", "spark.executor.instances": "0", "spark.executor.memory": "2048M", "spark.executor.cores": "2", "spark.driver.maxResultSize": "0" } ... } |
Property | Description | Value |
---|---|---|
spark.dynamicAllocation.enabled | Enable dynamic allocation on the Spark cluster, which allows Spark to dynamically adjust the number of executors. | true |
spark.shuffle.service.enabled | Enable Spark shuffle service, which manages the shuffle data for jobs, instead of the executors. | true |
spark.executor.instances | Default count of executor instances. | See Sizing Guidelines. |
spark.executor.memory | Default memory allocation of executor instances. | See Sizing Guidelines. |
spark.executor.cores | Default count of executor cores. | See Sizing Guidelines. |
spark.driver.maxResultSize | Enable serialized results of unlimited size by setting this parameter to zero (0). | 0
|
For smaller datasets, the platform recommends using the running environment.
For larger datasets, if the size information is unavailable, the platform recommends by default that you run the job on the Hadoop cluster. For these jobs, the default publishing action for the job is specified to run on the Hadoop cluster, generating the output format defined by this parameter. Publishing actions, including output format, can always be changed as part of the job specification.
As needed, you can change this default format.
"webapp.defaultHadoopFileFormat": "csv", |
Accepted values: csv
, json
, avro
, pqt
For more information, see Run Job Page.
You can set the following parameters as needed:
Steps:
Property | Required | Description | |
---|---|---|---|
aws.emr.resource.bucket | Y | S3 bucket name where | |
aws.emr.resource.path | Y | S3 path within the bucket where resources can be stored for job execution on the EMR cluster.
| |
aws.emr.proxyUser | Y | This value defines the user for the
| |
aws.emr.maxLogPollingRetries | N | Configure maximum number of retries when polling for log files from EMR after job success or failure. Minimum value is 5 . | |
aws.emr.tempfilesCleanupAge | N | Defines the number of days that temporary files in the By default, this value is set to If needed, you can set this to a positive integer value. During each job run, the platform scans this directory for temp files older than the specified number of days and removes any that are found. This cleanup provides an additional level of system hygiene. Before enabling this secondary cleanup process, please execute the following command to clear the
|
For more information on configuring the platform to integrate with Redshift, see Create Redshift Connections.
If needed, you can switch to a different EMR cluster through the application. For example, if the original cluster suffers a prolonged outage, you can switch clusters by entering the cluster ID of a new cluster. For more information, see Admin Settings Page.
Batch Job Runner manages jobs executed on the EMR cluster. You can modify aspects of how jobs are executed and how logs are collected. For more information, see Configure Batch Job Runner.
In environments where the EMR cluster is shared with other job-executing applications, you can review and specify the job tag prefix, which is prepended to job identifiers to avoid conflicts with other applications.
Steps:
Locate the following and modify if needed:
"aws.emr.jobTagPrefix": "TRIFACTA_JOB_", |