You can configure your instance of the to integrate with Amazon Elastic MapReduce (EMR), a highly scalable Hadoop-based execution environment.
NOTE: This section applies only to installations of |
This section can be used to integrate with the following cluster deployments:
This section outlines how to create a new EMR cluster and integrate the with it. The platform can be integrated with existing EMR clusters.
Supported Versions: EMR 6.2.1
NOTE: EMR 6.2 required Spark 3.0.1. For more information, see Configure for Spark. |
NOTE: Do not use EMR 6.2.0. Use EMR 6.2.1. |
NOTE: EMR 6.0 and EMR 6.1 are not supported. |
Supported Versions: EMR 5.13 - EMR 5.30.2
NOTE: EMR 5.28.0 is not supported, due to Spark compatibility issues. Please use 5.28.1 or later. |
NOTE: Do not use EMR 5.30.0 or EMR 5.30.1. Use EMR 5.30.2. |
NOTE: EMR 5.20 - EMR 5.30 requires Spark 2.4. For more information, see Configure for Spark. |
Depending on the version of EMR in use, you must configure the to use the appropriate version of Spark. Please note the appropriate configuration settings below for later use.
NOTE: The version of Spark to use for the platform is defined in the |
EMR versions | Spark version | Additional configuration and notes | ||
---|---|---|---|---|
EMR 6.2.1 |
| Please also set the following:
| ||
EMR 5.20 - EMR 5.30.2 |
| |||
EMR 5.13 - EMR 5.19 |
|
If you are integrating with a kerberized EMR cluster, the following additional limitations apply:
Create EMR ClusterUse the following section to set up your EMR cluster for use with the
Cluster options
In the Amazon EMR console, click Create Cluster. Click Go to advanced options. Complete the sections listed below.
For more information on setting up your EMR cluster, see http://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html. Advanced OptionsIn the Advanced Options screen, please select the following:
Hardware configuration
General Options
Security Options
Create cluster and acquire cluster IDIf you performed all of the configuration, including the sections below, you can create the cluster.
Specify cluster rolesThe following cluster roles and their permissions are required. For more information on the specifics of these policies, see "Access Policies" below.
AuthenticationYou can use one of two methods for authenticating the EMR cluster's access to S3:
Role-based IAM authenticationYou can leverage your IAM roles to provide role-based authentication to the S3 buckets.
Custom credential providerIf you are not using IAM roles for access, you can manage access using either of the following:
In either scenario, the
Set up S3 BucketsBucket setupYou must set up S3 buckets for read and write access.
For more information, see S3 Access. Set up EMR resources buckets
On the EMR cluster, all users of the platform must have access to the following locations:
These locations are configured on the Access PoliciesEC2 instance profile
Permissions for EMR high availabilityIf the
Additional configuration is required. See "Configure for EMR High Availability" below. EMR rolesThe following policies should be assigned to the EMR roles listed below for read/write access:
General configuration for Please complete the following sections to configure the Change admin passwordAs soon as you have installed the software, you should login to the application and change the admin password. The initial admin password is the instanceId for the EC2 instance. For more information, see Change Password. Verify S3 as base storage layerEMR integrations requires use of S3 as the base storage layer.
Set up S3 integrationTo integrate with S3, additional configuration is required. See S3 Access. Configure EMR authentication modeAuthentication to AWS and to EMR supports the following basic modes:
The authentication mode for your access to EMR can be configured independently from the base authentication mode for AWS, with the following exception:
Authentication mode configuration matrix:
Please apply the following configuration to set the EMR authentication mode: Steps:
EMR per-user authentication for the If you have enabled per-user authentication for EMR (
Configure |
NOTE: This section assumes that you are integrating with an EMR cluster that has not been kerberized. If you are integrating with a Kerberized cluster, please skip to "Configure for EMR with Kerberos". |
After you have configured S3 to be the base storage layer, you must enable EMR integration.
Steps:
Set the following value:
"webapp.runInEMR": true, |
Set the following values:
"webapp.runWithSparkSubmit": false, |
Verify the following property values:
"webapp.runWithSparkSubmit": false, "webapp.runInDataflow": false, |
Save your changes and restart the platform.
To enable , an in-memory running environment for small- to medium-sized jobs, please do the following:
In the , navigate to User menu > Admin console > Workspace settings.
In the Workspace Settings page, set Photon execution
to Enabled
.
is available for job execution when you next log in to the
.
The must be aware of the EMR cluster to which to connection.
Steps:
Under External Service Settings, enter your AWS EMR Cluster ID. Click the Save button below the textbox.
For more information, see Admin Settings Page.
If you have deployed your EMR cluster on a private sub-net that is accessible outside of AWS, you must enable this property, which permits the extraction of the IP address of the master cluster node through DNS.
NOTE: This feature must be enabled if your EMR is accessible outside of AWS on a private network. |
Steps:
Set the following property to true
:
"emr.extractIPFromDNS": false, |
For EMR, you can configure a set of Spark-related properties to manage the integration and its performance.
Depending on the version of EMR with which you are integrating, the must be modified to use the appropriate version of Spark to connect to EMR.
NOTE: You should have already acquired the value to apply. See "Supported Spark Versions" above. |
Steps:
Locate the following:
"spark.version": "<SparkVersionForMyEMRVersion>", |
spark.useVendorSparkLibraries
The Spark job service is not used for EMR job execution. Please complete the following to disable it:
Steps:
Locate the following and set it to false
:
"spark-job-service.enabled": false, |
Locate the following and set it to false
:
"spark-job-service.enableHiveSupport": false, |
Through the Admin Settings page, you can specify the YARN queue to which to submit your Spark jobs. All Spark jobs from the are submitted to this queue.
Steps:
In platform configuration, locate the following:
"spark.props.spark.yarn.queue" |
Save your changes.
The following properties must be passed from the to Spark for proper execution on the EMR cluster.
NOTE: Do not modify these properties through the Admin Settings page. These properties must be added as extra properties through the Spark configuration block. Ignore any references in |
"spark": { ... "props": { "spark.dynamicAllocation.enabled": "true", "spark.shuffle.service.enabled": "true", "spark.executor.instances": "0", "spark.executor.memory": "2048M", "spark.executor.cores": "2", "spark.driver.maxResultSize": "0" } ... } |
Property | Description | Value |
---|---|---|
spark.dynamicAllocation.enabled | Enable dynamic allocation on the Spark cluster, which allows Spark to dynamically adjust the number of executors. | true |
spark.shuffle.service.enabled | Enable Spark shuffle service, which manages the shuffle data for jobs, instead of the executors. | true |
spark.executor.instances | Default count of executor instances. | See Sizing Guide. |
spark.executor.memory | Default memory allocation of executor instances. | See Sizing Guide. |
spark.executor.cores | Default count of executor cores. | See Sizing Guide. |
spark.driver.maxResultSize | Enable serialized results of unlimited size by setting this parameter to zero (0). | 0
|
When profiling is enabled for a Spark job, the transform and profiling tasks are combined by default. As needed, you can separate these two tasks. Publishing behaviors vary depending on the approach. For more information, see Configure for Spark.
NOTE: This section applies only if you are integrating with a kerberized EMR cluster. If you are not, please skip to "Additional Configuration for EMR". |
When running jobs against a kerberized EMR cluster, you utilize the Spark-submit method of job submission. You must disable the standard EMR integration.
Steps:
Search for the following setting and set it false
:
"webapp.runInEMR": false, |
Set the following value:
"webapp.runWithSparkSubmit": true, |
Disable use of Hive, which is not supported with EMR:
"spark-job-service.enableHiveSupport": false, |
Verify the following property value:
"webapp.runInDataflow": false, |
To use Spark-submit, the Spark master must be set to use YARN.
Steps:
Search for the following setting and set it yarn
:
"spark.master": "yarn", |
Save your changes.
For integrating with an EMR cluster with Kerberos, the EMR cluster site XML configuration files must be downloaded from the EMR master node to the .
NOTE: This step is not required for non-Kerberized EMR clusters. |
NOTE: When these files change, you must update the local copies. |
core-site.xml
hdfs-site.xml
mapred-site.xml
yarn-site.xml
These configuration files must be moved to the . By default, these files are in
/etc/hadoop/conf
:
sudo cp <location>/*.xml /opt/trifacta/conf/hadoop-site/ sudo chown trifacta:trifacta /opt/trifacta/conf/hadoop-site/*.xml |
When integrating with a kerberized EMR cluster, the following are unused:
Unused EMR settings: In the Admin Settings page, the following EMR settings do not apply to EMR with Kerberos:
aws.emr.tempfilesCleanupAge aws.emr.proxyUser aws.emr.maxLogPollingRetries aws.emr.jobTagPrefix aws.emr.getLogsOnFailure aws.emr.getLogsForAllJobs aws.emr.extractIPFromDNS aws.emr.connectToRmEnabled |
By default, a job that is started on EMR cannot be canceled through the application. Optionally, you can enable users to cancel their EMR jobs in progress.
Pre-requisites:
The following permission must be added to the IAM role that is used to interact with EMR:
elasticmapreduce:CancelSteps |
For more information, See "EC2 instance profile" above.
Steps:
Please complete the following configuration changes to enable job cancellation on EMR.
Please locate the following parameter and verify that it has been set to true
:
NOTE: Enabled by default, this parameter is optional for basic EMR connectivity. This parameter must be enabled for EMR job cancellation. |
"aws.emr.connectToRmEnabled": true, |
Please locate the following parameter and verify that it is set to true
:
"aws.emr.cancelEnabled": true, |
For smaller datasets, the platform recommends using the running environment.
For larger datasets, if the size information is unavailable, the platform recommends by default that you run the job on the Hadoop cluster. For these jobs, the default publishing action for the job is specified to run on the Hadoop cluster, generating the output format defined by this parameter. Publishing actions, including output format, can always be changed as part of the job specification.
As needed, you can change this default format.
"webapp.defaultHadoopFileFormat": "csv", |
Accepted values: csv
, json
, avro
, pqt
For more information, see Run Job Page.
If you are publishing using Snappy compression for jobs run on an EMR cluster, you may need to perform the following additional configuration.
Steps:
SSH into EMR cluster (master) node:
ssh <EMR master node> |
Create tarball of native Hadoop libraries:
tar -C /usr/lib/hadoop/lib -czvf emr-hadoop-native.tar.gz native |
Copy the tarball to the instance used by the into the
/tmp
directory:
scp -p emr-hadoop-native.tar.gz <EC2 instance>:/tmp |
SSH to instance:
ssh <EC2 instance> |
Create path values for libraries:
sudo -u trifacta mkdir -p /opt/trifacta/services/batch-job-runner/build/libs |
Untar the tarball to the :
sudo -u trifacta tar -C /opt/trifacta/services/batch-job-runner/build/libs -xzf /tmp/emr-hadoop-native.tar.gz |
Verify libhadoop.so*
and libsnappy.so*
libraries exist and are owned by the :
ls -l /opt/trifacta/services/batch-job-runner/build/libs/native/ |
Verify that the /tmp
directory has the proper permissions for publication. For more information, see Supported File Formats.
You can set the following parameters as needed:
Steps:
Property | Required | Description | |
---|---|---|---|
aws.emr.resource.bucket | Y | S3 bucket name where | |
aws.emr.resource.path | Y | S3 path within the bucket where resources can be stored for job execution on the EMR cluster.
| |
aws.emr.proxyUser | Y | This value defines the user for the
| |
aws.emr.maxLogPollingRetries | N | Configure maximum number of retries when polling for log files from EMR after job success or failure. Minimum value is 5 . | |
aws.emr.tempfilesCleanupAge | N | Defines the number of days that temporary files in the By default, this value is set to If needed, you can set this to a positive integer value. During each job run, the platform scans this directory for temp files older than the specified number of days and removes any that are found. This cleanup provides an additional level of system hygiene. Before enabling this secondary cleanup process, please execute the following command to clear the
|
For more information on configuring the platform to integrate with Redshift, see Amazon Redshift Connections.
The can be configured to integrate with multiple master EMR nodes, which are deployed in a highly available environment.
Deploying additional instances of EMR may result in increased costs. |
Integration with EMR high availability is supported for EMR 5.23.0 and later.
An additional permission must be added to the ARN used to access EMR. For more information, see "Permissions for EMR high availability" above.
For more information, see https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-ha-launch.html.
When you create the cluster, you must select the "Use multiple master nodes to improve cluster availability" checkbox. For more information, see "Create Clusters" above.
If needed, you can switch to a different EMR cluster through the application. For example, if the original cluster suffers a prolonged outage, you can switch clusters by entering the cluster ID of a new cluster. For more information, see Admin Settings Page.
Batch Job Runner manages jobs executed on the EMR cluster. You can modify aspects of how jobs are executed and how logs are collected. For more information, see Configure Batch Job Runner.
In environments where the EMR cluster is shared with other job-executing applications, you can review and specify the job tag prefix, which is prepended to job identifiers to avoid conflicts with other applications.
Steps:
Locate the following and modify if needed:
"aws.emr.jobTagPrefix": "TRIFACTA_JOB_", |