...
- This feature is supported for
only.D s product product ee - The
must be deployed on an AWS EC2 Instance that is joined to the same domain as the EMR cluster.D s platform - The EMR cluster must be kerberized using the Cross-Realm Trust method. Additional information is below.
Excerpt | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Create EMR ClusterUse the following section to set up your EMR cluster for use with the
Cluster options
In the Amazon EMR console, click Create Cluster. Click Go to advanced options. Complete the sections listed below.
For more information on setting up your EMR cluster, see http://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html. Advanced OptionsIn the Advanced Options screen, please select the following:
Hardware configuration
General Options
Security Options
Create cluster and acquire cluster IDIf you performed all of the configuration, including the sections below, you can create the cluster.
Specify cluster rolesThe following cluster roles and their permissions are required. For more information on the specifics of these policies, see "Access Policies" below.
AuthenticationYou can use one of two methods for authenticating the EMR cluster's access to S3:
Role-based IAM authenticationYou can leverage your IAM roles to provide role-based authentication to the S3 buckets.
Custom credential providerIf you are not using IAM roles for access, you can manage access using either of the following:
In either scenario, the
Set up S3 BucketsBucket setupYou must set up S3 buckets for read and write access.
For more information, see S3 Access. Set up EMR resources buckets
On the EMR cluster, all users of the platform must have access to the following locations:
These locations are configured on the
Access PoliciesEC2 instance profile
Permissions for EMR high availabilityIf the
Additional configuration is required. See "Configure for EMR High Availability" below. EMR rolesThe following policies should be assigned to the EMR roles listed below for read/write access:
General configuration for
Please complete the following sections to configure the
Change admin passwordAs soon as you have installed the software, you should login to the application and change the admin password. The initial admin password is the instanceId for the EC2 instance. For more information, see Change Password. Verify S3 as base storage layerEMR integrations requires use of S3 as the base storage layer.
Set up S3 integrationTo integrate with S3, additional configuration is required. See S3 Access. Configure EMR authentication modeAuthentication to AWS and to EMR supports the following basic modes:
The authentication mode for your access to EMR can be configured independently from the base authentication mode for AWS, with the following exception:
Authentication mode configuration matrix:
Please apply the following configuration to set the EMR authentication mode: Steps:
EMR per-user authentication for the
If you have enabled per-user authentication for EMR (
Configure
Enable EMR integrationAfter you have configured S3 to be the base storage layer, you must enable EMR integration. Steps:
Apply EMR cluster IDThe
Steps:
For more information, see Admin Settings Page. Extract IP address of master node in private sub-netIf you have deployed your EMR cluster on a private sub-net that is accessible outside of AWS, you must enable this property, which permits the extraction of the IP address of the master cluster node through DNS.
Steps:
Configure Spark for EMRFor EMR, you can configure a set of Spark-related properties to manage the integration and its performance. Configure Spark versionDepending on the version of EMR with which you are integrating, the
Steps:
Disable Spark job serviceThe Spark job service is not used for EMR job execution. Please complete the following to disable it: Steps:
Specify YARN queue for Spark jobsThrough the Admin Settings page, you can specify the YARN queue to which to submit your Spark jobs. All Spark jobs from the
Steps:
Allocation propertiesThe following properties must be passed from the
|
...
|
...
|
...
Combine transform and profiling for Spark jobsWhen profiling is enabled for a Spark job, the transform and profiling tasks are combined by default. As needed, you can separate these two tasks. Publishing behaviors vary depending on the approach. For more information, see Configure for Spark. Configure
Disable standard EMR integrationWhen running jobs against a kerberized EMR cluster, you utilize the Spark-submit method of job submission. You must disable the standard EMR integration. Steps:
Enable YARNTo use Spark-submit, the Spark master must be set to use YARN. Steps:
Acquire site config filesFor integrating with an EMR cluster with Kerberos, the EMR cluster site XML configuration files must be downloaded from the EMR master node to the
Unused properties for EMR with KerberosWhen integrating with a kerberized EMR cluster, the following
Additional Configuration for EMREnable EMR job cancellationBy default, a job that is started on EMR cannot be canceled through the application. Optionally, you can enable users to cancel their EMR jobs in progress. Prerequisites:
Steps: Please complete the following configuration changes to enable job cancellation on EMR.
Default Hadoop job results formatFor smaller datasets, the platform recommends using the
For larger datasets, if the size information is unavailable, the platform recommends by default that you run the job on the Hadoop cluster. For these jobs, the default publishing action for the job is specified to run on the Hadoop cluster, generating the output format defined by this parameter. Publishing actions, including output format, can always be changed as part of the job specification. As needed, you can change this default format.
Accepted values: For more information, see Run Job Page. Configure Snappy publicationIf you are publishing using Snappy compression for jobs run on an EMR cluster, you may need to perform the following additional configuration. Steps:
Additional parametersYou can set the following parameters as needed: Steps:
Optional ConfigurationConfigure for RedshiftFor more information on configuring the platform to integrate with Redshift, see Amazon Redshift Connections. Configure for EMR high availabilityThe
VersionsIntegration with EMR high availability is supported for EMR 5.23.0 and later. PermissionsAn additional permission must be added to the ARN used to access EMR. For more information, see "Permissions for EMR high availability" above. For more information, see https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-ha-launch.html. Create clusterWhen you create the cluster, you must select the "Use multiple master nodes to improve cluster availability" checkbox. For more information, see "Create Clusters" above. Switch EMR ClusterIf needed, you can switch to a different EMR cluster through the application. For example, if the original cluster suffers a prolonged outage, you can switch clusters by entering the cluster ID of a new cluster. For more information, see Admin Settings Page. Configure Batch Job RunnerBatch Job Runner manages jobs executed on the EMR cluster. You can modify aspects of how jobs are executed and how logs are collected. For more information, see Configure Batch Job Runner. Modify Job Tag PrefixIn environments where the EMR cluster is shared with other job-executing applications, you can review and specify the job tag prefix, which is prepended to job identifiers to avoid conflicts with other applications. Steps:
|
Testing
- Load a dataset from the EMR cluster.
- Perform a few simple steps on the dataset.
- Click Run in the Transformer page.
- When specifying the job:
- Click the Profile Results checkbox.
- Select Spark.
- When the job completes, verify that the results have been written to the appropriate location.