Contents:
You can configure your instance of the Designer Cloud Powered by Trifacta platform to integrate with Amazon Elastic MapReduce (EMR), a highly scalable Hadoop-based execution environment.
NOTE: This section applies only to installations of the Designer Cloud Powered by Trifacta platform where a license key file has been acquired from Alteryx Inc and applied to the platform. This section does not apply to Data Preparation for Amazon Redshift and S3.
- Amazon EMR (Elastic MapReduce) provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. For more information on EMR, see http://docs.aws.amazon.com/cli/latest/reference/emr/.
Limitations
NOTE: Job cancellation is not supported on an EMR cluster.
Use the following section to set up your EMR cluster for use with the Designer Cloud Powered by Trifacta platform. NOTE: It is recommended that you set up your cluster for exclusive use by the Designer Cloud Powered by Trifacta platform. In the Amazon EMR console, click Create Cluster. Click Go to advanced options. Complete the sections listed below. NOTE: Please be sure to read all of the cluster options before setting up your EMR cluster. NOTE: Please perform your configuration through the Advanced Options workflow. For more information on setting up your EMR cluster, see http://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html. In the Advanced Options screen, please select the following: Software Configuration: Release: EMR 5.6 - 5.12 For EMR 5.8 - EMR 5.12.1: Spark 2.2.x NOTE: You must apply the Spark version number in the Copy and paste the following into Enter Configuration: Auto-terminate cluster after the last step is completed: Disable this option. NOTE: Please apply the sizing information for your EMR cluster that was recommended for you. If you have not done so, please contact your Alteryx representative. S3 folder: Please specify the S3 bucket and path to the logging folder. NOTE: Please verify that this location is read accessible to all users of the platform. See below for details. If you are using the default credential provider, you must create a bootstrap action. NOTE: This configuration must be completed before you create the EMR cluster. For more information, see Authentication below. Permissions: Set to Custom to reduce the scope of permissions. For more information, see EMR cluster policies below. NOTE: Default permissions give access to everything in the cluster. EC2 Security Groups: If you performed all of the configuration, including the sections below, you can create the cluster. NOTE: You must acquire your EMR cluster ID for use in configuration of the Designer Cloud Powered by Trifacta platform. The following cluster roles and their permissions are required. For more information on the specifics of these policies, see EMR cluster policies. You can use one of two methods for authenticating the EMR cluster: You can leverage your IAM roles to provide role-based authentication to the S3 buckets. NOTE: The IAM role that is assigned to the EMR cluster and to the EC2 instances on the cluster must have access to the data of all users on S3. If you are not using IAM roles for access, you can manage access using either of the following: In either scenario, you must use the custom credential provider JAR provided in the installation. This JAR file must be available to all nodes of the EMR cluster. After you have installed the platform and configured the S3 buckets, please complete the following steps to deploy this JAR file. NOTE: These steps must be completed before you create the EMR cluster. NOTE: This section applies if you are using the default credential provider mechanism for AWS and are not using the IAM instance-based role authentication mechanism. Steps: From the installation of the Designer Cloud Powered by Trifacta platform, retrieve the following file: NOTE: Do not remove the timestamp value from the filename. This information is useful for support purposes. Upload this JAR file to an S3 bucket location where the EMR cluster can access it: Via AWS command line: Create a bootstrap action script named Via AWS Console S3 UI: Create the bootstrap action to point to the script you uploaded on S3. In the command line cluster creation script, add a custom bootstrap action, such as the following: When the EMR cluster is launched with the above custom bootstrap action, the cluster does one of the following: For more information about AWSCredentialsProvider for EMRFS please see: Although it is not required, you should enable the consistent view feature for EMRFS on your cluster. During job execution, including profiling jobs, on EMR, the Designer Cloud Powered by Trifacta platform writes files in rapid succession, and these files are quickly read back from storage for further processing. However, Amazon S3 does not provide a guarantee of a consistent file listing until a later time. To ensure that the Designer Cloud Powered by Trifacta platform does not begin reading back an incomplete set of files, you should enable EMRFS consistent view. NOTE: If EMRFS consistent view is enabled, additional policies must be added for users and the EMR cluster. Details are below. NOTE: If EMRFS consistent view is not enabled, profiling jobs may not get a consistent set of files at the time of execution. Jobs can fail or generate inconsistent results. For more information on EMRFS consistent view, see http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-consistent-view.html. Amazon's DynamoDB is automatically enabled to store metadata for EMRFS consistent view. NOTE: DynamoDB incurs costs while it is in use. For more information, see https://aws.amazon.com/dynamodb/pricing/. NOTE: DynamoDB does not automatically purge metadata after a job completes. You should configure periodic purges of the database during off-peak hours. You must set up S3 buckets for read and write access. NOTE: Within the Designer Cloud Powered by Trifacta platform, you must enable use of S3 as the default storage layer. This configuration is described later. For more information, see Enable S3 Access. On the EMR cluster, all users of the platform must have access to the following locations: The S3 bucket and path where resources can be stored by the Designer Cloud Powered by Trifacta platform for execution of Spark jobs on the cluster. NOTE: If server-side encryption is in use, only SSE-S3 encryption type is supported for the resources bucket. If you are using the same bucket for resources and data and SSE-KMS is in use, you may need to deploy a second bucket for EMR resources. For more information on server-side encryption, see Enable S3 Access. The S3 bucket and path where logs are written for cluster job execution. These locations are configured on the Designer Cloud Powered by Trifacta platform later. Alteryx users require the following policies to run jobs on the EMR cluster: The following policies should be assigned to the EMR roles listed below for read/write access: If EMRFS consistent view is enabled, the following policy must be added for users and the EMR cluster permissions: Please complete the following sections to configure the Designer Cloud Powered by Trifacta platform to communicate with the EMR cluster. As soon as you have installed the software, you should login to the application and change the admin password. The initial admin password is the instanceId for the EC2 instance. For more information, see Change Password. EMR integrations requires use of S3 as the base storage layer. NOTE: The base storage layer must be set during initial installation and set up of the Alteryx node. To integrate with S3, additional configuration is required. See Enable S3 Access. After you have configured S3 to be the base storage layer, you must enable EMR integration. Steps: You can apply this change through the Admin Settings Page (recommended) or Search for the following setting: Set the following value to Verify the following property values: The Designer Cloud Powered by Trifacta platform must be aware of the EMR cluster to which to connection. Steps: Administrators can apply this configuration change through the Admin Settings Page in the application. If the application is not available, the settings are available in Under External Service Settings, enter your AWS EMR Cluster ID. Click the Save button below the textbox. For more information, see Admin Settings Page. If you have deployed your EMR cluster on a private sub-net that is accessible outside of AWS, you must enable this property, which permits the extraction of the IP address of the master cluster node through DNS. NOTE: This feature must be enabled if your EMR is accessible outside of AWS on a private network. Steps: Set the following property to Depending on the authentication method you used, you must set the following properties. You can apply this change through the Admin Settings Page (recommended) or Use default credential provider for all Alteryx access including EMR. NOTE: This method requires the deployment of a custom credential provider JAR. Use default credential provider for all Alteryx access. However, EC2 role-based IAM authentication is used for EMR. EC2 role-based IAM authentication for all Alteryx access For EMR, you can configure a set of Spark-related properties to manage the integration and its performance. For more information on how Spark is implemented in the platform, see Configure for Spark. Through the Admin Settings page, you can specify the YARN queue to which to submit your Spark jobs. All Spark jobs from the Designer Cloud Powered by Trifacta platform are submitted to this queue. Steps: In platform configuration, locate the following: Save your changes. The following properties must be passed from the Designer Cloud Powered by Trifacta platform to Spark for proper execution on the EMR cluster. To apply this configuration change, login as an administrator to the Alteryx node. Then, edit NOTE: Do not modify these properties through the Admin Settings page. These properties must be added as extra properties through the Spark configuration block. Ignore any references in For smaller datasets, the platform recommends using the Alteryx Server. For larger datasets, if the size information is unavailable, the platform recommends by default that you run the job on the Hadoop cluster. For these jobs, the default publishing action for the job is specified to run on the Hadoop cluster, generating the output format defined by this parameter. Publishing actions, including output format, can always be changed as part of the job specification. As needed, you can change this default format. You can apply this change through the Admin Settings Page (recommended) or Accepted values: For more information, see Run Job Page. You can set the following parameters as needed: Steps: You can apply this change through the Admin Settings Page (recommended) or S3 path where resources can be stored for job execution on the EMR cluster. NOTE: Do not include leading or trailing slashes for the path value. S3 bucket where Alteryx executables, libraries, and other resources can be stored that are required for Spark execution. NOTE: If server-side encryption is in use, only SSE-S3 encryption type is supported for the resources bucket. If you are using the same bucket for resources and data and SSE-KMS is in use, you may need to deploy a second bucket for EMR resources. For more information on server-side encryption, see Enable S3 Access. This value defines the user for the Alteryx users to use for connecting to the cluster. NOTE: Do not modify this value. Defines the timeout for EMR jobs in milliseconds. By default, this value is set to NOTE: This setting should be modified only if you are experiencing problems with jobs hanging during execution on the EMR cluster. For more information on configuring the platform to integrate with Redshift, see Create Redshift Connections. If needed, you can switch to a different EMR cluster through the application. For example, if the original cluster suffers a prolonged outage, you can switch clusters by entering the cluster ID of a new cluster. For more information, see Admin Settings Page. Batch Job Runner manages jobs executed on the EMR cluster. You can modify aspects of how jobs are executed and how logs are collected. For more information, see Configure Batch Job Runner. In environments where the EMR cluster is shared with other job-executing applications, you can review and specify the job tag prefix, which is prepended to job identifiers to avoid conflicts with other applications. Steps: Locate the following and modify if needed:Set up EMR Cluster
Cluster options
Advanced Options
spark.version
property in Admin Settings. Additional configuration is required. See Configure for Spark.[
{
"Classification": "capacity-scheduler",
"Properties": {
"yarn.scheduler.capacity.resource-calculator": "org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"
}
}
]
Hardware configuration
General Options
Security Options
Create cluster and acquire cluster ID
Specify cluster roles
Authentication
Role-based IAM authentication
For more information, see Configure for EC2 Role-Based Authentication.Specify the custom credential provider JAR file
trifacta-conf.json
[TRIFACTA_INSTALL_DIR]/aws/emr/build/libs/trifacta-aws-emr-credential-provider[TIMESTAMP].jar
aws s3 cp trifacta-aws-emr-credential-provider[TIMESTAMP].jar s3://<YOUR-BUCKET>/
configure_emrfs_lib.sh
. The contents must be the following:sudo aws s3 cp s3://<YOUR-BUCKET>/trifacta-aws-emr-credential-provider[TIMESTAMP].jar /usr/share/aws/emr/emrfs/auxlib/
configure_emrfs_lib.sh
file to the accessible S3 bucket.--bootstrap-actions '[
{"Path":"s3://<YOUR-BUCKET>/configure_emrfs_lib.sh","Name":"Custom action"}
]'
trifacta-conf.json
aws.mode
= user
, then the credentials registered by the user are used.EMRFS consistent view is recommended
DynamoDB
Set up S3 Buckets
Bucket setup
Set up EMR resources buckets
Location Description Required Access EMR Resources bucket/path Read/Write EMR Logs bucket/path Read Access Policies
EC2 instance profile
{
"Statement": [
{
"Effect": "Allow",
"Action": [
"elasticmapreduce:AddJobFlowSteps",
"elasticmapreduce:DescribeStep",
"elasticmapreduce:DescribeCluster",
"elasticmapreduce:ListInstanceGroups"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:*"
],
"Resource": [
"arn:aws:s3:::__EMR_LOG_BUCKET__",
"arn:aws:s3:::__EMR_LOG_BUCKET__/*",
"arn:aws:s3:::__EMR_RESOURCE_BUCKET__",
"arn:aws:s3:::__EMR_RESOURCE_BUCKET__/*"
]
}
]
}
EMR roles
{
"Effect": "Allow",
"Action": [
"s3:*"
],
"Resource": [
"arn:aws:s3:::__EMR_LOG_BUCKET__",
"arn:aws:s3:::__EMR_LOG_BUCKET__/*",
"arn:aws:s3:::__EMR_RESOURCE_BUCKET__",
"arn:aws:s3:::__EMR_RESOURCE_BUCKET__/*"
]
}
}
EMRFS consistent view policies
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"dynamodb:*"
],
"Effect": "Allow",
"Resource": [
"*"
]
}
]
}
Configure Designer Cloud Powered by Trifacta platform for EMR
Change admin password
Verify S3 as base storage layer
Set up S3 integration
Enable EMR integration
trifacta-conf.json
.
For more information, see Platform Configuration Methods."webapp.runInEMR": false,
true
. false
:"webapp.runInHadoop": false,
"webapp.runInTrifactaServer": true,
"webapp.runInEMR": true,
"webapp.runInHadoop": false,
"webapp.runInDataflow": false,
"photon.enabled": true,
Apply EMR cluster ID
trifacta-conf.json
.
For more information, see Platform Configuration Methods.Extract IP address of master node in private sub-net
trifacta-conf.json
.
For more information, see Platform Configuration Methods.true
:"emr.extractIPFromDNS": false,
EMR Authentication for the Designer Cloud Powered by Trifacta platform
trifacta-conf.json
.
For more information, see Platform Configuration Methods.Authentication method Properties and values "aws.credentialProvider":"default",
"aws.emr.forceInstanceRole":false,
"aws.credentialProvider":"default",
"aws.emr.forceInstanceRole":true,
"aws.credentialProvider":"instance",
Configure Spark for EMR
Specify YARN queue for Spark jobs
"spark.props.spark.yarn.queue"
Allocation properties
trifacta-conf.json
.
For more information, see Platform Configuration Methods.trifacta-conf.json
to these properties and their settings."spark": {
...
"props": {
"spark.dynamicAllocation.enabled": "true",
"spark.shuffle.service.enabled": "true",
"spark.executor.instances": "0",
"spark.executor.memory": "2048M",
"spark.executor.cores": "2",
"spark.driver.maxResultSize": "0"
}
...
}
Property Description Value spark.dynamicAllocation.enabled
Enable dynamic allocation on the Spark cluster, which allows Spark to dynamically adjust the number of executors. true
spark.shuffle.service.enabled
Enable Spark shuffle service, which manages the shuffle data for jobs, instead of the executors. true
spark.executor.instances
Default count of executor instances. See Sizing Guide. spark.executor.memory
Default memory allocation of executor instances. See Sizing Guide. spark.executor.cores
Default count of executor cores. See Sizing Guide. spark.driver.maxResultSize
Enable serialized results of unlimited size by setting this parameter to zero (0). 0
Default Hadoop job results format
trifacta-conf.json
.
For more information, see Platform Configuration Methods."webapp.defaultHadoopFileFormat": "csv",
csv
, json
, avro
, pqt
Additional EMR configuration
trifacta-conf.json
.
For more information, see Platform Configuration Methods.Property Required Description aws.emr.resource.path Y aws.emr.resource.bucket Y aws.emr.proxyUser Y aws.emr.maxLogPollingRetries N Configure maximum number of retries when polling for log files from EMR after job success or failure. Minimum value is 5
.aws.emr.maxJobTimeoutMillis N -1
, which allows jobs to run for an infinite length of time.Optional Configuration
Configure for Redshift
Switch EMR Cluster
Configure Batch Job Runner
Modify Job Tag Prefix
trifacta-conf.json
.
For more information, see Platform Configuration Methods."aws.emr.jobTagPrefix": "TRIFACTA_JOB_",
Testing
- Load a dataset from the EMR cluster.
- Perform a few simple steps on the dataset.
- Click Run Job in the Transformer page.
- When specifying the job:
- Click the Profile Results checkbox.
- Select Hadoop on EMR.
- When the job completes, verify that the results have been written to the appropriate location.
This page has no comments.