Configure for EMR
You can configure your instance of the Designer Cloud Powered by Trifacta platform to integrate with Amazon Elastic MapReduce (EMR), a highly scalable Hadoop-based execution environment.
Note
This section applies only to installations of Designer Cloud Powered by Trifacta Enterprise Edition where a valid license key file has been applied to the platform.
Amazon EMR (Elastic MapReduce) provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. For more information on EMR, see http://docs.aws.amazon.com/cli/latest/reference/emr/.
This section can be used to integrate with the following cluster deployments:
EMR: default
EMR with Kerberos: a separate configuration path is documented in this section
This section outlines how to create a new EMR cluster and integrate the Designer Cloud Powered by Trifacta platform with it. The platform can be integrated with existing EMR clusters.
Supported Versions
EMR 6
Supported Versions: EMR 6.2.1, 6.3
Note
EMR 6.2.1 and EMR 6.3 require Spark 3.0.1. For more information, see Configure for Spark.
Note
Do not use EMR 6.2.0. Use EMR 6.2.1.
Note
EMR 6.0 and EMR 6.1 are not supported.
EMR 5
Supported Versions: EMR 5.13 - EMR 5.30.2
Note
EMR 5.28.0 is not supported, due to Spark compatibility issues. Please use 5.28.1 or later.
Note
Do not use EMR 5.30.0 or EMR 5.30.1. Use EMR 5.30.2.
Note
EMR 5.20 - EMR 5.30 requires Spark 2.4. For more information, see Configure for Spark.
Supported Spark Versions
Depending on the version of EMR in use, you must configure the Designer Cloud Powered by Trifacta platform to use the appropriate version of Spark. Please note the appropriate configuration settings below for later use.
Note
The version of Spark to use for the platform is defined in the spark.version
property. This configuration step is covered later.
EMR versions | Spark version | Additional configuration and notes |
---|---|---|
EMR 6.2.1, EMR 6.3 | "spark.version": "3.0.1", | Please also set the following: "spark.scalaVersion": "2.12", |
EMR 5.20 - EMR 5.30.2 | "spark.version": "2.4.6", | |
EMR 5.13 - EMR 5.19 | "spark.version": "2.3.0", |
Limitations
The Designer Cloud Powered by Trifacta platform must be installed on AWS.
Limitations for Kerberos
If you are integrating with a kerberized EMR cluster, the following additional limitations apply:
This feature is supported for Designer Cloud Powered by Trifacta Enterprise Edition only.
The Designer Cloud Powered by Trifacta platform must be deployed on an AWS EC2 Instance that is joined to the same domain as the EMR cluster.
The EMR cluster must be kerberized using the Cross-Realm Trust method. Additional information is below.
Create EMR Cluster
Use the following section to set up your EMR cluster for use with the Designer Cloud Powered by Trifacta platform.
Via AWS EMR UI: This method is assumed in this documentation.
Via AWS command line interface: For this method, it is assumed that you know the required steps to perform the basic configuration. For custom configuration steps, additional documentation is provided below.
Note
If you are integrating with a kerberized EMR cluster, the cluster must be kerberized using the Cross-Realm Trust method. The KDC on the EMR cluster must establish a cross-realm trust with the external KDC. No other Kerberos method is supported.
For more information, see https://docs.amazonaws.cn/en_us/emr/latest/ManagementGuide/emr-kerberos-options.html.
Note
It is recommended that you set up your cluster for exclusive use by the Designer Cloud Powered by Trifacta platform.
Cluster options
Note
If you are deploying EMR in a highly available environment, you must create each EMR cluster from the command line. For more information, see "Configure for EMR High Availability" below.
In the Amazon EMR console, click Create Cluster. Click Go to advanced options. Complete the sections listed below.
Note
Please be sure to read all of the cluster options before setting up your EMR cluster.
Note
Please perform your configuration through the Advanced Options task.
For more information on setting up your EMR cluster, see http://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html.
Advanced Options
In the Advanced Options screen, please select the following:
Software Configuration:
Release: EMR version to select.
Select:
Hadoop 2.8.3
Hue 3.12.0
Ganglia 3.7.2
Tip
Although it is optional, Ganglia is recommended for monitoring cluster performance.
Spark version should be set accordingly. See "Supported Spark Versions" above.
Deselect everything else.
Multiple master nodes (optional):
To deploy your EMR cluster in a highly available configuration, select the "Use multiple master nodes" checkbox.
When selected, this option creates a total of three master nodes. Two of the nodes are configured as failover options if the first one fails.
Warning
Deploying EMR with multiple master nodes may incur additional costs.
Additional configuration is required. For more information, see "Configure for EMR High Availability" below.
Edit the software settings:
Copy and paste the following into Enter Configuration:
[ { "Classification": "capacity-scheduler", "Properties": { "yarn.scheduler.capacity.resource-calculator": "org.apache.hadoop.yarn.util.resource.DominantResourceCalculator" } } ]
EMR job cancellation: The following configuration entry is required if you wish to enable EMR job cancellation:
{ "Classification": "core-site", "Properties": { "hadoop.http.filter.initializers": "org.apache.hadoop.security.HttpCrossOriginFilterInitializer,org.apache.hadoop.http.lib.StaticUserWebFilter" } }
Additional configuration is required later to enable job cancellation. For more information, see "Enable EMR job cancellation" below.
Auto-terminate cluster after the last step is completed: Leave this option disabled.
Hardware configuration
Note
Please apply the sizing information for your EMR cluster that was recommended for you. If you have not done so, please contact your Alteryx representative.
General Options
Cluster name: Provide a descriptive name.
Logging: Enable logging on the cluster.
S3 folder: Please specify the S3 bucket and path to the logging folder.
Note
Please verify that this location is read accessible to all users of the platform. See below for details.
Debugging: Enable.
Termination protection: Enable.
Tags:
No options required.
Additional Options:
EMRFS consistent view: Do not enable.
Custom AMI ID: None.
Bootstrap Actions:
If you are using a custom credential provider JAR, you must create a bootstrap action.
Note
This configuration must be completed before you create the EMR cluster. For more information, see Authentication below.
Security Options
EC2 key pair: Please select a key/pair to use if you wish to access EMR nodes via SSH.
Permissions: Set to Custom to reduce the scope of permissions. For more information, see "Access Policies" below.
Note
Default permissions give access to everything in the cluster.
Encryption Options
No requirements.
EC2 Security Groups:
The selected security group for the master node on the cluster must allow TCP traffic from the Alteryx instance on port 8088. For more information, see System Ports.
Create cluster and acquire cluster ID
If you performed all of the configuration, including the sections below, you can create the cluster.
Note
You must acquire your EMR cluster ID for use in configuration of the Designer Cloud Powered by Trifacta platform.
Specify cluster roles
The following cluster roles and their permissions are required. For more information on the specifics of these policies, see "Access Policies" below.
EMR Role:
Read/write access to log bucket
Read access to resource bucket
EC2 instance profile:
If using instance mode:
EC2 profile should have read/write access for all users.
EC2 profile should have same permissions as EC2 Edge node role.
Read/write access to log bucket
Read access to resource bucket
Auto-scaling role:
Read/write access to log bucket
Read access to resource bucket
Standard auto-scaling permissions
Authentication
You can use one of two methods for authenticating the EMR cluster's access to S3:
Role-based IAM authentication (recommended): This method leverages your IAM roles on the EC2 instance.
Custom credential provider:This method utilizes a custom credential provider JAR file provided with the platform. This custom credential provider is automaticallydeployedby theDesigner Cloud Powered by Trifacta platformduring job submission.
Role-based IAM authentication
You can leverage your IAM roles to provide role-based authentication to the S3 buckets.
Note
The IAM role that is assigned to the EMR cluster and to the EC2 instances on the cluster must have access to the data of all users on S3.
For more information, see Configure for EC2 Role-Based Authentication.
Custom credential provider
If you are not using IAM roles for access, you can manage access using either of the following:
AWS key and secret values specified in
trifacta-conf.json
AWS user mode
In either scenario, the Designer Cloud Powered by Trifacta platform deploys a custom credential provider JAR file to the EMR before the job is executed.
Note
If you are also integrating with AWS Glue, you must provide a separate custom credential JAR file as part of that integration. For more information, see AWS Glue Access.
Set up S3 Buckets
Bucket setup
You must set up S3 buckets for read and write access.
Note
Within the Designer Cloud Powered by Trifacta platform, you must enable use of S3 as the default storage layer. This configuration is described later.
For more information, see S3 Access.
Set up EMR resources buckets
Note
If you are connecting to a kerberized EMR cluster, please skip to the next section. This section is not required.
On the EMR cluster, all users of the platform must have access to the following locations:
Location | Description | Required Access |
---|---|---|
EMR Resources bucket and path | The S3 bucket and path where resources can be stored by the Designer Cloud Powered by Trifacta platform for execution of Spark jobs on the cluster. The locations are configured separately in the Designer Cloud Powered by Trifacta platform. | Read/Write |
EMR Logs bucket and path | The S3 bucket and path where logs are written for cluster job execution. | Read |
These locations are configured on the Designer Cloud Powered by Trifacta platform later.
Access Policies
EC2 instance profile
Alteryx users require the following policies to run jobs on the EMR cluster:
{ "Statement": [ { "Effect": "Allow", "Action": [ "elasticmapreduce:AddJobFlowSteps", "elasticmapreduce:DescribeStep", "elasticmapreduce:DescribeCluster", "elasticmapreduce:ListInstanceGroups", "elasticmapreduce:CancelSteps" ], "Resource": [ "*" ] }, { "Effect": "Allow", "Action": [ "s3:*" ], "Resource": [ "arn:aws:s3:::__EMR_LOG_BUCKET__", "arn:aws:s3:::__EMR_LOG_BUCKET__/*", "arn:aws:s3:::__EMR_RESOURCE_BUCKET__", "arn:aws:s3:::__EMR_RESOURCE_BUCKET__/*" ] } ] }
Permissions for EMR high availability
If the Designer Cloud Powered by Trifacta platform is integrating with a highly available EMR cluster through multiple master nodes, the following permission must be included in the ARN used for accessing EMR:
"elasticmapreduce:listInstances",
Additional configuration is required. See "Configure for EMR High Availability" below.
EMR roles
The following policies should be assigned to the EMR roles listed below for read/write access:
{ "Effect": "Allow", "Action": [ "s3:*" ], "Resource": [ "arn:aws:s3:::__EMR_LOG_BUCKET__", "arn:aws:s3:::__EMR_LOG_BUCKET__/*", "arn:aws:s3:::__EMR_RESOURCE_BUCKET__", "arn:aws:s3:::__EMR_RESOURCE_BUCKET__/*" ] } }
General configuration for Designer Cloud Powered by Trifacta platform
Please complete the following sections to configure the Designer Cloud Powered by Trifacta platform to communicate with the EMR cluster.
Change admin password
As soon as you have installed the software, you should login to the application and change the admin password. The initial admin password is the instanceId for the EC2 instance. For more information, see Change Password.
Verify S3 as base storage layer
EMR integrations requires use of S3 as the base storage layer.
Note
The base storage layer must be set during initial installation and set up of the Trifacta node.
Set up S3 integration
To integrate with S3, additional configuration is required. See S3 Access.
Configure EMR authentication mode
Authentication to AWS and to EMR supports the following basic modes:
System: A single set of credentials is used to connect to resources.
User: Each user has a separate set of credentials. The user can choose to submit key-secret combinations or role-based authentication.
Note
Your method of authentication to AWS should already be configured. For more information, see Configure for AWS.
The authentication mode for your access to EMR can be configured independently from the base authentication mode for AWS, with the following exception:
Note
If aws.emr.authMode
is set to user
, then aws.mode
must also be set to user.
Authentication mode configuration matrix:
AWS mode (aws.mode) | system | user |
---|---|---|
EMR mode (aws.emr.authMode) | ||
system | AWS and EMR use a single key-secret combination. Parameters to set: "aws.s3.key" "aws.s3.secret" See Configure for AWS. | AWS access uses a single key-secret combination. EMR access is governed by per-user credentials. Per-user credentials can be provided from one of several different providers. Note Per-user access requires additional configuration for EMR. See the following section. For more information on configuring per-user access, see Configure for AWS. |
user | Not supported | AWS and EMR use the same per-user credentials for access. Per-user credentials can be provided from one of several different providers. Note Per-user access requires additional configuration for EMR. See the following section. For more information on configuring per-user access, see Configure AWS Per-User Auth for Temporary Credentials. |
Please apply the following configuration to set the EMR authentication mode:
Steps:
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Locate the following settings and apply the appropriate values. See the table below:
"aws.emr.authMode": "user",
Setting
Description
aws.emr.authMode
Configure the mode to use to authenticate to the EMR cluster:
system
- In system mode, the specified AWS key and secret combination are used to authenticate to the EMR cluster. These credentials are used for all users.user
- In user mode, user configuration is retrieved from the database.Note
User mode for EMR authentication requires that
aws.mode
be set touser
. Additional configuration for EMR is below.Save your changes.
EMR per-user authentication for the Designer Cloud Powered by Trifacta platform
If you have enabled per-user authentication for EMR (aws.emr.authMode=user
), you must set the following properties based on the credential provider for your AWS per-user credentials.
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json
. For more information, see Platform Configuration Methods.
Authentication method | Properties and values |
---|---|
Use default credential provider for all Alteryx access including EMR. Note This method requires the deployment of a custom credential provider JAR. | "aws.credentialProvider":"default", "aws.emr.forceInstanceRole":false, |
Use default credential provider for all Alteryx access. However, EC2 role-based IAM authentication is used for EMR. | "aws.credentialProvider":"default", "aws.emr.forceInstanceRole":true, |
EC2 role-based IAM authentication for all Alteryx access | "aws.credentialProvider":"instance", |
Configure Designer Cloud Powered by Trifacta platform for EMR
Note
This section assumes that you are integrating with an EMR cluster that has not been kerberized. If you are integrating with a Kerberized cluster, please skip to "Configure for EMR with Kerberos".
Enable EMR integration
After you have configured S3 to be the base storage layer, you must enable EMR integration.
Steps:
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json
. For more information, see Platform Configuration Methods.
Set the following value:
"webapp.runInEMR": true,
Set the following values:
"webapp.runWithSparkSubmit": false,
Verify the following property values:
"webapp.runWithSparkSubmit": false, "webapp.runInDataflow": false,
Save your changes and restart the platform.
To enable Trifacta Photon, an in-memory running environment for small- to medium-sized jobs, please do the following:
In the Trifacta Application, navigate to User menu > Admin console > Workspace settings.
In the Workspace Settings page, set
Photon execution
toEnabled
.Trifacta Photon is available for job execution when you next log in to the Trifacta Application.
Apply EMR cluster ID
The Designer Cloud Powered by Trifacta platform must be aware of the EMR cluster to which to connection.
Steps:
Administrators can apply this configuration change through the Admin Settings Page in the application. If the application is not available, the settings are available in
trifacta-conf.json
. For more information, see Platform Configuration Methods.Under External Service Settings, enter your AWS EMR Cluster ID. Click the Save button below the textbox.
For more information, see Admin Settings Page.
Extract IP address of master node in private sub-net
If you have deployed your EMR cluster on a private sub-net that is accessible outside of AWS, you must enable this property, which permits the extraction of the IP address of the master cluster node through DNS.
Note
This feature must be enabled if your EMR is accessible outside of AWS on a private network.
Steps:
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Set the following property to
true
:"emr.extractIPFromDNS": false,
Save your changes and restart the platform.
Configure Spark for EMR
For EMR, you can configure a set of Spark-related properties to manage the integration and its performance.
Configure Spark version
Depending on the version of EMR with which you are integrating, the Designer Cloud Powered by Trifacta platform must be modified to use the appropriate version of Spark to connect to EMR.
Note
You should have already acquired the value to apply. See "Supported Spark Versions" above.
Steps:
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Locate the following:
"spark.version": "<SparkVersionForMyEMRVersion>",
This setting is ignored for EMR:
spark.useVendorSparkLibraries
Save your changes.
Disable Spark job service
The Spark job service is not used for EMR job execution. Please complete the following to disable it:
Steps:
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Locate the following and set it to
false
:"spark-job-service.enabled": false,
Locate the following and set it to
false
:"spark-job-service.enableHiveSupport": false,
Save your changes.
Specify YARN queue for Spark jobs
Through the Admin Settings page, you can specify the YARN queue to which to submit your Spark jobs. All Spark jobs from the Designer Cloud Powered by Trifacta platform are submitted to this queue.
Steps:
In platform configuration, locate the following:
"spark.props.spark.yarn.queue"
Specify the name of the queue.
Save your changes.
Allocation properties
The following properties must be passed from the Designer Cloud Powered by Trifacta platform to Spark for proper execution on the EMR cluster.
To apply this configuration change, login as an administrator to the Trifacta node. Then, edit trifacta-conf.json
. For more information, see Platform Configuration Methods.
Note
Do not modify these properties through the Admin Settings page. These properties must be added as extra properties through the Spark configuration block. Ignore any references in trifacta-conf.json
to these properties and their settings.
"spark": { ... "props": { "spark.dynamicAllocation.enabled": "true", "spark.shuffle.service.enabled": "true", "spark.executor.instances": "0", "spark.executor.memory": "2048M", "spark.executor.cores": "2", "spark.driver.maxResultSize": "0" } ... }
Property | Description | Value |
---|---|---|
spark.dynamicAllocation.enabled | Enable dynamic allocation on the Spark cluster, which allows Spark to dynamically adjust the number of executors. | true |
spark.shuffle.service.enabled | Enable Spark shuffle service, which manages the shuffle data for jobs, instead of the executors. | true |
spark.executor.instances | Default count of executor instances. | See Sizing Guidelines. |
spark.executor.memory | Default memory allocation of executor instances. | See Sizing Guidelines. |
spark.executor.cores | Default count of executor cores. | See Sizing Guidelines. |
spark.driver.maxResultSize | Enable serialized results of unlimited size by setting this parameter to zero (0). | 0 |
Combine transform and profiling for Spark jobs
When profiling is enabled for a Spark job, the transform and profiling tasks are combined by default. As needed, you can separate these two tasks. Publishing behaviors vary depending on the approach. For more information, see Configure for Spark.
Configure Designer Cloud Powered by Trifacta platform for EMR with Kerberos
Note
This section applies only if you are integrating with a kerberized EMR cluster. If you are not, please skip to "Additional Configuration for EMR".
Disable standard EMR integration
When running jobs against a kerberized EMR cluster, you utilize the Spark-submit method of job submission. You must disable the standard EMR integration.
Steps:
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json
. For more information, see Platform Configuration Methods.
Search for the following setting and set it
false
:"webapp.runInEMR": false,
Set the following value:
"webapp.runWithSparkSubmit": true,
Disable use of Hive, which is not supported with EMR:
"spark-job-service.enableHiveSupport": false,
Verify the following property value:
"webapp.runInDataflow": false,
Save your changes.
Enable YARN
To use Spark-submit, the Spark master must be set to use YARN.
Steps:
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json
. For more information, see Platform Configuration Methods.
Search for the following setting and set it
yarn
:"spark.master": "yarn",
Save your changes.
Acquire site config files
For integrating with an EMR cluster with Kerberos, the EMR cluster site XML configuration files must be downloaded from the EMR master node to the Alteryx node.
Note
This step is not required for non-Kerberized EMR clusters.
Note
When these files change, you must update the local copies.
Download the Hadoop Client Configuration files from the EMR master node. The required files are the following:
core-site.xml
hdfs-site.xml
mapred-site.xml
yarn-site.xml
These configuration files must be moved to the Alteryx deployment. By default, these files are in
/etc/hadoop/conf
:sudo cp <location>/*.xml /opt/trifacta/conf/hadoop-site/ sudo chown trifacta:trifacta /opt/trifacta/conf/hadoop-site/*.xml
(Option) If we want to support impersonate, we also need copy *.keytab from the EMR master node under /etc folder to EC2 instance under same folder.
Unused properties for EMR with Kerberos
When integrating with a kerberized EMR cluster, the following Alteryx settings are unused:
External Service Settings: In the Admin Settings page, this section of configuration does not apply to EMR with Kerberos.
Unused EMR settings: In the Admin Settings page, the following EMR settings do not apply to EMR with Kerberos:
aws.emr.tempfilesCleanupAge aws.emr.proxyUser aws.emr.maxLogPollingRetries aws.emr.jobTagPrefix aws.emr.getLogsOnFailure aws.emr.getLogsForAllJobs aws.emr.extractIPFromDNS aws.emr.connectToRmEnabled
Additional Configuration for EMR
Enable EMR job cancellation
By default, a job that is started on EMR cannot be canceled through the application. Optionally, you can enable users to cancel their EMR jobs in progress.
Prerequisites:
The following permission must be added to the IAM role that is used to interact with EMR:
elasticmapreduce:CancelSteps
For more information, See "EC2 instance profile" above.
You should add an additional software setting to the cluster definition through the EMR console. For more information, see"Advanced Options" above.
Steps:
Please complete the following configuration changes to enable job cancellation on EMR.
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Please locate the following parameter and verify that it has been set to
true
:Note
Enabled by default, this parameter is optional for basic EMR connectivity. This parameter must be enabled for EMR job cancellation.
"aws.emr.connectToRmEnabled": true,
Please verify that the Trifacta node can connect to the EMR master node on port 8088. For more information, see "Create EMR Cluster" above.
Please locate the following parameter and verify that it is set to
true
:"aws.emr.cancelEnabled": true,
Save your changes and restart the platform.
To verify, launch an EMR job. In the Flow View context menu or in the Job History page, you should see a Cancel job option.
Default Hadoop job results format
For smaller datasets, the platform recommends using the Trifacta Photon running environment.
For larger datasets, if the size information is unavailable, the platform recommends by default that you run the job on the Hadoop cluster. For these jobs, the default publishing action for the job is specified to run on the Hadoop cluster, generating the output format defined by this parameter. Publishing actions, including output format, can always be changed as part of the job specification.
As needed, you can change this default format. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json
. For more information, see Platform Configuration Methods.
"webapp.defaultHadoopFileFormat": "csv",
Accepted values: csv
, json
, avro
, pqt
For more information, see Run Job Page.
Configure Snappy publication
If you are publishing using Snappy compression for jobs run on an EMR cluster, you may need to perform the following additional configuration.
Steps:
SSH into EMR cluster (master) node:
ssh <EMR master node>
Create tarball of native Hadoop libraries:
tar -C /usr/lib/hadoop/lib -czvf emr-hadoop-native.tar.gz native
Copy the tarball to the Alteryx EC2 instance used by the into the
/tmp
directory:scp -p emr-hadoop-native.tar.gz <EC2 instance>:/tmp
SSH to Alteryx EC2 instance:
ssh <EC2 instance>
Create path values for libraries:
sudo -u trifacta mkdir -p /opt/trifacta/services/batch-job-runner/build/libs
Untar the tarball to the Alteryx installation path:
sudo -u trifacta tar -C /opt/trifacta/services/batch-job-runner/build/libs -xzf /tmp/emr-hadoop-native.tar.gz
Verify
libhadoop.so*
andlibsnappy.so*
libraries exist and are owned by the Alteryx user:ls -l /opt/trifacta/services/batch-job-runner/build/libs/native/
Verify that the
/tmp
directory has the proper permissions for publication. For more information, see Supported File Formats.A platform restart is not required.
Additional parameters
You can set the following parameters as needed:
Steps:
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json
. For more information, see Platform Configuration Methods.
Property | Required | Description |
---|---|---|
aws.emr.resource.bucket | Y | S3 bucket name where Alteryx executables, libraries, and other resources can be stored that are required for Spark execution. |
aws.emr.resource.path | Y | S3 path within the bucket where resources can be stored for job execution on the EMR cluster. Note Do not include leading or trailing slashes for the path value. |
aws.emr.proxyUser | Y | This value defines the user for the Alteryx users to use for connecting to the cluster. Note Do not modify this value. |
aws.emr.maxLogPollingRetries | N | Configure maximum number of retries when polling for log files from EMR after job success or failure. Minimum value is |
aws.emr.tempfilesCleanupAge | N | Defines the number of days that temporary files in the By default, this value is set to If needed, you can set this to a positive integer value. During each job run, the platform scans this directory for temp files older than the specified number of days and removes any that are found. This cleanup provides an additional level of system hygiene. Before enabling this secondary cleanup process, please execute the following command to clear the hdfs dfs -rm -r -skipTrash /trifacta/tempfiles |
Optional Configuration
Configure for Redshift
For more information on configuring the platform to integrate with Redshift, see Amazon Redshift Connections.
Configure for EMR high availability
The Designer Cloud Powered by Trifacta platform can be configured to integrate with multiple master EMR nodes, which are deployed in a highly available environment.
Warning
Deploying additional instances of EMR may result in increased costs.
Versions
Integration with EMR high availability is supported for EMR 5.23.0 and later.
Permissions
An additional permission must be added to the ARN used to access EMR. For more information, see "Permissions for EMR high availability" above.
For more information, see https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-ha-launch.html.
Create cluster
When you create the cluster, you must select the "Use multiple master nodes to improve cluster availability" checkbox. For more information, see "Create Clusters" above.
Switch EMR Cluster
If needed, you can switch to a different EMR cluster through the application. For example, if the original cluster suffers a prolonged outage, you can switch clusters by entering the cluster ID of a new cluster. For more information, see Admin Settings Page.
Configure Batch Job Runner
Batch Job Runner manages jobs executed on the EMR cluster. You can modify aspects of how jobs are executed and how logs are collected. For more information, see Configure Batch Job Runner.
Modify Job Tag Prefix
In environments where the EMR cluster is shared with other job-executing applications, you can review and specify the job tag prefix, which is prepended to job identifiers to avoid conflicts with other applications.
Steps:
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Locate the following and modify if needed:
"aws.emr.jobTagPrefix": "TRIFACTA_JOB_",
Save your changes and restart the platform.
Testing
Load a dataset from the EMR cluster.
Perform a few simple steps on the dataset.
Click Run in the Transformer page.
When specifying the job:
Click the Profile Results checkbox.
Select Spark.
When the job completes, verify that the results have been written to the appropriate location.