AWS Deployment Scenarios

The following are the basic AWS deployment scenarios.

Deployment Scenario

installation

Base Storage LayerStorage - S3Storage - RedshiftClusterNotes

on-premises install with S3 read access

On-premisesHDFSread onlyNot supportedHadoopWhen HDFS is the base storage layer, the only accessible AWS resources is read-only access to S3.

 AWS install with S3 read access

EC2HDFSread onlyNot supportedHadoopWhen HDFS is the base storage layer, the only accessible AWS resources is read-only access to S3.

 AWS install with S3 read/write access

EC2S3read/writeread/write Hadoop or EMR  

AWS install through Amazon Marketplace

EC2S3read/writeread/write None

does not support integration with any running environment clusters. All job execution occurs on the . This scenario is suitable for smaller user groups and data volumes.

AWS install through Amazon Marketplace with integration to EMR cluster

EC2S3read/writeread/write EMR

This deployment scenario integrates by default with an EMR cluster.

It does not support integration with a Hadoop cluster.

Microsoft Azure     Integration with AWS-based resources is not supported. See Install from Azure Marketplace.

Legend and Notes:

ColumnNotes
Deployment ScenarioDescription of the AWS-connected deployment

installation

Location where the is installed in this scenario.

All AWS installations are installed on EC2 instances.

Base Storage Layer

When the is first installed, the base storage layer must be set.

NOTE: In marketplace deployments, the base storage layer is set for you. After you have begun using the product, you cannot change the base storage layer.


NOTE: Read/write access to AWS-based resources requires that S3 be set as the base storage layer.


Storage - S3

supports read access to S3 when the base storage layer is set to HDFS.

For read/write access to S3, the base storage layer must be set to S3.

Storage - RedshiftFor access to Redshift, the base storage layer must be set to S3.
Cluster

List of cluster types that are supported for integration and job execution at scale.

  • The can integrate with at most one cluster. It cannot integrate with two different clusters at the same time.
  • Access to an EMR cluster requires S3 to be the base storage layer.
  • Smaller jobs can be executed on the running environment, which is hosted on the itself.
  • For more information, see Running Environment Options.
NotesAny additional notes

AWS Installations

 on EC2 Instance

When the  is installed on AWS, it is deployed on an EC2 instance. Through the EC2 console, there are a few key parameters that must be specified. 

NOTE: After you have created the instance, you should retain the instanceId from the console, which must be applied to the configuration in the .

For more information, see Install from Amazon Marketplace.

 on Amazon Marketplace (AMI)

Through the Amazon Marketplace, you can license and deploy an AMI of , a self-contained version of  that does not require integration with a clustered running environment. All job execution happens within the AMI on the EC2 instance that you deploy. 

 with EMR

You can deploy an AMI of the  onto an EC2 instance and then integrate it with your pre-configured EMR cluster for Spark-based job execution. 

AWS Integrations

The following table describes the different AWS components that can host or integrate with the . Combinations of one or more of these items constitute one of the deployment scenarios listed in the following section.

AWS ServiceDescriptionBase Storage LayerOther Required AWS Services
EC2

Amazon Elastic Compute Cloud (EC2) can be used to host the in a scalable cloud-based environment. The following deployments are supported:

  • with or without access to an EMR cluster
  • on an AMI

Base storage layer can be S3 or HDFS.

If set to HDFS, only read access to S3 is permitted.

 
S3

Amazon Simple Storage Service (S3) can be used for reading data sources, writing job results, and hosting the .

Base storage layer can be S3 or HDFS.

If set to HDFS, only read access to S3 is permitted.

 
Redshift

Amazon Redshift provides a scalable data warehouse platform, designed for big data analytics applications. The can be configured to read and write from Amazon Redshift database tables.

Base Storage Layer = S3

S3
AMI

Through the Amazon Marketplace, you can license and install an Amazon Machine Image (AMI) instance of . This product is intended for smaller user groups that do not need large-scale processing of Hadoop-based clusters.

Base Storage Layer = S3

NOTE: HDFS is not supported.


EC2 instance
EMRThrough the Amazon Marketplace, you can license and install an AMI specifically configured to work with Amazon Elastic Map Reduce (EMR), a Hadoop-based data processing platform.

Base Storage Layer = S3

EC2 instance, AMI

Amazon RDS

Optionally, the can be installed on Amazon RDS. For more information, see Install Databases on Amazon RDS.

Base Storage Layer = S3