Creating a Scalable Amazon EMR Cluster on AWS in Minutes
Minutes to Scalable EMR Cluster on AWS
AWS EMR cluster
Spark helps you easily build up an Amazon EMR cluster to process and analyse data. This page covers Plan and Configure, Manage, and Clean Up.
This detailed guide to cluster setup:
Amazon EMR Cluster Configuration
Spark is used to launch an example cluster and run a PySpark script in the course. You must complete the âBefore you set up Amazon EMRâ exercises before starting.
While functioning live, the sample cluster will incur small per-second charges under Amazon EMR pricing, which varies per location. To avoid further expenses, complete the tutorialâs final cleaning steps.
The setup procedure has numerous steps:
Amazon EMR Cluster and Data Resources Configuration
This initial stage prepares your application and input data, creates your data storage location, and starts the cluster.
Setting Up Amazon EMR Storage:
Amazon EMR supports several file systems, but this article uses EMRFS to store data in an S3 bucket. EMRFS reads and writes to Amazon S3 in Hadoop.
This lesson requires a specific S3 bucket. Follow the Amazon Simple Storage Service Console User Guide to create a bucket.
You must create the bucket in the same AWS region as your Amazon EMR cluster launch. Consider US West (Oregon) us-west-2.
Amazon EMR bucket and folder names are limited. Lowercase letters, numerals, periods (.), and hyphens (-) can be used, but bucket names cannot end in numbers and must be unique across AWS accounts.
The bucket output folder must be empty.
Small Amazon S3 files may incur modest costs, but if youâre within the AWS Free Tier consumption limitations, they may be free.
Create an Amazon EMR app using input data:
Standard preparation involves uploading an application and its input data to Amazon S3. Submit work with S3 locations.
The PySpark script examines 2006â2020 King County, Washington food business inspection data to identify the top ten restaurants with the most âRedâ infractions. Sample rows of the dataset are presented.
Create a new file called health_violations.py and copy the source code to prepare the PySpark script. Next, add this file to your new S3 bucket. Uploading instructions are in Amazon Simple Storage Serviceâs Getting Started Guide.
Download and unzip the food_establishment_data.zip file, save the CSV file to your computer as food_establishment_data.csv, then upload it to the same S3 bucket to create the example input data. Again, see the Amazon Simple Storage Service Getting Started Guide for uploading instructions.
âPrepare input data for processing with Amazon EMRâ explains EMR data configuration.
Create an Amazon EMR Cluster:
Apache Spark and the latest Amazon EMR release allow you to launch the example cluster after setting up storage and your application. This may be done with the AWS Management Console or CLI.
Console Launch:
Launch Amazon EMR after login into AWS Management Console.
Start with âEMR on EC2â > âClustersâ > âCreate clusterâ. Note the default options for âRelease,â âInstance type,â âNumber of instances,â and âPermissionsâ.
Enter a unique âCluster nameâ without <, >, $, |, or `. Install Spark from âApplicationsâ by selecting âSparkâ. Note: Applications must be chosen before launching the cluster. Check âCluster logsâ to publish cluster-specific logs to Amazon S3. The default destination is s3://amzn-s3-demo-bucket/logs. Replace with S3 bucket. A new âlogsâ subfolder is created for log files.
Select your two EC2 keys under âSecurity configuration and permissionsâ. For the instance profile, choose âEMR_DefaultRoleâ for Service and âEMR_EC2_DefaultRoleâ for IAM.
Choose âCreate clusterâ.
The cluster information page appears. As the EMR fills the cluster, its âStatusâ changes from âStartingâ to âRunningâ to âWaitingâ. Console view may require refreshing. Status switches to âWaitingâ when cluster is ready to work.
AWS CLIâs aws emr create-default-roles command generates IAM default roles.
Create a Spark cluster with aws emr create-cluster. Name your EC2 key pair âname, set âinstance-type, âinstance-count, and âuse-default-roles. The sample commandâs Linux line continuation characters () may need Windows modifications.
Output will include ClusterId and ClusterArn. Remember your ClusterId for later.
Check your cluster status using aws emr describe-cluster âcluster-id myClusterId>.
The result shows the Status object with State. As EMR deployed the cluster, the State changed from STARTING to RUNNING to WAITING. When ready, operational, and up, the cluster becomes WAITING.
Open SSH Connections
Before connecting to your operating cluster via SSH, update your cluster security groups to enable incoming connections. Amazon EC2 security groups are virtual firewalls. At cluster startup, EMR created default security groups: ElasticMapReduce-slave for core and task nodes and ElasticMapReduce-master for main.
Console-based SSH authorisation:
Authorisation is needed to manage cluster VPC security groups.
Launch Amazon EMR after login into AWS Management Console.
Select the updateable cluster under âClustersâ. The âPropertiesâ tab must be selected.
Choose âNetworkingâ and âEC2 security groups (firewall)â from the âPropertiesâ tab. Select the security group link under âPrimary nodeâ.
EC2 console is open. Select âEdit inbound rulesâ after choosing âInbound rulesâ.
Find and delete any public access inbound rule (Type: SSH, Port: 22, Source: Custom 0.0.0.0/0). Warning: The ElasticMapReduce-master groupâs pre-configured rule that allowed public access and limited traffic to reputable sources should be removed.
Scroll down and click âAdd Ruleâ.
Choose âSSHâ for âTypeâ to set Port Range to 22 and Protocol to TCP.
Enter âMy IPâ for âSourceâ or a range of âCustomâ trustworthy client IP addresses. Remember that dynamic IPs may need updating. Select âSave.â
When you return to the EMR console, choose âCore and task nodesâ and repeat these steps to provide SSH access to those nodes.
Connecting with AWS CLI:
SSH connections may be made using the AWS CLI on any operating system.
Use the command: AWS emr ssh âcluster-id âkey-pair-file <~/mykeypair.key>. Replace with your ClusterId and the full path to your key pair file.
After connecting, visit /mnt/var/log/spark to examine master node Spark logs.
The next critical stage following cluster setup and access configuration is phased work submission.














