Cloudera HDP Sandbox on AWS EC2

5 minute read

In this tutorial, we will install the Hortonworks Data Platform (HDP) HDP 3.0.1/ HDP2.6.5 Sandbox Through Docker on AWS on an single node EC2 Instance.

Apache Hadoop is a layered structure to process and store massive amounts of data. In our case, Apache Hadoop will be recognized as an enterprise solution in the form of HDP. At the base of HDP exists our data storage environment known as the Hadoop Distributed File System. When data files are accessed by Hive, Pig or another coding language, YARN is the Data Operating System that enables them to analyze, manipulate or process that data. HDP includes various components that open new opportunities and efficiencies in healthcare, finance, insurance and other industries that impact people.

Steps

1.Login in to the AWS Console and search the EC2 (Virtual Server in the Cloud) services and we launch a service.

https://aws.amazon.com/it/console/

Screenshot 2020-11-23 at 00.22.30

Screenshot 2020-11-23 at 00.22.44

2.Boot an EC2 instance with Amazon Linux 2 AMI: Screenshot 2020-11-23 at 00.22.56

Use >= 16GB of RAM (t2.xlarge or above).

3.We keep the default settings

Screenshot 2020-11-23 at 00.23.32

Add >=20GB storage.

4 For the HDP 3.0.1 we require more than 30 GB storage

Screenshot 2020-11-23 at 00.24.07

5 We addd key tags key Name an value Hadoop

Screenshot 2020-11-23 at 00.24.25

Configure security Group: Add rule for [Type=All TCP, Source=My IP]. It is important to restrict access only to your IP.

Screenshot 2020-11-23 at 00.24.59

6 We choose our key pair or create new one.

Keep note the key pair you are using for the EC2 instance.
Note down the public ip of this instance. This is the IP address you are going to use to access HDP services and web UIs.

Screenshot 2020-11-23 at 00.26.07

7 After the EC2 instance is up and running, select the instance, click Connect, select Specify the path to your key-pair file and connect. I used to connect via ssh on the terminal.

8 Enter to the EC2 instance via ssh and install git with sudo yum install -y git.

9 Check out the scripts with git clone https://github.com/ruslanmv/HDP-Sandbox-AWS.git, and then cd HDP-Sandbox-AWS.

Screenshot 2020-11-23 at 00.29.06

Enter to the HDP-Sanbox-AWS folder

ima

10 Install docker on the EC2 instance with bash install_docker.sh. Logout the SSH client and login again

11 Run docker info to confirm docker is working without sudo.

HDP Installation

12 In the GitHub repository of the installation, that I have here

https://github.com/ruslanmv/HDP-Sandbox-AWS

There are two HDP versions 3.0.1, and 2.6.5 that we can install on AWS Cloud.

You should choose one, in this tutorial I selected the 2.6.5 because Hive View is not present in HDP 3.0.1 or Ambari 2.7.

a) HDP 3.0.1

cd HDP-Sandbox-AWS/HDP_3.0.1

Install HDP through docker with bash docker-deploy-hdp30.sh.

b) HDP 2.6.5

cd HDP-Sandbox-AWS/HDP_2.6.5 Install HDP through docker with bash docker-deploy-hdp265.sh.

It will take a while to install.

Screenshot 2020-11-23 at 00.32.55

The advantage of the HDP 2.6.5 is that you have Hive View

When it is finished you will have the following screen

Screenshot 2020-11-23 at 00.38.54

You can check with the command docker ps to see if is running

Screenshot 2020-11-23 at 00.39.37

14 After it finishes, access Ambari through http://your-ec2-public-ip:8080/ to confirm it is working.

You can see the ip4 public ip and paste it on the web browser

by default the credentials are:

Username : maria_dev

Password: maria_dev

Screenshot 2020-11-23 at 00.42.08

Just after deployed the server the services are starting

Screenshot 2020-11-23 at 00.42.31

and after few minutes the services should be ready ( something like 7 minutes)

Screenshot 2020-11-23 at 00.50.19

Great finally is running our server on the AWS Cloud with HDP.

Hive View

Just as an example let us visualize a dataset in Hive

We want to use a datset from the GroupLens which has collected and made available rating data sets from the MovieLens web site (http://movielens.org). The data sets were collected over various periods of time, depending on the size of the set. Just a test we will use the MovieLens 100K movie ratings. Stable benchmark dataset. 100,000 ratings from 1000 users on 1700 movies. Released 4/1998

.http://files.grouplens.org/datasets/movielens/ml-100k.zip

We can extract the all the files of the zip file.

Screenshot 2020-11-23 at 21.35.46