Deploy and Configure Hadoop Cluster
Hello today we are going to discuss Hadoop, we install and develop Hadoop into a cloud server.
Introduction
Hadoop has become a staple technology in the big data industry by enabling the storage and analysis of datasets so big that it would be otherwise impossible with traditional data systems.
We are going to deploy and configure a single-node pseudo-distributed Hadoop cluster. Then, after we spin it all up, we will execute a MapReduce job using the pre-packaged sample MapReduce applications.
Installation and Configuration
In ordering to show how to use Hadoop, we are going to use three cloud servers by using CentOS7.
Task 1 Install Java
First we have to find the java version which we have to install.
[[email protected] ~]$ yum search jdk
and we type
sudo yum install java-1.8.0-openjdk -y
we check if was installed
[[email protected] ~]$ java -version
openjdk version "1.8.0_262"
OpenJDK Runtime Environment (build 1.8.0_262-b10)
OpenJDK 64-Bit Server VM (build 25.262-b10, mixed mode)
Task 2 Setup ssh credentials
We need to to install our ssh pass
[[email protected] ~]$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/cloud_user/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/cloud_user/.ssh/id_rsa.
Your public key has been saved in /home/cloud_user/.ssh/id_rsa.pub
we can enter to our folder of ssh
[[email protected] ~]$ cd .ssh
[cloud_[email protected] .ssh]$ ls
authorized_keys id_rsa id_rsa.pub
In ordering to authorize enter to this node by using our public key we add the public key to the authorized key
[cloud_us[email protected] .ssh]$ cat id_rsa.pub >> authorized_keys
and then we enter
[[email protected] .ssh]$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
ECDSA key fingerprint is SHA256:058tWZxfDfNI+dkPSFBp5jZ7OqieG1+ctM5yJtkiv1I.
ECDSA key fingerprint is MD5:67:93:60:ca:e3:77:b9:91:45:e8:86:a5:73:33:1f:d8.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts
[[email protected] ~]$ exit
logout
Connection to localhost closed.
[[email protected] .ssh]$ ssh 127.0.0.1
The authenticity of host '127.0.0.1 (127.0.0.1)' can't be established.
ECDSA key fingerprint is SHA256:058tWZxfDfNI+dkPSFBp5jZ7OqieG1+ctM5yJtkiv1I.
ECDSA key fingerprint is MD5:67:93:60:ca:e3:77:b9:91:45:e8:86:a5:73:33:1f:d8.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '127.0.0.1' (ECDSA) to the list of known hosts.
[[email protected] ~]$ ssh 127.0.0.1
[cloud_us[email protected] ~]$ exit
logout
Connection to 127.0.0.1 closed.
Now we have the password ssh configured in our localhost with the ip 127.0.0.1
Task 3. Download and Deploy Hadoop
The General Architecture of Hadoop has one at leas Namenode and then one or more Datanode.
Name nodes manage the file system and regulate access to files by clients.
Datanodes store and manage data files which are split into blocks. Blocks can be replicated for allow for fault tolerance.
We install Hadoop in sudo distributed mode, this method works also for one one node or multi one.
To get mirrors to download we can enter to this link
http://www.apache.org/dyn/closer.cgi/hadoop/common/
[[email protected] ~]$ curl -O https://downloads.apache.org/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz
we extract the files
[[email protected] ~]$ tar -xzf hadoop-2.9.2.tar.gz
we check the unpacked directory
[[email protected] ~]$ ll
total 357868
drwxr-xr-x. 2 cloud_user cloud_user 6 Oct 15 2018 Desktop
drwxrwxr-x. 2 cloud_user cloud_user 4096 Mar 8 2019 Documents
drwxr-xr-x. 9 cloud_user cloud_user 4096 Nov 13 2018 hadoop-2.9.2
-rw-rw-r--. 1 cloud_user cloud_user 366447449 Oct 4 10:02 hadoop-2.9.2.tar.gz
we don’t need anymore the tar file
[[email protected] ~]$ rm hadoop-2.9.2.tar.gz
let us rename the folder
[[email protected] ~]$ mv hadoop-2.9.2 hadoop
[[email protected] ~]$ cd hadoop/
[[email protected] hadoop]$ ll
total 136
drwxr-xr-x. 2 cloud_user cloud_user 4096 Nov 13 2018 bin
drwxr-xr-x. 3 cloud_user cloud_user 19 Nov 13 2018 etc
drwxr-xr-x. 2 cloud_user cloud_user 101 Nov 13 2018 include
drwxr-xr-x. 3 cloud_user cloud_user 19 Nov 13 2018 lib
drwxr-xr-x. 2 cloud_user cloud_user 4096 Nov 13 2018 libexec
-rw-r--r--. 1 cloud_user cloud_user 106210 Nov 13 2018 LICENSE.txt
-rw-r--r--. 1 cloud_user cloud_user 15917 Nov 13 2018 NOTICE.txt
-rw-r--r--. 1 cloud_user cloud_user 1366 Nov 13 2018 README.txt
drwxr-xr-x. 3 cloud_user cloud_user 4096 Nov 13 2018 sbin
drwxr-xr-x. 4 cloud_user cloud_user 29 Nov 13 2018 share
[[email protected] hadoop]$ which java
/bin/java
What Hadoop wants to java home is the alternative java.
[[email protected] hadoop]$ ll /bin/java
lrwxrwxrwx. 1 root root 22 Oct 4 08:42 /bin/java -> /etc/alternatives/java
[[email protected] hadoop]$ ll /etc/alternatives/java
lrwxrwxrwx. 1 root root 73 Oct 4 08:42 /etc/alternatives/java -> /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el7_8.x86_64/jre/bin/java
in our case the directory /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el7_8.x86_64/jre
[[email protected] hadoop]$ ll /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el7_8.x86_64/jre
total 8
drwxr-xr-x. 2 root root 4096 Oct 4 08:43 bin
drwxr-xr-x. 10 root root 4096 Oct 4 08:42 lib
This is what Hadoop wants to java home to be.
So, in ordering to specify to Hadoop this directory we have to modify the hadoop-env.sh file
[[email protected] hadoop]$ vim etc/hadoop/hadoop-env.sh
Task 4 Configure Hadoop
vim etc/hadoop/core-site.xml
and we add the following
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Task 5 Configure HDFS
We want to use one one so, we modify the file hdfs-site.xml
[[email protected] hadoop]$ vim etc/hadoop/hdfs-site.xml
with the following setting:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
[[email protected] hadoop]$ ./bin/hdfs namenode -format
As we have Hadoop downloaded and installed, and HDFS cluster configured in sudo distributed mode, so go ahead, get things stored up
Execution
Task 5 Start Services
[[email protected] hadoop]$ ./sbin/start-dfs.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/cloud_user/hadoop/logs/hadoop-cloud_user-namenode-498069248d1c.mylabserver.com.out
localhost: starting datanode, logging to /home/cloud_user/hadoop/logs/hadoop-cloud_user-datanode-498069248d1c.mylabserver.com.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
ECDSA key fingerprint is SHA256:058tWZxfDfNI+dkPSFBp5jZ7OqieG1+ctM5yJtkiv1I.
ECDSA key fingerprint is MD5:67:93:60:ca:e3:77:b9:91:45:e8:86:a5:73:33:1f:d8.
Are you sure you want to continue connecting (yes/no)? yes
0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to /home/cloud_user/hadoop/logs/hadoop-cloud_user-secondarynamenode-498069248d1c.mylabserver.com.out
[[email protected] hadoop]$
We can check if is running
[[email protected] hadoop]$ ./bin/hdfs dfs -ls /
If there is nothing in output is ok, now is running our hdfs and we need to configure mar reduce.
Task 6 Configure MapReduce
We create a directory
[[email protected] hadoop]$ ./bin/hdfs dfs -mkdir -p /user/cloud_user
[[email protected] hadoop]$ ./bin/hdfs dfs -ls /
Found 1 items
drwxr-xr-x - cloud_user supergroup 0 2020-10-04 13:03 /user
we copy the LICENSE.txt in the hdfs
[[email protected] hadoop]$ ./bin/hdfs dfs -put LICENSE.txt LICENSE.txt
[[email protected] hadoop]$ ./bin/hdfs dfs -ls /user/cloud_user
Found 1 items
-rw-r--r-- 1 cloud_user supergroup 106210 2020-10-04 13:05 /user/cloud_user/LICENSE.txt
Task 7 Run a MapReduce
Let us take as an example
[[email protected] hadoop]$ ll share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar
-rw-r--r--. 1 cloud_user cloud_user 303323 Nov 13 2018 share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar
[[email protected] hadoop]$
to execute any jar map reduce
[[email protected] hadoop]$ ./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar
we can see the several options that we can proceed with mapreduce:
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the input files.
wordmedian: A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.
Let us select
wordmean: A map/reduce program that counts the average length of the words in the input files. as example
[[email protected] hadoop]$ ./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar wordmean LICENSE.txt license_wordmean_output
we obtain the mean
The mean is: 5.761225944404846
which is saved in our directory
[[email protected] hadoop]$ ./bin/hdfs dfs -ls /user/cloud_user
Found 2 items
-rw-r--r-- 1 cloud_user supergroup 106210 2020-10-04 13:05 /user/cloud_user/LICENSE.txt
drwxr-xr-x - cloud_user supergroup 0 2020-10-04 13:17 /user/cloud_user/license_wordmean_output
we can see them
[[email protected] hadoop]$ ./bin/hdfs dfs -cat license_wordmean_output/*
count 15433
length 88913
Another example should be use
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
[[email protected] hadoop]$ ./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar pi 10 100
and we obtain
Job Finished in 2.999 seconds
Estimated value of Pi is 3.14800000000000000000
This is a general example of how to execute MapReduce job.
Task 8 Stop Services
[[email protected] hadoop]$ ./sbin/stop-dfs.sh
Stopping namenodes on [localhost]
localhost: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
[[email protected] hadoop]$
Thank you we have deployed and Configured a Hadoop Cluster.
Credits to LinuxAcademy
Leave a comment