How to configure Hadoop cluster using Ansible playbook

Automation is the new future of today's industry now most of the industry is heading towards automation as it reduces human efforts and errors which can be performed by humans also will be reduced. Nowadays companies have to configure a large number of servers daily and doing manually every server is not practically possible and time-consuming as well and today's era we don't have time to waste for this so we have to need some automation tool which performs all the task in every server in a single click and one of the automation tool is Ansible.

First, let's talk about what is Hadoop

Apache Hadoop is open-source software that provides us with the facility to build our own distributed storage cluster and computing cluster by Hdfs and Mapreducers programming methods.

So what is the Hadoop cluster?

Hadoop cluster is a very great program which allows us to join many storages of the data node and give as single storage to the client creating an illusion that the client has a large amount of single storage but behind the scene, it uses HDFS cluster technology.

We can also create a computing cluster with a Hadoop which helps in providing the RAM of all task tracker and combine them to perform as a single RAM, but this concept will discuss later in this article we will talk about the HDFS cluster of Hadoop which has nodes as:

NameNode: The NameNode is the master point of the cluster its job is to give the address of the data node and also maintains the metadata of an HDFS file system. It keeps the directory tree of all files in the file system, Namenode itself won't store any data.
DataNode: It is the system that provides storage to the namenode and the client directly store the data in the data node which address is provided by the namenode to the client
Client node: The client node is the one who makes the connection with the namenode and stores the data in the HDFS cluster and the client can also access the MapReduce cluster as well at the same point of time as the MapReduce program will only perform computing whose data is present in the HDFS cluster.

In this article, we will configure a Hadoop cluster using Ansible and steps to achieve this configuration are
→ Copy the required software from the controller node to the managed node
→ Install software Hadoop and JDK software on the managed node
→ Copy and Configure hdfs-site and core-site file
→ Create namenode and data node directory and Format it
→ Start the Hadoop services for namenode and data node

Step1: Updating Inventory and Configuration file on Ansible Controller node

I have created my inventory on the “ /root/ip.txt “ location of the Controller Node, and it will mainly consist of few details about Target Node.

vim /root/ip.txt

Here we have provided username, password and the protocol by which ansible can log in into that system for configuration management.

Updating Configuration file

It is a one time process as after this we only add the IP of the nodes in the inventory file

vim /etc/ansible/ansible.cfg

Now we have to check if the IP is connecting or not and for that, we will use the ping command of the ansible

Step2: Copying Hadoop and JDK software in both managed node

We will write our code in yml file as ansible support this format

vim hadoop.yml

Here we used an ansible copy module which helps us to copy the required software from the Controller node to Managed node here we specified src (from where to copy the file) and dest (where to paste), here our managed node is one namenode and one data node.

Step3: Installing required software in managed nodes

For installing JDK software we used a package module where we have the path of the software and state = present which defines while running this task we want to achieve this final state.

For installing Hadoop software we have used the command module of Ansible and the worst thing about the command module as it is OS-specific and it is to idempotence but we have to use this as we have a little issue with Hadoop as this is Hadoop 1.2 hence it conflicts with some files hence we use command module to install it using — force.

Now let's run the playbook to check whether the above codes are working or not

So as you can view the playbook is completed successfully means the step to copying and installing software is completed.

Step4: Copying hdfs-site and core-site to the managed node from the controller node

Configuration files are different for namenode and data node

1: For namenode:

This is the hdfs file where we have to give in which directory do namenode will map all the data node storage in my case I have given “/nn” directory

This is a core-site file here we give information like on which port we want to run our services and who can connect with the namenode here we have to give 0.0.0.0 by this IP every can connect with the namenode as of now.

We have also had to write the code to copy these file to the managed node

Here we have used a template module to copy files from the controller node to managed node this template module has more functionality than the copy module of ansible.

2: For data node

This is the hdfs-site.xml file same as we did in namenode with some small changes as in place of the name we have used data as it is data node and we also have specified the folder which data node will share to namenode.

This is the core-site here we have some changes like tried to make this file more dynamic by providing the IP of namenode using ansible variable groups which automatically find the IP of the namenode from the inventory file and append here.

Here the template module comes into play as while copying a file it will also parse the variable present in the core-site.xml file and the copy module don't have the ability to perform this operation.

Now let us run the playbook and check whether the code is working

Step5: Creating directory for namenode and data node and Formatting it

Here we used the file module to create a directory in the managed node and for formatting, we have used the shell module as the command module doesn't support the pipe symbol which is used which formating the directory.

Likewise, we did the same in the data node

Step6: Let's start the services for both namenode as well as for data node

Here also we used the command module to start Hadoop services as we don't have any module to perform this operation and we have also enabled the firewall on port 9001 as my Hadoop services are running on that port.

Now let's run the playbook to check if its actually working or not

ansible-playbook hadoop.yml

Hence our playbook has successfully run without any error and by this now our Hadoop cluster is ready

So let's check the cluster status using the dfsadmin command

hadoop dfsadmin -report | less

Here it shows 1 data node is connected and contributing its 6.1 GB of storage to the Hadoop cluster

Now we have successfully configured the Hadoop cluster using Ansible automation

Here is the Github link of the playbook and other files used above

Conclusion:

So in this article, we have created a Hadoop HDFS cluster and the best part is we did it with the help of ansible which an automation tool, now on a single click entire environment can be set up with no effort.

I would be writing more about Ansible so stay tuned !! Hopefully, you learn something new from the article as well as enjoy it.

Thanks for reading this article! Leave a comment below if you have any questions.

Search This Blog

Technology Stack