In this internet world, a huge amount of data is generated from various sources, like social media platforms, emails, audio files, videos, documents, etc. Such type of data is referred to as unstructured data.
Traditional data storage systems, like relational database management systems (RDBMS), cannot efficiently manage the storage and processing of unstructured data since such data is complex and voluminous. This is where Hadoop comes into play. Hadoop is an open-source framework for processing of big data.
It supports distributed processing of big data across clusters of computers using the MapReduce programming model. This article explores the architecture of the Hadoop framework and discusses each component of the Hadoop architecture in detail. But before that, let’s understand what exactly the Hadoop framework is.
What is Hadoop?
Hadoop, also known as Apache Hadoop, is an open-source big data processing framework. It enables distributed storage and processing of large volumes of datasets across the network of computers using the MapReduce programming model.
The Hadoop framework does not use a single computer to store and process large data sets or big data. Instead, it focuses on clustering multiple computers and distributing large data sets across these computers. Such distribution facilitates the parallel processing of enormous volumes of data. Licensed under the Apache License 2.0, Apache Hadoop is a cross-platform framework developed and maintained by the Apache Software Foundation.
Many reputed companies, including Meta (Facebook), Netflix, Yahoo, and eBay, use the Hadoop framework to store and process big data. The core of Apache Hadoop consists of two parts. One is the storage part which is Hadoop Distributed File System (HDFS), and the other is the processing part, called the MapReduce programming model.
Hadoop takes a large data set, splits it into multiple blocks, and stores and processes it across multiple computers or nodes in a cluster. Therefore, Apache Hadoop makes it possible to process the entire data set efficiently and in less time.
The Hadoop architecture consists of four major components, namely, Hadoop Distributed File System (HDFS), Hadoop MapReduce, Hadoop YARN, and Hadoop Common.
- Hadoop Distributed File System (HDFS): It is a distributed file system that stores data on commodity hardware. Commodity hardware is an inexpensive IT component or a computer device widely available.
- Hadoop MapReduce: It is a programming model that processes large data sets across multiple computers in a cluster.
- Hadoop YARN: It is a platform responsible for allocating resources to different applications running in a Hadoop cluster. Also, it schedules tasks to be executed on different nodes in a Hadoop cluster.
- Hadoop Common: It contains utilities and libraries required by all the above Hadoop components.
Let us now discuss each of these Hadoop components in detail below.
Hadoop Distributed File System (HDFS)
Hadoop Distributed File System or Apache HDFS is a block-structured file system developed using the distributed file system design. It divides a single file into multiple chunks of blocks and stores them across several computers in a Hadoop cluster.
The Apache HDFS follows the Master/Slave architecture.
A Hadoop cluster consists of one NameNode and multiple DataNodes, where NameNode is the master node and DataNodes are slave nodes. Each node in a Hadoop cluster has one DataNode. NameNodes and DataNodes are the data storage nodes. Let us look at these data storage nodes in detail below.
NameNode is the master node in a Hadoop cluster responsible for managing and maintaining the blocks present on DataNodes. It stores the metadata of the data that resides on DataNodes. Metadata is the data about data.
It can be the name of a file, the size of a file, and the information about a DataNode’s location. Using metadata, NameNode can find the closest DataNode for faster communication. Also, the metadata can be transaction logs that track a user’s activity in a Hadoop cluster.
Features of NameNode:
- Besides managing and maintaining DataNodes, a NameNode is responsible for tracking and recording every change in file system metadata.
- NameNode regularly receives a heartbeat message from every DataNode of a Hadoop cluster indicating that they are alive.
- It maintains the record of all blocks, i.e., which block is present in which node of a Hadoop cluster.
- It is responsible for managing the replication factor of all the data blocks.
- When the NameNode is down, the entire HDFS or Hadoop cluster is considered down and becomes inaccessible.
DataNodes are the slave nodes in a Hadoop cluster. They are nothing but commodity hardware that stores actual data. NameNode instructs DataNodes about block creation, replication, and deletion. The greater the number of DataNodes, the more the amount of data the Hadoop cluster stores.
Features of DataNodes:
- DataNodes are responsible for receiving clients’ read and write requests.
- NameNode and DataNodes communicate constantly. DataNodes send heartbeat messages to the NameNode after every 3 seconds to report the overall health of the HDFS.
- There is no effect on the availability of data in a cluster even if a DataNode is down. The NameNode arranges the replication of the blocks associated with a specific DataNode, which is down or not available.
- DataNodes require a lot of hard disk space since they store the actual data.
File Block in HDFS
As we discussed earlier, data in HDFS is stored in the form of blocks. A single data block is split into multiple small blocks, with each block having a size of 128 MB. It is the default size; however, we can change it manually.
Now, let us understand the breaking down of files in blocks with a straightforward example. Consider a file of size 400MB, and we need to upload it to HDFS. When we upload this file to HDFS, it gets divided into blocks of 128 MB + 128 MB + 128 MB + 16 MB = 400 MB size. Therefore, four blocks are created, where the first three are 128 MB in size and the last one is 16 MB in size. But Hadoop considers the last block as a partial record.
Replication in HDFS
In general terms, replication implies making a copy of something. Replication in HDFS means making a copy of blocks, which ensures the availability of data. Moreover, the term ‘Replication factor’ in HDFS represents the number of times the copies of blocks are made. The replication factor for Hadoop is set to 3 by default.
Like the size of each block, we can also change the replication factor manually as per our requirements. In our example above, we have created four blocks, and each block has 3 replicas or copies. Therefore, 4*3 = 12 blocks are created for backup purposes. Since the actual data in HDFS is stored on commodity hardware, inexpensive system hardware, it may crash anytime.
Therefore, making such copies of blocks becomes essential in Hadoop for backup purposes. This feature of Hadoop is referred to as fault-tolerance.
Rack Awareness in HDFS
Rack in HDFS is the collection of DataNodes in a Hadoop cluster, maybe 40 to 50 DataNodes. A Hadoop cluster consists of multiple racks. The term ‘Rack Awareness’ can be defined as choosing the closest node depending upon the rack information.
NameNode follows an in-built Rack Awareness Algorithm to ensure that all the replicas of a block are not stored on the same rack. Considering 3 as the replication factor, then according to the Rack Awareness Algorithm:
- The first replica of a block gets stored on a local rack.
- The second replica of a block gets stored on another DataNode of the same rack.
- Finally, the third replica of a block gets stored on a different rack.
YARN stands for Yet Another Resource Negotiator. It is a framework on which Hadoop MapReduce works. It is responsible for performing two operations, Resource Management, and Job Scheduling. In job scheduling, a big task is split into multiple smaller jobs, and each job is assigned to slave nodes (DataNodes) in a Hadoop cluster. In addition, a job scheduler is responsible for keeping track of jobs, determining the priorities of jobs and dependencies between jobs, and other information.
A resource manager is responsible for managing all the resources that are made available for different applications running in a Hadoop cluster. The architecture of Hadoop YARN consists of five different components, namely Client, Resource Manager, Node Manager, Application Master, and Container.
A client in YARN is responsible for submitting map-reduce jobs. Now, let us understand each component of YARN in detail below:
A Resource Manager is the master daemon of YARN. It is responsible for allocating resources and managing them across all applications. It consists of two major components, namely Scheduler and Application Manager.
A scheduler in YARN is responsible for allocating resources to multiple applications depending upon their resource requirements. We can call it a pure scheduler since it does not perform other tasks, like monitoring or tracking the status of applications. In addition, it uses plug-ins, such as Capacity Scheduler and Fair Scheduler, for partitioning the resources among various applications.
An application manager is responsible for managing the operation of the Application Master in a Hadoop cluster. In case of the failure of the Application Master container, the application manager helps to restart it.
A Node Manager is responsible for monitoring each node in a Hadoop cluster. Also, it manages user jobs and workflow on a specific node in the cluster. By default, the Node Manager in YARN transmits heartbeats to the Resource Manager periodically to report the health status of the node. Furthermore, the Node Manager supervises containers assigned by the Resource Manager and monitors their resource usage.
It starts the containers requested by the Application Master by creating container processes and also kills them as instructed by the Resource Manager.
An application is considered a single job submitted to a framework. Each application has an Application Master, which is responsible for monitoring the execution of an application in a Hadoop cluster. In addition, it is responsible for negotiating resources with the Resource Manager, tracking their status, and monitoring progress.
The Application Master sends a Container Launch Context (CLC) to request a container from the Node Manager. This container includes everything required for an application to run. Once the application starts running, the Application Master transmits heartbeats to the Resource Manager periodically to report an application’s health report.
A container in YARN is a collection of physical resources, such as CPU cores, RAM, and a disk on a single node. The Node Manager is responsible for monitoring containers, and the Resource Manager is responsible for scheduling containers. All YARN containers are managed by Container Launch Context, also known as Container Life Cycle (CLC). It is a record that contains commands required to create the process, the payload for Node Manager services, a map of environment variables, security tokens, and dependencies.
Application Workflow in Hadoop YARN
Let us understand the application workflow in Hadoop YARN.
- Firstly, a client submits a job or an application.
- The Resource Manager then allocates a container to start the Application Manager.
- Once the Application Manager starts, it registers itself with the Resource Manager.
- Later, the Application Manager negotiates a container from the Resource Manager and notifies the Node Manager to launch that container.
- Once the container is launched, the application’s code is executed in it.
- The client then contacts the Resource Manager or Application Manager to know the status of the application.
- Once the process completes, the Application Manager unregisters with the Resource Manager.
Hadoop MapReduce is the heart of the Hadoop architecture, and it is based on the YARN framework. It is a programming model and a software framework that enables us to write applications to process enormous data volumes.
The principal feature of Hadoop MapReduce is to perform distributed processing simultaneously in a Hadoop cluster. It divides a large dataset into small chunks, and each chunk of data is processed on multiple machines in a Hadoop commodity cluster.
MapReduce involves two different tasks, namely map task and reduce task. The map task is followed by the reduce task. Initially, a large dataset is provided as an input to the Map() function. It breaks the input into smaller chunks, processes them on different machines, and transforms them into tuples or key-value pairs.
Now, these key-value pairs are given as an input to the Reduce() function. The Reduce() function combines these key-value pairs based on their key values, performs operations like sorting and aggregation, and produces the output as a small set of tuples. Let us now discuss the different phases in the Map task and Reduce Task.
The Map task involves the following phases:
InputFiles consist of input datasets that need to be processed by the MapReduce programming model. These files are stored in the Hadoop Distributed File System.
The dataset or record in an InputFile is broken down into multiple InputSplits. The Inputsplit size depends upon the block size of HDFS. Each split is assigned with an individual mapper for processing.
RecordReader is responsible for reading the InputSplit, converting it into tuples or key-value pairs, and passing them to the Mapper. A key in a key-value pair is the location, and value is the data associated with it.
Mapper takes the input as key-value pairs from RecordReader, processes them, and generates a set of intermediate key-value pairs. Alternatively, we can state that Mapper maps the input key-value pairs to a set of intermediate key-value pairs. These intermediate key-value pairs are not stored in HDFS. Instead, they are written to the local disk. The reason is that the intermediate key-value pairs are temporary, and storing them on HDFS will result in the creation of unnecessary copies. They are passed to the Combiner for further processing.
A Combiner is also referred to as a Mini-Reducer. It performs aggregation on the output produced by Mapper. The principal objective of Combiner is to minimize the data transfer between the Mapper and the Reducer.
When there is more than one reducer in MapReduce, the role of the partitioner comes into play. It takes the output of Combiner and performs partitioning. It uses a hash function to partition the intermediate key-value pairs. The number of partitions is equal to the number of reduced tasks.
The Reduce task involves the following phases:
Shuffling and Sorting
After combining and partitioning the intermediate key-value pairs of each mapper, the MapReduce framework collects all combined and partitioned intermediate output from all the mappers and shuffles them. Later, it sorts the shuffled output (intermediate key-value pairs) based on their keys and provides it to the Reducer.
Reducer reduces the sorted output (intermediate key-value pairs) to a smaller set of values. It is the final output that is stored in the Hadoop Distributed File System.
RecordWriter writes the output (key-value pairs) of the Reducer to an output file.
Hadoop Common is a collection of libraries and utilities required by all other components of the Hadoop architecture. In other words, it is a collection of Java files and Java libraries required by Hadoop YARN, HDFS, and MapReduce to run a Hadoop cluster.
Hadoop is a popular big data framework that is well-known for offering high performance, all thanks to its distributed storage architecture and distributed processing. Also, it is a cost-effective solution for big data processing since it uses network commodity hardware to store data.
We hope this article has helped you develop a better understanding of Hadoop architecture. Also, the detailed explanation of each component of Hadoop architecture, i.e. HDFS, MapReduce, YARN, and Hadoop Common, can come in handy for you to work with Hadoop effectively.
People are also reading:
Leave a Comment on this Post