We know that Apache Hadoop is a framework that allows us to perform data processing in a distributed way on very large datasets using commodity computers. That is why, this framework is highly scalable and can scale up from a single machine to thousands of machines. Most importantly, Hadoop is an open source and provides local data storage and computing. Above all, it provides data locality which is the main reason for its good performance. So, in this post, we will compare the different major versions of the Hadoop framework.
As on date, Hadoop 3.3.4 version has been released and available for download. We can visit the hadoop download page to download and install it. The link to download the Hadoop 3.3.4 is here.
Comparison between Hadoop 1.X vs Hadoop 2.X vs Hadoop 3.X
Hadoop 1.X | Hadoop 2.X | Hadoop 3.X |
Hadoop 1.x was released in 2011 | Hadoop 2.x released in 2012 | Hadoop 3.x released in 2017 |
It introduced MapReduce and HDFS. That is to say, the MapReduce frameowrk is used as data processing and for resource management also. | YARN (Yet another resource negotiator) added for better resource management. As a result, it enabled multi-tenancy. Therefore, the same cluster can be used by MapReduce as well as by some other processes using YARN. | In Hadoop 3.x, the YARN resource model is generalized to support user-defined resource types beyond CPU and memory. For example, the administrator can define resources like GPUs, software licenses, or locally-attached storage. YARN tasks can then be scheduled based on the availability of these resources. |
Supports single tenancy only | Supports multiple tenants using YARN | Multiple tenants are supported here. |
Hadoop 1.x uses Master-Slave architecture that consists of a single master and multiple slaves. So, in case the master node gets failed then the entire clusters become unavailable. | Hadoop 2.x is also a Master-Slave architecture. However, this consists of multiple masters that includes active namenode and standby namenode. So, in this case if master node get failed then the standby master node will take over it. As a result, hadoop 2.x fixes the problem of a single point of failure. | It added supports for multiple active namenodes |
Hadoop 1.x is limited to 4000 nodes per cluster. | It supports up to 10000 nodes in a cluster. | The scalability is improved in Hadoop 3.x and it can have more than 10000 nodes in one cluster. |
Manual intervention is needed for namenode recovery. | We don’t need manual intervention for namenode recovery. | |
Java 7 is the minimum supported version | Java 8 is the minimum supported version. | |
It supports HDFS(default), FTP, Amazon S3 and Windows Azure Storage Blobs (WASB) file systems. | All file systems including Microsoft Azure Data Lake filesystem is compatible with Hadoop 3.x. | |
It uses 3x replication scheme that results in 200% storage overhead. | Hadoop 3 uses eraser encoding in HDFS that helps to reduce the storage overhed. It has 50% storage overhead only. | |
It added support for GPU hardware that can be used to execute deep leanring algorithms on a Hadoop cluster. |
Now, to summarize the above points, we can have a look at the below image:
Thanks for the reading. Please share your inputs in the comment section.
Hadoop 1.X Hadoop 2.X Hadoop 3.X
Hadoop 1.x was released in 2011 Hadoop 2.x released in 2012 Hadoop 3.x released in 2017
It introduced MapReduce and HDFS. That is to say, the MapReduce frameowrk is used as data processing and for resource management also. YARN (Yet another resource negotiator) added for better resource management. As a result, it enabled multi-tenancy. Therefore, the same cluster can be used by MapReduce as well as by some other processes using YARN. In Hadoop 3.x, the YARN resource model is generalized to support user-defined resource types beyond CPU and memory. For example, the administrator can define resources like GPUs, software licenses, or locally-attached storage. YARN tasks can then be scheduled based on the availability of these resources.
Supports single tenancy only Supports multiple tenants using YARN Multiple tenants are supported here.
Hadoop 1.x uses Master-Slave architecture that consists of a single master and multiple slaves. So, in case the master node gets failed then the entire clusters become unavailable. Hadoop 2.x is also a Master-Slave architecture. However, this consists of multiple masters that includes active namenode and standby namenode. So, in this case if master node get failed then the standby master node will take over it. As a result, hadoop 2.x fixes the problem of a single point of failure. It added supports for multiple active namenodes
Hadoop 1.x is limited to 4000 nodes per cluster. It supports up to 10000 nodes in a cluster. The scalability is improved in Hadoop 3.x and it can have more than 10000 nodes in one cluster.
Manual intervention is needed for namenode recovery. We don’t need manual intervention for namenode recovery.
Java 7 is the minimum supported version Java 8 is the minimum supported version.
It supports HDFS(default), FTP, Amazon S3 and Windows Azure Storage Blobs (WASB) file systems. All file systems including Microsoft Azure Data Lake filesystem is compatible with Hadoop 3.x.
It uses 3x replication scheme that results in 200% storage overhead. Hadoop 3 uses eraser encoding in HDFS that helps to reduce the storage overhed. It has 50% storage overhead only.
It added support for GPU hardware that can be used to execute deep leanring algorithms on a Hadoop cluster.