Member-only story

Big Data: Hadoop

KEEP IN TOUCH

Nachi Keta
9 min readMar 9, 2024

Hadoop is an open-source framework that allows for the distributed processing of large data sets across clusters of computers. It is a fundamental tool in big data analytics.

One of the key features of Hadoop is its ability to handle massive amounts of data. Traditional databases often struggle with processing and analyzing large datasets, but Hadoop’s distributed architecture allows it to scale horizontally by adding more machines to the cluster. This enables organizations to store and process petabytes of data efficiently.

KEY IDEA

The core Hadoop project consists of a way to store data, known as the Hadoop distributed file system or HDFS. And a way to process data with MapReduce.

The key concept is that we split the data up and store it across the collection of machines known as a cluster. Then when we want to process the data, we process it where it’s actually stored. Rather than retrieving the data from a central server, the data’s already on the cluster, so we can process it in place.

You can add more machines to the cluster as the amount of data you’re storing grows.

Thus, Hadoop consists of two main components: the Hadoop Distributed File System (HDFS) and the MapReduce programming model.

  • HDFS is a distributed file system that provides high-throughput access to data across multiple machines. It breaks down large files into smaller blocks and distributes them across the cluster, ensuring data redundancy and fault tolerance.
  • The MapReduce programming model is the heart of Hadoop. It allows users to write parallelizable algorithms that can process large datasets in a distributed manner.

HDFS

Imagine we’re going to store a file called mydata.txt. In HDFS. This file is 150 megabytes. When a file is loaded into HDFS, it’s split into chunks which we call blocks. Each block is pretty big. The default is 64 megabytes. Each block is given a unique name, which is BLK, an underscore, and a large number.

--

--

No responses yet

Write a response