Core Concept

Distributed File Systems: GFS & HDFS

Google File System (GFS) and Hadoop Distributed File System (HDFS) are master-worker storage clusters optimized for sequentially reading and appending massive Multi-Gigabyte log assets across commodity server fleets.


What:

Master-worker distributed filesystems (GFS, HDFS) designed to store and stream massive, Multi-Terabyte files across clusters of commodity hardware servers.

Primary purpose:

Providing high-throughput sequential data access for analytics engines while surviving frequent physical server disk failures.

Usually used for:

Analytical data lakes, MapReduce/Spark data sources, and LSM-tree database storage backings (HBase, Cassandra SSTables).

How should I think about this inside system architectures?

👑 Decoupled Master Control

The NameNode manages directory mappings *only*. Clients query the Master for block addresses, then read data bytes directly from DataNodes.

🧱 Massive Block Segmentation

Files are cut into massive 64 MB or 128 MB blocks (rather than standard 4 KB OS filesystem sectors) to minimize NameNode metadata RAM footprints.

🧱 Pipeline replication

During file writes, clients stream block chunks to DataNode A in a pipeline. Node A forwards bytes to B, which forwards to C, minimizing network bottlenecks.