Distributed File Systems: GFS & HDFS – System Design Core Concept

What:

Master-worker distributed filesystems (GFS, HDFS) designed to store and stream massive, Multi-Terabyte files across clusters of commodity hardware servers.

Primary purpose:

Providing high-throughput sequential data access for analytics engines while surviving frequent physical server disk failures.

Usually used for:

Analytical data lakes, MapReduce/Spark data sources, and LSM-tree database storage backings (HBase, Cassandra SSTables).

How should I think about this inside system architectures?

👑 Decoupled Master Control

The NameNode manages directory mappings *only*. Clients query the Master for block addresses, then read data bytes directly from DataNodes.

🧱 Massive Block Segmentation

Files are cut into massive 64 MB or 128 MB blocks (rather than standard 4 KB OS filesystem sectors) to minimize NameNode metadata RAM footprints.

🧱 Pipeline replication

During file writes, clients stream block chunks to DataNode A in a pipeline. Node A forwards bytes to B, which forwards to C, minimizing network bottlenecks.

Distributed Cluster Node Matrix: Structuring master and worker boundaries:

Node Type	Operational Mechanic	Architectural Role
Master / NameNode	Maintains directory tree hierarchy and maps file paths to block arrays in RAM.	Single Coordinator (saves metadata edit logs to disk; does NOT touch file data bytes).
Worker / DataNode	Stores raw 64 MB / 128 MB data block segments on local Linux file disks.	Data Store (streams block bytes directly to clients; executes 3x replication loops).

Benefit	Cost
Sequential Throughput Economy (streaming massive 128 MB blocks directly from DataNodes achieves bare-metal disk throughput speeds)	No Random Modifies (files are strictly append-only; editing bytes inside the middle of a file is completely unsupported)
Commodity Fleet Durability (built-in 3x replication across distinct hardware server racks shields against routine host failures)	Master Metadata RAM Limits (because the NameNode maps every block in RAM, small-file proliferation exhausts Master memory)

Big Data Workload	Distributed File System Use	Architectural Rationale
Hadoop MapReduce / Spark	HDFS (Distributed File System)	High-volume sequential data-analytics jobs stream blocks locally from neighboring DataNodes, conserving cross-rack network bandwidth.
BigTable / HBase	GFS / HDFS Storage Layer	Appends database WAL records and SSTable segments sequentially to immutable DFS blocks, utilizing native 3x replica durability.