What:
Master-worker distributed filesystems (GFS, HDFS) designed to store and stream massive, Multi-Terabyte files across clusters of commodity hardware servers.
Primary purpose:
Providing high-throughput sequential data access for analytics engines while surviving frequent physical server disk failures.
Usually used for:
Analytical data lakes, MapReduce/Spark data sources, and LSM-tree database storage backings (HBase, Cassandra SSTables).
How should I think about this inside system architectures?
👑 Decoupled Master Control
The NameNode manages directory mappings *only*. Clients query the Master for block addresses, then read data bytes directly from DataNodes.
🧱 Massive Block Segmentation
Files are cut into massive 64 MB or 128 MB blocks (rather than standard 4 KB OS filesystem sectors) to minimize NameNode metadata RAM footprints.
🧱 Pipeline replication
During file writes, clients stream block chunks to DataNode A in a pipeline. Node A forwards bytes to B, which forwards to C, minimizing network bottlenecks.