HDFS

HDFS is short of Hadoop Distributed File System, which is a distributed file system to run on commodity hardware. HDFS is highly fault-tolerance and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

HDFS relaxes a few POSIX requirements to enable streaming access to file system data.

HDFS is part of Apache Hadoop Core project.

Assumptions and goals

  1. Hardware failure
    Hardware failure is the norm rather than the exception. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.

  2. Streaming data access
    Applications that run on HDFS need streaming access to their datasets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access.

  3. Large data sets

  4. Simple coherency model
    HDFS applications need a write-once-write-many access model for files. A file once created, written, and closed need not be changed except for appends and truncates. Appending the content to the end of files is supported but cannot be updated at arbrirary point.

  5. Moving computation is cheaper than moving data

  6. Portability across heterogeneous hardware and software platforms

NameNodes and DataNodes

The File System Namespace

Data Replication

The persistence of File System Metadata

The Communication Protocols

Robustness

Data Organization

Accessibility

Space Reclamation

References

HDFS