Parallel File Systems

Parallel File Systems are distributed file systems that are built for large computer clusters. Like in most scalable distributed systems, the key research challenges of parallel file systems include naming, state, and resiliency. In 2005 Ceph started out as a research prototype to meet these challenges and is now an open source project under active development. While testing Ceph we found it difficult to find the infrastructure to do so at scale. With FLAMBES we try to improve upon that situation: we would like to be able to simulate very large parallel file systems on a laptop (well, maybe more than one). In Gocfs we explore lockless designs. And finally, we explored ways to scalably monitor and display the state of very large scale file systems.

Project 1: Ceph

Project 2: IMPIOUS

Project 3: Gocfs

Project 4: GUI Monitor


Project 1


Members: Scott Brandt, Carlos Maltzahn

Alumni: Sage Weil

Ceph is a parallel file system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs). We leverage device intelligence by distributing data replication, failure detection and recovery to semi-autonomous OSDs running a specialized local object file system. A dynamic distributed metadata cluster provides extremely efficient metadata management and seamlessly adapts to a wide range of general purpose and scientific computing file system workloads. Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second. 



Project 2

IMPIOUS:  Scalable Simulation of Parallel File Systems

Members: Carlos Maltzahn, John Bent

Alumni: Esteban Molina-Estolano

Parallel file systems are gaining in popularity in high-end computing centers as well as commercial data centers. High-end computing systems are expected to scale exponentially and to pose new challenges to their storage scalability in terms of cost and power. To address these challenges, scientists and file system designers will need a thorough understanding of the design space of parallel file systems.  Yet there exist few systematic studies of parallel file system behavior at petabyte- and exabyte scale. An important reason is the significant cost of getting access to large-scale hardware to test parallel file systems.

To contribute to this understanding, we have attempted to build a parallel file system simulator that can simulate parallel file systems at very large scale. We aimed to simulate petabyte-scale parallel file systems on a small cluster or even a single machine in reasonable time, sacrificing some degree of fidelity in exchange for simulation performance and scalability. Such a simulator would allow file system experts to tune existing file systems for specific workloads; scientists and file system deployment engineers to better communicate workload requirements; file system designers and researchers to try out design alternatives and innovations at scale; and instructors to study very large-scale parallel file system behavior in the classroom.

Earlier preliminary results were encouraging for HPC workloads running under our simulator. Compared with experimental results on the Panasas file system, our earlier simulation results were qualitatively similar: we reproduced the performance effects of aligned and unaligned N-1 check pointing chunk sizes; and the drastic performance hit of N-1 workloads, in comparison to N-N workloads, with increasing numbers of clients. However, when we attempted a different sort of validation --- maintaining a constant check pointing workload, and finding the optimal file system parameters for that workload --- we were unable to match in simulation the performance effects of changing file system parameters for the experimental workload.


Project 3

Gocfs: a lockless cooperative cache for the cluster

Members: Latchesar (Lucho) Ionkov, Scott Brandt

When a job runs on a cluster, all processes, running on multiple nodes, access the same files on the parallel file system, most likely using the same access pattern. Instead of each node keeping its own private cache, it is beneficial for all the nodes assigned to a job to cooperate and reduce the duplication of cache data. In the Gocfs project we explore the design and implementation of a cooperative caching file system, and study the advantages the lockless design achieves and the restrictions it imposes.


Project 4

A GUI Monitoring Tool for the Ceph Distributed File System

Members: Michael McThrow, Carlos Maltzahn

Ceph is designed to run on computer clusters consisting of as many as tens of thousands of object-based storage devices (OSDs).  OSDs can be added to and removed from the storage system at any time.  The challenge of designing a graphical monitoring tool for Ceph is developing a user interface that scales with the potentially large amount of OSDs in the storage system and also adapts to changes in the storage system.  This project is about the design of an easy-to-use tool for monitoring Ceph storage systems that concisely and in real-time displays overall file system statistics and a "bird's-eye" view of the file system in a manner that allows users to "zoom into" a cluster of nodes to obtain an individual node's statistics.  The monitoring tool also allows users to issue commands to the storage system.