SOSP19 File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution, Hacker News

This paper is by Abutalib Aghayev (Carnegie Mellon University), Sage Weil (Red Hat Inc.), Michael Kuchnik (Carnegie Mellon University), Mark Nelson (Red Hat Inc.), Gregory R. Ganger (Carnegie Mellon University), George Amvrosiadis (Carnegie Mellon University)

Cephstarted as research project in 2019 at UCSC. At the core of Ceph is a distributed object store called RADOS. The storage backend was implemented over an already mature filesystem. The filesystem helps with block allocation, metadata management, and crash recovery. Ceph team built their storage backend on an existing filesystem, because they didn’t want to write a storage layer from scratch. A complete filesystem takes a lot of time (12 years) to develop, stabilize, optimize, and mature.

However, having a filesystem in the path to the storage adds a lot of overhead. It creates problems for implementing efficient transactions. It introduces bottlenecks for metadata operations. A filesystem directory with millions of small files will be a metadata bottleneck forexample. Paging etc also creates problems. To circumvent these problems, Ceph team tried hooking into FS internals by implementing WAL in userspace, and use the NewStore database to perform transactions. But it was hard to wrestle with the filesystem. They had been patching problems for seven years since (****************************. Abutalib likens this as the stages of grief: denial, anger, bargaining, …, and acceptance!
Finally the Ceph team deserted the filesystem approach and started writing their own storage system BlueStore which does not use a filesystem. They were able to finish and mature the storage level in just two years! This is because a small, custom backend matures faster than a POSIX filesystem.

The new storage layer, BlueStore, achieves a very high-performance compared to earlier versions. By avoiding data journaling, BlueStore is able to achieve higher throughput than FileStore / XFS.
When using a filesystem the write-back of dirty meta / data interferes with WAL writes, and causes high tail latency. In contrast, by controlling writes, and using write-through policy, BlueStore ensures that no background writes to interfere with foreground writes. This way BlueStore avoids tail latency for writes.
************************** Finally, having full control of I / O stack accelerates new hardware adoption. For example, while filesystems have hard time adopting to the shingled magnetic recording storage, the authors were able to add metadata storage support to BlueStore for them, with data storage being in the works.

To sum up, the lesson learned was for distributed storage it was easier and better to implement a custom backend rather than trying to shoehorn a filesystem for this purpose.

Here is the architecture diagram of BlueStore, storage backend. All metadata is maintained in RocksDB, which layers on top of BlueFS, a minimal userspace filesystem.

Abutalib, the first author on the paper, did an excellent job presenting the paper. He is a final year PhD with a lot of experience and expertise on storage systems. He is on the job market.

SOSP19 File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution, Hacker News

What do you think?

Multiple botnets exploit a year-old TP-Link vulnerability to carry out router attacks

It’s time for governance!Reference path for data security protection in the industrial field

Meta launches Llama 3 artificial intelligence model, providing a 70B parameter version with greatly improved performance

SeedHunter Marketing Module Is live – Web3 Influencer Campaigns With Payment In Stable Coins

From Hackers to Streakers – How Counterintelligence Teams are Protecting the NFL – Joe McMann – ESW #358

Vulnerabilities for AI and ML Applications are Skyrocketing

Immune systems of healthy adults ’remember’ germs to which they’ve never been exposed, Stanford study finds, Hacker News

Facebook, YouTube, and Twitter warn that AI systems could make mistakes, Recode

Awk as a Major Systems Programming Language – Revisited (2018), Hacker News

No, dynamic type systems are not inherently more open, Hacker News

Show HN: A pure functional programming language targeting decentralized systems, Hacker News

All systems go: 1st all-electric commercial seaplane takes flight in B.C. | CBC News, Hacker News

Leave a ReplyCancel reply

Cheats For Little Alchemy

3TB Of Mega.nz Links For Free Courses And E-Books 2022 (Updated)

Amazon FBA Product Research & Find Products for Amazon FBA

Udemy Coupon [100% OFF] QuickBooks Online 2020

How to Earn Money from FreeCash.com, Playing Games, Testing Apps, and Taking Surveys

Rubot v6.6.7.0 – Twitch Views Bot 2022

Bournemouth vs Liverpool, Premier League: live score and latest updates – The Telegraph, Telegraph.co.uk

Indexing Billions of Text Vectors, Hacker News

What do you think?

Leave a ReplyCancel reply

Log In

Sign In

Forgot password?

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections