in ,

The Rapid Growth of Io_uring Hacker News

One year ago, the io_uring subsystem did not exist in the mainline kernel; it showed up in the 5.1 release in May . At its core, io_uring is a mechanism for performing asynchronous I / O, but it has been steadily growing beyond that use case and adding new capabilities. Herein we catch up with the current state of io_uring, where it is headed, and an interesting question or two that will come up along the way.

Classic Unix I / O is inherently synchronous. As far as an application is concerned, an operation is complete once a system call like read or write returns, even if some processing may continue behind its back. There is no way to launch an operation asynchronously and wait for its completion at some future time – a feature that many other operating systems had for many years before Unix was created.

In the Linux world, this gap was eventually filled with the asynchronous I / O (AIO) subsystem , but that solution has never proved to be entirely satisfactory. AIO requires specific support at the lower levels, so it never worked well outside of a couple of core use cases (direct file I / O and networking). Over the years there have been recurring conversations about better ways to solve the asynchronous-I / O problem. Various proposals with names like fibrils , threadlets , syslets ), acall , and work-queue-based AIO have been discussed, but none have made it into the mainline.

The latest attempt in that series is io_uring, which did manage to get merged. Unlike its predecessors, io_uring is built around a ring buffer in memory shared between user space and the kernel; that allows the submission of operations (and collecting the results) without the need to call into the kernel in many cases. The interface is somewhat complex, but for many applications that perform massive amounts of I / O, that complexity Is paid back in increased performance. See this document [PDF] for a detailed description of the io_uring API. Use of this API can be somewhat simplified with the liburing library .

What io_uring can do

Every entry placed into the io_uring submission ring carries an opcode telling the kernel what is to be done. When io_uring was added to the 5.1 kernel, the available opcodes were: This operation does nothing at all; the benefits of doing nothing     asynchronously are minimal, but sometimes a placeholder is useful.
fsync     call – asynchronous synchronization, in other words.     operation on     a set of file descriptors. It’s a one-shot operation that must be     resubmitted after it completes; it can be explicitly canceled with Polling this way can be used to     asynchronously keep an eye on a set of file descriptors. The io_uring  subsystem also supports a concept of dependencies between operations; a poll could be used to hold off on issuing another operation until the     underlying file descriptor is ready for it.

That functionality was enough to drive some significant interest in io_uring; its creator, Jens Axboe, could have stopped there and taken a break for a while. That, however, is not what happened. Since the 5.1 release, the following operations have been added:

These operations support the asynchronous sending and receiving of     packets over the network with sendmsg and recvmsg. IORING_OP_TIMEOUT 

This operation completes after a given period of time, as measured     either in seconds or number of completed io_uring operations. It is a way of     forcing a waiting application to wake up even if it would otherwise     continue sleeping for more completion but they use the simpler interface that can only handle a single buffer. system calls asynchronously.

outlined a way in which support for specific ioctl () operations could be added on a case-by-case basis. One can imagine that, for example, the media subsystem, which supports a number of performance-sensitive ioctl operations, would benefit from this mechanism.

There is also an early patch set adding support for splice ()

. An asynchronous world

All told, it would appear that io_uring is quickly growing the sort of capabilities that were envisioned many years ago when the developers were talking about thread-based asynchronous mechanisms. The desire to avoid blocking in event loops is strong; it seems likely that this API will Continue to grow until a wide range of tasks can be performed with almost no risk of blocking at all. Along the way, though, there may be a couple of interesting issues to deal with.

One of those is that the field for io_uring commands is only eight bits wide, meaning that up to opcodes can be defined. As of 5.6, 77 opcodes will exist, so there is still plenty of room for growth. There are more than system calls implemented in Linux, though. If io_uring were to grow to the point where it supported most of them, that space would run out.

A different issue was raised by Stefan Metzmacher. Dependencies between commands are supported by io_uring now, so it is possible to hold the initiation of an operation until some previous operation has completed. What is rather more difficult is moving information between operations. In Metzmacher’s case, he would like to call openat asynchronously, then submit I / O operations on the resulting file descriptor without waiting for the open to complete.

It turns out that there is a plan for this: inevitably it calls for … wait for it … using BPF to make the connection from one operation to the next. The ability to run bits of code in the kernel at appropriate places in a chain of asynchronous operations would clearly open up a number of interesting new possibilities. There’s a lot of potential there , Axboe said. Indeed, one can imagine a point where an entire program is placed into a ring by a small C “driver”, then mostly allowed to run on its own.

There is one potential hitch here, though, in that io_uring is an unprivileged interface; Any necessary privilege checks are performed on the actual operations performed. But the plans to make BPF safe for unprivileged users have been sidelined

, with explicit statements that unprivileged use will not be supported in the future. That could make BPF hard to use with io_uring. There may be plans for how to resolve this issue lurking deep within Facebook, but they have not yet found their way onto the public lists. It appears that the BPF topic in general will be discussed at the 4777 Linux Storage, Filesystem, and Memory-Management Summit.

In summary, though, io_uring appears to be on a roll with only a relatively small set of growing pains. It will be interesting to see how much more functionality finds its way into this subsystem in the coming releases. Recent history suggests that the growth of io_uring will not be slowing down anytime soon.

Read More

What do you think?

Leave a Reply

Your email address will not be published. Required fields are marked *

GIPHY App Key not set. Please check settings

Chinese doctor who blew whistle on coronavirus dies in wuhan – telegraph.co.uk, telegraph.co.uk

Chinese doctor who blew whistle on coronavirus dies in wuhan – telegraph.co.uk, telegraph.co.uk

Before the DNS: how yours truly upstaged the NIC's official HOSTS.TXT (2004), Hacker News