in ,

Timeouts and Cancellation for Humans (2018), Hacker News

Your code might be perfect and never fail, but unfortunately the outside world is less reliable. Sometimes, other people programs crash or freeze. Networks go down; printers catch on fire . Your code needs to be prepared for this: every time you read from the network, attempt to acquire an inter-process lock, or send an HTTP request, there are at least three possibilities you need to think about:

  • It might succeed.
  • It might fail
  • It might hang forever, never succeeding or failing: days pass, leaves fall, winter comes, yet still our request waits, yearning for a response that will never come.
  • The first two are straightforward enough. To handle that last case, though, you need timeouts. Pretty much every place your program interacts with another program or person or system, it needs a timeout, and if you don’t have one, that’s a latent bug.

    Let’s be honest: if you’re like most developers, your code probably has tons of bugs caused by missing timeouts. Mine certainly does. And it’s weird – since this need is so ubiqituous, and so fundamental to do I / O correctly, you’d think that every programming environment would provide easy and robust ways to apply timeouts to arbitrary operations. But … they don’t. In fact, most timeout APIs are so tedious and error-prone that it’s just not practical for developers to reliably get this right. So don’t feel bad – it’s not your fault your code has all those timeout bugs, it’s the fault of those I / O libraries!

    But now I’m, uh, writing an I / O library . And not just any I / O library, but one whose whole selling point is that it’s obsessed with being easy to use. So I wanted to make sure that in my library – Trio – you can easily and reliably apply timeouts to arbitrary I / O operations. But designing a user-friendly timeout API is a surprisingly tricky task, so in this blog post I’m going to do a deep dive into the landscape of possible designs – and in particular the many precursors that inspired me – and then explain what I came up with, and why I think it’s a real improvement on the old state-of-the-art. And finally, I’ll discuss how Trio’s ideas could be applied more broadly, and in particular, I’ll demonstrate a prototype implementation for good old synchronous Python.

    So – what’s so hard about timeout handling?

    Simple timeouts don’t support abstraction.

    The simplest and most obvious way to handle timeouts is to go through each potentially-blocking function in your API, and give it a timeout argument. In the Python standard library you'll see this in APIs like threading.Lock.acquire :

     lock=threading.Lock ()  # Wait at most 15 seconds for the lock to become available lock.acquire (timeout=16 
     

    If you use the socket module for networking, it works the same way, except that the timeout is set on the socket object instead of passed to every call:

     sock=socket.socket ()  # Set the timeout once sock.settimeout (16 # Wait at most  seconds to establish a connection to the remote host sock.connect (...) # Wait at most 15 seconds for data to arrive from the remote host sock.recv (...) 
     

    This is a little more convenient than having to remember to pass in explicit timeouts every time (and we'll discuss the convenience issue more below) but it's important to understand that this is a purely cosmetic change. The semantics are the same as we saw with threading.Lock : each method call gets its own separate 14 second timeout.

    So what's wrong with this? It seems straightforward enough. And if we always wrote code directly against these low level APIs, then it would probably be sufficient. But - programming is about abstraction. Say we want to fetch a file from

     S3 . We might do that with boto3, using  S3.Client.get_object  . What does  S3.Client.get_object   do? It makes a series of HTTP requests to the S3 servers, by calling into the  requests  library for each one. And then each call to  requests  internally makes a series of calls to the  socket  module to do the actual network communication  [1] .  

    From the user's point of view, these are three different APIs that fetch data from a remote service:

  • (s3client) () (get_object ( ... ) requests (get
  • (recv) (
  • ... )
      

    Sure, they're at different levels of abstraction, but the whole idea of abstracting away such details is that the user doesn't have to care. So if our plan is to use timeout= arguments everywhere, then we should expect these each to take a timeout= argument:

  • (s3client) () (get_object ( ... , (timeout) requests (get
  • (recv) (
  • ... , (timeout)
  • =()
      

    Now here's the problem: if this is how we're doing things, then Actually implementing these functions is a pain in the butt. Why? Well, let's take a simplified example. When processing HTTP response, there comes a point when we've seen the (Content-Length) header, and now we need to read that many bytes to fetch the actual response body. So somewhere inside requests There is a loop like:

  • (def) (read_body) ()
    (sock) , (content_length) :      body
  • =(bytearray) ()
  •      while (len) (
      (body)   :          max_to_receive 
  • =(content_length) - (len
  • ) ( (body)          body
  • = (sock) . ( ()
    max_to_receive ()      assert (len) (
      (body) ==(content_length)       return   body  
      

    Now we'll modify this loop to add timeout support. We want to be able to say "I'm willing to wait at most 16 seconds to read the response body ". But we can't just pass the timeout argument through to recv , because imagine the first call to recv takes 6 seconds - now for our overall operation to complete in seconds, our second recv call has to be given a timeout of 4 seconds. With the timeout=approach, every time we pass between levels of abstraction we need to write some annoying gunk to recalculate timeouts:

  • (def) (read_body) ()
    (sock) , (content_length) , timeout ): (read_body_deadline)=(timeout) (time) (monotonic) () (body)=( bytearray ()      while (len) (
      (body)   :          max_to_receive 
  • =(content_length) - (len
  • ) ( (body) (recv_timeout)=read_body_deadline - (time) (monotonic) () body (= (sock) (recv) (
    max_to_receive ,
     (timeout)  = recv_timeout      (assert) (len) ( ( (body) )  ()== content_length       return   body  
      

    (And even this is actually simplified because we're pretending that sock.recv takes a timeout

    argument - if you wanted to this for real you'd have to call settimeout before every socket method, and then probably use some try

  • here - this problem is everywhere in Python APIs. I'm using requests as the example because Kenneth Reitz is famous for his obsession with making its API as obvious and intuitive as possible, and this is one of the rare places where he's failed. I think this is the only part of the requests API that gets a

    big box in the documentation warning you that it's counterintuitive . So like ... if even Kenneth Reitz can't get this right, I think we can conclude that "just slap a timeout=

     argument on it "does not lead to APIs fit for human consumption.   
    Absolute deadlines are composable (but kinda annoying to use)

    If timeout=

     arguments don't work, what can we do instead? Well, Here is one option that some people advocate. Notice how in our  read_body  example above, we converted the incoming relative timeout ("  seconds from the moment I called this function ”) into an absolute deadline ("when the clock reads :  65. 3673 "), and then converted back before each socket call. This code would get simpler if we wrote the whole API in terms of  deadline= arguments, instead of  timeout=arguments. This makes things simple for library implementors, because you can just pass the deadline down your abstraction stack: 

  • (def) (read_body) ()
  • =(bytearray) ()
  •      while (len) (
      (body)   :           max_to_receive 
  • =(content_length) - (len
  • ) ( (body)          body
  • = (sock) . ( ()
    max_to_receive , (deadline)== deadline )      assert (len) (
      (body) ==(content_length)       return   body   # Wait 17 seconds total for the response body to be downloaded    deadline  
  • =(time) . (monotonic) () ( )   read_body (
  • ( (sock) (sock) ,
    content_length , (deadline)
      

    (A well-known API that works like this is Go's socket layer .)

    But this approach also has a downside: it succeeds in moving the annoying bit out of the library internals, and and instead puts it on the person using the API. At the outermost level where timeout policy is being set, your library's users probably want to say something like "give up after (seconds), and if all you take is a deadline= argument then they have to do the conversion by hand every time. Or you could have every function take both timeout= and deadline= arguments, but then you need some boilerplate in every function to normalize them, raise an error if both are specified, and so forth. Deadlines are an improvement over raw timeouts, but it feels like There's still some missing abstraction here.

    (Cancel tokens)
  • ) Cancel tokens encapsulate cancellation state

    Here's the missing abstraction: instead of supporting two different arguments:

  • # What users do: requests (get
    (
  • ... , (timeout) ... ) # What libraries do: requests (get
    (
  • ... , (deadline) ... ) # How we implement it: def
  • (get (
      ... 
  • , (deadline)=(None) , (timeout)=(None) :      deadline
  • =normalize_deadline ( deadline) , (timeout)      ...
      

    We can encapsulate the timeout expiration information into an object with a convenience constructor:

  • (class) (Deadline) :      def
  • __ init __ (
      (self) 
  • , (deadline) :          self (deadline)=deadline) def
  • (after (
      (timeout) 
  • :      return (Deadline) (
      (time)  (monotonic) ()    (timeout) )    # Wait 17 seconds total for the URL to be fetched   requests   (get 
    ( (https: // /..." }, deadline = (after) (
  • ))
      

    That looks nice and natural for users, but since it uses an absolute deadline internally, it's easy for library implementors too.

    And once we've gone this far, we might as well make things a bit more abstract. After all, a timeout isn't the only reason you might want to give up on some blocking operation; "give up after 15 seconds have passed "is a special case of" give up after ". If you were using requests to implement a web browser, you'd want to be able to say "start fetching this URL, but give up when the 'stop' button gets pressed ". And libraries mostly treat this Deadline object as totally opaque in any case - they just pass it through to lower-level calls, and trust that eventually some low-level primitives will interpret it appropriately. So instead of thinking of this object as encapsulating a deadline, we can start thinking of it as encapsulating an arbitrary "should we give up now" check. And in honor of its more abstract nature, instead of calling it a Deadline (let's call this new thing a CancelToken :

    =(cancel_on_callback
  • ) () # Arrange for the callback to be called if someone clicks "stop" stop_button
  • (on_press)=(cancel_callback
  • ) # So this request gives up if someone clicks 'stop' requests (get
    ( (https: // /..." }, cancel_token = cancel_token )
      

    So promoting the cancellation condition to a first-class object makes our timeout API easier to use, and at the same time makes it dramatically more powerful: now we can handle not just timeouts, but also arbitrary cancellations, which is a very common requirement when writing concurrent code. (For example, it lets us express things like: "run these two redundant requests in parallel, and as soon as one of them finishes then cancel the other one ".) This is a great idea. As far as I know, it originally comes from Joe Duffy's cancellation tokens

    work in C #, and Go

    context objects are essentially the same idea. Those folks are pretty smart! In fact, cancel tokens also solve some other problems that show up in traditional cancellation systems. Cancel tokens are level- triggered and can be scoped to match your program's needs

    In our little tour of timeout and cancellation APIs, we started with timeouts. If you start with cancellation instead, then there's another common pattern you'll see in lots of systems: a method that lets you cancel a single thread (or task, or whatever your framework uses as a thread-equivalent), by waking it up and throwing in some kind of exception. Examples include asyncio's Task.cancel , Curio's Task.cancel , pthread cancellation, Java's Thread.interrupt , C # 's Thread.Interrupt , and so forth. In their honor, I'll call this the "thread interrupt" approach to cancellation.

    In the thread-interrupt approach, cancellation is a point-in-time event that's directed at a fixed-size entity : one call → one exception in one thread / task. There are two issues here.

    The problem with scale is fairly obvious: if you have a single function you'd like to call normally but you might need to cancel it, then you have to spawn a new thread / task / whatever just for that:

     http_thread=spawn_new_thread (requests.get, "https: // ...") # Arrange that http_thread.interrupt () will be called if someone # clicks the stop button stop_button.on_click=http_thread.interrupt try:     http_response=http_thread.wait_for_result () except Interrupted:     ... 
     

    Here the thread isn't being used for concurrency; it's just an awkward way of letting you delimit the scope of the cancellation.

    Or, what if you have a big complicated piece of work that you want to cancel - for example, something that internally spawns multiple worker threads? In our example above, if requests.get spawned some additional backgrounds threads, they might be left hanging when we cancel the first thread. Handling this correctly would require some complex and delicate bookkeeping.

    Cancel tokens solve this problem: the work they cancel is "whatever the token was passed into ", which could be a single function, or a complex multi-tiered set of thread pools, or anything in between.

    The other problem with the thread-interrupt approach is more subtle: it treats cancellation as an event . Cancel tokens, on the other hand, model cancellation as a state : they start out in the uncancelled state, and eventually transition into the cancelled state.

    This is subtle, but it makes cancel tokens less error-prone. One way to think of this is the edge-triggered / level-triggered distinction : thread-interrupt APIs provide edge-triggered notification of cancellations, as compared to level-triggered for cancel tokens. Edge-triggered APIs are notoriously tricky to use. You can see an example of this in Python's threading.Event : even though it's called "event", it actually has an internal boolean state; cancelling a cancel token is like setting an Event.

    That's all pretty abstract. Let's make it more concrete. Consider the common pattern of using a try / finally to make sure that a connection is shut down properly. Here's a rather artificial example of a function that makes a Websocket connection, sends a message, and then makes sure to close it, regardless of whether send_message raises an exception: [2]

  • ( (message)
  • message )      finally
  • :          ws (close) ()
  •   

    Now suppose we start this function running, but at some point the other side drops off the network and our send_message

  • call hangs forever. Eventually, we get tired of waiting, and cancel it.
  • With a thread-interrupt style edge-triggered API, this causes the send_message call to immediately raise an exception, and then our connection cleanup code automatically runs. So far so good. But here's an interesting fact about the websocket protocol: it has a "close" message

  • You're supposed to send before closing the connection. In general this is a good thing; it allows for cleaner shutdowns. So when we call ws.close () , it'll try to send this message. But ... in this case, the reason we're trying to close the connection is because we've given up on the other side accepting any new messages. So now ws.close () also hangs forever.

    If we used a cancel token, this does not happen:

  • ( (message)

  • message , cancel_token= cancel_token )      finally
  • :          ws (close) ( (cancel_token )=cancel_token
      

    Once the cancel token is triggered, then all future operations on that token are cancelled, so the call to ws.close

    doesnt get stuck. It's a less error-prone paradigm.

    It's kind of interesting how so many older APIs could get this wrong. If you follow the path we did in this blog post, and start by thinking about applying a timeout to a complex operation composed out of multiple blocking calls, then it's obvious that if the first call uses up the whole timeout budget, then any future calls should fail immediately. Timeouts are naturally level-triggered. And then when we generalize from timeouts to arbitrary cancellations, the insight carries over. But if you only think about timeouts for primitive operations then this never arises; or if you start with a generic cancellation API and then use it to implement timeouts (like e.g. Twisted and asyncio do), then the advantages of level-triggered cancellation are easy to miss.

    Cancel tokens are unreliable in practice because humans are lazy

    So cancel tokens have really great semantics, and are certainly better than raw timeouts or deadlines, but they still have a usability problem: to write a function that supports cancellation, you have to accept this boilerplate argument and then make sure to pass it on to every subroutine you call. And remember, a correct and robust program has to support cancellation in every function that ever does I / O, anywhere in your stack . If you ever get lazy and leave it out, or just forget to pass it through to any particular subroutine call, then you have a latent bug.

    Humans suck at this kind of boilerplate. I mean, not you, I'm sure You're a very diligent programmer who makes sure to implement correct cancellation support in every function and also flosses every day. But ... perhaps some of your co-workers are not so diligent? Or maybe you depend on some library that someone else wrote - how much do you trust your third-party vendors to get this right? As the size of your stack grows then the chance that everyone everywhere always gets this right approaches zero.

    Can I back that up with any real examples? Well, consider this: in both C # and Go, the most prominent languages ​​that use this approach and have been advocating it for a number of years, the underlying networking primitives still do not have cancel token support [3]

    . These are like ... THE fundamental operations that might hang for reasons outside your control and that you need to be prepared to time out or cancel, but ... I guess they just haven't gotten around to implementing it yet? Instead their socket layers support an older mechanism for setting (timeouts) or deadlines

  • on their socket objects, and if you want to use cancel tokens you have to figure out how to bridge between the two different systems yourself.

    The Go standard library does provide one example of how to do this: their function for establishing a network connection (basically the equivalent of Python's socket.connect ) does accept a cancel token. Implementing this requires 231 lines of source code , a background task, and the first try had a race condition that took a year to be discovered in production

    . So ... in Go if you want to use cancel tokens (or Context

    s, in Go parlance), then I guess that's what you need to implement every time you use any socket operation? Good luck?

    I don't mean to make fun. This stuff is hard. But C # and Go are huge projects maintained by teams of highly-skilled full-time developers and backed by Fortune 66 companies. If they can't get it right, who can? Not me. I'm one human trying to reinvent I / O in Python. I can't afford to make things that complicated.

    Cancel scopes: Trio's human-friendly solution for timeouts and cancellation

  • Remember way back at the beginning of this post, we noted that Python socket methods don't take individual timeout arguments, but instead let you set the timeout once on the socket so it's implicitly passed to every method you call? And in the section just above, we noticed that C # and Go do pretty much the same thing? I think they're on to something. Maybe we should accept that when you have some data that has to be passed through to every function you call, that's something the computer should handle, rather than making flaky humans do the work - but in a general way that supports complex abstractions, not just sockets.

    How cancel scopes work

    Here's how you impose a second timeout on an HTTP request in Trio:

  • (# The primitive API:
  • with (trio)
  • (open_cancel_scope) () () (as) (as (cancel_scope) :      cancel_scope (deadline)=()
  • current_time () (      await (request)
  • (get (
    ( (https: // ...) )

      

    Of course normally you'd use a (convenience wrapper) , like:

  • (# An equivalent but more idiomatic formulation: with (trio)
  • move_on_after (
  • (get (
    ( (https: // ...) )
      

    But since this post is about the underlying design, we'll focus on the primitive version. (Credit: the idea of ​​using with blocks for timeouts is something I first saw in Dave Beazley's Curio, though I changed a bunch. I'll hide the details in a footnote: [4]

    You should think of with open_cancel_scope () as creating a cancel token, but it doesn't actually expose any CancelToken object publically. Instead, the cancel token is pushed onto an invisible internal stack, and automatically applied to any blocking operations called inside the with block. So requests does not have to do anything to pass this through - when it eventually sends and receives data over the network, those primitive calls will automatically have the deadline applied.

    The cancel_scope object lets us control cancellation status: you can change the deadline, issue an explicit cancellation by calling cancel_scope.cancel () , and so so

    . If you know C #, it's analogous to a CancellationTokenSource . One useful trick it allows is implementing the kind raise-an-error-if-the-timeout-fires API that people are used to , on top of the more primitive cancel scope unwinding semantics.

    When an operation is cancelled, it raises a Canceled exception, which is used to unwind the stack back out to the appropriate with open_cancel_scope block. Cancel scopes can be nested; Canceled exceptions know which scope triggered them, and will keep propagating until they reach the corresponding with block. (As a consequence, you should always let the Trio runtime take care of raising and catching Canceled exceptions, so that it can properly keep track of these relationships.)

    Supporting nesting is important because some operations may want to use timeouts internally as an implementation detail. For example, when you ask Trio to make a TCP connection to a hostname that has multiple IP addresses associated with it, it uses a "happy eyeballs" algorithm to run multiple connections attempts in parallel with a staggered start

    . This requires an internal timeout to decide when it's time to initiate the next connection attempt. But users shouldn't have to care about that! If you want to say "try to connect to example.com:

  • , but give up after 16 seconds ", then that's just:

  • (with
  • :      tcp_stream
  • =(await) (trio) open_tcp_stream (
     "example.com"  ,  
  • )
      

    And everything works; thanks to the cancel scope nesting rules, it turns out open_tcp_stream