in ,

Million WebSockets and Go (2017), Hacker News

Million WebSockets and Go (2017), Hacker News


Flying Gopher with MailRu logo

Hi everyone! My name is Sergey Kamardin and I’m a developer at Mail.Ru.

This article is about how we developed the high-load WebSocket server with Go.

If you are familiar with WebSocket, but know little about Go, I hope you will still find this article interesting in terms of ideas and techniques for performance optimization.

1. Introduction

To define the context of our story, a few words should be said about why we need this server.

Mail.Ru has a lot of stateful systems. User email storage is obviously one of them. There are several ways of how you can know about system state changes – about the system events. Mostly it is either through periodic system polling or, reversely, system notifications about its state changes.

Both ways have their pros and cons; However, when it comes to mail, the faster a user receives new mail, the better. Mail polling involves about (********************************************************************************************************************************, ******************************************************************************************************************************************** (HTTP queries per second, **************************************************************************************************************** (% of which return the************************************************************************ status meaning there are no changes in the mailbox.

therefore , in order to reduce the load on the servers and to speed up mail delivery to users, the decision was made to re-invent the wheel by writing a publisher-subscriber server (aka bus, message broker, event-channel etc.) that would receive notifications about state changes on the one hand and subscriptions for such notifications on the o ther.

Previously:

----------- (2) ----------- (1) ----------- | | ◄ ------- | | ◄ ------- | | | Storage | | API | HTTP | Browser | | | ------- ► | | ------- ► | | ----------- (3) ----------- (4) -----------

Now:

------------- --------- WebSocket -----------  | Storage | | API * | ----------- ► | Browser |   ------------- --------- (3) -----------          (2) ▲         | |     (1) ▼ --------------------------------- | Bus | ---------------------------------

The first scheme shows what it was like before. The browser periodically polled the API and asked about Storage (mailbox service) changes.

The second scheme describes the new architecture. The browser establishes a WebSocket connection with notification API, which is a client to the Bus server. Upon receipt of new email, Storage sends a notification about it to Bus (1), and Bus to its subscribers (2). API determines the connection to send the received notification and sends it to the user browser (3).

As you can guess, today we're going to talk about API or the WebSocket server. Looking ahead, I'll tell you that the server will have about 3 million online connections. This figure will not once be mentioned in our further story on optimization.

2. Idiomatic way

Let's see how we would implement certain parts of our server using plain Go features without any optimizations.

Before we proceed with net / http

, let's talk about how we will send and receive data. The data which standsabovethe WebSocket protocol ( eg json objects) will hereinafter be referred to aspackets. Let's begin implementing the Channel structure that will contain the logic of sending and receiving such packets over the WebSocket connection.2.1. Channel struct

************** / Packet represents application level data.

typePacketstruct{

// Channel wraps user connection.

typeChannelstruct{// WebSocket connection.

connnet

.Conn// Outgoing packets queue.

sendchanPacketfuncNewChannel(

connnet********************** (Conn) ****************************************************************** Channel ******************************** {******************************

c

:=

&

Channel

{conn

:

conn(**********************,send

:

make

chanPacket************************************** (N ********************************************************

go

(c)readergo(c)writerreturn(c)

I'd like to draw your attention to the launch of two reading and writing goroutines. Each goroutine requires its own memory stack that may have an initial size of 2 to 8 Kbytedepending on the operating systemand Go version. Regarding the above mentioned number of 3 million online connections, we will need (GB of memory) (with the stack of 4 Kbyte) for all connections. And that's without the memory allocated for the

Channelstructure, the outgoing packets (ch.send) ************** and other internal fields. 2.2. I / O goroutines

Let's have a look at the implementation of the “reader”:

func**********************

Channel

****************************reader (************************************************************** (

{// We make a buffered read to reduce read syscalls.

buf:=bufio ********

********** (NewReader) ************************** ()************************ (c) ************************************************ (conn) )for{pkt,

_=readPacket

************ buf

**************chandlepkt

Pretty simple, right? We use thebufio.Readerto reduce the number of (read ()syscalls and to read as many as allowed by thebufbuffer size. Within the infinite loop, we expect new data to come. Please remember the words:expect new data to come: we will return to them later.

We will leave aside the parsing and processing of incoming packets, as it is not important for the optimizations we will talk about. However,

bufis worth our attention now: by default, it is 4 Kbyte which means another (GB) of memory for our connections. There is a similar situation with the “writer”:

func (

(

(c)*

Channel)writer()********************** {// We make buffered write to reduce write syscalls.

buf:=bufio ********

********** (NewWriter) ************

************ ()************************ (c) ************************************************ (conn) )forpkt=range

(

**********************send

**************{**********************)(

:=writePacket

buf

********************** (p pkt

**************bufFlush

We iterate across the outgoing packets channel

c.send

and write them to the buffer. This is, as our attentive readers can already guess, another 4 Kbyte and (GB) *************************** of memory for our 3 million connections. 2.3. HTTP

We already have a simpleChannelimplementation, now we need to get a WebSocket connection to work with. As we are still under theIdiomatic Way

heading , let's do it in the corresponding way.

If you don't know how WebSocket works, it should be mentioned that the client switches to the WebSocket protocol by means of a special HTTP mechanism called Upgrade. After the successful processing of an Upgrade request, the server and the client use the TCP connection to exchange binary WebSocket frames.

(Here) Is a description of the frame structure inside the connection.

* http.Requestinitialization and further response writing.

Regardless of the WebSocket library used, after a successful response to the Upgrade request, theserver receives (I / O buffers together with the TCP connection after the

responseWriter.Hijack ()
call.

Hint: in some cases the

go: linknamecan be used to return the buffers to the

sync.Poolinside

net / httpthrough the callnet / http.putBufio {Reader, Writer}

****************************

Thus, we need another (GB) *************************** of memory for 3 million connections.

So, a total of 74 Gbyteof memory for the application that does nothing yet!

3). Optimizations

Let's review what we talked about in the introduction part and remember how a user connection behaves. After switching to WebSocket, the client sends a packet with the relevant events or in other words subscribes for events. Then (not taking into account technical messages such as

ping / pong

), the client may send nothing else for the whole connection lifetime.

The connection lifetime may last from several seconds to several days.

So for the most time ourChannel.reader ()andChannel.writer ()are waiting for the handling of data for receiving or sending. Along with them waiting are the I / O buffers of 4 Kbyte each.

Now it is clear that certain things could be done better, couldn't they?

3.1. Netpoll

Do you remember the

Channel.reader () implementation that

expected new data to comeby getting locked on theconn.Read ()

call inside thebufio.Reader.Read ()? If there was data in the connection, Go runtime “woke up” our goroutine and allowed it to read the next packet. After that, the goroutine got locked again while expecting new data. Let's see how Go runtime understands that the goroutine must be “woken up”.

If we look at the conn.Read () implementation, we'll see thenet.netFD.Read () callinside it:

****************** (error************)) (************** {******************************

...

for{

n

,

err

=syscall********************** Read************** () **************** (fd********************************** (sysfd ********************,

p

)iferr=

nil

{

n=

(0)if

err==syscall

EAGAIN

************

iferr=fdpd**************************************waitRead

**************** ****************************

***********************

err=nil

{continue

...

break

...

}

Go uses sockets in non-blocking mode. EAGAIN says there is no data in the socket and not to get locked on reading from the empty socket, OS returns control to us.

We see aread ()

syscall from the connection file descriptor. If read returns the EAGAIN error (**************************, runtime makes the (pollDesc.waitRead () call

:

************************/ net / fd_poll_runtime.go******************func(

pd

*pollDesc

**********************

waitRead************** () **************************************error (************** {******************************    returnpd

wait

'r' ****************************

func(

pd

*

pollDesc**********************wait************** () ****************

****************** int ******************************

error (****************** {

   res:=runtime_pollWait********************** ()************************ (pd) ************************************************ (runtimeCtx,(mode) **********************   

...

}

If wedig deeper

, we'll (see) that netpoll is implemented usingepollin Linux and kqueue in BSD. Why not use the same approach for our connections? We could allocate a read buffer and start the reading goroutine only when it is really necessary: ​​when there is really readable data in the socket.

On github.com/golang/go, there is the (issue) ************************** of exporting netpoll functions.

(****************************** 3.2. Getting rid of goroutines

Suppose we have (netpoll implementationfor Go. Now we can avoid starting the

Channel.reader ()

goroutine with the inside buffer , and subscribe for the event of readable data in the connection:

(ch :=****************** (NewChannel) (conn)// Make conn to be observed by netpoll instance.

poller

.Start( conn**********************,********************** (netpoll**************************************** (EventRead) ****************** (**********************,

******************** (func) ******************************************** ()

{{************************)// We spawn goroutine here to prevent poller wait loop

// to become locked during receiving packet from ch.

goReceive

(ch)

)// Receive reads a packet from conn and handles it somehow.

func(ch*Channel************************************************** (Receive) ****************************** (********************** {buf

:=bufio

NewReader

************ ch

************************************** conn**************** ()pkt

:=readPacket

buf

c

handlepkt

It is easier with the

Channel.writer ()because we can run the goroutine and allocate the buffer only when we are going to send the packet:

(func(ch***************** (Channel)

********************** (Send) ******************************** ()

************** (p) ********************************** (Packet) ************************************ ()

{if

(c)noWriterYet

**********************gochwriterchsend

p

Note that we do not handle cases when operating system returns (EAGAIN) ************** onwrite ()

system calls. We lean on Go runtime for such cases, cause it is actually rare for such kind of servers.

After reading the outgoing packets fromch.send

(one or several), the writer will finish its operation and free the goroutine stack and the send buffer.

Perfect! We have saved GBby getting rid of the stack and I / O buffers inside of two continuously running goroutines.

3.3. Control of resources

A great number of connections involves not only high memory consumption. When developing the server, we experienced repeated race conditions and deadlocks often followed by the so-called self-DDoS - a situation when the application clients rampantly tried to connect to the server thus breaking it even more.

For example, if for some reason we suddenly could not handleping / pongmessages, but the handler of idle connections continued to close such connections (supposing that the connections were broken and therefore provided no data), the client appeared to lose connection every N seconds and tried to connect again instead of waiting for events.

It would be great if the locked or overloaded server just stopped accepting new connections, and the balancer before it (for example, nginx) passed request to the next server instance.

Moreover, regardless of the server load, if all clients suddenly want to send us a packet for any reason (presumably by cause of bug), the previously saved 48 GBwill be of use again, as we will actually get back to the initial state of the goroutine and the buffer per each connection.

3.3.1. Goroutine pool

We can restrict the number of packets handled simultaneously using a goroutine pool. This is what a naive implementation of such pool looks like:

package (gopool) **********************func

New(

int************************************************ Pool****************return&Pool

{work

:

make

chanfunc () **********************************************************************************sem

:

makechanstruct******************************************************************************************************,func(

p

*Pool**********************Schedule () ****************task ******************************func (********************

****************** (error) **********************{select

{pworktask************

:

psemstruct

************************************** {****************************)

**********************:gopworker

task************

func(

p

*Pool

worker

************** (

) ****************task ******************************func (********************

{deferfunc(

{

************************************** sem

for{task(

task=

p

work

Now our code with

netpoll

looks as follows:

))

So now we read the packet not only upon readable data appearance in the socket, but also upon the first opportunity to take up the free goroutine in the pool.

) Similarly, we'll change

Send ():

Accept ()

and  (Upgrade ()

of new connections and to avoid most situations with DDoS.

3.4. Zero-copy upgrade

Let's deviate a little from the WebSocket protocol. As was already mentioned, the client switches to the WebSocket protocol using a HTTP Upgrade request. This is what it looks like:

That is, in our case we need the HTTP request and its headers only for switch to the WebSocket protocol. This knowledge andwhat is storedinside thehttp.Requestsuggests that for the sake of optimization, we could probably refuse unnecessary allocations and copyings when processing HTTP requests and abandon the standard

net / http

server.

For example, the http.Request

contains a field with the same-name Header typethat is unconditionally filled with all request headers by copying data from the connection to the values ​​strings. Imagine how much extra data could be kept inside this field, for example for a large-size Cookie header.

But what to take in return?

3.4.1 WebSocket implementation

Unfortunately, all libraries located at the time of our server optimization allowed us to do upgrade only for the standard

net / httpserver. Moreover, neither of the (two) libraries made it possible to use all the above read and write optimizations. For these optimizations to work, we must have a rather low-level API for working with WebSocket. To reuse the buffers, we need the procotol functions to look like this:

(func)ReadFrame

(io

.Reader

)********************** (************************ (Frame) ********************************, error

funcWriteFrame(

io

Writer**************************************************************************error

If we had a library with such API, we could read packets from the connection as follows (the packet writing would look the same):

// getReadBuf, putReadBuf are intended to

// reuse * bufio.Reader (with sync.Pool for example).

funcgetReadBuf

(io**********************.************************ Reader************************************************************************** (bufio) ******************

ReaderfuncputReadBuf(

*bufio

**********************Reader

**************

****************************// readPacket must be called when data could be read from conn.

**************** (func) **************************************** (readPacket

(connioReader

errorbuf

:=getReadBuf

deferputReadBuf(

buf