High Performance Linux

Friday, March 29, 2013

What's Wrong With Sockets Performance And How to Fix It

Socket API is a nice think which allows you to easily write network programs. But sockets have fundamental problem from performance point of view - they are asynchronous with network interrupts. And this is regardless whether you're using blocking or nonblocking IO.

Let's consider a multi-threaded application which is working with number of sockets and reads some data from them. Typically, it does following (pseudo code):

    int n = epoll_wait(e->fd, e->event, e->max_events, 1000);
    for (int i = 0; i < n; ++i) {
        unsigned char buf[4096];
        read(e->event[i].data.fd, buf, 4096);
    }

The polled socket could be either blocking or non-blocking socket. Let's forget about the buffer copying for a moment and concentrate on what's happen with arriving packets.


The figure depicts two processes (there is no difference between processes and threads in our discussion) which reads from three sockets. The processes are working on different CPUs. Probably Receive Flow Steering (RFS) is used so packets designated for the first process go to the first CPU and packets for second process are processed by softirq at the second CPU. Each socket has a receive queue where incoming packets are placed before reading process consumes them.

If we look at the code sample carefully then we find two system calls, relatively slow operations. The process also can be rescheduled and/or preempted between the syscalls. So if the process is waked up in epoll_wait() call by a socket event (when the socket gets a packet) then it reads data from the socket with some delay. There are a bold arrow between the second socket's queue and the first process which depicts reading a data from the socket. There are two complications:
  • the process can be preempted by softirq between waking up on epoll_wait() and reading from the socket (however, it's easy to prevent this by binding the process and the NIC interrupts to different cores);
  • during high loaf Linux switches to polling mode and very very quickly grabs bunches of packets, so during the delay between the two syscalls softirq can process a lot of other packets.
The problem is that when the process goes to treat the packet, softirq can read other packet (a lot of packets actually, see a bold arrow from softirq to the queue of the first socket at the figure). With packet lenght from 64 to 1500 bytes for common Ethernet link it's obviously that the packet, which is reading by the process now, could not be in CPU cache any more. The packet is simply  pushed out by other packets from the CPU cache. Thus ever with using zero copy networking user-space applications can not achieve good performance.

In fact Linux firewall works in softirq context. It means that the packet is processed synchronously, immediately when it is received. Moreover, synchronous packets processing is not limited by network level (on which firewalls works) operations. Fortunately Linux assembles TCP stream also in softirq context. Linux kernel also provides few callbacks in struct sock (see include/net/sock.h):

    void (*sk_state_change)(struct sock *sk);
    void (*sk_data_ready)(struct sock *sk, int bytes);
    void (*sk_write_space)(struct sock *sk);
    void (*sk_error_report)(struct sock *sk);
    int  (*sk_backlog_rcv)(struct sock *sk, struct sk_buff *skb);


For example sk_data_ready() is called when a new data received on the socket. So it is simple to read TCP data synchronously in deferred interrupt context. Writing to the socket is bit more harder, but still possible. Of course, your application must be in-kernel now.

Let's have a look at simple example how to use the hooks for TCP data reading. First of all we need a listening socket (this is kernel sockets, so the the socket API is different);

    struct socket *l_sock;
    sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &l_sock);
    inet_sk(s->sk)->freebind = 1;
    /* addr is some address packed into struct sockaddr_in */
    l_sock->ops->bind(l_sock, (struct sockaddr *)addr,

                      sizeof(addr));

    l_sock->sk->sk_state_change = th_tcp_state_change;


    l_sock->ops->listen(l_sock, 100);


sk_state_chage() is called by Linux TCP code when the socket state is changed. We need a new connection established socket, so we need to handle TCP_ESTABLISHED state change. TCP_ESTABLISHED will be set for child socket of course, but we set the callback to listening socket because the child socket inherits the callback pointers from its parent. th_tcp_state_change() can be defined as:

    void
    th_tcp_state_change(struct sock *sk)
    {
        if (sk->sk_state == TCP_ESTABLISHED)

            sk->sk_data_ready = th_tcp_data_ready;
    }

And here we set other callback, but already for the child socket. th_tcp_data_ready() is called when a new data is available in socket receive queue (sk_receive_queue). So in the function we need to do what standard Linux tcp_recvmsg() does - traverse the queue and pick packets with appropriate sequence numbers from it:

    void
    th_tcp_data_ready(struct sock *sk, int bytes)
    {

        unsigned int processed = 0, off;
        struct sk_buff *skb, *tmp;
        struct tcp_sock *tp = tcp_sk(sk);


        skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) {
            off = tp->copied_seq - TCP_SKB_CB(skb)->seq;
            if (tcp_hdr(skb)->syn)
                off--;
            if (off < skb->len) {

                int n = skb_headlen(skb);
                printk(KERN_INFO "Received: %.*s\n",

                       n - off, skb->data + off);
                tp->copied_seq += n - off;

                processed += n - off;
            }

        }  

        /*
         * Send ACK to the client and recalculate

         * the appropriate TCP receive buffer space.
         */

        tcp_cleanup_rbuf(sk, processed);
        tcp_rcv_space_adjust(sk);


        /* Release skb - it's no longer needed. */
        sk_eat_skb(sk, skb, 0);
}

The function should be more complicated to properly handle skb's paged data and fragments, release the skb, more accurately process TCP sequence numbers and so on, but basic idea should be clear.


UPD: You can find source code of the Linux kernel synchronous socket API at  https://github.com/krizhanovsky/sync_socket .

8 comments:

  1. Thank you for the information.
    IMO something is missing here: The high level objective and solution design.

    If I understand correctly, your proposal is to move the "application" to the kernel and run it in a softirq context, is this correct?

    ReplyDelete
    Replies
    1. Hi,

      yes, probably I should have to be more punctilious in the post introduction to describe the issue and proposed solution more verbose.

      Actually, just to move a socket application to kernel space doesn't help much. In kernel we still have Socket API and if you look at any kernel space proxy (like TUX or khttpd) then you see very similar to user space sockets.

      The solution for the problem is to use the socket callbacks instead of Socket API calls to handle packets synchronously in softirq context.

      Delete
    2. Hi,

      Usually, soft interrupt context is very limited in the functionality options that can run under it. The "application" will have to be very (very) focused with a small actions. In many cases, the application will look to perform an amount of work which may be too much for putting in an interrupt context.

      But this is well applicable for small actions, like offloading calculations.
      In fact, some are already done at such a low level (checksum offloading is the simplest one), even at HW level (based on the NIC used).

      In the end, it is the conflict between the cost of delaying work for higher layers (at lower priority) and serializing work into one single context. Both have pros and cons, the solution will usually fit somewhere in between.

      Delete
    3. Yes, I'm absolutely agree - deferred interrupts (softirq) are not a place to run complicated logic. We can recall firewall (which sometimes executes very large rules set) and XFRM (on top of which IPsec is implemented and can run relatively slow encryption/decryption operations with the packets). So actually we can do some logic in deferred interrupts. However, if we try to implement a complicated web-application server on top of the TCP callbacks, then probably it behaves worse than traditional scheme with "asynchronous" packets processing.

      So it is worth to stress that only relatively tiny servers should be implemented by the callbacks.

      Delete
  2. Thank you for great article! It is useful information? that sometimes not so easy to find.

    ReplyDelete
  3. I am trying to find method for sending immediate answer on some tcp message (example get /). Can i hook sk_data_ready with my function and if i get not interresting message just send it back to tcp_recvmsg.
    It is not break tcp sequence?

    ReplyDelete
  4. What about writing to socket? you say: "Writing to the socket is bit more harder" Can you explain how?

    ReplyDelete
  5. Hi Mikhail,

    I'm glad that the article was useful for you.

    Actually, the article is pretty old, but the core of technology is used in Tempesta FW, where you can find the actual state of the network I/O technology. The main code is available here https://github.com/tempesta-tech/tempesta/blob/master/tempesta_fw/sock.c

    Yes, hooking sk_data_ready() doesn't affect TCP receive queue. In Tempesta's case if we hadn't removed skb from the receive queue at https://github.com/tempesta-tech/tempesta/blob/master/tempesta_fw/sock.c#L799, TCP receive queue wouldn't have been affected.

    To write data to a TCP socket we can't use a callback and have to directly call Linux TCP routines, see ss_do_send() (https://github.com/tempesta-tech/tempesta/blob/master/tempesta_fw/sock.c#L348) for example.

    ReplyDelete