High Performance Linux: What's Wrong With Sockets Performance And How to Fix It

Socket API is a nice think which allows you to easily write network programs. But sockets have fundamental problem from performance point of view - they are asynchronous with network interrupts. And this is regardless whether you're using blocking or nonblocking IO.

Let's consider a multi-threaded application which is working with number of sockets and reads some data from them. Typically, it does following (pseudo code):

    int n = epoll_wait(e->fd, e->event, e->max_events, 1000);
    for (int i = 0; i < n; ++i) {
    unsigned char buf[4096];
        read(e->event[i].data.fd, buf, 4096);
}

The polled socket could be either blocking or non-blocking socket. Let's forget about the buffer copying for a moment and concentrate on what's happen with arriving packets.

The figure depicts two processes (there is no difference between processes and threads in our discussion) which reads from three sockets. The processes are working on different CPUs. Probably Receive Flow Steering (RFS) is used so packets designated for the first process go to the first CPU and packets for second process are processed by softirq at the second CPU. Each socket has a receive queue where incoming packets are placed before reading process consumes them.

If we look at the code sample carefully then we find two system calls, relatively slow operations. The process also can be rescheduled and/or preempted between the syscalls. So if the process is waked up in epoll_wait() call by a socket event (when the socket gets a packet) then it reads data from the socket with some delay. There are a bold arrow between the second socket's queue and the first process which depicts reading a data from the socket. There are two complications:

the process can be preempted by softirq between waking up on epoll_wait() and reading from the socket (however, it's easy to prevent this by binding the process and the NIC interrupts to different cores);
during high loaf Linux switches to polling mode and very very quickly grabs bunches of packets, so during the delay between the two syscalls softirq can process a lot of other packets.

The problem is that when the process goes to treat the packet, softirq can read other packet (a lot of packets actually, see a bold arrow from softirq to the queue of the first socket at the figure). With packet lenght from 64 to 1500 bytes for common Ethernet link it's obviously that the packet, which is reading by the process now, could not be in CPU cache any more. The packet is simply pushed out by other packets from the CPU cache. Thus ever with using zero copy networking user-space applications can not achieve good performance.

In fact Linux firewall works in softirq context. It means that the packet is processed synchronously, immediately when it is received. Moreover, synchronous packets processing is not limited by network level (on which firewalls works) operations. Fortunately Linux assembles TCP stream also in softirq context. Linux kernel also provides few callbacks in struct sock (see include/net/sock.h):

    void (*sk_state_change)(struct sock *sk);
    void (*sk_data_ready)(struct sock *sk, int bytes);
    void (*sk_write_space)(struct sock *sk);
    void (*sk_error_report)(struct sock *sk);
    int (*sk_backlog_rcv)(struct sock *sk, struct sk_buff *skb);

For example sk_data_ready() is called when a new data received on the socket. So it is simple to read TCP data synchronously in deferred interrupt context. Writing to the socket is bit more harder, but still possible. Of course, your application must be in-kernel now.

Let's have a look at simple example how to use the hooks for TCP data reading. First of all we need a listening socket (this is kernel sockets, so the the socket API is different);

    struct socket *l_sock;
    sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &l_sock);
    inet_sk(s->sk)->freebind = 1;
    /* addr is some address packed into struct sockaddr_in */
    l_sock->ops->bind(l_sock, (struct sockaddr *)addr,
                      sizeof(addr));

    l_sock->sk->sk_state_change = th_tcp_state_change;

    l_sock->ops->listen(l_sock, 100);

sk_state_chage() is called by Linux TCP code when the socket state is changed. We need a new connection established socket, so we need to handle TCP_ESTABLISHED state change. TCP_ESTABLISHED will be set for child socket of course, but we set the callback to listening socket because the child socket inherits the callback pointers from its parent. th_tcp_state_change() can be defined as:

    void
    th_tcp_state_change(struct sock *sk)
    {
        if (sk->sk_state == TCP_ESTABLISHED)
            sk->sk_data_ready = th_tcp_data_ready;
    }

And here we set other callback, but already for the child socket. th_tcp_data_ready() is called when a new data is available in socket receive queue (sk_receive_queue). So in the function we need to do what standard Linux tcp_recvmsg() does - traverse the queue and pick packets with appropriate sequence numbers from it:

    void
    th_tcp_data_ready(struct sock *sk, int bytes)
{
      unsigned int processed = 0, off;
        struct sk_buff *skb, *tmp;
        struct tcp_sock *tp = tcp_sk(sk);

        skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) {
            off = tp->copied_seq - TCP_SKB_CB(skb)->seq;
            if (tcp_hdr(skb)->syn)
                off--;
            if (off < skb->len) {
                int n = skb_headlen(skb);
                printk(KERN_INFO "Received: %.*s\n",
                       n - off, skb->data + off);
                tp->copied_seq += n - off;
                processed += n - off;
            }
}

/*
         * Send ACK to the client and recalculate
         * the appropriate TCP receive buffer space.
         */
        tcp_cleanup_rbuf(sk, processed);
        tcp_rcv_space_adjust(sk);

        /* Release skb - it's no longer needed. */
        sk_eat_skb(sk, skb, 0);
}

The function should be more complicated to properly handle skb's paged data and fragments, release the skb, more accurately process TCP sequence numbers and so on, but basic idea should be clear.

UPD: You can find source code of the Linux kernel synchronous socket API at https://github.com/krizhanovsky/sync_socket .

9 comments:

EdDevApril 10, 2013 at 9:42 AM
Thank you for the information.
IMO something is missing here: The high level objective and solution design.

If I understand correctly, your proposal is to move the "application" to the kernel and run it in a softirq context, is this correct?
UnknownAugust 2, 2017 at 5:36 AM
Thank you for great article! It is useful information? that sometimes not so easy to find.
UnknownAugust 2, 2017 at 5:55 AM
I am trying to find method for sending immediate answer on some tcp message (example get /). Can i hook sk_data_ready with my function and if i get not interresting message just send it back to tcp_recvmsg.
It is not break tcp sequence?
UnknownAugust 2, 2017 at 6:07 AM
What about writing to socket? you say: "Writing to the socket is bit more harder" Can you explain how?
Alexander KrizhanovskyAugust 2, 2017 at 3:57 PM
Hi Mikhail,

I'm glad that the article was useful for you.

Actually, the article is pretty old, but the core of technology is used in Tempesta FW, where you can find the actual state of the network I/O technology. The main code is available here https://github.com/tempesta-tech/tempesta/blob/master/tempesta_fw/sock.c

Yes, hooking sk_data_ready() doesn't affect TCP receive queue. In Tempesta's case if we hadn't removed skb from the receive queue at https://github.com/tempesta-tech/tempesta/blob/master/tempesta_fw/sock.c#L799, TCP receive queue wouldn't have been affected.

To write data to a TCP socket we can't use a callback and have to directly call Linux TCP routines, see ss_do_send() (https://github.com/tempesta-tech/tempesta/blob/master/tempesta_fw/sock.c#L348) for example.
davidSeptember 19, 2018 at 8:19 AM
This comment has been removed by the author.

Note: Only a member of this blog may post a comment.

High Performance Linux