High Performance Linux

Wednesday, December 14, 2011

Linux Zero Copy Output

Recently we implemented high-performance network traffic processing server. It uses splice(2)/vmsplice(2) Linux system calls which provide zero-copy data transfers between file descriptors and user space. On modern linux kernels it only makes difference for network output since network input is implemented by common copy_to_user() call.

Before starting with the system calls I ran tests provided by Jens Axboe(initial developer of splice) on two servers and got 3.7 times performance improvement. The results are below (please consider only sys time for xmit (sender) programs, since data copy is performed in kernel space). Large reall time is due to TCP buffers overflow. So the real bottleneck is network throughput (ever for loopback interface), but this calls still could have performance impact on large data transfers and huge local cpu usage (e.g. replication process in parallel with local huge read loading).

Splice() output:

# ./nettest/xmit -s65536 -p1000000  127.0.0.1 5500
opt packets=1000000
use port 5500
xmit: msg=64kb, packets=1000000 vmsplice() -> splice()
Connecting to 127.0.0.1/5500
usr=9259, sys=6864, real=27973

# ./nettest/recv -s65536 -r 5500
recv: msg=64kb, recv()
Waiting for connect...
Got connect!
usr=219, sys=27746, real=27973


Sendmsg() output:

# ./nettest/xmit -s65536 -p1000000 -n 127.0.0.1 5500
opt packets=1000000
use normal io
use port 5500
xmit: msg=64kb, packets=1000000 send()
Connecting to 127.0.0.1/5500
usr=8762, sys=25497, real=34261

# ./nettest/recv -s65536 -r 5500
recv: msg=64kb, recv()
Waiting for connect...
Got connect!
usr=198, sys=30100, real=34261 
 
Usage examples can be found in Jens's test code. This is really fast. However the only one big drawback of the technology is necessity of double buffering. Since data pages which sent to kernel by vmsplice() are directly used by network stack (BTW you can use the syscalls not only with TCP, but also with SCTP and UDP protocols - it has generic implementation in the kernel), so you can not use the pages until kernel completely send them to network. When it happen? Not latter than you wrote 2 size of network output buffer. Thus only when you send double network output buffer sized data you can use the pages again. In practice it means that you need special memory (page) allocator.

C macro, goto and other "ugly" stuff

As Stroustrup mentioned in "The Design and Evolution of C++" generic purpose programming language shall be suitable for different users. It means it has to provide number of features which possibly useful only by small subset of users. It does not mean that complex shall be the language built-in data type, but some fundamental features, which can not be implemented as a library extensions, shall be.

An example of such feature in C++ is C macro. C macroses migrated to C++ only due to compatibility with C. C macro is bad. However it is very useful in number of cases and used in number of projects by many people (ever in such C++ projects as Boost).

I just faced one more link about C macro in D programming language C macro in D programming language. I quickly recall hot debates about GOTO operator. Actually useful in real life, but maybe not so beautiful. It greatly helps to make code clean (e.g. move error handling code to the end of function in C program). I think about Linux and FreeBSD kernels and OpenSSL as an example of great codding style in plain C. And they do use goto and macro in regular way. From other side constructions like

do {
    /* do something */; 
    break;
    /* do something else */
    break;
   /* ........ */
} while (0)

looks much uglier and confusing.

I believe such "ugly", but very useful, things like goto and macro shall stay in modern programming languages. If some people do not like them, then they can do not use them. For instance, C++ exceptions is still frequently criticized, so some project simply do not use it.

Tuesday, October 18, 2011

How to debug memory corruptions on old Solaris 8

Here is no problem to cope with memory corruptions on modern UNIX thanks to such great tools as valgrind. But it could be hard task if we're running with old OS like Solaris 8. So it's possible to use following concept (the proof of concept program is below) - just unset write permissions for the pages which possess the corrupted data and get SIGSEGV on actual writing on wrong data. This way we fail exactly on first occurrence of data corruption instead of failing latter on using the wrong data.

Please keep in mind that mprotect() can operate only with page granularity, so it's possible that the page which possess the required data is also possessing other data on which you'll get the segmentation fault exception instead of required data corruption. In such cases we would recommend to allocate the debugged data per individual page which will be protected after.

Also pay attention on pstack trick which could be very useful in number of other cases.

Here is the example how to do the trick with output samples on the top comments:

/*
 * Debug memory corruptions with mprotect(). Could be useful on old UNIX.
 *
 * Compile on Solaris 8 with:
 *  $ CC -g -o mprotect mprotect.c
 *
 * Linux Backtrace output:
 *  write (corrupt) data by 0x607124
 *  segmentation fault at 0x607124
 *  ./mprotect [0x400997]   // signal handler
 *  /lib/libc.so.6 [0x7fcfe99f33a0]
 *  ./mprotect [0x400a5d]   // the culprit!
 *  ./mprotect [0x400b88]
 *  /lib/libc.so.6(__libc_start_main+0xe6) [0x7fcfe99dfa26]
 *  ./mprotect [0x400879]
 *
 * Solaris Backtrace output:
 *  write (corrupt) data by 808650c
 *  segmentation fault at 808650c
 *  3866:   ./mprotect
 *   d0141e75 read     (4, 80894b4, 1400)
 *   d010c25c _filbuf  (8061858, b, d0072a00, d0110846) + d3
 *   d011091c fread    (8046580, 1000, 1, 8061858, 0, 0) + e4
 *   0805113e sigsegv_handler (b, 8047898, 8047698) + 9e
 *   d013d0cf __sighndlr (b, 8047898, 8047698, 80510a0) + f
 *   d01301bf call_user_handler (b) + 2af
 *   d01303ef sigacthandler (b, 8047898, 8047698) + df
 *   --- called from signal handler with signal 11 (SIGSEGV) ---
 *   080511c0 __1cQmemory_corruptor6F_v_ (8047980, d03fc7b4, a, 8061828, d01c0000, 804797c) + 50
 *             ^^^^^^^^^^^^^^^^ the culprit!
 *   080513cf main     (1, 80479c4, 80479cc, 8050f80) + 1ff
 *   0805100d _start   (1, 8047ae0, 0, 8047aeb, 8047b28, 8047b4c) + 7d
 *
 * GDB core dump backtrace (Linux):
 *  Program terminated with signal 11, Segmentation fault.
 *  #0  0x0000000000400a0d in memory_corruptor () at mprotect.c:111
 *  68                      data[PAGE_SIZE * 5 + i] = 0x12;
 */
// #include <execinfo.h> // Only applicable for modern UNIX
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>

#define PAGE_SIZE   getpagesize()
#define PAGE_MASK   (~(PAGE_SIZE - 1))
#define SIZE        (PAGE_SIZE * 8)

char *data;

extern "C" {
void
sigsegv_handler(int sig, siginfo_t *si, void *data)
{
    int i, ret = 1;
    char **btrace;

    printf("segmentation fault at %p\n", si->si_addr);
    fflush(NULL);

    /*
     * The ugly way to print stack on old Solaris.
     */
    char cmd[64], stack[4096];
    snprintf(cmd, 64, "pstack %u", getpid());
    FILE *f = popen(cmd, "r");
    if (!f) {
        perror("popen");
        goto out;
    }
    fread(stack, 4096, 1, f);
    printf(stack);

    /*
     * The better way to do it on modern UNIX (Solaris, Linux, FreeBSD).
     */
#if 0
    void *trace_addrs[200];
    int n_addr = backtrace(trace_addrs,
            sizeof(trace_addrs) / sizeof(trace_addrs[0]));
    if (!n_addr || n_addr == 200) {
        perror("backtrace");
        ret = 2;
        goto out;
    }
    btrace = backtrace_symbols(trace_addrs, n_addr);
    if (!btrace) {
        perror("backtrace_symbols");
        ret = 2;
        goto out;
    }

    for (i = 0; i < n_addr; ++i)
        printf("%s\n", btrace[i]);
    free(btrace);
#endif
out:
    fflush(NULL);
    _exit(ret);
}
}

void
memory_corruptor(void)
{
    int i;

    printf("write (corrupt) data by %p\n", data + PAGE_SIZE * 5 + 100);
    for (i = 100; i < 150; ++i)
        data[PAGE_SIZE * 5 + i] = 0x12;
}

int
main(int argc, char *argv[])
{
    int i;

    struct sigaction sa;
    sigemptyset(&sa.sa_mask);
    sigaddset(&sa.sa_mask, SIGSEGV);
    sa.sa_flags = SA_SIGINFO;
    sa.sa_sigaction = sigsegv_handler;
    sigaction(SIGSEGV, &sa, NULL);

    data = (char*)malloc(SIZE);
    for (i = 0; i < SIZE; ++i)
        data[i] = 0x0a;

    printf("Mapped (%p). Press any key...\n", data);
    fflush(NULL);
    getchar();

    /* That's ok to write into memory without protection. */
    memory_corruptor();

    /*
     * Set memory protection to catch memory writtings.
     * Usually only PROT_READ should be set.
     */
    if (mprotect((char*)((long)(data + PAGE_SIZE * 4) & PAGE_MASK),
                PAGE_SIZE * 4, PROT_READ|PROT_EXEC))
    {
        perror("mprotect");
        exit(1);
    }
    printf("Protected (%p). Press any key...\n",
            (void*)((long)(data + PAGE_SIZE * 4) & PAGE_MASK));
    fflush(NULL);
    getchar();

    memory_corruptor();

    return 0;
}

Wednesday, October 5, 2011

Speaking at HighLoad 2011

Yesterday I was speaking at HighLoad .

In the presentation (in Russian) I concentrated on atomic operations, lock-free data structures, Linux zero-copy network IO and CPU binding. We have got great experience on implementing these cool stuff in our current project (high performance clustering software to process Cisco RDRv1 traffic) for Video International and I was pleased to share basic principles of development of high performance Linux server software.