High Performance Linux

Wednesday, December 14, 2011

Linux Zero Copy Output

Recently we implemented high-performance network traffic processing server. It uses splice(2)/vmsplice(2) Linux system calls which provide zero-copy data transfers between file descriptors and user space. On modern linux kernels it only makes difference for network output since network input is implemented by common copy_to_user() call.

Before starting with the system calls I ran tests provided by Jens Axboe(initial developer of splice) on two servers and got 3.7 times performance improvement. The results are below (please consider only sys time for xmit (sender) programs, since data copy is performed in kernel space). Large reall time is due to TCP buffers overflow. So the real bottleneck is network throughput (ever for loopback interface), but this calls still could have performance impact on large data transfers and huge local cpu usage (e.g. replication process in parallel with local huge read loading).

Splice() output:

# ./nettest/xmit -s65536 -p1000000  127.0.0.1 5500
opt packets=1000000
use port 5500
xmit: msg=64kb, packets=1000000 vmsplice() -> splice()
Connecting to 127.0.0.1/5500
usr=9259, sys=6864, real=27973

# ./nettest/recv -s65536 -r 5500
recv: msg=64kb, recv()
Waiting for connect...
Got connect!
usr=219, sys=27746, real=27973


Sendmsg() output:

# ./nettest/xmit -s65536 -p1000000 -n 127.0.0.1 5500
opt packets=1000000
use normal io
use port 5500
xmit: msg=64kb, packets=1000000 send()
Connecting to 127.0.0.1/5500
usr=8762, sys=25497, real=34261

# ./nettest/recv -s65536 -r 5500
recv: msg=64kb, recv()
Waiting for connect...
Got connect!
usr=198, sys=30100, real=34261 
 
Usage examples can be found in Jens's test code. This is really fast. However the only one big drawback of the technology is necessity of double buffering. Since data pages which sent to kernel by vmsplice() are directly used by network stack (BTW you can use the syscalls not only with TCP, but also with SCTP and UDP protocols - it has generic implementation in the kernel), so you can not use the pages until kernel completely send them to network. When it happen? Not latter than you wrote 2 size of network output buffer. Thus only when you send double network output buffer sized data you can use the pages again. In practice it means that you need special memory (page) allocator.

C macro, goto and other "ugly" stuff

As Stroustrup mentioned in "The Design and Evolution of C++" generic purpose programming language shall be suitable for different users. It means it has to provide number of features which possibly useful only by small subset of users. It does not mean that complex shall be the language built-in data type, but some fundamental features, which can not be implemented as a library extensions, shall be.

An example of such feature in C++ is C macro. C macroses migrated to C++ only due to compatibility with C. C macro is bad. However it is very useful in number of cases and used in number of projects by many people (ever in such C++ projects as Boost).

I just faced one more link about C macro in D programming language C macro in D programming language. I quickly recall hot debates about GOTO operator. Actually useful in real life, but maybe not so beautiful. It greatly helps to make code clean (e.g. move error handling code to the end of function in C program). I think about Linux and FreeBSD kernels and OpenSSL as an example of great codding style in plain C. And they do use goto and macro in regular way. From other side constructions like

do {
    /* do something */; 
    break;
    /* do something else */
    break;
   /* ........ */
} while (0)

looks much uglier and confusing.

I believe such "ugly", but very useful, things like goto and macro shall stay in modern programming languages. If some people do not like them, then they can do not use them. For instance, C++ exceptions is still frequently criticized, so some project simply do not use it.