High Performance Linux

Sunday, June 5, 2016

Tempesta FW: First Installation

Dear colleagues, I didn't write posts in the blog for a long time. That's because we were releasing Tempesta FW. And now I'm happy to announce that http://tempesta-tech.com is accelerated hereafter by Tempesta FW. This is the first installation of our Web accelerator. It was so long way!

Now the site is powered by Apache HTTPD behind Tempesta FW. However, shortly we'll implement Web-server features as well and we'll be able to service sites fully under Tempesta FW.

Stay tuned!

Sunday, April 17, 2016

x86-64 Wastes TLB Entries

Basically TLB caches page table translations. We need page table since contiguous virtual memory area can in fact consist of many physical fragments. So page table is used to map the physical memory fragments, pages, to some virtual address space. x86-64 has 4-level page table, so if you access a page, then in fact you for 5 memory accesses (4 for page table to resolve the virtual address and the last one is you actual access). To mitigate the performance overhead caused by page table translations TLB is used to cache virtual to physical address translations.

Surprisingly, when OS is loaded it maps whole physical memory to some virtual address space. It's important that the physical and virtual address spaces are both contiguous now. And OS kernel works with exactly this virtual address space. For example, kmalloc() in Linux kernel returns pointer from this virtual address space. However, vmalloc() maps physical pages to some other virtual address space, named vmalloc area.

Thus kernel addresses can be resolved by simple offsets (see linux/arch/x86/include/asm/page_64.h):

    static inline unsigned long
    __phys_addr_nodebug(unsigned long x)
    {
        unsigned long y = x - __START_KERNEL_map;

        /* use the carry flag to determine if
           x was < __START_KERNEL_map */
        x = y + ((x > y)
                 ? phys_base
                 : (__START_KERNEL_map - PAGE_OFFSET));    

        return x;
    }

Since virtual to physical address translation is trivial, why in earth do we need to use page table and waste invaluable TLB entries for the translations? However, there is nothing like MIPS's direct mapped kernel space segments for x86-64. The sad story about x86-64 is that trivial mappings wastes TLD entries and require extra memory transfers. The only one thing which x86-64 does to optimize TLD usage is global address spaces, e.g. kernel space, which are never invalidated in TLB on context switches. But still if you switch to kernel, kernel mappings evict your user-space mappings from TLB. Meantime, if your application is memory greedy, then syscalls can take long time due to TLB cache misses.

Wednesday, September 30, 2015

Fast Memory Pool Allocators: Boost, Nginx & Tempesta FW

Memory Pools and Region-based Memory Management

Memory pools and region-based memory management allow you to improve your program performance by avoiding unnecessary memory freeing calls. Moreover, pure memory pools gain even more performance due to simpler internal mechanisms. The techniques are widely used in Web servers and using them you can do following (pseudo code for some imaginary Web server):

    http_req->pool = pool_create();
    while (read_request & parse_request) {
        http_req->hdr[i] = pool_alloc(http_req->pool);
        // Do other stuff, don't free allocated memory.
    }
    // Send the request.
    // Now destroy all allocated items at once.
    pool_destroy(http_req->pool);

This reduces number of memory allocator calls, simplifies its mechanisms (e.g. since you don't free allocated memory chunks it doesn't need to care about memory fragmentation and many other problems which common memory allocators must solve at relatively high cost) and makes the program run faster.

Probably, you've noticed that we call pool_alloc() in the example above without specifying allocation size. And here we go to the difference between memory pools and region-based memory management: memory pools can allocate memory chunks with only one fixed size while region-based memory allocators are able to allocate chunks with different size. Meantime, both of them allow you not to care about freeing memory and just drop all the allocated memory chunks at once.

Boost C++ library provides memory pool library boost::pool, let's have a closer look at the library. By the way, there is nice description of memory pool concepts.


Boost::pool

If we run following loop for N = 20,000,000 (this is body of a template function with argument T, you can find the full source code of the benchmark at our repository at GitHub):

    boost::object_pool<T> p;

    for (size_t i = 0; i < N * sizeof(Small) / sizeof(T); ++i) {
        T *o = p.malloc();
        touch_obj(o);
    }

, it takes about 260ms for T 24 bytes (this is the size of structure Small) in size and about 170ms for an object of size 240 bytes (size of structure Big) on my Intel Core i7-4650U 1.70GHz. Please, note that I don't measure pools construction and destruction times to concentrate on memory management techniques without additional expenses. Also I limit number of allocations by

    N * sizeof(Small) / sizeof(T)

to avoid memory exhausting on tests with large object sizes.

Hereafter touch_obj() is just a simple macro checking result of memory allocation and breaking current loop:

    #define touch_obj(o)                                   \
        if (__builtin_expect(!o, 0)) {                     \
                std::cerr << "failed alloc" << std::endl;  \
                break;                                     \
        } else {                                           \
                *(long *)o = 1;                            \
        }

Now if we free each 4th allocated element:

    for (size_t i = 0; i < N * sizeof(Small) / sizeof(T); ++i) {
        T *o = p.malloc();
        touch_obj(o);
        if (__builtin_expect(!(i & 3), 0))
            p.free(o);
    }

the function takes about 165ms and 155ms for T sizes 24 bytes and 240 bytes correspondingly and this is almost nothing for big objects, but more than 50% performance improvement for small objects. Results for the boost::pool benchmark:

    $ ./a.out 
                 small object size:    24
                   big object size:    240

               boost::pool (Small):    258ms
       boost::pool w/ free (Small):    165ms
                 boost::pool (Big):    174ms
         boost::pool w/ free (Big):    154ms


(I take the best numbers from 3 runs of the test). The loop now contains extra conditional jump and free() call which is surely not for free. So why do we get the performance improvement for the version with objects freeing? There are two reasons why it's faster:

  1. the main reason that in the first version of program we're allocating more than 450MB of RAM, so we cause serious pressure on system memory allocator which must allocate a lot of memory for us. While the second version of program requires 25% less system memory;
  2. each write operation touches a new memory region, so some cache lines are evicted from CPU cache and the memory is loaded to cache. Thus, we get serious CPU cache starvation.

Therefore, periodic or casual memory freeings reduce overall workflow (while we have more free() calls, the underlying system allocator is called rarely) and effectively reuse CPU cache lines.

But why does big allocations test not win from freeing? The short answer is exponential growing of memory storage used by boost::pool and particular numbers (N and sizeof(T))in the test. The key function of the boost::object_pool is pool<UserAllocator>::ordered_malloc_need_resize() (simplified for brevity):

    template <typename UserAllocator>
    void *
    pool<UserAllocator>::ordered_malloc_need_resize()
    {
        size_type POD_size = next_size * partition_size + ...
        char * ptr = (UserAllocator::malloc)(POD_size);
        ....
        if (!max_size)
            next_size <<= 1;
        ....
    }

The function requests system allocator by exponentially growing steps. If we call strace -e trace=mmap against the test for small objects without freeing, then we can see following mmap() calls (glibc uses mmap() for large allocations and brk() for smaller):

    mmap(NULL, 200704,....) = 0x7fdb428bd000
    mmap(NULL, 397312, ...) = 0x7fdb4285c000
    mmap(NULL, 790528, ...) = 0x7fdb4279b000
    .....
    mmap(NULL, 201330688, ....) = 0x7fdb295e3000
    mmap(NULL, 402657280, ....) = 0x7fdb115e2000

12 calls in total for the benchmark code itself. The benchmark for small objects with freeing produces 11 calls ending at

    mmap(NULL, 201330688, ....) = 0x7f91ed60b000

So the freeing really reduces size of allocated memory and improves performance. However, this is not the case for big allocation benchmarks - both the benchmarks have the same number of equal mmaps():

    mmap(NULL, 249856, ....) = 0x7f34431be000
    ....
    mmap(NULL, 251662336, ....) = 0x7f3423f50000

I.e. in this particular test the last very large memory allocation is enough to satisfy the benchmark memory requirements even without freeing.

Boost::pool is the pure memory pool which allows you to allocate memory chunks of one fixed size only, but this is not the case for Nginx memory pool which is able to allocate objects of different size.


Nginx Pool

Wikipedia says: "the web server Nginx, use the term memory pool to refer to a group of variable-size allocations which can be later deallocated all at once. This is also known as a region". In fact, Nginx's pool allows you to allocate variable size memory chunks up to 4095 bytes for x86-64. You can request larger memory areas, but they're allocated using plain malloc().

To learn performance of Nginx pool I copy & pasted some pieces of src/core/ngx_palloc.c from nginx-1.9.5. Now benchmark code for small and large memory allocations looks like the following (you can find the full code here):

    ngx_pool_t *p = ngx_create_pool(PAGE_SIZE);

    for (size_t i = 0; i < N * sizeof(Small) / sizeof(T); ++i) {
        T *o = (T *)ngx_palloc(p, sizeof(T));
        touch_obj(o);
    }

    ngx_destroy_pool(p);

Ngimx has ngx_pfree(), but it's for large data blocks only, so small memory chunks aren't freeable, so we test periodic memory freeings for large memory chunks only. Nginx's pool support different allocation sizes, so let's test mixed allocation with 1/4 of big (240B) and 1/4096 of huge (8305B) chunks:

    ngx_pool_t *p = ngx_create_pool(PAGE_SIZE);
         
    for (size_t i = 0; i < N; ++i) {
        if (__builtin_expect(!(i & 3), 0)) {
            if (__builtin_expect(!(i & 0xfff), 0)) {
                Huge *o = (Huge *)ngx_palloc(p, sizeof(*o));
                touch_obj(o);
                if (!(i & 1))
                    ngx_pfree(p, o);
            }
            else if (__builtin_expect(!(i & 3), 0)) {
                Big *o = (Big *)ngx_palloc(p, sizeof(*o));
                touch_obj(o);
            }
            else {
                Small *o = (Small *)ngx_palloc(p, sizeof(*o));
                touch_obj(o);
            }
    }

    ngx_destroy_pool(p);      

The code above gives us following results:

                  ngx_pool (Small):    305ms
                    ngx_pool (Big):    132ms
            ngx_pool w/ free (Mix):    542ms

Let's do review of Nginx pool implementation and quick analyzing of the results:

  1. the implementation is very simple (the ported allocator code takes less than 200 lines) without complex OOP wrapping code which boost::pool has in plenty;
  2. ngx_palloc() scans each memory block at most 4 times trying to better utilize allocated memory chunks. There is small multiplication factor, but the algorithm is still O(1). If you do a small allocation after few large allocations, then it's likely that there is some room to place the new chunk somewhere in previously allocated blocks without requesting a new memory block. This is the cost for ability to allocate memory chunks of different sizes. Boost::pool just puts current memory chunk at the end of allocated data, so doesn't do any scans at all;
  3. Nginx grows its pool by blocks of fixed size, it doesn't use exponential growing like boost::pool.

Update. Previously I explained the difference between results for Big and Small allocations by O(n) complexity in ngx_palloc(), in particular I was confused by this code:

    do {
        m = ngx_align_ptr(p->d.last, NGX_ALIGNMENT);
        if ((size_t)(p->d.end - m) >= size) {
            p->d.last = m + size;
            return m;
        }
        p = p->d.next;
    } while (p);

However, ngx_palloc_block() does

    for (p = pool->current; p->d.next; p = p->d.next) {
        if (p->d.failed++ > 4)
            pool->current = p->d.next;
    }

So when a new block is allocated the rest of the blocks increment failed counter, it means that we could not satisfy current request from all of them. Thus, we try to allocate a chunk at most 4 times from a block after that we move pool->current and never touche the head of the list. So typically we have very long head of the list and relatively small tail which is scanned in ngx_palloc().

Thanks to Maxim Dunin and Rinat Ibragimov for pointing me out the mistake. Meantime, Rinat tried following change for ngx_palloc_block():

    - for (p = pool->current; p->d.next; p = p->d.next) {
    - if (p->d.failed++ > 4)
    - pool->current = p->d.next;
    - }
    + pool->current = p_new;

I.e. he removed the scans at all. After the change benchmark results at his machine changed from

                  ngx_pool (Small):    475ms
                    ngx_pool (Big):    198ms
            ngx_pool w/ free (Mix):    817ms
             ngx_pool cr. & destr.:    161ms

to

                  ngx_pool (Small):    249ms
                    ngx_pool (Big):    163ms
            ngx_pool w/ free (Mix):    585ms
             ngx_pool cr. & destr.:    476ms

So the scans really degrades performance in particular workload from the benchmark, however strictly speaking the algorithm is still constant time.

The results are somewhat worse than boost::pool. Does it mean that boost::pool is better? Nope. Nginx is developed to process thousands or even more HTTP requests per second and each request has associated with it memory pool. Obviously, regular HTTP request doesn't have 20 million chunks, headers or so on, so our test is somewhat unnatural for Nginx's pool. To compare Nginx's pool with boost::pool in more natural for Nginx workload I did other benchmark:

    for (size_t i = 0; i < N / 100; ++i) {
        ngx_pool_t *p = ngx_create_pool(PAGE_SIZE);
        for (size_t j = 0; j < 100; ++j) {
            if (__builtin_expect(!(i & 3), 0)) {
                Big *o = (Big *)ngx_palloc(p, sizeof(*o));
                touch_obj(o);
            } else {
                Small *o = (Small *)ngx_palloc(p, sizeof(*o));
                touch_obj(o);
            }
        }
        ngx_destroy_pool(p);
    }       

Since boost::pool can't allocate various size chunks, I just ran 1/4 of all loops for Big and 3/4 for Small object allocations. The results are:

          boost::pool cr. & destr.:    133ms
             ngx_pool cr. & destr.:    98ms

And this time Nginx pool is more than 30% faster than boost::pool.

By the way, APR library used in Apache HTTPD server uses somewhat similar to Nginx pool mechanisms.


TfwPool

In Tempesta FW we tried to make more flexible pool allocator (actually region-based memory allocator, but we follow Apache HTTPD and Nginx developers in the allocator's name), such that it's able to free allocated memory like boost::pool and allocate variable size memory chunks like Nginx. But make it as fast as possible.

TfwPool uses stack-like freeing approach. In fact, there are two common cases when we need to free or reallocate the last allocated chunks. The first one is very trivial - when you just need a temporal buffer to do some stuff (e.g. snprintf()).

For example, checking of very short string with HTTP Host header looks like:

    buf = tfw_pool_alloc(req->pool, len);     
    tfw_str_to_cstr(&field, buf, len);
    if (!tfw_addr_pton(buf, &addr))
        // Process the bad Host header
    tfw_pool_free(req->pool, buf, len);                        

tfw_str_to_cstr() copies chunked string to continuous buf. If you design your application as pure zero-copy, then typically such copies are not what you like to do. However, zero-copy techniques could be very expensive for small chunked data, when you need to handle list of small data chunks. So sometimes it's faster to do small copying rather than do complex things on lists of small chunks. Nevertheless, tfw_str_to_cstr() is considered deprecated to be replaced with smarter techniques in further versions.

The second case when you typically need to free exactly the last allocated memory chunk, is realloc(). Imagine that you need to handle some descriptor of string chunks of HTTP message headers, body etc. The descriptor is effectively an array of pointers to chunks of the message field. Arrays are better than linked lists because of better memory locality, but you must pay by heavy realloc() call to be able to grow an array. So if we'd have some very fast method to quickly grow an array, then we can reach very good performance. And there is a way to grow arrays quickly without usual for realloc() memory copies. It worth to mention that if you're reading for example HTTP message body, then you're growing the array of the field descriptor, while other allocated chunks are not touched. So at some point you just execute realloc() many times against the same memory chunk. If the allocator has some room just after the memory chunk, then it just need to move its data end pointer forward - no any real reallocations or copies are necessary.

Probably you noticed in previous code examples that I call free() just after alloc() - actually, this is relatively frequent case which we kept in mind designing TfwPool.

Like Nginx case I just copied original code with small adjustments to the benchmark. You can find the original allocator code at GitHub. The results for the allocator are depicted at the below:

                  tfw_pool (Small):    279ms
          tfw_pool w/ free (Small):    101ms
                    tfw_pool (Big):    106ms
            tfw_pool w/ free (Big):    50ms
            tfw_pool w/ free (Mix):    107ms
             tfw_pool cr. & destr.:    53ms

Well, it's faster than Nginx's pool, but Small test is still slower than fro boost::pool. Surely, boost::pool shows amazing numbers thanks to it's exponential growing. We don't use the technique because our allocator was designed for Linux kernel and using large continuous allocations is not the most efficient way.

Our allocator is designed to operate with 4096-aligned memory pages. Page alignment (x86-64 has pages equal to 4096 bytes) is good since less pages are used improving TLB cache usage. In our benchmark we do the allocations using posix_memalign() with alignment argument 4096 and it significantly hurts the benchmark performance...


Perfomance Issue with posix_memalign(3)

To understand posix_memalign() issues lets run memalign test which compares performance of 4KB allocations aligned to 16 and 4096 bytes correspondingly:

    $ ./memalign_benchmark 
                       16B aligned:    110ms
                      page aligned:    203ms

Page aligned allocations are almost 2 times slower, that's too expensive. Why it happens?

strace shows that more than 90% of time the program spends in brk() system call:

    $ strace -c ./memalign_benchmark
                       16B aligned:    233ms
                      page aligned:    334ms
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     93.81    0.010849           1     17849           brk
      1.98    0.000229          13        17           mmap
      1.03    0.000119          12        10           mprotect
    ...............

If we run each part of benchmark independently, then we can find follwoing (firstly I run 16 bytes alignment test and next page alignment test):

    $ strace -e trace=brk ./memalign_benchmark 2>&1 |wc -l
    6087

    $ strace -e trace=brk ./memalign_benchmark 2>&1 |wc -l
    11769

The bigger number of brk() calls is expected since glibc requests more memory from the system to throw almost half of it for alignment. So large alignment argument is very expensive for posix_memalign().

Meantime, operating system kernel manages memory in pages which are certainly 4096-byte aligned, so you can get properly aligned memory chunks without the alignment overhead. Moreover, actually when you do malloc() or posix_memalign(), the kernel must allocate VMA (Virtual Memory Area - the kernel descriptor handling information about address mapping) for it. Next when you write to the area page fault happens, it allocates a new page and populates it to current page table. A lot of things happen when we work with memory in user space. Then nice thing which glibc does to mitigate impact of the complex things is caching memory requested from the kernel.


Tempesta FW: Native Kernel Implementation

It was also interesting to see how much a regular application can gain from moving to kernel space. First, to demonstrate how user space memory allocation is slow (if the application just starts and glibc doesn't have cache), I made very simple test:

    static const size_t N = 100 * 1000;
    static void *p_arr[N];

    for (size_t i = 0; i < N; ++i) {
        r = posix_memalign((void **)&p_arr[i], 16, PAGE_SIZE);
        touch_obj(p_arr[i]);
    }
    for (size_t i = 0; i < N; ++i)
        free(p_arr[i]);

It takes about 112ms on my system:

    $ ./memalign_benchmark 
                      alloc & free:    112ms

Now let's try the kernel test, you can find it near from the other tests. You can run the test by loading the kernel module (it's built by Makefile as other benchmarks) and dmesg will show you the results (don't forget to remove the module):

    $ sudo insmod pool.ko
    $ sudo rmmod pool
    $ lsmod |grep pool

Firstly, let's have a look how memory allocation is fast in kernel:

    $ dmesg |grep 'alloc & free'
    alloc & free:  22ms

This is 5 times faster!

It's very curious how much regular application can gain from moving to kernel. So I also ported the Nginx's pool to kernel module and here are the results:

    ngx_pool (Small):               229ms
    ngx_pool (Big):                 49ms
    ngx_pool w/ free (Mix):         291ms
    ngx_pool cr. & destr.:          150ms

This is 30% faster for small allocations, 270% (!) faster for big objects and 86% faster for mixing workload... Wait... What's happen with our "real life" test for many pool with low number of allocations? It's slower more than 50%!..

The problem is that when you free a page using free_pages() interface some of them are going to per CPU page cache, but the main point is that they're going to buddy allocator where they can be coalesced with their buddies to satisfy further requests for large allocations. Meantime, glibc does good job on aggressive caching of allocated memory. So the benchmark causes significant pressure on Linux buddy system.

TfwPool uses it's own lightweight per-CPU page cache, so we get much better results:

    tfw_pool (Small):               95ms
    tfw_pool w/ free (Small):       90ms
    tfw_pool (Big):                 45ms
    tfw_pool w/ free (Big):         35ms
    tfw_pool w/ free (Mix):         99ms
    tfw_pool cr. & destr.:          54ms

In short: 80-170% faster for small allocations than the fastest user-space allocator (boost::pool) and more than 3 times faster than Nginx's pool,3-5 times faster for big allocations, more than 5 times for mixed and 2 times for "real life" workloads in comparison with Nginx. Recall that the allocator was designed for kernel space, so it show much better results in Linus kernel rather than for user space.


Conclusion

As a conclusion I'd make couple of points about fast memory pools:

  • while page aligned memory operations are good due to better TLB usage, they are impractical due to poor posix_memalign() work. You can use your own page cache to cope with the issue;
  • exponential growing is good and old thing, but people still don't use it for some reason;
  • any memory allocator must solve many problems (time complexity, memory utilization and fragmentation, time to create and destroy the pool etc.), so it's likely that you can develop much faster allocator than any existing for your particular project - there is no ideal memory allocator for all tasks;
  • just moving from user space to kernel can make some performance improvement for your program, but also you can hit performance degradation - user-space and kernel use very different mechanisms, so kernel applications must be designed especially for kernel space.

Don't you love kernel programming as we love it?


Update

Rinat Ibragimov added benchmark for Glib memory pool. Now the output of benchmark looks like (the best numbers for 4 runs):

                 small object size:    24
                   big object size:    240
                  huge object size:    8305

                mallocfree (Small):    711ms
        mallocfree w/ free (Small):    478ms
                  mallocfree (Big):    319ms
          mallocfree w/ free (Big):    204ms

               GLib slices (Small):    943ms
       GLib slices w/ free (Small):    621ms
                 GLib slices (Big):    227ms
         GLib slices w/ free (Big):    111ms

               boost::pool (Small):    262ms
       boost::pool w/ free (Small):    169ms
                 boost::pool (Big):    181ms
         boost::pool w/ free (Big):    157ms
          boost::pool cr. & destr.:    136ms

                  ngx_pool (Small):    312ms
                    ngx_pool (Big):    134ms
            ngx_pool w/ free (Mix):    554ms
             ngx_pool cr. & destr.:    99ms

                  tfw_pool (Small):    273ms
          tfw_pool w/ free (Small):    102ms
                    tfw_pool (Big):    107ms
            tfw_pool w/ free (Big):    46ms
            tfw_pool w/ free (Mix):    103ms
             tfw_pool cr. & destr.:    53ms


Monday, March 9, 2015

Linux Netlink Mmap: Bulk Data Transfer for Kernel Database

Netlink is relatively new IPC mechanism in Linux, usually considered as replacement for ioctl(2). Wikipedia article gives a nice explanation about the subject. Communications using Nelink are very similar to usual sockets. So if you transfer data between user-space program and the Linux kernel, then typically you do data copying.

Linux 3.10 introduced Netlink mmap interface from Patrick McHardy. The patch set shares mapped memory region between user-space program and the kernel in circular buffer manner, so you can do huge zero-copy data transfers between user- and kernel-spaces. Patrick's presentation motivates the interface by: "Performance doesn't matter much in many cases since most netlink subsystems have very low bandwidth demands. Notable exceptions are nfnetlink_queue, ctnetlink and possibly nfnetlink_log". The patches were merged into the kernel in 2012, but it seems no one kernel subsystem uses the interface as well as libnl (Netlink Protocol Library Suite) still doesn't implement any API for the feature. Probably, this is just because nobody needs to load tons on firewall rules, packets or statistics to/from the kernel...

Meantime, this is known fact, that embedded databases deliver the best performance for simple workloads due to lower overhead in comparison with client-server databases. However, what if you need perfect performance for many user-space processes (not threads)? Usually embedded databases serialize access to  shared data using file locks. And file operations surely are slow. To solve the problem process oriented server software like PostgreSQL, Nginx or Apache allocates shared memory region and introduces spinlocks in the area.

What happens if a process holding the spinlock dies? All others just spin on the lock in vain. So you need to introduce some lock control mechanism which manages releasing of locks acquired by dead processes. Ok, you control the locks now, but what's the state of the shared memory region if a process dies holding the lock and updating some data? We can use something like write ahead log (WAL) for safe updates. Who will be responsible for undo and/or redo the log? Some designated management process? Maybe so. However, this is a good task for operating system - it can do, and in fact usually does, all cleaning operations if application process dies abnormally. I/O operations for the database persistency are also good to move from user-space database process to OS. By the way, there was an old LKML discussion on O_DIRECT which is used by database developers and which is considered as braindamaged by OS developers.

Furthermore, in our case most of the logic is done in kernel space, but we need to transfer tons of filtering rules, network analytics, logging and other data between user- and kernel-spaces. So we have started to develop Tempesta DB, a database shared by the kernel and many user-space processes, but still very fast and reliable. This is were Netlink mmap shines for data transfers.

Patrick wrote nice examples how to use the interface in user-space. Kevin Kaichuan He in his Linux Journal article described generic Netlink kernel interfaces, but unfortunately the article is outdated after 10 years. During these years Linux significantly developed Netlink interfaces and amazing thing about Netlink mmap is that the whole zero-copy magic is happen in  linux/net/netlink/af_netlink.c and you don't need to bother about skbs with mmaped buffers. You just need to switch on CONFIG_NETLINK_MMAP kernel configuration option and do almost common Netlink kernel calls. So lets have a look at an example how to use the new Netlink kernel interface. I hope the example will be useful in addition to Patrick's and Kevin's articles.


Firstly, we need to define NETLINK_TEMPESTA in linux/include/uapi/linux/netlink.h, so the small kernel patch is required. Next, we should register our Netlink hooks:

    netlink_kernel_create(&init_net, NETLINK_TEMPESTA, &tdb_if_nlcfg);


We register the input hook only:

    static struct netlink_kernel_cfg tdb_if_nlcfg = {
        .input  = tdb_if_rcv,
    };


The hook just calls standard netlink_rcv_skb, which executes our other hook in a loop over all skb data fragments:

    static void
    tdb_if_rcv(struct sk_buff *skb)
    {
        netlink_rcv_skb(skb, &tdb_if_proc_msg);
    }


Now we can start to process our own data structure TdbMsg describing messages passed over Netlink (we still must pass small message descriptors over standard Linux system call interface, but the message bodies are transfered in zero-copy fashion):

    static int
    tdb_if_proc_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
    {
        TdbMsg *m;

        if (nlh->nlmsg_len < sizeof(*nlh) + sizeof(TdbMsg)) {
            TDB_ERR("too short netlink msg\n");
            return -EINVAL;
        }

        m = nlmsg_data(nlh);


        {
                struct netlink_dump_control c = {
                        .dump = tdb_if_call_tbl[m->type].dump,
                        .data = m,
                        .min_dump_alloc = NL_FR_SZ / 2,
                };
                return netlink_dump_start(nls, skb, nlh, &c);
        }
    }


Here we call standard Linux routine again, netlink_dump_start, which does helping stuff for the message processing and prepares skb for the kernel answer. netlink_dump_control describes the message processing handler. min_dump_alloc specifies number of bytes to allocate for the response skb. We use half of Netlink mmap frame size which we defined as 16KB. data is just a pointer to the skb received from user-space. And dump is the callback to process the message. We get the callback by the message type used as a callback table index. And the table is defined as:

    static const struct {
        int (*dump)(struct sk_buff *, struct netlink_callback *);
    } tdb_if_call_tbl[__TDB_MSG_TYPE_MAX] = {
        [TDB_MSG_INFO]    = { .dump = tdb_if_info },
        /* .... */
    };


And finally define dump callbacks as:

    static int
    tdb_if_info(struct sk_buff *skb, struct netlink_callback *cb)
    {
        TdbMsg *m;
        struct nlmsghdr *nlh;


        /* Allocate space at response skb. */

        nlh = nlmsg_put(skb, NETLINK_CB(cb->skb).portid,
                        cb->nlh->nlmsg_seq, cb->nlh->nlmsg_type,
                        TDB_NLMSG_MAXSZ, 0);

        m = nlmsg_data(nlh);

        /* Fill in response... */

        return 0; /* end transfer,

                     return skb->len if you have more data to send */
    }

 
You can find real life example of user-space Netlink mmap usage in libtdb. And the kernel side implementation in Tempesta DB core interface.

Sunday, November 23, 2014

Fast Finite State Machine for HTTP Parsing

There is special type of DDoS attacks, application level DDoS, which is quite hard to combat against. Analyzing logic which filters this type of DDoS attack must operate on HTTP message level. So in most cases the logic is implemented as custom modules for application layer (usually nowadays user space) HTTP accelerators. And surely Nginx is the most widespread platform for such solutions. However, common HTTP servers and reverse proxies were not designed for DDoS mitigation- they are simply wrong tools for this issue. One of the reason is that they are too slow to combat with massive traffic (see my recent paper and presentation for other reasons).

If logging is switched off and all content is in cache, then HTTP parser becomes the hottest spot. Simplified output of perf for Nginx under simple DoS is shown below (Nginx’s calls begin with ’ngx’ prefix, memcpy and recv are standard GLIBC calls):


      %            symbol name
    1.5719    ngx_http_parse_header_line
    1.0303    ngx_vslprintf
    0.6401    memcpy
    0.5807    recv
    0.5156    ngx_linux_sendfile_chain
    0.4990    ngx_http_limit_req_handler


The next hot spots are linked to complicated application logic (ngx vslprintf ) and I/O. 

During Tempesta FW development We have studied several HTTP servers and proxies (Nginx, Apache Traffic Server, Cherokee, node.js, Varnish and userver) and learned that all of them use switch and/or if-else driven state machines.

The problem with the approach is that HTTP parsing code is comparable in size with L1i cache and processes one character at a time with significant number of branches. Modern compilers optimize large switch statements to lookup tables that minimizes number of conditional jumps, but branch misprediction and instruction cache misses still hurt performance of the state machine. So the method probably has poor performance.

The other well-known approach is table-driven automaton. However, simple HTTP parser can have more than 200 states and 72 alphabet cardinality. That gives 200 x 72 = 14400 bytes for the table, which is about half of L1d of modern microprocessors. So the approach is also could be considered as inefficient due to high memory consumption.

The first obvious alternative for the state machine is to use Hybrid State Machine (HSM) described in our paper, which combines very small table with also small switch statement. In our case we tried to encode outgoing transitions from a state with at most 4 ranges. If the state has more outgoing transitions, then all transitions over that 4 must be encoded in switch. All actions (like storing HTTP header names and values) must be performed in switch. Using this technique we can encode each state with only 16 bytes, i.e. one cache line can contain 4 states. Giving this the approach should have significantly improve data cache hit.

We also know that Ragel generates perfect automatons and combines case labels in switch statement with direct goto labels (it seems switch is used to be able to enter FSM from any state, i.e. to be able to process chunked data). Such automatons has lower number of loop cycle and bit faster than traditional a-loop-cycle-for-each-transition approach. There was successful attempt to generate simple HTTP parsers using Ragel, but the parsers are limited in functionality.

However there are also several research papers which says that an automaton states is just auxiliary information and an automaton can be significantly accelerated if state information is declined.
So the second interesting opportunity to generate the fastest HTTP parser is just to encode the automaton directly using simple goto statements, ever w/o any explicit loop.

Basically HTTP parsers just matches a string against set of characters (e.g. [A-Za-z_-] for header names), what strspn(3) does. SSE 4.2 provides PCMPSTR instructions family for this purpose (GLIBC since 2.16 uses SSE 4.2 implemenetation for strspn()). However, this is vector instruction which doesn't support accept or reject sets more than 16 characters, so it's not too usable for HTTP parsers.

Results
I made a simple benchmark for four approaches described above (http_ngx.c - Nginx HTTP parsing routines, http_table.c - table-driven FSM, http_hsm.c - hybrid state machine and http_goto.c - simple goto-driven FSM). And here are the results (routines with 'opt' or 'lw' - are optimized or lightweight versions of functions):

Haswell (i7-4650U)

    Nginx HTTP parser:
        ngx_request_line:       730ms
        ngx_header_line:        422ms
        ngx_lw_header_line:     428ms
        ngx_big_header_line:    1725ms

    HTTP Hybrid State Machine:
        hsm_header_line:        553ms

    Table-driven Automaton (DPI)
        tbl_header_line:        473ms
        tbl_big_header_line:    840ms

    Goto-driven Automaton:
        goto_request_line:      470ms
        goto_opt_request_line:  458ms
        goto_header_line:       237ms
        goto_big_header_line:   589ms

Core (Xeon E5335)

    Nginx HTTP parser:
        ngx_request_line:       909ms
        ngx_header_line:        583ms
        ngx_lw_header_line:     661ms
        ngx_big_header_line:    1938ms

    HTTP Hybrid State Machine:
        hsm_header_line:        433ms

    Table-driven Automaton (DPI)
        tbl_header_line:        562ms
        tbl_big_header_line:    1570ms

    Goto-driven Automaton:
        goto_request_line:      747ms
        goto_opt_request_line:  736ms
        goto_header_line:       375ms
        goto_big_header_line:   975ms

Goto-driven automaton shows the better performance in all the tests on both the architectures. Also it's much easier to implement in comparison with HSM. So in Tempesta FW we migrated from HSM to goto-driven atomaton, but with some additional optimizations.

Lessons Learned
Haswell has very good BPU
Core micro-architecture has show that HSM behaves much better than switch-driven and table-driven automatons. While this is not the case for Haswell - the approach loses to both the approaches. I've tried many optimizations techniques to improve HSM performance, but the results above are the best and they still worse than the simple FSM approaches.
Profiler shows that the problem (hot spot) in HSM on Haswell is in the following code
 
    if (likely((unsigned char)(c - RNG_CB(s, 0)) <= RNG_SUB(s, 0))) {
        st = RNG_ST(s, 0);
        continue;
    }

Here we extract transition information and compare current character with the range. In most cases only this one branch is observer in the test. 3rd and 4th branches are never observed. The whole automaton was encoded with only 2 cache lines.

In first test case, when XTrans.x structure is dereferenced to get access to the ranges, the compiler generates 3 pointer dereferences. In fact these instructions (part of the disassembled branch)
 
    sub    0x4010c4(%rax),%bl
    cmp    0x4010c5(%rax),%bl
    movzbl 0x4010cc(%rax),%eax

produce 3 accesses to L1d and the cache has very limited bandwidth (64 bytes for reading and 32 bytes for writing) on each cycle with minimal latency as 4 cycles for Haswell. While the only one cache line is accessed by all the instructions. So the test case bottle neck is L1d bandwidth.

If we use XTrans.l longs (we need only l[0], which can be loaded with only one L1d access, in all the cases) and use bitwise operations to extract the data, then we get lower number of L1d accesses (4G vs 6.7G for previous cases), but branch mispredictions are increased.

The problem is that more complex statement in the conditions makes harder to Branch Prediction Unit to predict branches.

However, we can see that simple branches (for switch-driven and goto-driven automatons) show perfect performance on Haswell. So advanced Haswell BPU perfectly processes simple automatons making complex HSM inadequate.

In fact HSM is only test which is slower on Haswell in comparison with Core Xeon. Probably, this is the difference between server and mobile chips that ever old server processor beats modern mobile CPU on complex loads...

-O3 is ambiguous
Sometimes -O3 (GCC 4.8.2) generates slower code than -O2. Also benchmarks for -O3 show very strange and unexpected results. For example the below are results for -O2:

    goto_request_line: 470ms

However, -O3 shows worse results:

    goto_request_line: 852ms



Automata must be encoded statically whenever possible
Table-driven and HSM automaton are encoded using static constant tables (in difference with run-time generated tables for current DPI parser). This was done during HSM optimizations. Sometimes compiler can't optimize code using run-time generated tables. And this is crucial for real hot spots (for HSM the table is used in the if-statement described above which gets about 50-70% of whole the function execution time) - after the moving to the static data the code can get up to 50% performance improvement (the case for HSM).

Wednesday, January 15, 2014

NatSys Lock-free Queue vs LMAX Disruptor vs boost::lockfree::queue

I've been asked many times how our lock-free queue (you can read more about it here) differs from LMAX Disruptor. The last time is in discussion at Hacker News (it seems one of disruptor author was asking this). In LinkedIn discussions I also was asked about boost::lockfree::queue (it appeared in Boost recently). So I'm writing the post to share the answers.

LMAX Disruptor

Its source code available at https://github.com/redjack/varon-t.

The algorithms are simply different and solve different problems: disruptor is a messaging queue (if a producer P0 emits M0 into the queue then all consumers C0..CN receive M0) while our queue is a classical work queue (if P0 and P1 emit message M0 and M1, then only one consumer Ci receives M0 and only consumer Cj receives M1 (i could be equal to j)).


Our implementation competes with boost::lockfree::queue and it's much faster since Boost implementation uses more heavy synchronization techniques. The benchmark on GitHub also has Boost implementation, so you can compare both the queues.

Unfortunately, there is no adequate algorithm description for disruptor queue, only some indistinct descriptions mostly suitable for business people rather than engineers. So it was not easy to dig it's source code. However, I learned it and there are some notes about the implementation.

The implementation is bit inaccurate: there are a lot of branches without branch prediction information available at compile time, to avoid cache line bouncing it wastes two cache lines instead of simple align an item on cache line (I mean vrt_padded_int). I didn't pay too much attention to memory barriers usage, but giving that X86-64 provides relatively strict memory ordering, probably some of them also could be eliminated. Disruptor uses very cleaver ideas, but I believe its performance can be improved after good code review.

One more point is that while our queue implementation is C++, it's still self sufficient and can be easily ported to C for using in kernel space. It's doubtful (in my humble opinion) that generic container depends on non-standard libcork and moreover logging library (clogger).

One more thing to note about our queue and Disruptor. Dirsuptor's uses naive yielding logic. The problems with the logic is that if a thread has not job for long time it fails to sleep for increasing from yield to yield call time, this if the queue has no job for long time, but at once has a big spike, then it typically will waking up for some time before it starts to work. We solved the issue with lock-free condition wait.


Boost::lockfree::queue

 It uses at least one CAS operation (see for example do_push() in boost/lockfree/queue.hpp), which is slower than plain RMW operation (atomic increment in our case) in best case, so it's slower. You can user benchmark for both the queues at GitHub. For my Intel Core i7-4650U it gives:

     $ g++ -Wall -std=c++0x -O2 -D DCACHE1_LINESIZE=`getconf LEVEL1_DCACHE_LINESIZE` -I/opt/boost_1_55_0/include/ lockfree_rb_q.cc -lpthread
    $ ./a.out
    13969ms 
    check X data... 
    Passed 
    69357ms 
    check X data... 
    Passed 
    20240ms 
    check X data... 
    Passed


Note that I use latest boost version. The first result is for our queue, the second for naive queue implementation and the last one is for Boost. I.e. our implementation is more than 30% faster than boost::lockfree::queue.