High Performance Linux

Wednesday, September 30, 2015

Fast Memory Pool Allocators: Boost, Nginx & Tempesta FW

Memory Pools and Region-based Memory Management

Memory pools and region-based memory management allow you to improve your program performance by avoiding unnecessary memory freeing calls. Moreover, pure memory pools gain even more performance due to simpler internal mechanisms. The techniques are widely used in Web servers and using them you can do following (pseudo code for some imaginary Web server):

    http_req->pool = pool_create();
    while (read_request & parse_request) {
        http_req->hdr[i] = pool_alloc(http_req->pool);
        // Do other stuff, don't free allocated memory.
    // Send the request.
    // Now destroy all allocated items at once.

This reduces number of memory allocator calls, simplifies its mechanisms (e.g. since you don't free allocated memory chunks it doesn't need to care about memory fragmentation and many other problems which common memory allocators must solve at relatively high cost) and makes the program run faster.

Probably, you've noticed that we call pool_alloc() in the example above without specifying allocation size. And here we go to the difference between memory pools and region-based memory management: memory pools can allocate memory chunks with only one fixed size while region-based memory allocators are able to allocate chunks with different size. Meantime, both of them allow you not to care about freeing memory and just drop all the allocated memory chunks at once.

Boost C++ library provides memory pool library boost::pool, let's have a closer look at the library. By the way, there is nice description of memory pool concepts.


If we run following loop for N = 20,000,000 (this is body of a template function with argument T, you can find the full source code of the benchmark at our repository at GitHub):

    boost::object_pool<T> p;

    for (size_t i = 0; i < N * sizeof(Small) / sizeof(T); ++i) {
        T *o = p.malloc();

, it takes about 260ms for T 24 bytes (this is the size of structure Small) in size and about 170ms for an object of size 240 bytes (size of structure Big) on my Intel Core i7-4650U 1.70GHz. Please, note that I don't measure pools construction and destruction times to concentrate on memory management techniques without additional expenses. Also I limit number of allocations by

    N * sizeof(Small) / sizeof(T)

to avoid memory exhausting on tests with large object sizes.

Hereafter touch_obj() is just a simple macro checking result of memory allocation and breaking current loop:

    #define touch_obj(o)                                   \
        if (__builtin_expect(!o, 0)) {                     \
                std::cerr << "failed alloc" << std::endl;  \
                break;                                     \
        } else {                                           \
                *(long *)o = 1;                            \

Now if we free each 4th allocated element:

    for (size_t i = 0; i < N * sizeof(Small) / sizeof(T); ++i) {
        T *o = p.malloc();
        if (__builtin_expect(!(i & 3), 0))

the function takes about 165ms and 155ms for T sizes 24 bytes and 240 bytes correspondingly and this is almost nothing for big objects, but more than 50% performance improvement for small objects. Results for the boost::pool benchmark:

    $ ./a.out 
                 small object size:    24
                   big object size:    240

               boost::pool (Small):    258ms
       boost::pool w/ free (Small):    165ms
                 boost::pool (Big):    174ms
         boost::pool w/ free (Big):    154ms

(I take the best numbers from 3 runs of the test). The loop now contains extra conditional jump and free() call which is surely not for free. So why do we get the performance improvement for the version with objects freeing? There are two reasons why it's faster:

  1. the main reason that in the first version of program we're allocating more than 450MB of RAM, so we cause serious pressure on system memory allocator which must allocate a lot of memory for us. While the second version of program requires 25% less system memory;
  2. each write operation touches a new memory region, so some cache lines are evicted from CPU cache and the memory is loaded to cache. Thus, we get serious CPU cache starvation.

Therefore, periodic or casual memory freeings reduce overall workflow (while we have more free() calls, the underlying system allocator is called rarely) and effectively reuse CPU cache lines.

But why does big allocations test not win from freeing? The short answer is exponential growing of memory storage used by boost::pool and particular numbers (N and sizeof(T))in the test. The key function of the boost::object_pool is pool<UserAllocator>::ordered_malloc_need_resize() (simplified for brevity):

    template <typename UserAllocator>
    void *
        size_type POD_size = next_size * partition_size + ...
        char * ptr = (UserAllocator::malloc)(POD_size);
        if (!max_size)
            next_size <<= 1;

The function requests system allocator by exponentially growing steps. If we call strace -e trace=mmap against the test for small objects without freeing, then we can see following mmap() calls (glibc uses mmap() for large allocations and brk() for smaller):

    mmap(NULL, 200704,....) = 0x7fdb428bd000
    mmap(NULL, 397312, ...) = 0x7fdb4285c000
    mmap(NULL, 790528, ...) = 0x7fdb4279b000
    mmap(NULL, 201330688, ....) = 0x7fdb295e3000
    mmap(NULL, 402657280, ....) = 0x7fdb115e2000

12 calls in total for the benchmark code itself. The benchmark for small objects with freeing produces 11 calls ending at

    mmap(NULL, 201330688, ....) = 0x7f91ed60b000

So the freeing really reduces size of allocated memory and improves performance. However, this is not the case for big allocation benchmarks - both the benchmarks have the same number of equal mmaps():

    mmap(NULL, 249856, ....) = 0x7f34431be000
    mmap(NULL, 251662336, ....) = 0x7f3423f50000

I.e. in this particular test the last very large memory allocation is enough to satisfy the benchmark memory requirements even without freeing.

Boost::pool is the pure memory pool which allows you to allocate memory chunks of one fixed size only, but this is not the case for Nginx memory pool which is able to allocate objects of different size.

Nginx Pool

Wikipedia says: "the web server Nginx, use the term memory pool to refer to a group of variable-size allocations which can be later deallocated all at once. This is also known as a region". In fact, Nginx's pool allows you to allocate variable size memory chunks up to 4095 bytes for x86-64. You can request larger memory areas, but they're allocated using plain malloc().

To learn performance of Nginx pool I copy & pasted some pieces of src/core/ngx_palloc.c from nginx-1.9.5. Now benchmark code for small and large memory allocations looks like the following (you can find the full code here):

    ngx_pool_t *p = ngx_create_pool(PAGE_SIZE);

    for (size_t i = 0; i < N * sizeof(Small) / sizeof(T); ++i) {
        T *o = (T *)ngx_palloc(p, sizeof(T));


Ngimx has ngx_pfree(), but it's for large data blocks only, so small memory chunks aren't freeable, so we test periodic memory freeings for large memory chunks only. Nginx's pool support different allocation sizes, so let's test mixed allocation with 1/4 of big (240B) and 1/4096 of huge (8305B) chunks:

    ngx_pool_t *p = ngx_create_pool(PAGE_SIZE);
    for (size_t i = 0; i < N; ++i) {
        if (__builtin_expect(!(i & 3), 0)) {
            if (__builtin_expect(!(i & 0xfff), 0)) {
                Huge *o = (Huge *)ngx_palloc(p, sizeof(*o));
                if (!(i & 1))
                    ngx_pfree(p, o);
            else if (__builtin_expect(!(i & 3), 0)) {
                Big *o = (Big *)ngx_palloc(p, sizeof(*o));
            else {
                Small *o = (Small *)ngx_palloc(p, sizeof(*o));


The code above gives us following results:

                  ngx_pool (Small):    305ms
                    ngx_pool (Big):    132ms
            ngx_pool w/ free (Mix):    542ms

Let's do review of Nginx pool implementation and quick analyzing of the results:

  1. the implementation is very simple (the ported allocator code takes less than 200 lines) without complex OOP wrapping code which boost::pool has in plenty;
  2. ngx_palloc() scans each memory block at most 4 times trying to better utilize allocated memory chunks. There is small multiplication factor, but the algorithm is still O(1). If you do a small allocation after few large allocations, then it's likely that there is some room to place the new chunk somewhere in previously allocated blocks without requesting a new memory block. This is the cost for ability to allocate memory chunks of different sizes. Boost::pool just puts current memory chunk at the end of allocated data, so doesn't do any scans at all;
  3. Nginx grows its pool by blocks of fixed size, it doesn't use exponential growing like boost::pool.

Update. Previously I explained the difference between results for Big and Small allocations by O(n) complexity in ngx_palloc(), in particular I was confused by this code:

    do {
        m = ngx_align_ptr(p->d.last, NGX_ALIGNMENT);
        if ((size_t)(p->d.end - m) >= size) {
            p->d.last = m + size;
            return m;
        p = p->d.next;
    } while (p);

However, ngx_palloc_block() does

    for (p = pool->current; p->d.next; p = p->d.next) {
        if (p->d.failed++ > 4)
            pool->current = p->d.next;

So when a new block is allocated the rest of the blocks increment failed counter, it means that we could not satisfy current request from all of them. Thus, we try to allocate a chunk at most 4 times from a block after that we move pool->current and never touche the head of the list. So typically we have very long head of the list and relatively small tail which is scanned in ngx_palloc().

Thanks to Maxim Dunin and Rinat Ibragimov for pointing me out the mistake. Meantime, Rinat tried following change for ngx_palloc_block():

    - for (p = pool->current; p->d.next; p = p->d.next) {
    - if (p->d.failed++ > 4)
    - pool->current = p->d.next;
    - }
    + pool->current = p_new;

I.e. he removed the scans at all. After the change benchmark results at his machine changed from

                  ngx_pool (Small):    475ms
                    ngx_pool (Big):    198ms
            ngx_pool w/ free (Mix):    817ms
             ngx_pool cr. & destr.:    161ms


                  ngx_pool (Small):    249ms
                    ngx_pool (Big):    163ms
            ngx_pool w/ free (Mix):    585ms
             ngx_pool cr. & destr.:    476ms

So the scans really degrades performance in particular workload from the benchmark, however strictly speaking the algorithm is still constant time.

The results are somewhat worse than boost::pool. Does it mean that boost::pool is better? Nope. Nginx is developed to process thousands or even more HTTP requests per second and each request has associated with it memory pool. Obviously, regular HTTP request doesn't have 20 million chunks, headers or so on, so our test is somewhat unnatural for Nginx's pool. To compare Nginx's pool with boost::pool in more natural for Nginx workload I did other benchmark:

    for (size_t i = 0; i < N / 100; ++i) {
        ngx_pool_t *p = ngx_create_pool(PAGE_SIZE);
        for (size_t j = 0; j < 100; ++j) {
            if (__builtin_expect(!(i & 3), 0)) {
                Big *o = (Big *)ngx_palloc(p, sizeof(*o));
            } else {
                Small *o = (Small *)ngx_palloc(p, sizeof(*o));

Since boost::pool can't allocate various size chunks, I just ran 1/4 of all loops for Big and 3/4 for Small object allocations. The results are:

          boost::pool cr. & destr.:    133ms
             ngx_pool cr. & destr.:    98ms

And this time Nginx pool is more than 30% faster than boost::pool.

By the way, APR library used in Apache HTTPD server uses somewhat similar to Nginx pool mechanisms.


In Tempesta FW we tried to make more flexible pool allocator (actually region-based memory allocator, but we follow Apache HTTPD and Nginx developers in the allocator's name), such that it's able to free allocated memory like boost::pool and allocate variable size memory chunks like Nginx. But make it as fast as possible.

TfwPool uses stack-like freeing approach. In fact, there are two common cases when we need to free or reallocate the last allocated chunks. The first one is very trivial - when you just need a temporal buffer to do some stuff (e.g. snprintf()).

For example, checking of very short string with HTTP Host header looks like:

    buf = tfw_pool_alloc(req->pool, len);     
    tfw_str_to_cstr(&field, buf, len);
    if (!tfw_addr_pton(buf, &addr))
        // Process the bad Host header
    tfw_pool_free(req->pool, buf, len);                        

tfw_str_to_cstr() copies chunked string to continuous buf. If you design your application as pure zero-copy, then typically such copies are not what you like to do. However, zero-copy techniques could be very expensive for small chunked data, when you need to handle list of small data chunks. So sometimes it's faster to do small copying rather than do complex things on lists of small chunks. Nevertheless, tfw_str_to_cstr() is considered deprecated to be replaced with smarter techniques in further versions.

The second case when you typically need to free exactly the last allocated memory chunk, is realloc(). Imagine that you need to handle some descriptor of string chunks of HTTP message headers, body etc. The descriptor is effectively an array of pointers to chunks of the message field. Arrays are better than linked lists because of better memory locality, but you must pay by heavy realloc() call to be able to grow an array. So if we'd have some very fast method to quickly grow an array, then we can reach very good performance. And there is a way to grow arrays quickly without usual for realloc() memory copies. It worth to mention that if you're reading for example HTTP message body, then you're growing the array of the field descriptor, while other allocated chunks are not touched. So at some point you just execute realloc() many times against the same memory chunk. If the allocator has some room just after the memory chunk, then it just need to move its data end pointer forward - no any real reallocations or copies are necessary.

Probably you noticed in previous code examples that I call free() just after alloc() - actually, this is relatively frequent case which we kept in mind designing TfwPool.

Like Nginx case I just copied original code with small adjustments to the benchmark. You can find the original allocator code at GitHub. The results for the allocator are depicted at the below:

                  tfw_pool (Small):    279ms
          tfw_pool w/ free (Small):    101ms
                    tfw_pool (Big):    106ms
            tfw_pool w/ free (Big):    50ms
            tfw_pool w/ free (Mix):    107ms
             tfw_pool cr. & destr.:    53ms

Well, it's faster than Nginx's pool, but Small test is still slower than fro boost::pool. Surely, boost::pool shows amazing numbers thanks to it's exponential growing. We don't use the technique because our allocator was designed for Linux kernel and using large continuous allocations is not the most efficient way.

Our allocator is designed to operate with 4096-aligned memory pages. Page alignment (x86-64 has pages equal to 4096 bytes) is good since less pages are used improving TLB cache usage. In our benchmark we do the allocations using posix_memalign() with alignment argument 4096 and it significantly hurts the benchmark performance...

Perfomance Issue with posix_memalign(3)

To understand posix_memalign() issues lets run memalign test which compares performance of 4KB allocations aligned to 16 and 4096 bytes correspondingly:

    $ ./memalign_benchmark 
                       16B aligned:    110ms
                      page aligned:    203ms

Page aligned allocations are almost 2 times slower, that's too expensive. Why it happens?

strace shows that more than 90% of time the program spends in brk() system call:

    $ strace -c ./memalign_benchmark
                       16B aligned:    233ms
                      page aligned:    334ms
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     93.81    0.010849           1     17849           brk
      1.98    0.000229          13        17           mmap
      1.03    0.000119          12        10           mprotect

If we run each part of benchmark independently, then we can find follwoing (firstly I run 16 bytes alignment test and next page alignment test):

    $ strace -e trace=brk ./memalign_benchmark 2>&1 |wc -l

    $ strace -e trace=brk ./memalign_benchmark 2>&1 |wc -l

The bigger number of brk() calls is expected since glibc requests more memory from the system to throw almost half of it for alignment. So large alignment argument is very expensive for posix_memalign().

Meantime, operating system kernel manages memory in pages which are certainly 4096-byte aligned, so you can get properly aligned memory chunks without the alignment overhead. Moreover, actually when you do malloc() or posix_memalign(), the kernel must allocate VMA (Virtual Memory Area - the kernel descriptor handling information about address mapping) for it. Next when you write to the area page fault happens, it allocates a new page and populates it to current page table. A lot of things happen when we work with memory in user space. Then nice thing which glibc does to mitigate impact of the complex things is caching memory requested from the kernel.

Tempesta FW: Native Kernel Implementation

It was also interesting to see how much a regular application can gain from moving to kernel space. First, to demonstrate how user space memory allocation is slow (if the application just starts and glibc doesn't have cache), I made very simple test:

    static const size_t N = 100 * 1000;
    static void *p_arr[N];

    for (size_t i = 0; i < N; ++i) {
        r = posix_memalign((void **)&p_arr[i], 16, PAGE_SIZE);
    for (size_t i = 0; i < N; ++i)

It takes about 112ms on my system:

    $ ./memalign_benchmark 
                      alloc & free:    112ms

Now let's try the kernel test, you can find it near from the other tests. You can run the test by loading the kernel module (it's built by Makefile as other benchmarks) and dmesg will show you the results (don't forget to remove the module):

    $ sudo insmod pool.ko
    $ sudo rmmod pool
    $ lsmod |grep pool

Firstly, let's have a look how memory allocation is fast in kernel:

    $ dmesg |grep 'alloc & free'
    alloc & free:  22ms

This is 5 times faster!

It's very curious how much regular application can gain from moving to kernel. So I also ported the Nginx's pool to kernel module and here are the results:

    ngx_pool (Small):               229ms
    ngx_pool (Big):                 49ms
    ngx_pool w/ free (Mix):         291ms
    ngx_pool cr. & destr.:          150ms

This is 30% faster for small allocations, 270% (!) faster for big objects and 86% faster for mixing workload... Wait... What's happen with our "real life" test for many pool with low number of allocations? It's slower more than 50%!..

The problem is that when you free a page using free_pages() interface some of them are going to per CPU page cache, but the main point is that they're going to buddy allocator where they can be coalesced with their buddies to satisfy further requests for large allocations. Meantime, glibc does good job on aggressive caching of allocated memory. So the benchmark causes significant pressure on Linux buddy system.

TfwPool uses it's own lightweight per-CPU page cache, so we get much better results:

    tfw_pool (Small):               95ms
    tfw_pool w/ free (Small):       90ms
    tfw_pool (Big):                 45ms
    tfw_pool w/ free (Big):         35ms
    tfw_pool w/ free (Mix):         99ms
    tfw_pool cr. & destr.:          54ms

In short: 80-170% faster for small allocations than the fastest user-space allocator (boost::pool) and more than 3 times faster than Nginx's pool,3-5 times faster for big allocations, more than 5 times for mixed and 2 times for "real life" workloads in comparison with Nginx. Recall that the allocator was designed for kernel space, so it show much better results in Linus kernel rather than for user space.


As a conclusion I'd make couple of points about fast memory pools:

  • while page aligned memory operations are good due to better TLB usage, they are impractical due to poor posix_memalign() work. You can use your own page cache to cope with the issue;
  • exponential growing is good and old thing, but people still don't use it for some reason;
  • any memory allocator must solve many problems (time complexity, memory utilization and fragmentation, time to create and destroy the pool etc.), so it's likely that you can develop much faster allocator than any existing for your particular project - there is no ideal memory allocator for all tasks;
  • just moving from user space to kernel can make some performance improvement for your program, but also you can hit performance degradation - user-space and kernel use very different mechanisms, so kernel applications must be designed especially for kernel space.

Don't you love kernel programming as we love it?


Rinat Ibragimov added benchmark for Glib memory pool. Now the output of benchmark looks like (the best numbers for 4 runs):

                 small object size:    24
                   big object size:    240
                  huge object size:    8305

                mallocfree (Small):    711ms
        mallocfree w/ free (Small):    478ms
                  mallocfree (Big):    319ms
          mallocfree w/ free (Big):    204ms

               GLib slices (Small):    943ms
       GLib slices w/ free (Small):    621ms
                 GLib slices (Big):    227ms
         GLib slices w/ free (Big):    111ms

               boost::pool (Small):    262ms
       boost::pool w/ free (Small):    169ms
                 boost::pool (Big):    181ms
         boost::pool w/ free (Big):    157ms
          boost::pool cr. & destr.:    136ms

                  ngx_pool (Small):    312ms
                    ngx_pool (Big):    134ms
            ngx_pool w/ free (Mix):    554ms
             ngx_pool cr. & destr.:    99ms

                  tfw_pool (Small):    273ms
          tfw_pool w/ free (Small):    102ms
                    tfw_pool (Big):    107ms
            tfw_pool w/ free (Big):    46ms
            tfw_pool w/ free (Mix):    103ms
             tfw_pool cr. & destr.:    53ms

Monday, March 9, 2015

Linux Netlink Mmap: Bulk Data Transfer for Kernel Database

Netlink is relatively new IPC mechanism in Linux, usually considered as replacement for ioctl(2). Wikipedia article gives a nice explanation about the subject. Communications using Nelink are very similar to usual sockets. So if you transfer data between user-space program and the Linux kernel, then typically you do data copying.

Linux 3.10 introduced Netlink mmap interface from Patrick McHardy. The patch set shares mapped memory region between user-space program and the kernel in circular buffer manner, so you can do huge zero-copy data transfers between user- and kernel-spaces. Patrick's presentation motivates the interface by: "Performance doesn't matter much in many cases since most netlink subsystems have very low bandwidth demands. Notable exceptions are nfnetlink_queue, ctnetlink and possibly nfnetlink_log". The patches were merged into the kernel in 2012, but it seems no one kernel subsystem uses the interface as well as libnl (Netlink Protocol Library Suite) still doesn't implement any API for the feature. Probably, this is just because nobody needs to load tons on firewall rules, packets or statistics to/from the kernel...

Meantime, this is known fact, that embedded databases deliver the best performance for simple workloads due to lower overhead in comparison with client-server databases. However, what if you need perfect performance for many user-space processes (not threads)? Usually embedded databases serialize access to  shared data using file locks. And file operations surely are slow. To solve the problem process oriented server software like PostgreSQL, Nginx or Apache allocates shared memory region and introduces spinlocks in the area.

What happens if a process holding the spinlock dies? All others just spin on the lock in vain. So you need to introduce some lock control mechanism which manages releasing of locks acquired by dead processes. Ok, you control the locks now, but what's the state of the shared memory region if a process dies holding the lock and updating some data? We can use something like write ahead log (WAL) for safe updates. Who will be responsible for undo and/or redo the log? Some designated management process? Maybe so. However, this is a good task for operating system - it can do, and in fact usually does, all cleaning operations if application process dies abnormally. I/O operations for the database persistency are also good to move from user-space database process to OS. By the way, there was an old LKML discussion on O_DIRECT which is used by database developers and which is considered as braindamaged by OS developers.

Furthermore, in our case most of the logic is done in kernel space, but we need to transfer tons of filtering rules, network analytics, logging and other data between user- and kernel-spaces. So we have started to develop Tempesta DB, a database shared by the kernel and many user-space processes, but still very fast and reliable. This is were Netlink mmap shines for data transfers.

Patrick wrote nice examples how to use the interface in user-space. Kevin Kaichuan He in his Linux Journal article described generic Netlink kernel interfaces, but unfortunately the article is outdated after 10 years. During these years Linux significantly developed Netlink interfaces and amazing thing about Netlink mmap is that the whole zero-copy magic is happen in  linux/net/netlink/af_netlink.c and you don't need to bother about skbs with mmaped buffers. You just need to switch on CONFIG_NETLINK_MMAP kernel configuration option and do almost common Netlink kernel calls. So lets have a look at an example how to use the new Netlink kernel interface. I hope the example will be useful in addition to Patrick's and Kevin's articles.

Firstly, we need to define NETLINK_TEMPESTA in linux/include/uapi/linux/netlink.h, so the small kernel patch is required. Next, we should register our Netlink hooks:

    netlink_kernel_create(&init_net, NETLINK_TEMPESTA, &tdb_if_nlcfg);

We register the input hook only:

    static struct netlink_kernel_cfg tdb_if_nlcfg = {
        .input  = tdb_if_rcv,

The hook just calls standard netlink_rcv_skb, which executes our other hook in a loop over all skb data fragments:

    static void
    tdb_if_rcv(struct sk_buff *skb)
        netlink_rcv_skb(skb, &tdb_if_proc_msg);

Now we can start to process our own data structure TdbMsg describing messages passed over Netlink (we still must pass small message descriptors over standard Linux system call interface, but the message bodies are transfered in zero-copy fashion):

    static int
    tdb_if_proc_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
        TdbMsg *m;

        if (nlh->nlmsg_len < sizeof(*nlh) + sizeof(TdbMsg)) {
            TDB_ERR("too short netlink msg\n");
            return -EINVAL;

        m = nlmsg_data(nlh);

                struct netlink_dump_control c = {
                        .dump = tdb_if_call_tbl[m->type].dump,
                        .data = m,
                        .min_dump_alloc = NL_FR_SZ / 2,
                return netlink_dump_start(nls, skb, nlh, &c);

Here we call standard Linux routine again, netlink_dump_start, which does helping stuff for the message processing and prepares skb for the kernel answer. netlink_dump_control describes the message processing handler. min_dump_alloc specifies number of bytes to allocate for the response skb. We use half of Netlink mmap frame size which we defined as 16KB. data is just a pointer to the skb received from user-space. And dump is the callback to process the message. We get the callback by the message type used as a callback table index. And the table is defined as:

    static const struct {
        int (*dump)(struct sk_buff *, struct netlink_callback *);
    } tdb_if_call_tbl[__TDB_MSG_TYPE_MAX] = {
        [TDB_MSG_INFO]    = { .dump = tdb_if_info },
        /* .... */

And finally define dump callbacks as:

    static int
    tdb_if_info(struct sk_buff *skb, struct netlink_callback *cb)
        TdbMsg *m;
        struct nlmsghdr *nlh;

        /* Allocate space at response skb. */

        nlh = nlmsg_put(skb, NETLINK_CB(cb->skb).portid,
                        cb->nlh->nlmsg_seq, cb->nlh->nlmsg_type,
                        TDB_NLMSG_MAXSZ, 0);

        m = nlmsg_data(nlh);

        /* Fill in response... */

        return 0; /* end transfer,

                     return skb->len if you have more data to send */

You can find real life example of user-space Netlink mmap usage in libtdb. And the kernel side implementation in Tempesta DB core interface.