High Performance Linux

Monday, March 9, 2015

Linux Netlink Mmap: Bulk Data Transfer for Kernel Database

Netlink is relatively new IPC mechanism in Linux, usually considered as replacement for ioctl(2). Wikipedia article gives a nice explanation about the subject. Communications using Nelink are very similar to usual sockets. So if you transfer data between user-space program and the Linux kernel, then typically you do data copying.

Linux 3.10 introduced Netlink mmap interface from Patrick McHardy. The patch set shares mapped memory region between user-space program and the kernel in circular buffer manner, so you can do huge zero-copy data transfers between user- and kernel-spaces. Patrick's presentation motivates the interface by: "Performance doesn't matter much in many cases since most netlink subsystems have very low bandwidth demands. Notable exceptions are nfnetlink_queue, ctnetlink and possibly nfnetlink_log". The patches were merged into the kernel in 2012, but it seems no one kernel subsystem uses the interface as well as libnl (Netlink Protocol Library Suite) still doesn't implement any API for the feature. Probably, this is just because nobody needs to load tons on firewall rules, packets or statistics to/from the kernel...

Meantime, this is known fact, that embedded databases deliver the best performance for simple workloads due to lower overhead in comparison with client-server databases. However, what if you need perfect performance for many user-space processes (not threads)? Usually embedded databases serialize access to  shared data using file locks. And file operations surely are slow. To solve the problem process oriented server software like PostgreSQL, Nginx or Apache allocates shared memory region and introduces spinlocks in the area.

What happens if a process holding the spinlock dies? All others just spin on the lock in vain. So you need to introduce some lock control mechanism which manages releasing of locks acquired by dead processes. Ok, you control the locks now, but what's the state of the shared memory region if a process dies holding the lock and updating some data? We can use something like write ahead log (WAL) for safe updates. Who will be responsible for undo and/or redo the log? Some designated management process? Maybe so. However, this is a good task for operating system - it can do, and in fact usually does, all cleaning operations if application process dies abnormally. I/O operations for the database persistency are also good to move from user-space database process to OS. By the way, there was an old LKML discussion on O_DIRECT which is used by database developers and which is considered as braindamaged by OS developers.

Furthermore, in our case most of the logic is done in kernel space, but we need to transfer tons of filtering rules, network analytics, logging and other data between user- and kernel-spaces. So we have started to develop Tempesta DB, a database shared by the kernel and many user-space processes, but still very fast and reliable. This is were Netlink mmap shines for data transfers.

Patrick wrote nice examples how to use the interface in user-space. Kevin Kaichuan He in his Linux Journal article described generic Netlink kernel interfaces, but unfortunately the article is outdated after 10 years. During these years Linux significantly developed Netlink interfaces and amazing thing about Netlink mmap is that the whole zero-copy magic is happen in  linux/net/netlink/af_netlink.c and you don't need to bother about skbs with mmaped buffers. You just need to switch on CONFIG_NETLINK_MMAP kernel configuration option and do almost common Netlink kernel calls. So lets have a look at an example how to use the new Netlink kernel interface. I hope the example will be useful in addition to Patrick's and Kevin's articles.

Firstly, we need to define NETLINK_TEMPESTA in linux/include/uapi/linux/netlink.h, so the small kernel patch is required. Next, we should register our Netlink hooks:

    netlink_kernel_create(&init_net, NETLINK_TEMPESTA, &tdb_if_nlcfg);

We register the input hook only:

    static struct netlink_kernel_cfg tdb_if_nlcfg = {
        .input  = tdb_if_rcv,

The hook just calls standard netlink_rcv_skb, which executes our other hook in a loop over all skb data fragments:

    static void
    tdb_if_rcv(struct sk_buff *skb)
        netlink_rcv_skb(skb, &tdb_if_proc_msg);

Now we can start to process our own data structure TdbMsg describing messages passed over Netlink (we still must pass small message descriptors over standard Linux system call interface, but the message bodies are transfered in zero-copy fashion):

    static int
    tdb_if_proc_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
        TdbMsg *m;

        if (nlh->nlmsg_len < sizeof(*nlh) + sizeof(TdbMsg)) {
            TDB_ERR("too short netlink msg\n");
            return -EINVAL;

        m = nlmsg_data(nlh);

                struct netlink_dump_control c = {
                        .dump = tdb_if_call_tbl[m->type].dump,
                        .data = m,
                        .min_dump_alloc = NL_FR_SZ / 2,
                return netlink_dump_start(nls, skb, nlh, &c);

Here we call standard Linux routine again, netlink_dump_start, which does helping stuff for the message processing and prepares skb for the kernel answer. netlink_dump_control describes the message processing handler. min_dump_alloc specifies number of bytes to allocate for the response skb. We use half of Netlink mmap frame size which we defined as 16KB. data is just a pointer to the skb received from user-space. And dump is the callback to process the message. We get the callback by the message type used as a callback table index. And the table is defined as:

    static const struct {
        int (*dump)(struct sk_buff *, struct netlink_callback *);
    } tdb_if_call_tbl[__TDB_MSG_TYPE_MAX] = {
        [TDB_MSG_INFO]    = { .dump = tdb_if_info },
        /* .... */

And finally define dump callbacks as:

    static int
    tdb_if_info(struct sk_buff *skb, struct netlink_callback *cb)
        TdbMsg *m;
        struct nlmsghdr *nlh;

        /* Allocate space at response skb. */

        nlh = nlmsg_put(skb, NETLINK_CB(cb->skb).portid,
                        cb->nlh->nlmsg_seq, cb->nlh->nlmsg_type,
                        TDB_NLMSG_MAXSZ, 0);

        m = nlmsg_data(nlh);

        /* Fill in response... */

        return 0; /* end transfer,

                     return skb->len if you have more data to send */

You can find real life example of user-space Netlink mmap usage in libtdb. And the kernel side implementation in Tempesta DB core interface.