High Performance Linux

Monday, September 12, 2016

How To Analyze Performance of Your Program

Is your program fast? This question seems easy to answer and typically people do some benchmarks to know it. However, believe or not many ever experienced engineers do wrong things in performance analysis. There are many tutorials about profiling (e.g. for perf). There are also tutorials and presentations about Linux performance monitoring (e.g. Brendan Gregg's or mine, in Russian) and tuning (e.g. from RedHat). However, I haven't seen any consolidated tutorial on Linux server software performance analysis so far. So this article is an example of performance analysis of program, Tempesta FW, which is supposed to be used as introductory tutorial for software engineers on Linux software performance analysis.

Please feel free to write comments with additional references, good links about benchmarking, performance monitoring and profiling. Additions, corrections etc to make the "tutorial" more complete are very appreciated.

Define the Reference and the Target

The first thing to do is clearly define what "fast" means for your program. Probably you have business requirements like "the software must process at least 50,000 reqeusts per second (RPS)". Or probably you reference some existing software, e.g. "be as fast as the Competitor's solution".

Nginx is so cool and so widely used that use it for the reference in our development. Actually we could reference Varnish or Apache Traffic Server as well (and actually we do keep their design in mind), but they are very close to Nginx in peformance, just because they all are traditional user space daemons. However, since Tempesta FW is pure kernel solution with it's pros and cons it a priori delivers much more performance than any traditional user-space solutions (solutions based on user-space TCP/IP are different story). So we defined our target as "to be x10 times faster than traditional user-space Web accelerator" and use Nginx as an example of "traditional user-space Web accelerator" across the article.

Define the Measurements

The next step is definition of performance measurements. The truth is that only real production usage will show whether the software is fast enough. However, it's usually not wished to deploy a software in production without preliminary performance measurements. So we're just doing preliminary, synthetic in other words, performance testing.

Why the tests usually can't show real performance? Just because real workload you'll face very different use cases. For example for Web server in real deployment you see various DDoS attacks, users with old and/or RFC incompatible Web browsers, very different robots from Web search engines and many many other clients. Usually it's just impossible to emulate all the use patterns in tests.

Thus, we're defining very reduced set of test cases, or benchmarks, to analyze performance of our software in particular use cases. For example we can define following benchmark scenarios:
  1. Analyze performance for 1, 8, 64, 512, 2048, 8192 and 32768 connections. The test case analyzes how well the program scales with increasing number of connections. Small connections number challenges single CPU core performance while largest connections number challenges I/O multiplexing and probably Web accelerator and operating system internal synchronization mechanisms.
  2. Test minimal requests like 'GET / HTTP\1.0\n\n' (simple HTTP flood) as well as normal browser requests. Software should be fast for normal users as well as to be resistant to application layer DDoS attacks;
  3. Test small (16 bytes), medium (3KB) and large (16KB) Web content files. Well, 16 bytes seems unrealistic payload size, but it efficiently test networking I/O, HTTP processing and other logic. Meantime, very large payloads, like tens megabytes or more, just test network throughput minimizing Web accelerator role. So we don't use such large payloads. The interesting case is medium size: 3KB plus some space for HTTP headers and metadata should fit system page of size 4KB - this is very comfortable case for memory allocators.
  4. Test normal Web content processing and efficiency of rate limiting clients blocking. The motivation for the test case is the same as for (2): be fast for normal conditions and DDoS resistant.
  5.  Test keep-alive HTTP sessions as well as new TCP connection for each request. New TCP connections for each HTTP request cause good stress to network I/O, Web accelerator sessions initialization and so on, while keep-alive connections move focus to HTTP processing.
One of our client, a VPN vendor, knew their bottle neck - it was hard for them to scale single heavyweight TCP connection among many CPU cores for encryption. So the analysis case for them was exactly to pass one, as large as possible, TCP connection through the VPN agent and process it as quickly as possible using all available CPU cores. So knowing or assuming a priori bottle necks of your software is very useful to define good test case for benchmarks.

One other example could be business requirement. Very useful requirement is to show performance results as high as possible, but without cheating (e.g. without using jumbo frames). Usually the requirement arrives from marketing department since the guys must show where the product outperforms competitors. In particular you choose any benchmark test which is the most promising for your solution and do fair performance tests.

In our example test No. 4, rate limiting clients blocking, is exactly the case. Any modern user-space Web accelerator runs rate limiting on application layer, e.g. if Nginx rate limits a client it still has to accept TCP connection from the client, read and parse the request and only after that limiting modules get control and block the request. Meantime, Tempesta FW directly passes blocked IP addresses to IP layer and all following requests from the malicious client are immediately blocked at IP layer (you can find details from my presentation - it's in Russian, but the slides are in English). This is unique feature were the solution shines and performance results will just clearly show that.

Setup the Testbed

The last preparation step is definition and setting up testbed. The testbed configuration is crucial since it must emulate real life use case. Meantime, testbed must be able to explicitly show the software performance. I'll describe hardware and configuration issues with testbed setup at the below.

In our case we're using testbed configuration as depicted on the figure. There are 2 servers: one to run Web server perfromance measurement tool and the second one to run Web accelerator, Tempesta FW or Nginx in our tests. Each server has 8 CPU cores Intel Xeon E5620 (very old processor) and the servers are linked using 10Gbps Intel network adapters. Nginx backend is running on port 8080. Tempesta FW uses port 80 and Nginx as Web accelerator uses port 9000.

Note that we use Nginx backend (upstream) on the same host as analyzed Tempesta FW or Nginx as Web accelerator. While 3rd machine running the backend server looks more practical, it's undistinguishable from running backend on the same host in our test scenario because we use single page Web content which is cached by Web accelerator on first hit. Testing Web cache requires large Web content, such that the Web accelerator constantly evicts some cache entries and fetches new ones from the backend server. So more network I/O is introduced, probably disk I/O in Nginx case and surely cache eviction algorithm. This is relatively complex topic which merits its own benchmark and independent analysis. We'll do this for Tempesta FW in future. Meantime I want to share interesting Nginx vs Varnish performance study from BBC which discusses complex topics in the Web accelerators performance.

Warm Caches

So far so good, let's start banchmarking. ApacheBench (ab) is the first tool which Wikipedia mentions, so let's give it a try. We can make following simple run against Nginx as the first try (all senseless output like the tool banner is skipped for brivity):

    # ab -n 10

    Server Software:        nginx/1.6.2
    Server Hostname:
    Server Port:            9000

    Document Path:          /
    Document Length:        867 bytes

    Concurrency Level:      1
    Time taken for tests:   0.076 seconds
    Complete requests:      10
    Failed requests:        0
    Total transferred:      10990 bytes
    HTML transferred:       8670 bytes
    Requests per second:    131.40 [#/sec] (mean)
    Time per request:       7.610 [ms] (mean)
    Time per request:       7.610 [ms] (mean, across all concurrent requests)
    Transfer rate:          141.03 [Kbytes/sec] received

    Connection Times (ms)
                  min  mean[+/-sd] median   max
    Connect:        0    0   0.2      0       1
    Processing:     0    7  21.2      0      68
    Waiting:        0    7  21.2      0      68
    Total:          1    8  21.4      1      69

    Percentage of the requests served within a certain time (ms)
      50%      1
      66%      1
      75%      1
      80%      1
      90%     69
      95%     69
      98%     69
      99%     69
     100%     69 (longest request)

One request took 69ms while others are within 1ms - too big difference, what's the reason? The answer is simple - the cache. To service the first request Web accelerator must transfer the request to the backend server and when it receives the response, it sends the response to the client and caches it. All following requests are serviced directly from the cache. So if you don't test caches, ensure that all the caches are properly warmed up before start of real benchmark. As it was mentioned before, we don't test cache, so I send one request to Web accelerator before each benchmark. Also keep in mind that caches can stale, so I use proxy_cache_valid any 30d for Nginx configuration to make sure that there is no cache invalidation during the tests. Actually there are many different caches which can affect results of benchmarks. E.g. CPU cache plays significant role in Intel TSX benchmarks. So rule of thumb is to make a test run before the actual benchmark to warm all the caches.

Performance of Performance Measurement Tool

Now caches are warm and we can start real measurement for concurrent sessions (several output lines are skipped as they're the same as for previous run):

    # ab -n 100000 -c 64

    Concurrency Level:      64
    Time taken for tests:   20.906 seconds
    Complete requests:      100000
    Failed requests:        0
    Total transferred:      109900000 bytes
    HTML transferred:       86700000 bytes
    Requests per second:    4783.21 [#/sec] (mean)
    Time per request:       13.380 [ms] (mean)
    Time per request:       0.209 [ms] (mean, across all     concurrent requests)
    Transfer rate:          5133.54 [Kbytes/sec] received    Connection Times (ms)
              min  mean[+/-sd] median   max
    Connect:        1    6   0.7      6      10
    Processing:     2    8   1.2      7      14
    Waiting:        2    6   1.3      5      12
    Total:          7   13   0.8     13      19

    Percentage of the requests served within a certain time (ms)
      50%     13
      66%     14
      75%     14
      80%     14
      90%     14
      95%     15
      98%     15
      99%     15
     100%     19 (longest request)

Only 4783 RPS for Nginx on 8 core machine? Yes, and that's just because ab can't produce more load, it's essentially the bottleneck of our installation. If you run top -b  -p `pidof ab` -n 10 on load generation machine, then typically you find something like:

       PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
     9120 root      20   0   34416   5984   3892 R  99.3  0.3   0:11.57 ab

The problem is that ab is unable to use more than one CPU core. If you need more workload, then you have to run many ab instances in parallel and merge their results somehow.

So it has sense to try other popular HTTP benchmarking tool, wrk, which is able to generate load in many threads. Running wrk in 8 threads (equal to CPU number) gives us much better results:

    # ./wrk -t 8 -c 64 -d 30s
    Running 30s test @
      8 threads and 64 connections
      Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency     4.66ms   10.93ms 285.48ms   99.14%
        Req/Sec     2.09k   248.31     2.83k    90.03%
      498152 requests in 30.02s, 524.48MB read
    Requests/sec:  16594.13
    Transfer/sec:     17.47MB

Tune the Reference System

17KRPS isn't so impressive results for Nginx on 8 core system, right? So the next step is to tune operating system and reference program to get more performance, such that our performance comparison becomes fair. There are many resources in Internet about Linux and Nginx performance tuning, e.g. Tuning NGINX for Performance. Also this Linux Web Server Performance Benchmark can give you an idea how to properly setup the system and which numbers you should expect from the benchmarks.

The previous results were for default Debian Nginx 1.62 configuration with almost 1KB index.html content. So I downloaded Nginx 1.11.3 which introduces performance improvements, replaced original index.html by "0123456789abcde" (16 bytes including trailing '\n') and used the same sysctl settings as in the benchmark mentioned above:

    sysctl -w fs.file-max=5000000
    sysctl -w net.core.netdev_max_backlog=400000
    sysctl -w net.core.optmem_max=10000000
    sysctl -w net.core.rmem_default=10000000
    sysctl -w net.core.rmem_max=10000000
    sysctl -w net.core.somaxconn=100000
    sysctl -w net.core.wmem_default=10000000
    sysctl -w net.core.wmem_max=10000000
    sysctl -w net.ipv4.conf.all.rp_filter=1
    sysctl -w net.ipv4.conf.default.rp_filter=1
    sysctl -w net.ipv4.ip_local_port_range='1024 65535'
    sysctl -w net.ipv4.tcp_congestion_control=bic
    sysctl -w net.ipv4.tcp_ecn=0
    sysctl -w net.ipv4.tcp_max_syn_backlog=12000
    sysctl -w net.ipv4.tcp_max_tw_buckets=2000000
    sysctl -w net.ipv4.tcp_mem='30000000 30000000 30000000'
    sysctl -w net.ipv4.tcp_rmem='30000000 30000000 30000000'
    sysctl -w net.ipv4.tcp_sack=1
    sysctl -w net.ipv4.tcp_syncookies=0
    sysctl -w net.ipv4.tcp_timestamps=1
    sysctl -w net.ipv4.tcp_wmem='30000000 30000000 30000000'
    sysctl -w net.ipv4.tcp_tw_recycle=1
    sysctl -w net.ipv4.tcp_tw_reuse=1

My nginx.conf is

    # Don't use gzip module for small content files.
    user www-data;
    pid /run/nginx-proxy.pid;
    worker_processes 8;
    worker_rlimit_nofile 100000;
    error_log /opt/nginx-1.11.3/logs/error-proxy.log crit;
    events {
        worker_connections 16384;
        use epoll;
        multi_accept on;
        accept_mutex off;
    http {
        server {
            listen 9000 backlog=131072 deferred reuseport \

            location / {
        access_log off;
        proxy_cache_path /opt/nginx/cache keys_zone=one:10m;
        proxy_cache one;
        proxy_cache_valid any 30d;
        proxy_cache_valid 200 302 10m;
        open_file_cache max=100000 inactive=30d;
        open_file_cache_valid 30d;
        open_file_cache_min_uses 2;
        open_file_cache_errors on;
        # `sendfile on` has no sense for small files
        etag off;
        tcp_nopush on;
        tcp_nodelay on;
        keepalive_timeout 65;
        keepalive_requests 100000000;
        reset_timedout_connection on;
        client_body_timeout 40;
        send_timeout 10;

So now we dramatically increase our results to almost 95KRPS for HTTP Keep-Alive connections:

    # ./wrk -t 8 -c 64 -d 30s
    Running 30s test @
      8 threads and 64 connections
      Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency   674.67us  632.24us  26.02ms   97.67%
        Req/Sec    11.83k     1.09k   20.05k    57.67%
      2830565 requests in 30.10s, 680.26MB read
    Requests/sec:  94793.24
    Transfer/sec:     22.60MB

And 17KRPS for "Connection: close" requests:

    # ./wrk -t 8 -c 64 -d 30s --header 'Connection: close'
    Running 30s test @
      8 threads and 64 connections
      Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency     3.27ms    2.26ms  26.19ms   71.41%
        Req/Sec     2.16k   105.14     2.56k    69.83%
      516962 requests in 30.03s, 121.77MB read
    Requests/sec:  17215.52
    Transfer/sec:      4.06MB

Compare Performance Measurements

So now we have data about performance of reference system and now is time to measure performance of our program. Comprehensive performance benchmarks, covering all the mentioned test cases, are complex and too large. So I use only 3, probably the most notable, of them. All the tests are run for 1, 8, 64, 512, 2048, 8192 and 32768 connections and use the same 16 byte index.html.

I used following configuration file for Tempesta FW:

    server conns_n=8;
    cache 1;
    cache_fulfill * *;

The first test uses keep-alive HTTP connections and sends smallest possible GET requests:

    ./wrk -t $T -c $C -d 30s$P/

, where $C is connections number from 1 to 32,768, $T is threads number, 1 for $C = 1 and 8 for larger values and $P is port, 9000 for Nginx and 80 for Tempesta FW.

The second test adds 'Connection: close' header to the requests:

    ./wrk -t $T -c $C -d 30s --header 'Connection: close'$P/

And the last test mimics real life requests with large Cookie, keep-alive connections are used:

    ./wrk -t $T -c $C -d 30s --header 'Connection: keep-alive' --header 'Upgrade-Insecure-Requests: 1' --header 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' --header 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' --header 'Accept-Encoding: gzip, deflate, sdch' --header 'Accept-Language: en-US,en;q=0.8,ru;q=0.6' --header 'Cookie: a=sdfasd; sdf=3242u389erfhhs; djcnjhe=sdfsdafsdjfb324te1267dd; sdaf=mo2u8943478t67437461746rfdgfcdc; ityu=9u489573484duifhd; GTYFT=nsdjhcbyq3te76ewgfcZ; uityut=23Y746756247856425784657; GA=URHUFVHHVSDNFDHGYSDGF; a=%45345%dfdfg%4656%4534sdfjhsdb.sdfsg.sdfgsf.; nsdjhfb=4358345y; jkbsdff=aaaa; aa=4583478; aaaaa=34435345; rrr=iy7t67t6tsdf; ggg=234i5y24785y78ry534785; mmm=23uy47fbhdsfbgh; bsdfhbhfgdqqwew=883476757%345345; jksdfb=2348y; ndfsgsfdg=235trHHVGHFGC; a=sdfasd; sdf=3242u389erfhhs; djcnjhe=sdfsdafsdjfb324te1267dd; sdaf=mo2u8943478t67437461746rfdgfcdc; ityu=9u489573484duifhd; GTYFT=nsdjhcbyq3te76ewgfcZ; uityut=23Y746756247856425784657; GA=URHUFVHHVSDNFDHGYSDGF; a=%45345%dfdfg%4656%4534sdfjhsdb.sdfsg.sdfgsf.; nsdjhfb=4358345y; jkbsdff=aaaa; aa=4583478; aaaaa=34435345; rrr=iy7t67t6tsdf; ggg=234i5y24785y78ry534785; mmm=23uy47fbhdsfbgh; bsdfhbhfgdqqwew=883476757%345345; jksdfb=2348y; ndfsgsfdg=235trHHVGHFGC; erertrt=3242342343423324234; ggggggggg=8888888888888888888888888888888888888888888888888888888888888788'$P/

The plots show that we beat a user-space Web accelerator in 2 times at most and like 1.5 times only in average. Does it mean that we did poor job and didn't reach our targets? Not at all. Let's understand the results and figure out the next steps.

Understand the Results

As the first step let's see top output for Nginx server. It shows

    %Cpu(s): 28.0 us, 44.7 sy,  0.0 ni,  0.2 id,  0.0 wa,  0.0 hi, 27.0 si,  0.0 st

for small requests and keep-alive connections or

    %Cpu(s):  35.7 us, 42.3 sy,  0.0 ni,  0.2 id,  0.0 wa,  0.0 hi, 21.8 si,  0.0 st

for real life requests with keep-alive connections. In all the cases traffic generating machine is idle for about 15-30%:

    %Cpu(s):  5.3 us, 52.6 sy,  0.0 ni, 28.0 id,  0.0 wa,  0.1 hi, 13.9 si,  0.0 st

Meantime Tempesta FW benchmarks behave very differently:

    %Cpu(s):  0.0 us,  1.2 sy,  0.0 ni, 58.4 id,  0.0 wa,  0.0 hi, 40.3 si,  0.0 st

for real life requests and

    %Cpu(s):  0.0 us,  0.2 sy,  0.0 ni, 94.3 id,  0.0 wa,  0.0 hi,  5.4 si,  0.0 st

for small requests. Note that in the last case 94% of CPU is idle, i.e. Tempesta FW uses only 5% of CPU to process almost 140KRPS! That's not bad at all. By the way, Tempesta FW works fully in softirq context, so this is why we see from the top output that CPU runs in softirq only.

And now let's have a look at traffic generating machine:

    %Cpu(s):  6.0 us, 62.2 sy,  0.0 ni,  0.1 id,  0.0 wa,  0.0 hi, 31.7 si,  0.0 st

So now the traffic generator is the bottle neck again. This is why we could not show results better that x2 times faster than Nginx. So this time to cause more load onto our Web accelerator we need much stronger hardware to generate more HTTP load. However, if we simply increase number of CPU cores that doesn't change results for small connections number, like 1 or 8, because only a few CPU cores are able to generate the load. To show the best results for small connections number we need to use much more productive machine in single CPU core loads.

What to Improve

Ok, having more power hardware to generate HTTP traffic we can show numbers in 5-10 times better for keep-alive connections and small requests. However, our top results for real life requests weren't so promising: we ate almost 60% of CPU. So this is the point for improvements.

perf top (again, I reference to perf tutorial and nice Brendan Gregg's article) shows us following profile:

    13.67%  [tempesta_fw]   [k] tfw_http_parse_req
      4.37%  [kernel]       [k] __memset
      3.80%  [tempesta_fw]  [k] __str_grow_tree
      3.52%  [tempesta_fw]  [k] tfw_http_msg_field_chunk_fixup
      2.98%  [kernel]       [k] debug_lockdep_rcu_enabled
      2.80%  [kernel]       [k] lock_acquire
      2.67%  [kernel]       [k] lock_release
      2.14%  [kernel]       [k] check_poison_obj
      2.03%  [tempesta_fw]  [k] tfw_pool_realloc

So the bottle neck is HTTP parser, actually known issue.

Rate Limiting

I didn't measure 4th case, rate limiting clients blocking. While the test is very notable, it's not easy to do it properly. The reason is that traditional Web server performance measurement tools like ab or wrk can't measure how many requests or connections were blocked, just how many of them were processed. To address the issue we're developing our own benchmark tool, tfw_bomber, which "bombs" Web server by normal as well as malicious request and is able to measure blocked connections and requests. I'll describe the tool in upcoming articles.

Sunday, June 5, 2016

Tempesta FW: First Installation

Dear colleagues, I didn't write posts in the blog for a long time. That's because we were releasing Tempesta FW. And now I'm happy to announce that http://tempesta-tech.com is accelerated hereafter by Tempesta FW. This is the first installation of our Web accelerator. It was so long way!

Now the site is powered by Apache HTTPD behind Tempesta FW. However, shortly we'll implement Web-server features as well and we'll be able to service sites fully under Tempesta FW.

Stay tuned!

Sunday, April 17, 2016

x86-64 Wastes TLB Entries

Basically TLB caches page table translations. We need page table since contiguous virtual memory area can in fact consist of many physical fragments. So page table is used to map the physical memory fragments, pages, to some virtual address space. x86-64 has 4-level page table, so if you access a page, then in fact you for 5 memory accesses (4 for page table to resolve the virtual address and the last one is you actual access). To mitigate the performance overhead caused by page table translations TLB is used to cache virtual to physical address translations.

Surprisingly, when OS is loaded it maps whole physical memory to some virtual address space. It's important that the physical and virtual address spaces are both contiguous now. And OS kernel works with exactly this virtual address space. For example, kmalloc() in Linux kernel returns pointer from this virtual address space. However, vmalloc() maps physical pages to some other virtual address space, named vmalloc area.

Thus kernel addresses can be resolved by simple offsets (see linux/arch/x86/include/asm/page_64.h):

    static inline unsigned long
    __phys_addr_nodebug(unsigned long x)
        unsigned long y = x - __START_KERNEL_map;

        /* use the carry flag to determine if
           x was < __START_KERNEL_map */
        x = y + ((x > y)
                 ? phys_base
                 : (__START_KERNEL_map - PAGE_OFFSET));    

        return x;

Since virtual to physical address translation is trivial, why in earth do we need to use page table and waste invaluable TLB entries for the translations? However, there is nothing like MIPS's direct mapped kernel space segments for x86-64. The sad story about x86-64 is that trivial mappings wastes TLD entries and require extra memory transfers. The only one thing which x86-64 does to optimize TLD usage is global address spaces, e.g. kernel space, which are never invalidated in TLB on context switches. But still if you switch to kernel, kernel mappings evict your user-space mappings from TLB. Meantime, if your application is memory greedy, then syscalls can take long time due to TLB cache misses.

Wednesday, September 30, 2015

Fast Memory Pool Allocators: Boost, Nginx & Tempesta FW

Memory Pools and Region-based Memory Management

Memory pools and region-based memory management allow you to improve your program performance by avoiding unnecessary memory freeing calls. Moreover, pure memory pools gain even more performance due to simpler internal mechanisms. The techniques are widely used in Web servers and using them you can do following (pseudo code for some imaginary Web server):

    http_req->pool = pool_create();
    while (read_request & parse_request) {
        http_req->hdr[i] = pool_alloc(http_req->pool);
        // Do other stuff, don't free allocated memory.
    // Send the request.
    // Now destroy all allocated items at once.

This reduces number of memory allocator calls, simplifies its mechanisms (e.g. since you don't free allocated memory chunks it doesn't need to care about memory fragmentation and many other problems which common memory allocators must solve at relatively high cost) and makes the program run faster.

Probably, you've noticed that we call pool_alloc() in the example above without specifying allocation size. And here we go to the difference between memory pools and region-based memory management: memory pools can allocate memory chunks with only one fixed size while region-based memory allocators are able to allocate chunks with different size. Meantime, both of them allow you not to care about freeing memory and just drop all the allocated memory chunks at once.

Boost C++ library provides memory pool library boost::pool, let's have a closer look at the library. By the way, there is nice description of memory pool concepts.


If we run following loop for N = 20,000,000 (this is body of a template function with argument T, you can find the full source code of the benchmark at our repository at GitHub):

    boost::object_pool<T> p;

    for (size_t i = 0; i < N * sizeof(Small) / sizeof(T); ++i) {
        T *o = p.malloc();

, it takes about 260ms for T 24 bytes (this is the size of structure Small) in size and about 170ms for an object of size 240 bytes (size of structure Big) on my Intel Core i7-4650U 1.70GHz. Please, note that I don't measure pools construction and destruction times to concentrate on memory management techniques without additional expenses. Also I limit number of allocations by

    N * sizeof(Small) / sizeof(T)

to avoid memory exhausting on tests with large object sizes.

Hereafter touch_obj() is just a simple macro checking result of memory allocation and breaking current loop:

    #define touch_obj(o)                                   \
        if (__builtin_expect(!o, 0)) {                     \
                std::cerr << "failed alloc" << std::endl;  \
                break;                                     \
        } else {                                           \
                *(long *)o = 1;                            \

Now if we free each 4th allocated element:

    for (size_t i = 0; i < N * sizeof(Small) / sizeof(T); ++i) {
        T *o = p.malloc();
        if (__builtin_expect(!(i & 3), 0))

the function takes about 165ms and 155ms for T sizes 24 bytes and 240 bytes correspondingly and this is almost nothing for big objects, but more than 50% performance improvement for small objects. Results for the boost::pool benchmark:

    $ ./a.out 
                 small object size:    24
                   big object size:    240

               boost::pool (Small):    258ms
       boost::pool w/ free (Small):    165ms
                 boost::pool (Big):    174ms
         boost::pool w/ free (Big):    154ms

(I take the best numbers from 3 runs of the test). The loop now contains extra conditional jump and free() call which is surely not for free. So why do we get the performance improvement for the version with objects freeing? There are two reasons why it's faster:

  1. the main reason that in the first version of program we're allocating more than 450MB of RAM, so we cause serious pressure on system memory allocator which must allocate a lot of memory for us. While the second version of program requires 25% less system memory;
  2. each write operation touches a new memory region, so some cache lines are evicted from CPU cache and the memory is loaded to cache. Thus, we get serious CPU cache starvation.

Therefore, periodic or casual memory freeings reduce overall workflow (while we have more free() calls, the underlying system allocator is called rarely) and effectively reuse CPU cache lines.

But why does big allocations test not win from freeing? The short answer is exponential growing of memory storage used by boost::pool and particular numbers (N and sizeof(T))in the test. The key function of the boost::object_pool is pool<UserAllocator>::ordered_malloc_need_resize() (simplified for brevity):

    template <typename UserAllocator>
    void *
        size_type POD_size = next_size * partition_size + ...
        char * ptr = (UserAllocator::malloc)(POD_size);
        if (!max_size)
            next_size <<= 1;

The function requests system allocator by exponentially growing steps. If we call strace -e trace=mmap against the test for small objects without freeing, then we can see following mmap() calls (glibc uses mmap() for large allocations and brk() for smaller):

    mmap(NULL, 200704,....) = 0x7fdb428bd000
    mmap(NULL, 397312, ...) = 0x7fdb4285c000
    mmap(NULL, 790528, ...) = 0x7fdb4279b000
    mmap(NULL, 201330688, ....) = 0x7fdb295e3000
    mmap(NULL, 402657280, ....) = 0x7fdb115e2000

12 calls in total for the benchmark code itself. The benchmark for small objects with freeing produces 11 calls ending at

    mmap(NULL, 201330688, ....) = 0x7f91ed60b000

So the freeing really reduces size of allocated memory and improves performance. However, this is not the case for big allocation benchmarks - both the benchmarks have the same number of equal mmaps():

    mmap(NULL, 249856, ....) = 0x7f34431be000
    mmap(NULL, 251662336, ....) = 0x7f3423f50000

I.e. in this particular test the last very large memory allocation is enough to satisfy the benchmark memory requirements even without freeing.

Boost::pool is the pure memory pool which allows you to allocate memory chunks of one fixed size only, but this is not the case for Nginx memory pool which is able to allocate objects of different size.

Nginx Pool

Wikipedia says: "the web server Nginx, use the term memory pool to refer to a group of variable-size allocations which can be later deallocated all at once. This is also known as a region". In fact, Nginx's pool allows you to allocate variable size memory chunks up to 4095 bytes for x86-64. You can request larger memory areas, but they're allocated using plain malloc().

To learn performance of Nginx pool I copy & pasted some pieces of src/core/ngx_palloc.c from nginx-1.9.5. Now benchmark code for small and large memory allocations looks like the following (you can find the full code here):

    ngx_pool_t *p = ngx_create_pool(PAGE_SIZE);

    for (size_t i = 0; i < N * sizeof(Small) / sizeof(T); ++i) {
        T *o = (T *)ngx_palloc(p, sizeof(T));


Ngimx has ngx_pfree(), but it's for large data blocks only, so small memory chunks aren't freeable, so we test periodic memory freeings for large memory chunks only. Nginx's pool support different allocation sizes, so let's test mixed allocation with 1/4 of big (240B) and 1/4096 of huge (8305B) chunks:

    ngx_pool_t *p = ngx_create_pool(PAGE_SIZE);
    for (size_t i = 0; i < N; ++i) {
        if (__builtin_expect(!(i & 3), 0)) {
            if (__builtin_expect(!(i & 0xfff), 0)) {
                Huge *o = (Huge *)ngx_palloc(p, sizeof(*o));
                if (!(i & 1))
                    ngx_pfree(p, o);
            else if (__builtin_expect(!(i & 3), 0)) {
                Big *o = (Big *)ngx_palloc(p, sizeof(*o));
            else {
                Small *o = (Small *)ngx_palloc(p, sizeof(*o));


The code above gives us following results:

                  ngx_pool (Small):    305ms
                    ngx_pool (Big):    132ms
            ngx_pool w/ free (Mix):    542ms

Let's do review of Nginx pool implementation and quick analyzing of the results:

  1. the implementation is very simple (the ported allocator code takes less than 200 lines) without complex OOP wrapping code which boost::pool has in plenty;
  2. ngx_palloc() scans each memory block at most 4 times trying to better utilize allocated memory chunks. There is small multiplication factor, but the algorithm is still O(1). If you do a small allocation after few large allocations, then it's likely that there is some room to place the new chunk somewhere in previously allocated blocks without requesting a new memory block. This is the cost for ability to allocate memory chunks of different sizes. Boost::pool just puts current memory chunk at the end of allocated data, so doesn't do any scans at all;
  3. Nginx grows its pool by blocks of fixed size, it doesn't use exponential growing like boost::pool.

Update. Previously I explained the difference between results for Big and Small allocations by O(n) complexity in ngx_palloc(), in particular I was confused by this code:

    do {
        m = ngx_align_ptr(p->d.last, NGX_ALIGNMENT);
        if ((size_t)(p->d.end - m) >= size) {
            p->d.last = m + size;
            return m;
        p = p->d.next;
    } while (p);

However, ngx_palloc_block() does

    for (p = pool->current; p->d.next; p = p->d.next) {
        if (p->d.failed++ > 4)
            pool->current = p->d.next;

So when a new block is allocated the rest of the blocks increment failed counter, it means that we could not satisfy current request from all of them. Thus, we try to allocate a chunk at most 4 times from a block after that we move pool->current and never touche the head of the list. So typically we have very long head of the list and relatively small tail which is scanned in ngx_palloc().

Thanks to Maxim Dunin and Rinat Ibragimov for pointing me out the mistake. Meantime, Rinat tried following change for ngx_palloc_block():

    - for (p = pool->current; p->d.next; p = p->d.next) {
    - if (p->d.failed++ > 4)
    - pool->current = p->d.next;
    - }
    + pool->current = p_new;

I.e. he removed the scans at all. After the change benchmark results at his machine changed from

                  ngx_pool (Small):    475ms
                    ngx_pool (Big):    198ms
            ngx_pool w/ free (Mix):    817ms
             ngx_pool cr. & destr.:    161ms


                  ngx_pool (Small):    249ms
                    ngx_pool (Big):    163ms
            ngx_pool w/ free (Mix):    585ms
             ngx_pool cr. & destr.:    476ms

So the scans really degrades performance in particular workload from the benchmark, however strictly speaking the algorithm is still constant time.

The results are somewhat worse than boost::pool. Does it mean that boost::pool is better? Nope. Nginx is developed to process thousands or even more HTTP requests per second and each request has associated with it memory pool. Obviously, regular HTTP request doesn't have 20 million chunks, headers or so on, so our test is somewhat unnatural for Nginx's pool. To compare Nginx's pool with boost::pool in more natural for Nginx workload I did other benchmark:

    for (size_t i = 0; i < N / 100; ++i) {
        ngx_pool_t *p = ngx_create_pool(PAGE_SIZE);
        for (size_t j = 0; j < 100; ++j) {
            if (__builtin_expect(!(i & 3), 0)) {
                Big *o = (Big *)ngx_palloc(p, sizeof(*o));
            } else {
                Small *o = (Small *)ngx_palloc(p, sizeof(*o));

Since boost::pool can't allocate various size chunks, I just ran 1/4 of all loops for Big and 3/4 for Small object allocations. The results are:

          boost::pool cr. & destr.:    133ms
             ngx_pool cr. & destr.:    98ms

And this time Nginx pool is more than 30% faster than boost::pool.

By the way, APR library used in Apache HTTPD server uses somewhat similar to Nginx pool mechanisms.


In Tempesta FW we tried to make more flexible pool allocator (actually region-based memory allocator, but we follow Apache HTTPD and Nginx developers in the allocator's name), such that it's able to free allocated memory like boost::pool and allocate variable size memory chunks like Nginx. But make it as fast as possible.

TfwPool uses stack-like freeing approach. In fact, there are two common cases when we need to free or reallocate the last allocated chunks. The first one is very trivial - when you just need a temporal buffer to do some stuff (e.g. snprintf()).

For example, checking of very short string with HTTP Host header looks like:

    buf = tfw_pool_alloc(req->pool, len);     
    tfw_str_to_cstr(&field, buf, len);
    if (!tfw_addr_pton(buf, &addr))
        // Process the bad Host header
    tfw_pool_free(req->pool, buf, len);                        

tfw_str_to_cstr() copies chunked string to continuous buf. If you design your application as pure zero-copy, then typically such copies are not what you like to do. However, zero-copy techniques could be very expensive for small chunked data, when you need to handle list of small data chunks. So sometimes it's faster to do small copying rather than do complex things on lists of small chunks. Nevertheless, tfw_str_to_cstr() is considered deprecated to be replaced with smarter techniques in further versions.

The second case when you typically need to free exactly the last allocated memory chunk, is realloc(). Imagine that you need to handle some descriptor of string chunks of HTTP message headers, body etc. The descriptor is effectively an array of pointers to chunks of the message field. Arrays are better than linked lists because of better memory locality, but you must pay by heavy realloc() call to be able to grow an array. So if we'd have some very fast method to quickly grow an array, then we can reach very good performance. And there is a way to grow arrays quickly without usual for realloc() memory copies. It worth to mention that if you're reading for example HTTP message body, then you're growing the array of the field descriptor, while other allocated chunks are not touched. So at some point you just execute realloc() many times against the same memory chunk. If the allocator has some room just after the memory chunk, then it just need to move its data end pointer forward - no any real reallocations or copies are necessary.

Probably you noticed in previous code examples that I call free() just after alloc() - actually, this is relatively frequent case which we kept in mind designing TfwPool.

Like Nginx case I just copied original code with small adjustments to the benchmark. You can find the original allocator code at GitHub. The results for the allocator are depicted at the below:

                  tfw_pool (Small):    279ms
          tfw_pool w/ free (Small):    101ms
                    tfw_pool (Big):    106ms
            tfw_pool w/ free (Big):    50ms
            tfw_pool w/ free (Mix):    107ms
             tfw_pool cr. & destr.:    53ms

Well, it's faster than Nginx's pool, but Small test is still slower than fro boost::pool. Surely, boost::pool shows amazing numbers thanks to it's exponential growing. We don't use the technique because our allocator was designed for Linux kernel and using large continuous allocations is not the most efficient way.

Our allocator is designed to operate with 4096-aligned memory pages. Page alignment (x86-64 has pages equal to 4096 bytes) is good since less pages are used improving TLB cache usage. In our benchmark we do the allocations using posix_memalign() with alignment argument 4096 and it significantly hurts the benchmark performance...

Perfomance Issue with posix_memalign(3)

To understand posix_memalign() issues lets run memalign test which compares performance of 4KB allocations aligned to 16 and 4096 bytes correspondingly:

    $ ./memalign_benchmark 
                       16B aligned:    110ms
                      page aligned:    203ms

Page aligned allocations are almost 2 times slower, that's too expensive. Why it happens?

strace shows that more than 90% of time the program spends in brk() system call:

    $ strace -c ./memalign_benchmark
                       16B aligned:    233ms
                      page aligned:    334ms
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     93.81    0.010849           1     17849           brk
      1.98    0.000229          13        17           mmap
      1.03    0.000119          12        10           mprotect

If we run each part of benchmark independently, then we can find follwoing (firstly I run 16 bytes alignment test and next page alignment test):

    $ strace -e trace=brk ./memalign_benchmark 2>&1 |wc -l

    $ strace -e trace=brk ./memalign_benchmark 2>&1 |wc -l

The bigger number of brk() calls is expected since glibc requests more memory from the system to throw almost half of it for alignment. So large alignment argument is very expensive for posix_memalign().

Meantime, operating system kernel manages memory in pages which are certainly 4096-byte aligned, so you can get properly aligned memory chunks without the alignment overhead. Moreover, actually when you do malloc() or posix_memalign(), the kernel must allocate VMA (Virtual Memory Area - the kernel descriptor handling information about address mapping) for it. Next when you write to the area page fault happens, it allocates a new page and populates it to current page table. A lot of things happen when we work with memory in user space. Then nice thing which glibc does to mitigate impact of the complex things is caching memory requested from the kernel.

Tempesta FW: Native Kernel Implementation

It was also interesting to see how much a regular application can gain from moving to kernel space. First, to demonstrate how user space memory allocation is slow (if the application just starts and glibc doesn't have cache), I made very simple test:

    static const size_t N = 100 * 1000;
    static void *p_arr[N];

    for (size_t i = 0; i < N; ++i) {
        r = posix_memalign((void **)&p_arr[i], 16, PAGE_SIZE);
    for (size_t i = 0; i < N; ++i)

It takes about 112ms on my system:

    $ ./memalign_benchmark 
                      alloc & free:    112ms

Now let's try the kernel test, you can find it near from the other tests. You can run the test by loading the kernel module (it's built by Makefile as other benchmarks) and dmesg will show you the results (don't forget to remove the module):

    $ sudo insmod pool.ko
    $ sudo rmmod pool
    $ lsmod |grep pool

Firstly, let's have a look how memory allocation is fast in kernel:

    $ dmesg |grep 'alloc & free'
    alloc & free:  22ms

This is 5 times faster!

It's very curious how much regular application can gain from moving to kernel. So I also ported the Nginx's pool to kernel module and here are the results:

    ngx_pool (Small):               229ms
    ngx_pool (Big):                 49ms
    ngx_pool w/ free (Mix):         291ms
    ngx_pool cr. & destr.:          150ms

This is 30% faster for small allocations, 270% (!) faster for big objects and 86% faster for mixing workload... Wait... What's happen with our "real life" test for many pool with low number of allocations? It's slower more than 50%!..

The problem is that when you free a page using free_pages() interface some of them are going to per CPU page cache, but the main point is that they're going to buddy allocator where they can be coalesced with their buddies to satisfy further requests for large allocations. Meantime, glibc does good job on aggressive caching of allocated memory. So the benchmark causes significant pressure on Linux buddy system.

TfwPool uses it's own lightweight per-CPU page cache, so we get much better results:

    tfw_pool (Small):               95ms
    tfw_pool w/ free (Small):       90ms
    tfw_pool (Big):                 45ms
    tfw_pool w/ free (Big):         35ms
    tfw_pool w/ free (Mix):         99ms
    tfw_pool cr. & destr.:          54ms

In short: 80-170% faster for small allocations than the fastest user-space allocator (boost::pool) and more than 3 times faster than Nginx's pool,3-5 times faster for big allocations, more than 5 times for mixed and 2 times for "real life" workloads in comparison with Nginx. Recall that the allocator was designed for kernel space, so it show much better results in Linus kernel rather than for user space.


As a conclusion I'd make couple of points about fast memory pools:

  • while page aligned memory operations are good due to better TLB usage, they are impractical due to poor posix_memalign() work. You can use your own page cache to cope with the issue;
  • exponential growing is good and old thing, but people still don't use it for some reason;
  • any memory allocator must solve many problems (time complexity, memory utilization and fragmentation, time to create and destroy the pool etc.), so it's likely that you can develop much faster allocator than any existing for your particular project - there is no ideal memory allocator for all tasks;
  • just moving from user space to kernel can make some performance improvement for your program, but also you can hit performance degradation - user-space and kernel use very different mechanisms, so kernel applications must be designed especially for kernel space.

Don't you love kernel programming as we love it?


Rinat Ibragimov added benchmark for Glib memory pool. Now the output of benchmark looks like (the best numbers for 4 runs):

                 small object size:    24
                   big object size:    240
                  huge object size:    8305

                mallocfree (Small):    711ms
        mallocfree w/ free (Small):    478ms
                  mallocfree (Big):    319ms
          mallocfree w/ free (Big):    204ms

               GLib slices (Small):    943ms
       GLib slices w/ free (Small):    621ms
                 GLib slices (Big):    227ms
         GLib slices w/ free (Big):    111ms

               boost::pool (Small):    262ms
       boost::pool w/ free (Small):    169ms
                 boost::pool (Big):    181ms
         boost::pool w/ free (Big):    157ms
          boost::pool cr. & destr.:    136ms

                  ngx_pool (Small):    312ms
                    ngx_pool (Big):    134ms
            ngx_pool w/ free (Mix):    554ms
             ngx_pool cr. & destr.:    99ms

                  tfw_pool (Small):    273ms
          tfw_pool w/ free (Small):    102ms
                    tfw_pool (Big):    107ms
            tfw_pool w/ free (Big):    46ms
            tfw_pool w/ free (Mix):    103ms
             tfw_pool cr. & destr.:    53ms