High Performance Linux

Monday, November 26, 2012

Coding Style

I believe that coding style in a project is important thing. MySQL, PostgreSQL, Linux and FreeBSD kernels, glibc and many other excelent open source projects have their own coding style. The styles are different and that's expected because they were developed by different people with different preferences. The only crucial thing is code consistency. I guess you saw a code where spaces and tabs were mixed for indentations - it is always disgusting to work with such code. So the code of a project can use any style his owner likes which, but it must be consistent.

In our projects we follow Linux kernel coding style (you can find it at linux/Documentation/CodingStyle) with some additions for C++. We use the same coding style for kernel C and application C++ programming, so we have adjusted the original guide with some C++ specific things. Here it is.

Friday, September 21, 2012

Linux: scaling softirq among many CPU cores

Some years ago I have tested network interrupts affinity - you set ~0 as a CPU mask to balance network interrupts among all your CPU cores and you get all softirq instances running in parallel. Such interrupts distribution among CPU cores sometimes is a bad idea due to CPU caches computational burden and probable packets reordering. In most cases it is not recommended for servers performing some TCP application (e.g. web server). However this ability is crucial for some low level packet applications like firewalls, routers or Anti-DDoS solutions (in last cases most of the packets must be dropped as quick as possible), which do a lot of work in softirq. So for some time I was thinking that there is no problem to share softirq load between CPU cores.

To get softirq sharing between CPU cores you just need to do

    $ for irq in `grep eth0 /proc/interrupts | cut -d: -f1`; do \
        echo ffffff > /proc/irq/$irq/smp_affinity; \

This makes (as I thought) your APIC to distribute interrupts between all your CPUs in round-robin fashion (or probably using some more cleaver technique). And this really was working in my tests.

Recently our client concerned about this ability, so I wrote very simple testing kernel module which just makes more work in softirq:

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/netfilter.h>
#include <linux/netfilter_ipv4.h>


 * Just eat some local CPU time and accept the packet.
static unsigned int
st_hook(unsigned int hooknum, struct sk_buff *skb,
        const struct net_device *in,
        const struct net_device *out,
        int (*okfn)(struct sk_buff *))
    unsigned int i;
    for (i = 0; i <= 1000 * 1000; ++i)

    return NF_ACCEPT;

static struct nf_hook_ops st_ip_ops[] __read_mostly = {
        .hook = st_hook,
        .owner = THIS_MODULE,
        .pf = PF_INET,
        .hooknum = NF_INET_PRE_ROUTING,
        .priority = NF_IP_PRI_FIRST,

static int __init
    if (nf_register_hooks(st_ip_ops, ARRAY_SIZE(st_ip_ops))) {
        printk(KERN_ERR "%s: can't register nf hook\n",
        return 1;
    printk(KERN_ERR "%s: loaded\n", __FILE__);

    return 0;

static void
    nf_unregister_hooks(st_ip_ops, ARRAY_SIZE(st_ip_ops));
    printk(KERN_ERR "%s: unloaded\n", __FILE__);


I loaded the system with iperf over 1Gbps channel. And I was very confused when see that only one CPU of 24-cores machine was doing whole the work and all other CPUs was doing nothing!

To understand what's going on lets have a look how Linux handles incoming packets and interrupts from network card (e.g. Intel 10 Gigabit PCI Express which is placed at drivers/net/ixgbe). Softirq works in per-cpu kernel threads, ksoftirqd (kernel/softirq.c: ksoftirqd()), i.e. if you have 4-cores machine, then you have 4 ksoftirqd threads (ksoftirqd/0, ksoftirqd/1, ksoftirqd/2 and ksoftirqd/3). ksoftirqd() calls do_softirq(), which by-turn calls __do_softirq(). The last one uses softirq_vec vector to get required hadler for current softirq type (e.g. NET_RX_SOFTIRQ for receiving or NET_TX_SOFTIRQ for sending softirqs correspondingly). The next step is to call virtual function action() for the handler. For NET_RX_SOFTIRQ net_rx_action() (net/core/dev.c) is called here. net_rx_action() reads napi_struct from per-cpu queue softnet_data and calls virtual function poll() - a NAPI callback (ixgbe_poll() in our case) which actually reads packets from the device ring queues. The driver processes interrupts in ixgbe_intr(). This function runs NAPI through call __napi_schedule(), which pushes current napi_struct to per-cpu softnet_data->poll_list, which net_rx_action() reads packets (on the same CPU) from. Thus softirq runs on the same core which received hardware interrupt.

This way theoretically if harware interrupts are going to N cores, then these and only these N cores are doing softirq. So I had a look at /proc/interrupts statistics and saw that only one 0th core is actually receiving interrupts from NIC while I set ~0 mask in smp_affinity for the interupt (actually I had MSI-X card, so I set the mask to all the interrupt vectors for the card).

I started googling for the answers why on earth interupts do not distribute among all the cores. The first topics which I found were nice articles by Alexander Sandler:

Following these articles not all hardware is actually able to spread interrupts between CPU cores. During my tests I was using IBM servers of particular model, but this is not the case of the client - they use very different hardware. This is why I saw one nice picture on my previous tests, but faced quite different behaviour on other hardware.

The good news is that linux 2.6.35 has introduced nice feature -  RPS (Receive Packet Steering). The core of the feature is get_rps_cpu() from dev/net/core.c, which computes a hash from IP source and destination addresses of an incoming packet and determines a which CPU send the packet to based on the hash. netif_receive_skb() or netif_rx() which call the function puts the packet to appropriate per-cpu queue for further processing by softirq. So there are two important consequences:
  1. packets are processed by different CPUs (with processing I mostly mean Netfilter pre-routing hooks);
  2. it is unlikely that packets belonging to the same TCP stream are reordered (packets reordering is a well-known problem for TCP performance, see for example Beyond softnet).
To enable the feature you should specify CPUs mask as following (the adapter from the example is connected via MSI-X and has 8 tx-rx queues, so we need to update masks for all the queues):

    $ for i in `seq 0 7`; do \
        echo fffffff > /sys/class/net/eth0/queues/rx-$i/rps_cpus ; \

After runnign linux-2.6.35 and setting all CPUs to be able to process softirq I got following nice picture in top:

  2238 root      20   0  411m  888  740 S  152  0.0   2:38.94 iperf
    10 root      20   0     0    0    0 R  100  0.0   0:35.44 ksoftirqd/2
    19 root      20   0     0    0    0 R  100  0.0   0:46.48 ksoftirqd/5
    22 root      20   0     0    0    0 R  100  0.0   0:29.10 ksoftirqd/6
    25 root      20   0     0    0    0 R  100  0.0   2:47.36 ksoftirqd/7
    28 root      20   0     0    0    0 R  100  0.0   0:33.73 ksoftirqd/8
    31 root      20   0     0    0    0 R  100  0.0   0:46.63 ksoftirqd/9
    40 root      20   0     0    0    0 R  100  0.0   0:45.33 ksoftirqd/12
    46 root      20   0     0    0    0 R  100  0.0   0:29.10 ksoftirqd/14
    49 root      20   0     0    0    0 R  100  0.0   0:47.35 ksoftirqd/15
    52 root      20   0     0    0    0 R  100  0.0   2:33.74 ksoftirqd/16
    55 root      20   0     0    0    0 R  100  0.0   0:46.92 ksoftirqd/17
    58 root      20   0     0    0    0 R  100  0.0   0:32.07 ksoftirqd/18
    67 root      20   0     0    0    0 R  100  0.0   0:46.63 ksoftirqd/21
    70 root      20   0     0    0    0 R  100  0.0   0:28.95 ksoftirqd/22
    73 root      20   0     0    0    0 R  100  0.0   0:45.03 ksoftirqd/23
     7 root      20   0     0    0    0 R   99  0.0   0:47.97 ksoftirqd/1
    37 root      20   0     0    0    0 R   99  0.0   2:42.29 ksoftirqd/11
    34 root      20   0     0    0    0 R   77  0.0   0:28.78 ksoftirqd/10
    64 root      20   0     0    0    0 R   76  0.0   0:30.34 ksoftirqd/20

So as we see almost all of the cores are doing softirqs.

Saturday, May 26, 2012

Software Transactional Memory (STM) in GCC-4.7

GCC-4.7 introduces new amazing feature - Software Transactional Memory (STM). It is still experimental and not yet optimized feature, however we already can have a look how STM works. Currently GCC implements pure Software TM, i.e. without hardware support. Intel announced hardware support for TM (HTM) in Haswell microarchitecture as Transactional Synchronization Extension (TSX), so probably in next year we'll have hybrid TM - software transactional memory with hardware optimizations.

Firstly, to understand what STM is, lets consider following simple program:

    #include <iostream> 
    #include <thread>

    static const auto THR_NUM = 4;
    static const auto ITER_NUM = 1000 * 1000;

    static auto a = 0, b = 0, c = 0;

    static void
            for (auto i = 0; i < ITER_NUM; ++i) {
                    b += 2;
                    c = a + b;

    main(int argc, char *argv[])
            std::thread thr[THR_NUM];

            for (auto &t : thr)
                    t = std::thread(thr_func);

            for (auto &t : thr)

            std::cout << "a=" << a << " b=" << b
                    << " c=" << c << std::endl;

            return 0;

Now try to compile (don't forget -std=c++11 since C++11 is still not default option for g++) and run the program. Probably you'll see that a, b and c contains ugly values which change from run to run, e.g.:

        $ ./a.out
        a=2139058 b=4316262 c=6455320
        $ ./a.out
        a=2077152 b=4463948 c=6541100

Result is expected because 4 threads concurrently updates all the three variables and all the variables are updated in RMW (Read-Modify-Write) manner. Now lets place operations on all the three variables into one transaction (yes, this is very like database transactions), so all the variables will be read and written in atomic manner:

        static void
                for (auto i = 0; i < ITER_NUM; ++i)
                        __transaction_atomic {
                                b += 2;
                                c = a + b;

Lets compile the code with -fgnu-tm to enable STM in GCC and rerun the program. This time you'll see nice numbers, which stay the same regardless of the run try:

        $ ./a.out
        a=4000000 b=8000000 c=12000000
        $ ./a.out
        a=4000000 b=8000000 c=12000000

This is quite simple case and you'll probably prefer to use mutex here. But you can refer to Ulrich Drepper's "Parallel Programming with Transactional Memory" for more complicated example when mutex alternative is not so obvious. It's easy to see that STM would be quite useful for example to implement highly-concurrent self-balancing binary search tree which could need to lock number of nodes for rotation on insertion or deletion (traditionally such data structures are implemented by introducing per-node mutex and are prone to deadlocks).

You may noticed that STM version of the program runs much more slowly. So lets analyze what it's doing so long. For basic investigation lets run the program under strace and print system calls statistics:

        $ strace -f -c ./a.out
        % time   seconds  usecs/call   calls  errors syscall
        ------ --------- ----------- ------- ------- ---------
         99.39  0.021295          11    1920     390 futex

So it means that STM in libitm (GCC implements STM as libitm library which you can see in ldd output) is implemented via futex() system call, like common mutex. Before going deeper into libitm internals lets see at the transaction code more carefully and split the code into basic read and write operations. We have 3 memory locations, variables a, b and c, which we perform read and write operations on. First operation is ++a which is actually read a value from memory, update it and write back, so we have two operations here - one read and one write. Next b += 2 - exactly the same: read the value, add 2 and write it back. And the last one, c = a + b, is two reads (a and b) and one write to c. Moreover, all these operation are inside transaction, so we have to start and commit a transaction.

To understand what's going on inside thr_func() lets simplify it as follows:

        static void
               __transaction_atomic {

and disassemble it:

        push   %rbp
        mov    %rsp,%rbp
        mov    $0x29,%edi
        mov    $0x0,%eax
        callq  400fd8 <_ITM_beginTransaction@plt>
        mov    $0x6052ec,%edi
        callq  4010b8 <_ITM_RU4@plt>
        add    $0x1,%eax
        mov    %eax,%esi
        mov    $0x6052ec,%edi
        callq  400fe8 <_ITM_WU4@plt>
        callq  400f48 <_ITM_commitTransaction@plt>
        pop    %rbp

Now we see four calls of _ITM_* functions (as explained in info libitm, GCC follows the Intel's Draft Specification of Transactional LanguageConstructs for C++ (v1.1) in its implementation of transactions, so _ITM_ prefix is just Intel's naming convention) for transaction begin, transaction commit and the pair of read (RU4) and write (WU4) operations.

_ITM_beginTransaction() saves the machine state (for x86 see  libitm/config/x86/sjlj.S) and calls GTM::gtm_thread::begin_transaction() (see libitm/beginend.cc) which initializes the transaction data, checks transaction nesting and performs other preparation steps.

_ITM_commitTransaction() is defined in  libitm/beginend.cc and tries to commit the transaction by calling GTM::gtm_thread::trycommit() and if it fails restarts the transaction. GTM::gtm_thread::trycommit() is the place where all the threads are sleeping in futex() (which we saw in strace output) to write all modified data. So this is most heavy part of transaction.

The most interesting stuff is in read and write operations. 0x6052ec is address of variable a. _ITM_RU4 and _ITM_WU4 are just a sequence of jumps which lead (in this particular case) to ml_wt_dispatch::load() and ml_wt_dispatch::store() correspondingly. First one accepts only the variable address and the second one - the variable address and the stored value. Load() reads a memory region by specified address, but before that it calls ml_wt_dispatch::pre_load() function which verifies that the memory location is not locked or recent and restarts the transaction (these service data is taken from global table indexed by hash function over the address). Store() by-turn calls ml_wt_dispatch::pre_write() which locks the memory location (all service data for the memory location also is taken by the same global table) and updated the release (version) of the memory location before the write (the release version is checked in pre_load() as 'recent').

Friday, March 30, 2012

Dynamic Strong Type System

Dynamic type system means that data types are defined in run-time. This is opposite to static type system, when data types are defined in compile-time. Examples of static type programming languages are C/C++ or Java, dynamic type - Python, Perl and Erlang. Strong type system requires explicit types conversions (like static_cast in C++), from other side weak type system allows implicit type conversions (e.g. "5" + 9 is valid Perl expression).

I argue against dynamic but strong type system. It makes large programs difficult to maintain.Let's write simple function which just sums two integer variables and prints the result in Perl, C++ and Erlang:

    % Erlang
    sum(A, B) ->
        io:format("~p~n", [A + B]).

    // C++
    void sum(int a, int b)
        std:cout << a + b << std::end;

    # Perl
    sub sum
        my ($a, $b) = @_;
        print $a + $b

What happen if we decide to change the function argument types, say from integer to string? It will still work for Perl and broke for C++ on compile time since you can not pass std::string or char * as integer arguments to the function. However, Erlang example will crash in runtime with bad argument error.

Changing type of arguments from integer to string seems very unnatural at first glance. But converting types from atom to list for Erlang is not so unnatural and such adjustments in code requires careful investigation of whole data flow from the level where the types were changed to the lower lever which uses the data. 

This way a programming language is more pleasant to use (at least for me) if it uses dynamic and weak type system or static and strong, but not dynamic and strong.

Saturday, March 24, 2012

Simple Cluster Management Using Erlang Mnesia

    If you need to run some application in cluster to get high availability (through redundancy) or high performance (though scaling out), then you need some cluster management system. Basically such system performs two operations: load balancing between the cluster nodes and nodes failovering. Also usually dynamic cluster reconfiguration is required, i.e. we desire to be able to introduce a new node to the cluster or shutdown a node at any given time.

    The logic for load balancing and failovering itself is not trivial, but there is also one not obvious problem - synchronization between the cluster nodes. The problem is crucial in shared-nothing cluster architecture with shared state, i.e. there is no single physical point of failure in the cluster, but all the nodes operate on a shared data. Lets consider that you have a cluster of working nodes and you have a jobs on all of the nodes, which you need to reassign depending on each node load. Traditionally you'd do this with some director node - this node collects load factors from all the nodes and reschedule the jobs between the nodes. The director itself is a cluster of at least two machines for redundancy. But this works only for large clusters - it has not so much sense to have two directors for small cluster of two or little bit more working nodes which works in active-active scenario. For such small clusters we would wish that all the nodes are able to participate in cluster load balancing and failovering itself.

    Let's consider a quick example. You have cluster of two active working nodes, A and B, and both the nodes are loaded equally. Let J1 and J2 be a jobs (e.g. client connections) on node A and J3 and J4 a jobs on node B respectively. At some point of time clients which makes jobs J3 and J4 reduced their activity, but J1 and J2 increased (i.e. load factors produced by the jobs are increased). You would expect that the nodes will reschedule the jobs/connections between the nodes in manner like J1 and J3 to node A and J2 and J4 to node B. To do this both the nodes have to track current load on itself and other node and reassign the jobs in accordance with the loads ratio. Since both the nodes work independently, then we can not guarantee that the node won't try to reassign the jobs in different ways at the same time (e.g. node A tries to assign J1 and J3 to itself and J2 and J4 to B while B is trying to assign J1 and J4 to itself and J2 and J3 to A). If we have a cluster of bit more more machines (say 3) and the machines can fail at any time then we also have a change that some node crash during the jobs reassignment and we loose some jobs.

    To synchronize such kind of communications between the nodes usually we use algorithms of distributed state machines like Paxos or Virtual Synchrony. The last is greatly implemented in Spread Toolkit, which you can use with C/C++ or Perl bindings to build reliable cluster management system.

    However Erlang provides distributed database, Mnesia, from the box, which is useful in solving such issues. In fact, Mnesia have distributed transactions, but it is not a heavy disk-bases database. Mnesia has special kind of tables, ram_copies, which is actually just a hash tables. So the key point: with Erlang and Mnesia we can atomically and reliably execute distributed operations on hash tables stored in RAM on different cluster nodes.

    Now lets have a look how to solve the jobs scheduling problem with Erlang (I suppose that the reader is familiar with basic Erlang syntax). First of all let's create two table with assigned jobs and current load:

        -record(jobs, {job_id, weight, node_name}).
        -record(cluster_view, {node_name}).

            [{ram_copies, [node()]},
             {attributes, record_info(fields, cluster_view)}

            [{ram_copies, [node()]},
             {attributes, record_info(fields, jobs)},
             {index, [node_name]}

    (I defined secondary index for node_name in jobs table to be able to select all jobs assigned to a node).

    Each node periodically updates its load status in transactional manner:

        mnesia:transaction(fun() ->
            mnesia:write(#cluster_view{node_name = node(), 
                                       load = CurrentLoad})

    where CurrentLoad is the value of current load of the node. I don't do any error handling here for brevity, but it should be done in production code.

    And load balancing can be done in following way (this is not the most effective method, but simple enough):

        mnesia:transaction(fun() ->
            % QLC Query Handler to get sorted list of Jobs by
            % weight field in descending order.
            % Returns list of tuples of form {weight, job_id}
            JLQH = qlc:sort(qlc:q([{J#jobs.weight, J#jobs.job_id}
                                   || J <- mnesia:table(jobs)]),
                            [{order, descending}]).

            % Get two lists of jobs assigned to each node.
            % E.g. if it returns {10,[1],8,[2,4,3]}, then
            % job 1 with weight 10 has to be assigned to
            % node A and jobs 1, 4 and 3 with total weight 8
            % to node B.
            qlc:fold(fun(J, D) ->
                    if element(1, D) =< element(3, D) ->
                        {element(1, D) + element(1, J),
                         lists:append(element(2, D),[element(2, J)]),
                         element(3, D),
        element(4, D)};
        true ->
                        {element(1, D),
                         element(2, D),
                     element(3, D) + element(1, J),
                 lists:append(element(4, D), [element(2, J)])}
                {0, [], 0, []},

    This way only one node at each given time can check current nodes load and redistribute jobs atomically.

    Thus, distributed Mnesia transactions are very useful to build simple cluster management system for small clusters (distributed transactions are very expensive in general, so they are absolutely not suitable for large clusters with intensive management operations), but it has number of drawbacks which make it hard to develop flexible cluster operations. The one of such things is that Mnesia does not have normal mechanism to detach the cluster. So if you need to detach a node from the cluster such that it will keep current cluster data snapshot, but won't replicate to and from other cluster nodes, then you have to dump all tables with dump_tables() to local text file, recreate local database schema and load the tables back. Other thing is that you have no enough flexibility in Mnesia setting to manage database schema in node failure case.

    P.S. Ulf Wiger has given very good presentation about Mnesia ablilities in cluster environment.

    Thursday, March 15, 2012

    GCC Inline Assembly: Easy Way to Build-in Your Shellcode Into Exploit

    In most cases when developing an exploit people independently develop the shell code, compile it and only after that build it into the exploit in hex format as a string. So usually you can find following in an exploit code:

    char shell_code[] =

    int main()
        // Do something with shell_code

    This is easier can be done with GCC inline assembly. You just need to define the shell code as a function and one other, dummy, function just after it:

    __asm__ (
            ".p2align 4, 0x90\n"
            ".local shellcode\n"
            ".type shellcode,@function\n"
                /* The function body */
            ".p2align 4, 0x90\n"
            ".local end_of_shellcode\n"
            ".type end_of_shellcode,@function\n"
            "end_of_shellcode:\n"      /* just empty function */

    The shellcode is defined as shellcode() function. The function end_of_shellcode() is required only to get the shellcode length. You can work with the shellcode exactly the same way as for string. E.g. to get the shellcode address:

    printf("shellcode address is %p\n", &shellcode);

    Or to get length of the shellcode:

    printf("shellcode length is %lu\n",
        (unsigned long)&shellcode - (unsigned long)&end_of_shellcode);

    This way you just need to use inline assembly instead of compile and decode the shellcode separately. This is a common practice and Linux kernel widely uses such mixing of assembly and C function declarations.

    Master Class in Ukraine

    On April 28-29 I'll be giving master class (in Russian) about development of high performance server applications in UNIX/Linux environment. The main topic of the class will be methods and technologies to build HTTP-, Instant Messaging-, database and other servers, which processes hundreds of thousands requests per second and gigabits of incoming traffic on commodity hardware. Mostly I'll be concentrating on following topics:
    • the server architecture (threads, processes, queues and IO);
    • concurrency (synchronization, context switches);
    • fast data structures and properties of classic data structures from performance point of view;
    • data alignment and optimization of CPU caches usage;
    • lock-free data structures, atomic operations and memory barriers;
    • memory allocators (when you need to develop your own allocator, types of memory allocators)
    • internal server caches (data structures, replacement policies);
    • zero-copy for disk and network IO and how to reduce copying in your code;
    • CPU-binding for interrupts and threads/processes;
    • working with huge files (mmap(), O_DIRECT, blocks allocation);
    • how Linus works with files (page cache, disk synchronization, IO scheduling);
    • profiling;
    • and finally, when do you need to move to kernel.

    Tuesday, January 10, 2012

    List Comprehensions in Perl

    Some time ago I found List comprehension article in Wikipedia. There are many examples for languages which natively support List Comprehension (mainly functional languages), but also few examples of similar constructs for languages which have not embedded list comprehension.I wonder why there is no Perl examples.

    Although Perl doesn't have native list comprehension it is very easy to do with map and grep constructions. Let's consider couple of examples.

    At first lets see how to implement example of list comprehension construction for Python from the article (I replaced 101 by 5 for shorter output):

    $ python -c 'print [2 * x for x in range(5) if x ** 2 > 3]'
    [4, 6, 8]

    In Perl it would be (Dumper output is reformatted for brevity):

    $ perl -MData::Dumper -e 'print Dumper([map { $_ * 2 } \
                                            grep { $_ ** 2 > 3 } (0..4)])'
    $VAR1 = [4, 6, 8];

    Other example for list of lists generation

    $ python -c 'print [(i,j) for j in range(2, 5) for i in range(1,j)]'
    [(1, 2), (1, 3), (2, 3), (1, 4), (2, 4), (3, 4)]

    and its Perl analog is (Dumper output is reformatted for brevity)

    $ perl -MData::Dumper -e 'print Dumper([map { $n = $_; \
                                            map {[$_,$n]} (1..$n - 1)} \
    $VAR1 = [
              [1, 2],
              [1, 3],
              [2, 3],
              [1, 4],
              [2, 4],
              [3, 4]

    Thursday, January 5, 2012

    Tail Recursion in Perl and Python (or why do we prefer Perl to Python)

    Few years ago I was making several Python projects. I moved from Perl to Python because of its clearer syntax and it was just interesting for me to know why so much people (like Google and Yandex) moved to Python.

    However I back to Perl after two years with Python. Python is good mature language with great modules and scalable preemptive multi-threading support (unlike Ruby which uses cooperative multi-threading for 1.8 and global interpreter lock (GIL) for current 1.9 version). But programming in Python becomes quite tedious if you wish to do some unusual things.

    Lets consider following simple example of recursive factorial calculation:

    sub tail_factorial
            my ($total, $n) = @_;

            return $total unless $n;
            return tail_factorial($total * $n, $n - 1);

    In Python this code looks like (also quite short and straightforward):

    def tail_factorial(total, n):
            if not n:
                    return total
            return tail_factorial(total * n, n - 1)

    Perl5 does not have tail recursion optimization (unlike Perl6). The same story with Python3.However there is simple solution for Perl (example is borrowed from Pure Perl tail call optimization, changed lines are in italic):

    sub tail_factorial2
            my ($total, $n) = @_;

            return $total unless $n;

            @_ = ($n * $total, $n - 1);
            goto &tail_factorial2;


    However I was wondered that there is no clear solution for Python. There is decorators implementations (see Tail Call Optimization Decorator (Python recipe) or Tail recursion decorator revisited), but all of them uses additional functions and internal loops. Other solutions are Tail Recursion in Python using Pysistence and Tail recursion in Python also uses internal cycles and helping functions.

    Guido has stated about such solutions in his blog as "it's simply unpythonic". This is bit unnatural example and really there is no big deal to rewrite the function to use a loop (we live with the practice in kernel programming for a long time). However actually there are plenty of such limitations with Python (one more good example is that you can not call Python program as a single line, so you can not buili it into shell scripts and use it as extended sed/awk for advanced grep'ing). Thus I prefer to work with powerful and flexible tool rather than with limited and poor. This is why we use Perl in our daily work.