High Performance Linux: Software Transactional Memory (STM) in GCC-4.7

GCC-4.7 introduces new amazing feature - Software Transactional Memory (STM). It is still experimental and not yet optimized feature, however we already can have a look how STM works. Currently GCC implements pure Software TM, i.e. without hardware support. Intel announced hardware support for TM (HTM) in Haswell microarchitecture as Transactional Synchronization Extension (TSX), so probably in next year we'll have hybrid TM - software transactional memory with hardware optimizations.

Firstly, to understand what STM is, lets consider following simple program:

#include <iostream>

    #include <thread>

    static const auto THR_NUM = 4;
    static const auto ITER_NUM = 1000 * 1000;

    static auto a = 0, b = 0, c = 0;

    static void
    thr_func()
    {
            for (auto i = 0; i < ITER_NUM; ++i) {
                    ++a;
                    b += 2;
                    c = a + b;
            }
    }

    int
    main(int argc, char *argv[])
    {
            std::thread thr[THR_NUM];

            for (auto &t : thr)
                    t = std::thread(thr_func);

            for (auto &t : thr)
                    t.join();

            std::cout << "a=" << a << " b=" << b
                    << " c=" << c << std::endl;

            return 0;
    }

Now try to compile (don't forget -std=c++11 since C++11 is still not default option for g++) and run the program. Probably you'll see that a, b and c contains ugly values which change from run to run, e.g.:

        $ ./a.out 
        a=2139058 b=4316262 c=6455320

        $ ./a.out 
        a=2077152 b=4463948 c=6541100

Result is expected because 4 threads concurrently updates all the three variables and all the variables are updated in RMW (Read-Modify-Write) manner. Now lets place operations on all the three variables into one transaction (yes, this is very like database transactions), so all the variables will be read and written in atomic manner:

        static void
        thr_func()
        {
                for (auto i = 0; i < ITER_NUM; ++i) 
                        __transaction_atomic {
                                ++a;
                                b += 2;
                                c = a + b;
                        }
        }

Lets compile the code with -fgnu-tm to enable STM in GCC and rerun the program. This time you'll see nice numbers, which stay the same regardless of the run try:

        $ ./a.out 
        a=4000000 b=8000000 c=12000000

$ ./a.out
a=4000000 b=8000000 c=12000000

This is quite simple case and you'll probably prefer to use mutex here. But you can refer to Ulrich Drepper's "Parallel Programming with Transactional Memory" for more complicated example when mutex alternative is not so obvious. It's easy to see that STM would be quite useful for example to implement highly-concurrent self-balancing binary search tree which could need to lock number of nodes for rotation on insertion or deletion (traditionally such data structures are implemented by introducing per-node mutex and are prone to deadlocks).

You may noticed that STM version of the program runs much more slowly. So lets analyze what it's doing so long. For basic investigation lets run the program under strace and print system calls statistics:

$ strace -f -c ./a.out

        ........

        % time   seconds  usecs/call   calls  errors syscall
        ------ --------- ----------- ------- ------- ---------
         99.39  0.021295          11    1920     390 futex
        .......

So it means that STM in libitm (GCC implements STM as libitm library which you can see in ldd output) is implemented via futex() system call, like common mutex. Before going deeper into libitm internals lets see at the transaction code more carefully and split the code into basic read and write operations. We have 3 memory locations, variables a, b and c, which we perform read and write operations on. First operation is ++a which is actually read a value from memory, update it and write back, so we have two operations here - one read and one write. Next b += 2 - exactly the same: read the value, add 2 and write it back. And the last one, c = a + b, is two reads (a and b) and one write to c. Moreover, all these operation are inside transaction, so we have to start and commit a transaction.

To understand what's going on inside thr_func() lets simplify it as follows:

        static void
        thr_func()
        {
               __transaction_atomic {
                        ++a;
               }
        }

and disassemble it:

        push   %rbp

        mov    %rsp,%rbp

        mov    $0x29,%edi

        mov    $0x0,%eax

        callq  400fd8 <_ITM_beginTransaction@plt>

        mov    $0x6052ec,%edi

        callq  4010b8 <_ITM_RU4@plt>

        add    $0x1,%eax

        mov    %eax,%esi

        mov    $0x6052ec,%edi

        callq  400fe8 <_ITM_WU4@plt>

        callq  400f48 <_ITM_commitTransaction@plt>

        pop    %rbp

        retq

Now we see four calls of _ITM_* functions (as explained in info libitm, GCC follows the Intel's Draft Specification of Transactional LanguageConstructs for C++ (v1.1) in its implementation of transactions, so _ITM_ prefix is just Intel's naming convention) for transaction begin, transaction commit and the pair of read (RU4) and write (WU4) operations.

_ITM_beginTransaction() saves the machine state (for x86 see libitm/config/x86/sjlj.S) and calls GTM::gtm_thread::begin_transaction() (see libitm/beginend.cc) which initializes the transaction data, checks transaction nesting and performs other preparation steps.

_ITM_commitTransaction() is defined in libitm/beginend.cc and tries to commit the transaction by calling GTM::gtm_thread::trycommit() and if it fails restarts the transaction. GTM::gtm_thread::trycommit() is the place where all the threads are sleeping in futex() (which we saw in strace output) to write all modified data. So this is most heavy part of transaction.

The most interesting stuff is in read and write operations. 0x6052ec is address of variable a. _ITM_RU4 and _ITM_WU4 are just a sequence of jumps which lead (in this particular case) to ml_wt_dispatch::load() and ml_wt_dispatch::store() correspondingly. First one accepts only the variable address and the second one - the variable address and the stored value. Load() reads a memory region by specified address, but before that it calls ml_wt_dispatch::pre_load() function which verifies that the memory location is not locked or recent and restarts the transaction (these service data is taken from global table indexed by hash function over the address). Store() by-turn calls ml_wt_dispatch::pre_write() which locks the memory location (all service data for the memory location also is taken by the same global table) and updated the release (version) of the memory location before the write (the release version is checked in pre_load() as 'recent').

11 comments:

UnknownJuly 27, 2012 at 2:27 AM
Hi

A very interesting post.
I want to tell about one situation.

While debugging my high load kernel driver, I declared a debug counter, which should count general stats on queue's. This counter should be equal with one value. I was in state of shock , when the value after a test, was less than expected. I spent a day, while finding the bottleneck (profiler said OK, cpu load was low). Then I remembered, that I use multithreading.Threads, due to high rate, were updating this one value with no lock. :-) After declaring separate counters for every thread, everything went fine. As I remember, in earlier GCC versions, in this case, I should have a segfault? If it's true, than my question : "Is is good , that the kernel would control the read/write lock ?". For my point of you, If I made a stupid mistake, it's better to receive a segfault, than some ugly values.

Regards,
Igor
Alexander KrizhanovskyJuly 27, 2012 at 3:42 AM
Hi Igor,

it seems you didn't use atomic_inc() (or atomic_add()) to update the counter in your driver. atomic_inc() in kernel or __sync_fetch_and_add() in userspace (the second one is GCC intrinsics which you can find in GCC documentation) allow multiple processors to update a variable in atomic, synchronous, manner. However, keep in mind that atomic operations are not for free - they are implemented through CPU cache coherence protocol, so basically simple variable increment (like i++) is faster than atomic increment and you should use atomic operations in heavy loaded code carefully.

So this all is about CPU operations - just read, modify and write the value back (the so-called RMW operation) in atomic or not manner. In any case if you access valid memory location, then this is valid operation from CPU point of view. From other side segmentation fault is the result of an exception happen on CPU (due to access to invalid address) and handled by OS kernel. Thus answer to your question is: not, you should not see segmentation fault.
UnknownJuly 27, 2012 at 5:15 AM
About segfalt I was mistaken. Wanted to ask , why not to use a mutex? Now reading http://queue.acm.org/detail.cfm?id=1454464

Regards,
Igor
UnknownJuly 27, 2012 at 5:31 AM
On natsys-lab.com the link "new technology of data structures construction based on x86-64 virtual memory paging" is broken. Is there any way to read this paper?

Regards,
Igor
Alexander KrizhanovskyJuly 27, 2012 at 4:02 PM
Mutex is relatively heavy synchronization primitive. Basically mutexes are used for long operations and/or operations which could sleep.

Currently there is no link to the technology of data structures construction based on x86-64 virtual memory paging on our web site (or did I miss something?). I presented it on HighLoad++ 2009 and I can send the presentation by request.
UnknownAugust 2, 2012 at 2:26 AM
Hi Alexander

Broken link screenshot
Could you send me the presentation.

igor.korynkevych at gmail.com

Regards,
Igor
Alexander KrizhanovskyAugust 2, 2012 at 9:15 AM
Hi Igor,

I sent you the presentation.

Many thanks for pointing out the broken link!
JZAugust 20, 2015 at 5:35 PM
Missing "-pthread" in compiler option. My GCC version is 4.8.3.

$ g++ -std=c++11 a.cpp
$ ./a.out
terminate called after throwing an instance of 'std::system_error'
what(): Enable multithreading to use std::thread: Operation not permitted
Aborted (core dumped)

$ g++ -pthread -std=c++11 a.cpp
$ ./a.out
a=1635863 b=3348722 c=4984585
zhangcasaJune 4, 2017 at 8:48 PM
Hi Alexander, thanks your this good work, and does only gcc-4.7 have the '__transaction_atomic' function? I use the gcc-4.8.5 version, compile the program and it tells me that 'can not find -litm'.

Note: Only a member of this blog may post a comment.

High Performance Linux