High Performance Linux

Saturday, May 26, 2012

Software Transactional Memory (STM) in GCC-4.7

GCC-4.7 introduces new amazing feature - Software Transactional Memory (STM). It is still experimental and not yet optimized feature, however we already can have a look how STM works. Currently GCC implements pure Software TM, i.e. without hardware support. Intel announced hardware support for TM (HTM) in Haswell microarchitecture as Transactional Synchronization Extension (TSX), so probably in next year we'll have hybrid TM - software transactional memory with hardware optimizations.

Firstly, to understand what STM is, lets consider following simple program:

    #include <iostream> 
    #include <thread>

    static const auto THR_NUM = 4;
    static const auto ITER_NUM = 1000 * 1000;

    static auto a = 0, b = 0, c = 0;

    static void
    thr_func()
    {
            for (auto i = 0; i < ITER_NUM; ++i) {
                    ++a;
                    b += 2;
                    c = a + b;
            }
    }

    int
    main(int argc, char *argv[])
    {
            std::thread thr[THR_NUM];

            for (auto &t : thr)
                    t = std::thread(thr_func);

            for (auto &t : thr)
                    t.join();

            std::cout << "a=" << a << " b=" << b
                    << " c=" << c << std::endl;

            return 0;
    }

Now try to compile (don't forget -std=c++11 since C++11 is still not default option for g++) and run the program. Probably you'll see that a, b and c contains ugly values which change from run to run, e.g.:

        $ ./a.out
        a=2139058 b=4316262 c=6455320
        $ ./a.out
        a=2077152 b=4463948 c=6541100

Result is expected because 4 threads concurrently updates all the three variables and all the variables are updated in RMW (Read-Modify-Write) manner. Now lets place operations on all the three variables into one transaction (yes, this is very like database transactions), so all the variables will be read and written in atomic manner:

        static void
        thr_func()
        {
                for (auto i = 0; i < ITER_NUM; ++i)
                        __transaction_atomic {
                                ++a;
                                b += 2;
                                c = a + b;
                        }
        }

Lets compile the code with -fgnu-tm to enable STM in GCC and rerun the program. This time you'll see nice numbers, which stay the same regardless of the run try:

        $ ./a.out
        a=4000000 b=8000000 c=12000000
        $ ./a.out
        a=4000000 b=8000000 c=12000000

This is quite simple case and you'll probably prefer to use mutex here. But you can refer to Ulrich Drepper's "Parallel Programming with Transactional Memory" for more complicated example when mutex alternative is not so obvious. It's easy to see that STM would be quite useful for example to implement highly-concurrent self-balancing binary search tree which could need to lock number of nodes for rotation on insertion or deletion (traditionally such data structures are implemented by introducing per-node mutex and are prone to deadlocks).

You may noticed that STM version of the program runs much more slowly. So lets analyze what it's doing so long. For basic investigation lets run the program under strace and print system calls statistics:

        $ strace -f -c ./a.out
        ........
        % time   seconds  usecs/call   calls  errors syscall
        ------ --------- ----------- ------- ------- ---------
         99.39  0.021295          11    1920     390 futex
        .......

So it means that STM in libitm (GCC implements STM as libitm library which you can see in ldd output) is implemented via futex() system call, like common mutex. Before going deeper into libitm internals lets see at the transaction code more carefully and split the code into basic read and write operations. We have 3 memory locations, variables a, b and c, which we perform read and write operations on. First operation is ++a which is actually read a value from memory, update it and write back, so we have two operations here - one read and one write. Next b += 2 - exactly the same: read the value, add 2 and write it back. And the last one, c = a + b, is two reads (a and b) and one write to c. Moreover, all these operation are inside transaction, so we have to start and commit a transaction.

To understand what's going on inside thr_func() lets simplify it as follows:

        static void
        thr_func()
        {
               __transaction_atomic {
                        ++a;
               }
        }

and disassemble it:

        push   %rbp
        mov    %rsp,%rbp
        mov    $0x29,%edi
        mov    $0x0,%eax
        callq  400fd8 <_ITM_beginTransaction@plt>
        mov    $0x6052ec,%edi
        callq  4010b8 <_ITM_RU4@plt>
        add    $0x1,%eax
        mov    %eax,%esi
        mov    $0x6052ec,%edi
        callq  400fe8 <_ITM_WU4@plt>
        callq  400f48 <_ITM_commitTransaction@plt>
        pop    %rbp
        retq

Now we see four calls of _ITM_* functions (as explained in info libitm, GCC follows the Intel's Draft Specification of Transactional LanguageConstructs for C++ (v1.1) in its implementation of transactions, so _ITM_ prefix is just Intel's naming convention) for transaction begin, transaction commit and the pair of read (RU4) and write (WU4) operations.

_ITM_beginTransaction() saves the machine state (for x86 see  libitm/config/x86/sjlj.S) and calls GTM::gtm_thread::begin_transaction() (see libitm/beginend.cc) which initializes the transaction data, checks transaction nesting and performs other preparation steps.

_ITM_commitTransaction() is defined in  libitm/beginend.cc and tries to commit the transaction by calling GTM::gtm_thread::trycommit() and if it fails restarts the transaction. GTM::gtm_thread::trycommit() is the place where all the threads are sleeping in futex() (which we saw in strace output) to write all modified data. So this is most heavy part of transaction.


The most interesting stuff is in read and write operations. 0x6052ec is address of variable a. _ITM_RU4 and _ITM_WU4 are just a sequence of jumps which lead (in this particular case) to ml_wt_dispatch::load() and ml_wt_dispatch::store() correspondingly. First one accepts only the variable address and the second one - the variable address and the stored value. Load() reads a memory region by specified address, but before that it calls ml_wt_dispatch::pre_load() function which verifies that the memory location is not locked or recent and restarts the transaction (these service data is taken from global table indexed by hash function over the address). Store() by-turn calls ml_wt_dispatch::pre_write() which locks the memory location (all service data for the memory location also is taken by the same global table) and updated the release (version) of the memory location before the write (the release version is checked in pre_load() as 'recent').

11 comments:

  1. Hi

    A very interesting post.
    I want to tell about one situation.

    While debugging my high load kernel driver, I declared a debug counter, which should count general stats on queue's. This counter should be equal with one value. I was in state of shock , when the value after a test, was less than expected. I spent a day, while finding the bottleneck (profiler said OK, cpu load was low). Then I remembered, that I use multithreading.Threads, due to high rate, were updating this one value with no lock. :-) After declaring separate counters for every thread, everything went fine. As I remember, in earlier GCC versions, in this case, I should have a segfault? If it's true, than my question : "Is is good , that the kernel would control the read/write lock ?". For my point of you, If I made a stupid mistake, it's better to receive a segfault, than some ugly values.

    Regards,
    Igor

    ReplyDelete
  2. Hi Igor,

    it seems you didn't use atomic_inc() (or atomic_add()) to update the counter in your driver. atomic_inc() in kernel or __sync_fetch_and_add() in userspace (the second one is GCC intrinsics which you can find in GCC documentation) allow multiple processors to update a variable in atomic, synchronous, manner. However, keep in mind that atomic operations are not for free - they are implemented through CPU cache coherence protocol, so basically simple variable increment (like i++) is faster than atomic increment and you should use atomic operations in heavy loaded code carefully.

    So this all is about CPU operations - just read, modify and write the value back (the so-called RMW operation) in atomic or not manner. In any case if you access valid memory location, then this is valid operation from CPU point of view. From other side segmentation fault is the result of an exception happen on CPU (due to access to invalid address) and handled by OS kernel. Thus answer to your question is: not, you should not see segmentation fault.

    ReplyDelete
  3. About segfalt I was mistaken. Wanted to ask , why not to use a mutex? Now reading http://queue.acm.org/detail.cfm?id=1454464

    Regards,
    Igor

    ReplyDelete
  4. On natsys-lab.com the link "new technology of data structures construction based on x86-64 virtual memory paging" is broken. Is there any way to read this paper?

    Regards,
    Igor

    ReplyDelete
  5. Mutex is relatively heavy synchronization primitive. Basically mutexes are used for long operations and/or operations which could sleep.

    Currently there is no link to the technology of data structures construction based on x86-64 virtual memory paging on our web site (or did I miss something?). I presented it on HighLoad++ 2009 and I can send the presentation by request.

    ReplyDelete
  6. Hi Alexander

    Broken link screenshot
    Could you send me the presentation.

    igor.korynkevych at gmail.com

    Regards,
    Igor

    ReplyDelete
  7. Hi Igor,

    I sent you the presentation.

    Many thanks for pointing out the broken link!

    ReplyDelete
  8. Missing "-pthread" in compiler option. My GCC version is 4.8.3.

    $ g++ -std=c++11 a.cpp
    $ ./a.out
    terminate called after throwing an instance of 'std::system_error'
    what(): Enable multithreading to use std::thread: Operation not permitted
    Aborted (core dumped)

    $ g++ -pthread -std=c++11 a.cpp
    $ ./a.out
    a=1635863 b=3348722 c=4984585

    ReplyDelete
  9. Hi Alexander, thanks your this good work, and does only gcc-4.7 have the '__transaction_atomic' function? I use the gcc-4.8.5 version, compile the program and it tells me that 'can not find -litm'.

    ReplyDelete
    Replies
    1. Hi Casa,

      this links may help you https://flanaras.wordpress.com/2016/10/06/transactional-memory-on-fedora-24/ - it seems you just need to install the library in your system.

      Delete
    2. Thanks Krizhanovsky, it works, and the '__transaction_atomic' is slower than spinlock, it depends on the memory size two threads access.

      Delete