Lock-free ordered list

There is a method for inserting an item into an ordered list. Will it be correct for lock-free?

template< typename _Pred >
inline void lock_free_push( sll_node*tail, sll_node*n, _Pred Pred ){
    std::list< sll_node* >Local;
    sll_node*head,*next;
    //
    for( Local.push_back( n ) ; !Local.empty() ; ){
        n   = Local.back();
        head    = tail->next;
        if( ( head == tail ) || Pred( n, head ) ){
            n->next = head;
            if( _InterlockedCompareExchangePtr( &tail->next, n, head ) == head ){
                Local.pop_back();
            }
        }else if( head != tail ){
            next    = head->next;
            if( _InterlockedCompareExchangePtr( &tail->next, next, head ) == head ){
                Local.insert(
                    std::find_if(
                        Local.begin(), Local.end(), [ = ]( sll_node*e ){
                            return Pred( head, e );
                        }
                    ), head
                );
            }
        }
    }
}

In a nutshell: here, the local element n is recursively inserted into the head of the ring interlocked list with the boundary element tail, provided that n <= head. If n > head, an attempt is made to extract the head element. If it works, it is placed in the local list Local. Iterations continue until Local is non-empty. As a result, we should get an ordered list from the stack lock-free lock-free.

tail - tail (head) border element of the lock-free-stack
sll_node - single-linked list node
Pred( a, b ) - an arbitrary method that returns TRUE if a <= b

You can do without the STL, the container is taken to simplify the implementation. Does anyone see the errors?

UPD1: removed the transformations so as not to break the strings.
UPD2: implementation of the same, but already without STL:

template< typename _Pred >
inline void lock_free_push( sll_node*tail, sll_node*n, _Pred Pred ){
    sll_node**p,*head,*next, local;
    // вставляем в `local` узел `n` и повторяем итерации,
    // пока `local` непустой
    for( n->next = &local, local.next = n ; local.next != &local ; ){
        // берем из `local` максимальный элемент (первый)
        n   = local.next;
        // запоминаем головной элемент стека в `head`
        head    = tail->next;
        if( ( head == tail ) || Pred( n, head ) ){
            // список пуст или `n <= head` - запоминаем локальный хвост, после `n`
            next    = n->next;
            // пытаемся атомарно разместить `n` на вершине стека,
            // если она не изменилась
            n->next = head;
            if( _InterlockedCompareExchangePtr( &tail->next, n, head ) == head ){
                // получилось - удаляем `n` из `local`
                local.next  = next;
            }else{
                // не получилось - восстанавливаем локальный хвост
                // и продолжаем lock-free итерации    
                n->next = next;
            }
        }else if( head != tail ){
            // список не пуст и `n > head` - запоминаем хвост, после `head` в `next`
            // и пытаемся атомарно вытолкнуть вершину стека `head`,
            // если она не изменилась
            next    = head->next;
            if( _InterlockedCompareExchangePtr( &tail->next, next, head ) == head ){
                // получилось - ищем позицию для `head` в `local`:
                // сортировка по убыванию
                for( p = &local.next ; ( *p != &local ) && Pred( head, *p ) ; p = &(*p)->next );
                // размещаем `head` в `local`
                head->next  = *p;
                *p          = head;
            }
        }
    }
}

In the worst case , a single thread will take all the elements, arrange them, and insert them back.

Comments:

Do I understand correctly that in the process, some of the items from the tail list are moved to local?

If so, then they (elements) will become invisible to other threads (and not just to threads that list new elements).

This is provided in the general processing algorithm data?

I am trying to use such an object when distributing tasks on SMP, i.e. yes, it is implied that there are other means to provide the flow with information that there are other list items that just need to "wait".

In general, this is done in an elementary way: just start interlocked - the node counter in the list. If the node is deleted, the counter decreases the value. If it is simply blocked (exclusively used in the body of the thread), the counter stores a value about it on the node, without releasing the other threads. Something like that.:)

Just imagine the situation with the assignment of tasks to different programmers for data processing in such a system.

One - receive from the outside, add and organize. It will call this lock_free_push ().

Another is to search in sorted lists and count the sums (or something else).

Here the customer will be surprised when he sees that the amount is growing, then decreasing (and according to the physics of the process, it can only grow).

@avp , I mean here to work only in the usual stack pardigma lock-free, but with a small addition that may be key: such a stack can have a "minimum" element removed from the top. I.e., of course, it can be ordered in any direction, but from all the currently available nodes, from the point of view of the algorithm - the minimum will always be at the top of the stack.

P. s.: strange all the same regulations on the Hash Code about comments, why can't you control their number individually?

Author: Qwertiy, 2012-10-25

2 answers

And what is the fundamental difference between AVL and red-black? I think it's all the same for practice.

The result, of course, can be one, but for me it is easier to explicitly keep track of the difference in the heights of the subtrees. In my opinion, this approach reveals the essence of the balancing algorithm as much as possible.

Something else is interesting. When is it possible (necessary) to apply lock-free algorithms to sufficiently complex structures?

To sufficiently complex - do not sure.

Visible only in cases where simultaneous access is unlikely.

That's exactly what I'm interested in right now.

There are N threads and one parallel task. The problem is solved in iterations. Each iteration is performed for approximately a fixed time (the duration of the iteration [sampling of the problem] is adjusted externally, statically). Each thread performs its own iteration. And since, initially, the task is one, therefore it is necessary to split into the remaining threads. Let each thread that did not get a task take the task with the maximum resource and divide it into two. If a thread has solved its iteration, it looks for a new free task or tries to split the maximum one if the number of tasks is less than the number of threads. A task is considered available for division if its size is higher than the specified limit. Thus, while the task is being divided, threads will always be busy solving it.

This is the algorithm I'm trying to implement without blocking. A list is a list of free tasks. Tasks that no thread has taken on at the moment. A free problem can be either divided or solved by one of its iterations. Naturally, the number of tasks should not exceed the number of threads, but there may be small logical exceptions. Since the number of threads is equal to the number of cores, the list is small enough to bother much with sorting, but you may have to think about optimizing the sorting if there is a lot of competition between streams.

In theory, I can calculate the time spent competing between threads on deadlocks, since I can "very cheaply" calculate the number of iterations of idle loops, and I can roughly calculate how many clock cycles the loops themselves run.

I.e., ultimately, this will be the percentage of time spent on aggregate lock-free synchronization.

I can't only count the time spent on synchronization by standard means ( critical sections)., mutexes ). If you explicitly set the clock measurements before such locks, it will be very inaccurate. It seems that the most logical way would be to measure the task execution time on both types of synchronization and determine whether lock-free wins in this case. But if I will be engaged in such experiments, it will not be soon, because I am limited by my free time.

Update: to be honest, I was never able to stabilize this algorithm on SMP.

But after a long and after hard research, lock-free came to the conclusion that in its pure form (inline) such methods are inoperable. For further experiments, they need to be stuffed with barrier synchronizations both at the compiler level _ReadBarrier/_WriteBarrier/_ReadWriteBarrier and at the hardware level _mm_lfence/_mm_sfence/_mm_mfence.

I would be very grateful if there are working examples with such an organization of the task queue, since I still have very specific expectations for lock-free:

I was running a test task " clipping a set points with a rectangular window". Each of the iterations of this task counts the time spent on the solution and accumulates it in the core accumulator (a thread attached to the core). After the calculations are completed, the key parameter is calculated - the number of points per second, equal to S / T, where T is the total running time of all iterations on the slowest core (the thread attached to the core ), and S is the total number of processed points. In the case of lock-free, this characteristic once accelerated to 300 million per second. eight 3GHz cores. I would like to have this speed and leave instead 20-50 million with locks.

Moreover, "overclocked" is exactly the word, since I ran this task in a loop 100 times, and at each subsequent iteration I received an increase in the speed indicator.

On SMP, I was never able to stabilize this algorithm.

About a month ago, I still got this implementation. And immediately in 3 versions: on Interlocked Singly Linked Lists , on the dynamic stack (when the number of tasks grows unpredictably) and on the static stack (when the maximum number of tasks is known in advance, or it is possible to use a fixed pool of tasks).

The performance is lower on API, because we spend time calling the function and do not have the ability to sort the vertices.

The performance on the static stack is the highest. Although its own implementation of lock-free-стека did not do without the SIMD extension. Not right now, exactly. I will say specific numbers, but my preliminary estimate of using such sorting is ~3-6% a gain compared to the basic algorithm lock-free, ~20-30% - compared to "spin", and on" nuclear " synchronization objects, I did not even count (here it is more difficult to compare, because the implementation of the scheduler on them will be noticeably different from lock-free).

I will describe the implementation in more detail in a separate topic, a little later, when there is time for a simple test.

 2
Author: mega, 2013-01-11 07:39:49

@mega, I absolutely agree with the regulations, but it is unlikely that it will be possible to change this. So I ran out of comments and had to start "reply". It is quite unpleasant that the information about the update of the question (here you have updated it) and apparently the answer is not sent by mail. I just paid attention to the "rise" of your question in the general list.

On business. I'm still not sure I understand your data structures correctly. In particular, I don't understand why you are talking about stack (in my understanding, the stack is a FIFO). I see them like this:

  1. There is a cyclic singly connected list, ordered in ascending order.
  2. When calling lock_free_push (), tail points to an arbitrary element in it.
  3. The purpose of the call is to insert a new element while maintaining the order of the list.
  4. Many threads simultaneously perform this operation without using "mutexes".
  5. For any other (not called lock_free_push() the current composition of the list does not matter, but the list remains ordered.

Right?

If so, it is not clear to me how all "tail" in threads not involved in the lock_free_push () call is ensured correctly, unless it is a" global " pointer for each list. That is, for each list, there is exactly one variable that points to an element of this list (the minimum one), and this variable is shared by all threads that use it. list. This is assumption 2.

What is the correct one?

I don't know, of course, the general purpose of your task in the SMP system, but in the Linux kernel, for structures with fast access to the minimum element, use a red-black tree.

 0
Author: avp, 2012-10-26 09:21:23