Optimization of operations for a binary indexed tree (Fenwick tree)

I am doing optimization in one project that is very out of line with the performance requirements. Here, Fenwick trees are widely used to quickly obtain prefix sums (AKA intermediate sums) for an array with constantly changing contents. The structure itself is very useful and helps a lot, but here, in addition to the standard operations of incrementing arbitrary elements, two additional non-standard ones are very often used operations (I have not found any analogues in the literature). And these operations account for a significant part of the computational load.

  1. Getting values (uploading to an external array) of prefix sums for elements from the continuous range from i to i + k.

  2. Zeroing (or, say, filling in with a constant) elements from the continuous range from i to i + k with the correct recalculation of all affected amounts.

Important: For both operations the average value of k is much less than n (the number of elements in the tree).

At the moment, these operations are implemented stupidly and simply:

  1. In the loop (separately!) for each element in the range, its value is found (a standard operation for the Fenwick tree). Total: the complexity of the operation is O(k*log(n)) for k elements.

  2. In the loop (separately!) for each element, its value is found, and then its value is executed. increment by a number that is the reverse of its current value (to reset to zero). The complexity, again, is equal to O(k*log(n)).

I thought about the structure of trees and these operations. And it seems to me that both of them can be implemented with less complexity. After all, log(n) is the depth of the passage from the root to arbitrary of the element, and in these operations, the group is traversed sequentially arranged elements. It is possible to benefit from this fact, so as the closest common node of the tree at best lies at a depth of log(k) from any of the end nodes in a continuous range of long k. This means that to traverse the entire range, you can go down from the root in log(n), and then go up and down from one neighboring element to another in log(k). Total: the complexity of the algorithm should be O(log(n) + k*log(k)) (which is much better than the current version). This is for the best case. But the average estimate should also be close to this, since the worst case O(k*log(n)) is unlikely at k ≪ n.

Now I'm stuck on implementing algorithms for these two operations. So far, I don't quite understand how to go from the i element to the i+1 element without going back to the root. The idea is to go up to the nearest common node and go back down to the next element, along the way counting the amounts needed to get the value of the next element. And for operation (2), it seems that the values of intermediate nodes should be adjusted simultaneously on the same pass, as in the case of performing a standard increment (if it is even possible to do it at the same time).

These are all general considerations for binary trees. How this should be implemented specifically for the Fenwick tree, I do not yet imagine.

I would be grateful for help with specific algorithms in pseudocode or any imperative language.

If my reasoning about complexity is wrong, then I would like to know what exactly I am wrong about.

Author: cridnirk, 2017-06-04

3 answers

Yes, you can try to speed up these operations. To begin with, I will give my implementation:

int next2n(int x) {
    // only 32bit
    x = x - 1;
    x = x | (x >> 1);
    x = x | (x >> 2);
    x = x | (x >> 4);
    x = x | (x >> 8);
    x = x | (x >> 16);
    return x+1;
}
class BIT_plus {
    std::vector<int> in;
    std::vector<int> num;
    int n;
     public:
    BIT_plus(int n) {
        init(n);
    }

    BIT_plus(const std::vector<int>& list) {
        init(list);
    }

    void init(const std::vector<int>& list) {
        init(list.size());
        for(auto i=0u; i < list.size(); ++i) {
            inc(i, list[i]);
        }    
    }

    std::vector<int> sum_dip(int start, int len) {
        int summ = sum(start);
        std::vector<int> res;
        for(int i=start; i < start + len; i++) {
            res.push_back(summ);
            summ += num[i+1];
        }
        return res;
    }

    void init(int n) {
        in.assign(n, 0);
        num.assign(n, 0);
        this->n = n;
    }

    int size() {
        return n; 
    }

    void inc (int i, int delta) {
        num[i] += delta;

        for (; i < n; i = (i | (i+1)))
            in[i] += delta;
    }

    int sum (int r) const {
        int result = 0;
        for (; r >= 0; r = (r & (r+1)) - 1)
            result += in[r];
        return result;
    }

    int sum (int l, int r) const{
        return sum (r) - sum (l-1);
    }

    void fill_const(int start, int len, int val) {
        int n2n = next2n(start + len + 1)-1;
        int sumdelta = 0;
        for(int i = start; i < start + len; i++) {
            int delta = val - num[i];
            sumdelta += delta;
            for(int j = i; j < n2n; j = (j|(j+1))) {
                in[j] += delta;
            }
            num[i] = val;
        }
        for(int i = n2n; i < n; i = (i|(i+1))) {
            in[i] += sumdelta;
        }
    }

    auto get_num() const {
        return num;
    }
    auto get_in() const {
        return in;
    }
};

It is a little overblown with unnecessary methods, but it has all the most important things in it. (For some basis, I took the implementation of from here). In it, compared to the usual implementation, there are two new methods: fill_const (fills the range of the original sequence with a certain number) and sum_range (returns the prefix sums of elements from the range).


The main idea of the implementation is pretty simple. It consists in the fact that in addition to the array itself for the Fenwick tree(in), we store the source elements themselves (num).
1. Then finding all the prefix sums from the range [i, i+k] goes like this: first we find the prefix sum for the i th element. This happens as in a normal implementation. The complexity of the calculation will be O(log(n)). And then, to find sum(i+1), add the number num[i+1]. And indeed, if sum(i) is the sum of the first i elements, then sum(i) + num[i+1] -- sum of the first i+1 element.
2. The second part with filling in the constant is a little more complicated. First, let's understand which elements we need to change in the original array in. If we change the element with the index i, then first we must change the element with the index i, then f(i), f(f(i)), where f(i) = i|(i+1) ( | -- bitwise or). Note that if at some point we hit 2k-1, then we will also hit 2i-1. Moreover, if we start from a certain number i, then in the next one after it we will not skip a number of the form 2k-1 (otherwise, we will get one in the next digit, and the function is monotonically increasing). Let's use this method: from the number j to [i, i+k), we will change the cells (according to the standard rules. Adding to them delta) until the number of the cell we are changing is less than the number following i+k of the form 2k-1 (next2n(i+k)). Then the indexes of the array that we will change for all j will match. And we will change them by a sumar value changes to all cells.

Probably, the second algorithm can be somewhat optimized (now it seems to work for O(k*log(i)+log(N)), which is asymptotically not faster (i can be of the order of n), but faster (especially for small i), but already in this implementation it will work faster than an independent change for each variable.

 6
Author: retorta, 2017-06-06 21:24:20

The operation of uploading responses to an external array is performed for O(k + log n) as follows:

  1. We store the source array{[19] separately]}
  2. Unloading for O(log n) the amount up to the first element (i).
  3. For O(1) we add the element i+1 to it, we get the sum up to the element i+1.
  4. Similarly, we count the sums up to all elements up to i+k, the total of O(k) operations.

The change is more difficult - the Fenwick tree does not support such an operation so easily. However, you may need to to help the tree of segments with group change - it is very popular in sports programming. For this reason, there is, for example, a manual on e-maxx, but it will not satisfy any reasonable standards of readability and maintainability of the code, so you need to delve into and write your own implementation (perhaps there is a library one, but I have not studied it).

The segment tree will allow you to calculate the sum and mass assignment on any segment for O(log n) (regardless of k), and upload operations can either be done for O(k * log n) "head-on", or for O(k + log n) with approximately the same optimization as with Fenwick - first we count the sum up to the element i, and then one by one we find the values of the following elements. This can be done en masse, and there will be O(k) operations (you will have to completely bypass several subtrees of the total size O(k)).

 6
Author: yeputons, 2017-06-06 21:52:44

Algorithms using an array of source data (from which the tree was built) will, of course, work faster. So, if the project allows you to store a copy of the source data in parallel with the Fenwick tree, I recommend that you first turn to the answers of other participants.

I was only interested in tinkering with the Fenwick tree itself. Therefore, my solutions are not so fast, but they allow you not to store a copy of the original data.

Item 1 - getting values (uploading to an external array) prefix sums for elements from the continuous range from i to (i+k)

Going from the prefix sum of the i element to the prefix sum of the (i+1) element is actually easy.
I got the following algorithm:
1) we collect the partial sums that make up the i th element, put them in the data structure "stack", in the order of their indexes in the tree from the smallest to the largest;
for optimization purposes, you can store non-partials themselves on the stack. sums and the cumulative total for them;
the top of the stack will store the value of the i th prefix sum (we return it in the response);
for example, with i = 10, the stack will be like this:
stack[0] = tree_item[7]
stack[1] = tree_item[7] + tree_item[9]
stack[2] = tree_item[7] + tree_item[9] + tree_item[10] // the top of the stack, equal to the 10th prefix
by the way, the maximum size of this stack is quite small - log2(max_size), i.e. when the number of source elements is described by a 32-bit number, the size of this stack will be only 32 elements;
2) we consider the binary writing the number (i+1), writing the bits from left to right from the highest to the lowest;
if the binary entry ends with zero, go to the next step;
if the binary entry ends with some continuous sequence of units, then we count the number of these units, and remove from the top of the stack as many elements as we found the final units;
for example, with i = 10, we have:
(i+1) = 11 (decimal notation) = 0001011 (binary entry);
binary entry numbers end with a group of 2 units;
therefore, we remove 2 elements from the stack, and the contents of the stack become as follows:
stack[0] = tree_item[7]
3) added to the stack (i+1)-th element of the tree;
the top of the stack will contain the value of the (i+1)-th prefix sum (returned it);
for example, if i = 10, have:
stack[0] = tree_item[7]
stack[1] = tree_item[7] + tree_item[11] // stack top equal to the 11th prefix
4) repeat steps 2-3 for the indexes (i+2),(i+3),...(i+k).

for more clarity, I will write out a few more steps:
(i+2), binary notation of 0001100; from the stack is nothing to be removed;
stack[0] = tree_item[7]
stack[1] = tree_item[7] + tree_item[11]
stack[2] = tree_item[7] + tree_item[11] + tree_item[12] // 12-nd prefix
(i+3), binary notation of 0001101; removed from the stack only 1 element;
stack[0] = tree_item[7]
stack[1] = tree_item[7] + tree_item[11]
stack[2] = tree_item[7] + tree_item[11] + tree_item[13] // 13-nd prefix
(i+4), binary notation of 0001110; from the stack is nothing to be removed;
stack[0] = tree_item[7]
stack[1] = tree_item[7] + tree_item[11]
stack[2] = tree_item[7] + tree_item[11] + tree_item[13]
stack[3] = tree_item[7] + tree_item[11] + tree_item[13] + tree_item[14] // 14-nd prefix
...
all these operations with the stack is very easy to understand if you look at the picture depicting the Fenwick tree in 2-dimensional form, something like here

I don't have much knowledge of Big-O-notation, so I can't accurately determine the complexity of this algorithm. I can only say that it will definitely be much better than O(k*log(i)) when i is large enough.

Item 2 - filling in the constant value of elements from the continuous range from i to (i+k) with the correct recalculation all affected amounts

Here I got very similar to what is described in the response of the participant retorta, so I will not duplicate it in vain.
I will only mention that here we had to additionally calculate the original values.
value[i] = prefix(i) - prefix(i-1)
A trivial implementation requires 2*k accesses to the prefix sum calculation function, and it is easy to optimize to (k+1) accesses.
However, you can also go from the other side: the function that calculates is well optimized i-th value of the original array, if based on the fact that every even element of the tree is the original value, and every second of the odd ones is equal to the sum of the i-th and (i-1) - th original values, etc ...
In the general case-again" shoots " counting the number of single bits in the lower bits of the number, as in point 1 :)

Implementation of
The text has already turned out quite a lot, so I decided to put the code on github, I hope this is not prohibited :)
I eventually coded it in C++ in MS Visual Studio Community 2015.
The code lacks some checks, but there are simple tests for the validity and performance of the code.
Separately, I want to remind you that the measurement results are very different in the Debug-and Release - configurations, in the latter the gain is usually less.
Link to the code

 3
Author: velial, 2017-06-13 11:15:32