Is it possible to avoid scanning an index in a large table?

Deducted that there is a ColumnStore index type.

I haven't worked with him before and I don't know what kind of animal he is.

The official website says that this index can improve performance for analytical queries when working with large amounts of data, which is exactly what I want to achieve.

Scenario: there are 2 hefty tables that need to be connected to each other and, for example, calculate the amount of money for a certain period. period.

Actually, the questions are:

  1. on which fields is it rational to hang this type of index?
  2. If you hang it on the fields for which the JOIN occurs, then you need to hang it on both tables or can you mix it?
  3. If you put it in the field where the sum is executed, will it play any role?
  4. I read somewhere that tables with this type of index become unreadable. Is it true?

Definitions tables':

Таблица 1: ID PK INT,Date DateTime бла бла бла

Таблица 2: ID PK INT, Money FLOAT, бла бла бла

These tables are joined by ID.

Right now, when these tables are joined, an index scan is performed, which takes a long time. If you put an index on the date, it doesn't get much faster.

P.S is an abstract query and tables. In fact, there are a little more connections and data aggregation.

And so I think how to make this whole thing faster. I can of course play with Force Seek, but not with the simple one, the scheduler chooses the scan?

Author: iluxa1810, 2017-10-04

2 answers

[...] index type ColumnStore [...] what a beast this is.

First, let's remember what a regular non-cluster index is. What is characteristic of it? Such an index has a key and a pointer to the data (RID or clustered index key), which is implicitly included in the index:

INDEX ([KeyColumn1], [KeyColumn2], ...)
    INDEX ([KeyColumn1], [KeyColumn2], ...) INCLUDE (RID)
    INDEX ([KeyColumn1], [KeyColumn2], ...) INCLUDE ([CI_KeyColumn1], [CI_KeyColumn2], ...)

To optimize the Index Scan and Index Seek operations, additional non-cluster indexes can also be included columns:

INDEX ([KeyColumn1], [KeyColumn2], ...) INCLUDE ([SomeColumn1], [SomeColumn2], ...)

So, semantically, columnstore is a degenerate case of an index that has no key columns, and all the columns of the index are included (included):

INDEX () INCLUDE ([SomeColumn1], [SomeColumn2], ...)

And if nonclustered columnstore can somehow be called an index, since it includes at least a pointer to data (implicitly, like a normal non-clustered index), then clustered columnstore can only be called an index with a very large it's a stretch, because it has neither a key nor a pointer, in fact it is just a storage (like heap).

Now we change the word INCLUDE to COLUMNSTORE (in reality, of course, this is not a simple replacement, the new word also has a special organization of the index), adapt the syntax, and here it is

COLUMNSTORE INDEX ([SomeColumn1], [SomeColumn2], ...)

Designed to optimize scanning operations.


Columnstore index (not counting its delta-store parts) stores data as follows. Suppose there is a table TABLE (A, B, C) with rows

R1 (A1, B1, C1),
R2 (A2, B2, C2),
R3 (A3, B3, C3),
...

The source data is divided into groups of rows (rowgroup). Each group includes up to 1048576 (220) lines. If the index is built under memory-pressure conditions, this number may be less. All values of one column of the group are stored together and form a segment (segment).

RowGroup1 (
    A_Segment1 (A1, A2, ... A_n)
    B_Segment1 (B1, B2, ... B_n)
    C_Segment1 (C1, C2, ... C_n)
)
RowGroup2 (
    A_Segment2 (A_n+1, A_n+2, ... A_n+k)
    B_Segment2 (B_n+1, B_n+2, ... B_n+k)
    C_Segment2 (C_n+1, C_n+2, ... C_n+k)
)
...

The data of the columnstore index groups and segments can be read regardless. For example, for the query

SELECT COUNT(DISTINCT A)
FROM T;

Only A_Segment1, A_Segment2, , etc. will be read, the data of the extra columns for the query will not be affected.

In the case of regular rowstore tables (and indexes), if a column is needed for a query, then the data pages containing the table rows are read into the cache completely (with data from all columns). A column can be read relatively independently only if the data is already in the cache.

However, even in cases where there are no "extra" columns in the table, in rowstore tables and indexes, the cost of storing data can be quite high (which then affects reading). This is especially true for tables that consist of a small number of columns of simple types. A typical example is tables that link two entities (a pair of int columns), in which the data itself takes up only about half of the storage space, while the other half is mainly row headers. columnstore the overhead of all possible headers per data row is significantly less.

In the segment headers columnstore of the index, the minimum and maximum values for the column for this segment are stored, so that, if there are predicates

SELECT COUNT(DISTINCT A)
FROM T
WHERE A BETWEEN @a1 AND @a2;

In some cases, a partial scan may be performed, in which some groups or segments are simply skipped (so-called rowgroup elimination or segment elimination).

In addition, the data in the segments is stored in a compressed form. Due to the fact that the data in the column is of the same type, good compressibility is achieved. So, for example, a column of 10 million values int, in which there are only a dozen different values (some FK column, for example) it can take up only a few kilobytes in the columnstore. Non-repeating data can also compress well. For example, the identity column of 10 million values int takes up a couple of tens of megabytes in the columnstore, instead of the ~38 MB of pure data.

All this significantly reduces the cost of scanning the columnstore index compared to the cost of scanning the rowstore tables and indexes.

Optimized storage isn't the only thing that adds performance to columnstore indexes. When processing data coming from the columnstore index, in addition, the batch processing mode (batch mode processing) is used, in which the rows are processed by the query plan operators not one by one, but in batches, which reduces costs (and, consequently, CPU costs). Due to this, connections and aggregations can occur more quickly, for example. The list of operators that support batch mode expands with each new version of SQLServer. Also batch mode apparently will someday be supported and for rowstore indexes. In the meantime, to use batch mode for rowstore, they resort to certain tricks.

In the latest versions of SQLServer, columnstore - indexes (including nonclustered) can be updated and do not block data updates in the table, however, the columnstore-indexes are designed so that they work more efficiently when data is only added to them, but not deleted and not deleted. they change. Therefore, they are more suitable for use in OLAP scenarios.


Scenario: there are 2 hefty tables that need to be interconnected connect and let's say, calculate the amount of money for a certain period. [...] Table definitions [...] Join these tables by ID.

Let's simulate the situation. So, our tables are:

CREATE TABLE T1
(
    [ID] int NOT NULL CONSTRAINT PK_T1 PRIMARY KEY,
    [Date] datetime NOT NULL,
    [BlaBlaBla] binary(200) NULL
);

CREATE TABLE T2
(
    [ID] int NOT NULL CONSTRAINT PK_T2 PRIMARY KEY,
    [Money] float NOT NULL,
    [BlaBlaBla] binary(200) NULL
);

ALTER TABLE T2
ADD CONSTRAINT FK_T2_T1 FOREIGN KEY ([ID]) REFERENCES T1 ([ID]);

Test data (10 million rows in each table)

WITH tally AS (
    SELECT N = row_number() over (order by @@spid)
    FROM sys.all_columns a, sys.all_columns b
)
INSERT INTO T1 WITH (TABLOCK) ([ID], [Date], [BlaBlaBla])
SELECT TOP (10000000)
    N, dateadd(minute, N, '2010-01-01'), 0x00
FROM tally;

WITH tally AS (
    SELECT N = row_number() over (order by @@spid)
    FROM sys.all_columns a, sys.all_columns b
)
INSERT INTO T2 WITH (TABLOCK) ([ID], [Money], [BlaBlaBla])
SELECT TOP (10000000)
    N, rand(checksum(newid())), 0x00
FROM tally;

Our sample request will be

DECLARE @start datetime, @end datetime;
SET @start = ...;
SET @end = ...;

SELECT sum(T2.[Money])
FROM T1
    JOIN T2 ON T2.[ID] = T1.[ID]
WHERE T1.[Date] >= @start AND T1.[Date] < @end;

For such a query, I will take the usual indexes

CREATE INDEX IX_T1_Date ON T1 ([Date]) INCLUDE ([ID]);
CREATE UNIQUE INDEX IX_T2 ON T2 ([ID]) INCLUDE ([Money]);

And columnstore indexes

CREATE COLUMNSTORE INDEX CX_T1 ON T1 ([ID], [Date]);
CREATE COLUMNSTORE INDEX CX_T2 ON T2 ([ID], [Money]);

Columns [ID] in indexes IX_T1_Date, CX_T1 and CX_T2 to include, in principle, is not necessary, because PK_T1 and PK_T2 are clustered.

On T1, however, I will build the indexes in this order

CREATE COLUMNSTORE INDEX CX_T1 ON T1 ([Date]) WITH (MAXDOP = 1);
CREATE INDEX IX_T1_Date ON T1 ([Date]);

So that the data for CX_T1 is not taken from IX_T1_Date, but from PK_T1 (when scanning it, ordered index scan will be used). In this In this case, the [Data] segments of the column in CX_T1 will be filled with dates in ascending order (because in the source data, the dates increase with increasing [ID]), and segment elimination can be used more efficiently.

First, we estimate the worst case - we calculate the full amount, specifying a deliberately larger date interval

DECLARE @start datetime, @end datetime;
SET @start = '2000-01-01';
SET @end = '2050-01-01';

The use of a particular type of index is fixed by hint - s. Enable statistics output:

SET STATISTICS IO, TIME ON;

Amount using the usual indexes:

SELECT sum(T2.[Money])
FROM T1 WITH (INDEX(IX_T1_Date))
    JOIN T2 WITH (INDEX(IX_T2)) ON T2.[ID] = T1.[ID]
WHERE T1.[Date] >= @start AND T1.[Date] < @end;

Costs:

Table 'T2'. Scan count 1, logical reads 22310, ...
Table 'T1'. Scan count 1, logical reads 22332, ...

CPU time = 6469 ms, elapsed time = 6516 ms.

Using columnstore indexes:

SELECT sum(T2.[Money])
FROM T1 WITH (INDEX(CX_T1))
    JOIN T2 WITH (INDEX(CX_T2)) ON T2.[ID] = T1.[ID]
WHERE T1.[Date] >= '2000-01-01' AND T1.[Date] < '2050-01-01';

Costs:

Table 'T2'. Scan count 2, ..., lob logical reads 26732, ...
Table 'T2'. Segment reads 10, segment skipped 0.
Table 'T1'. Scan count 2, ..., lob logical reads 26732, ...
Table 'T1'. Segment reads 10, segment skipped 0.

CPU time = 875 ms, elapsed time = 869 ms.

Next, the amount for 5 years:

SET @start = '2015-01-01 00:00';
SET @end = '2020-01-01 00:00';

Costs with regular indexes:

Table 'T2'. Scan count 1, logical reads 22310, ...
Table 'T1'. Scan count 1, logical reads 5875, ...

CPU time = 2672 ms, elapsed time = 2673 ms.

Costs with columnstore indexes:

Table 'T2'. Scan count 2, ..., lob logical reads 11208, ...
Table 'T2'. Segment reads 4, segment skipped 6.
Table 'T1'. Scan count 2, ..., lob logical reads 11208, ...
Table 'T1'. Segment reads 4, segment skipped 6.

CPU time = 282 ms,  elapsed time = 282 ms.

(segment skipped in the output - indicates the use of segment elimination optimization).

Amount for one year:

SET @start = '2017-01-01 00:00';
SET @end = '2018-01-01 00:00';

Costs from with regular indexes:

Table 'T2'. Scan count 1, logical reads 22310, ...
Table 'T1'. Scan count 1, logical reads 1178, ...

CPU time = 1344 ms, elapsed time = 1351 ms.

Costs with columnstore indexes:

Table 'T2'. Scan count 2, ..., lob logical reads 5604, ...
Table 'T2'. Segment reads 2, segment skipped 8.
Table 'T1'. Scan count 2, ..., lob logical reads 5604, ...
Table 'T1'. Segment reads 2, segment skipped 8.

CPU time = 63 ms, elapsed time = 64 ms.

Amount for one month:

SET @start = '2017-01-01 00:00';
SET @end = '2017-02-01 00:00';

Costs with regular indexes:

Table 'T2'. Scan count 1, logical reads 22310, ...
Table 'T1'. Scan count 1, logical reads 104, ...

CPU time = 1125 ms, elapsed time = 1122 ms.

Costs with columnstore indexes:

Table 'T2'. Scan count 2, ..., lob logical reads 2802, ...
Table 'T2'. Segment reads 1, segment skipped 9.
Table 'T1'. Scan count 2, ..., lob logical reads 2802, ...
Table 'T1'. Segment reads 1, segment skipped 9.

CPU time = 0 ms, elapsed time = 10 ms.

In all cases above, the columnstore index showed noticeably better performance.

However, when requesting an amount for a specific date, the usual indexes gave a win

DECLARE @date datetime;
SET @date = '2017-08-01 12:45';

SELECT sum(T2.[Money])
FROM T1
    JOIN T2
WHERE T1.Date = @date;

Costs with the usual indexes:

Table 'T2'. Scan count 0, logical reads 3, ...
Table 'T1'. Scan count 1, logical reads 3, ...

CPU time = 0 ms, elapsed time = 0 ms.

Costs with columnstore indexes:

Table 'T2'. Scan count 2, ..., lob logical reads 2802, ...
Table 'T2'. Segment reads 1, segment skipped 9.
Table 'T1'. Scan count 2, ..., lob logical reads 2802, ...
Table 'T1'. Segment reads 1, segment skipped 9.

CPU time = 0 ms, elapsed time = 5 ms.

Also, when querying the amount for a short period (on this data-a few hours or less), regular indexes showed better performance than columnstore. However, due to the poor statistics on the date, I sometimes had to resort to using the FORCESEEK hint on T2.

 3
Author: i-one, 2017-10-20 15:45:38

Columnstore index is the index in which the data storage mechanism is changed. Roughly speaking, classical indexes store data line by line, i.e. the first value of the first row, the second value of the first row,..., the last value of the first row, then the second row, etc. The column index stores data line by line - the first value of the first row, the first value of the second row, etc.

Due to this storage, the possibility of better data compression is achieved if the data in the the column is often repeated, the best speed of scanning the table for several fields - if in the normal index you subtract all the index fields, then in the column-only the necessary fields.

Now the answers:

  1. To all fields that participate in queries that require a full scan of the table.
  2. You can mix it, the type of index affects the way the data is read, but not the type of connection.
  3. It will only affect the speed of reading data, not the speed of aggregation.
  4. This is relevant for all column indexes in SQL Server 2012 and non - cluster column indexes for SQL Server 2014-they are, indeed, non-updated. Many improvements have been made in SQL Server 2016, including the ability to update such indexes.

For more information about how to work with column indexes in different versions of SQL Server, see here: https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-what-s-new

And the last answer: SQL Server can actually choose Scan instead of Seek if it thinks it is cheaper. This can happen if you need to do too many Seek ' s, and it is easier to subtract the entire table once. It is for this case that the column index can help.

 3
Author: minamoto, 2017-10-05 11:51:01