COUNT (*) x COUNT (1) x COUNT (id)

I would like to better understand the difference between these ways of using the counter:

select COUNT(*) from tabela

select COUNT(1) from tabela

select COUNT(id) from tabela

This question deals with the performance between two of the forms, but I would like to better detail what each differs, as well as add a third way to the question.

  • is there a (real) difference in performance?
  • can there be a difference in the results?
  • is it possible to map the best use for each of the shapes?
Author: rLinhares, 2018-09-17

1 answers

select COUNT(*) from tabela

Takes all the columns of the table during the query, then all the data will be available in memory for the SGDB to do the counting operation and stuff. If there is no optimization. But many DBs optimize this somehow for a simple expression so, some can do this even in constant time O(1). Otherwise it will be O (n).

select COUNT(1) from tabela

Takes a constant, that is, a value that is already in memory, so in thesis it is for being absurdly faster because you don't even have to load anything from the database, but it depends on having some optimization. It always depends on the implementation of the database. It may be the (1), but ma most will be the(n) same. The difference is that the load on each of the elements will potentially be less.

select COUNT(id) from tabela

Here reads only one column, it is usually faster than the first one (some cases of tables with few and short columns can be equal), and will be equal to the previous one by having that read all the lines, even if it does not bring anything other than the indicator to the counter, but again, unless it has some optimization, in this case this may be much slower than the previous one, but it may not be. It can be O (n) or o(1).

I put on GitHub for future reference.

More details

Some databases have an optimization where the total count without filter (without WHERE) has already stored automatically and is guaranteed which is always up to date, is in this case where the complexity will be the(1). The most scalable DBs do not usually have this for technical difficulties. I can cite the example of MySQL's MyISAM which always knows the COUNT() without filter or other criteria being used. But in most cases it will be O(n). If you have WHERE or other information in SELECT or make a JOIN or have other characteristics that can affect the count it will certainly be O(n).

If it is O (n) the performance difference will be small, especially in tables with very short lines. Because he will have to count himself, he will have to load all the lines to count. In very large lines there may be a difference when you use COUNT(*), but only in these cases. And then the other 2 examples should be the same, because reading the line just to count without having to use any information and reading a simple id gives the same.

To understand this all depends on understanding how the internal storage of the engine responsible for this in your SGDB and also how the file system works in general. And understand that each page that the software does hit the disk (or SSD or other form of storage) is absurdly slower (enough to be 3 orders of magnitude) that take a die that is already in memory, no matter the size of the die, although it changes a bit the proportion in some cases. And then understand when you need to go to secondary storage or not. So caching can make an absurd difference, in some extreme cases can range from less than 1 millisecond when everything is cached, to more than 1 second when nothing is cached.

Is there a (real) difference in performance?

A little implementation depends on whether it is a performance problem or not, because it is possible to optimize knowing that in practice, this example, makes no difference, after all the load of the data has zero function in this specific query. Stopping charging even saying to do it doesn't change anything. There just testing in each case to know if it does or not. And the test may change on each run. Contrary to what people imagine testing database performance is very complicated. People expect linear responses, but in something that has so many optimizations this doesn't happen.

Can there be a difference in the results?

Should not in this case in most DBs, in other cases can.

You can map the best use for each the shapes?

I believe so, whatever that means. But the basic thing is to test each one where it will use and see which one is faster (doing in various scenarios, considering the cache). In other examples it may be more a matter of doing what you expect or not, but you can only analyze on a case-by-case basis.

 7
Author: Maniero, 2020-08-06 16:56:28