External and internal memory sorting algorithms

I was researching about the difference between external and internal memory sorting algorithms and found the following answer in quora :

" in cases where we have to sort more data than can fit the main memory, we need an external algorithm classification. Instead of doing all the sorting in memory, we sort chunks of the data in memory at a time, dump the results for a file, and so on until we have a file fully ordered."

Until ai gave to understand the concept , I am having difficulty to understand which sorting algorithm that is used in each type, there are several types of sorting but how will I know if this sorting algorithm is about memory Primary or secondary ?

Author: Maniero, 2016-11-18

2 answers

Can actually be said to be the same for both external and internal use. It is likely that some algorithm will be slightly better in one case than other algorithms. I will not analyze one by one to come to that conclusion.

The choice of the most suitable algorithm depends on a number of factors. It is often said that the Quicksort is the most suitable for the general case. Which is not to say that he is the best always. It was created for use in RAM. But nothing prevents it from being used in secondary storage.

When going to secondary storage it is extremely important to access the disk as few times as possible. So in thesis I would need to find an algorithm that minimizes this, even if it has other higher costs.

But this is naivety. In practice every classification algorithm that needs a lot of performance will combine different techniques to achieve the goal. It is obvious that one of these techniques is to put the as much data as possible in memory and sort what gives there and go organizing all the portions. If you can put everything in secondary memory to primary, the best algorithm for that case will work well in memory. If you can not put everything, you need to manage the load as intelligently as possible. This has little to do with the algorithm of sort itself.

Note that the amount of memory that computers have today and the advent of SSD has greatly improved the condition of a not so optimized algorithm achieve good results. If you have multiple processors the choice may tend more towards a specific algorithm that makes use of parallelism. In fact the right hardware will make a lot of difference, probably more than the most suitable algorithm, as long as it does not use a very bad one for any case.

One can do this transparently. One can use a memory mapped file and do in memory everything you expect to do in file. The operating system manages what should be in memory and what needs to be on disk. This way you can focus on the algorithm and memory management is on the operating system. Reading and recording is something you would have to do. Of course it will be very efficient if it fits everything in memory or the data allows few page changes. Of course this facility has a cost, with the right algorithm can better control how you want to do the pagination, or can do much worse than the operating system would do . MMF is more used in computing than you might think. All your executables are loaded into memory like this. Every virtual memory works like this. In fact any memory allocation, one way or another ends up calling an MMF.

If really the hardware is very restricted I remember that the MergeSort is usually good in cases like this. I've been researching and it seems that the BucketSort is even better, and has some variants that they can help each case better. There's a study that seems to confirm this .

 10
Author: Maniero, 2020-03-23 18:52:16

The points exposed by @ Maniero are correct. An image might help visualize how such a process would work.

Any Sort algorithm can be used in both primary and secondary memory if you segment your data. In the following example we have a computer that can load and handle 6 main memory addresses, and a secondary memory structure that supports 12 addresses.

  • MP - memory main
  • MS - secondary memory
  • RO - sorting result

insert the description of the image here

In the first cycle, 6 addresses (1 - 6) are loaded to the main memory, sorted and returned to the secondary memory. (In the image above the addresses that have been altered are marked with a light blue background.)

In the second cycle, the 3 final addresses of the previous operation (4-6) plus 3 (7-9) undergo a same loading, sorting and storage operation.

This happens successively until all addresses in the secondary memory are loaded, sorted and written back. After all the secondary memory is sorted, the process evaluates whether the cycle should be run again; if there has been at least one change, then the process will be repeated.

Secondary memory can be considered ordered if no sorting operation in primary memory results in address changes during a full cycle:

insert the description of the image here

 8
Author: OnoSendai, 2019-09-05 17:32:08