Select from a list of 400 random items

There is a list:

List<string> list = new List<string>();

Its count is 10,000. You need to select 400 random (random) items from this list and put them in a new list:

List<string> listNew = new List<string>();

If you use r. Next(0, list.Count-1), that is, the probability that random can select the same element twice. How to avoid this and choose 400 random unique elements?

Author: default locale, 2018-06-14

2 answers

At first glance, this is a classic problem for the Reservoir Sampling algorithm. The task is solved in one consecutive pass through the list, and in order to achieve an equally probable selection, you do not even need to know in advance the number of elements in the list. At the same time, you do not need to resort to the terrible vicious practice of "trial and error" (checking "already taken - not yet taken"), which is often suggested to be used when compiling naive algorithms for solving this problem.

The algorithm such is:

  1. We start an array of 400 elements and fill it with the first 400 elements of the list.
  2. We read further elements of the list in the loop (from the 401st and onwards):

    2.1. Take the i-th element from the list and with probability 400/i decide to keep this element in our sample. If the decision is made to "not save", then just skip it.

    2.2. If you decide to save an element, save it to a random position in our array (by throwing it out previous element).

After going through the entire list to the end in this cycle, we will get 400 equally likely selected list items.

As you can see, for an effective implementation, it is desirable to store the selected 400 elements in an array, rather than in a list, during the operation of the algorithm. Here you can already think for yourself what is best to do: start an array initially, and then copy the selected elements to the list, or immediately store them in the list, sacrificing the efficiency of direct access when replacing with step 2.2.


However, if your source list provides efficient direct access (see @Andrey NOP's comment), then the "trial and error" option will be quite viable and even more efficient, since 400 is much less than 10000.

 4
Author: AnT, 2018-06-14 07:43:29

If the performance is not critical, shuffle the entire list

A suboptimal and crude, but fairly straightforward approach: shuffle the entire list in random order and select the first 400:

var random = new Random();
var listNew = list.OrderBy(s=> random.Next()).Take(400).ToList();

Suboptimal because it will sort the entire list.
Rough because collisions of random numbers will lead to the fact that the hit of the first elements in order will be some fraction higher.

Campaign for a small sample: selecting unique ones indexes

Let's write a method that generates an infinite sequence of random indexes:

private static Random random;

public static IEnumerable<int> RandomNumbers(int maxValue)
    {
        while (true)
        {
            yield return random.Next(maxValue);
        }
    } 

Using the method, we get 400 unique indexes and select the elements:

var uniqueIndices = RandomNumbers(list.Count).Distinct().Take(400);
var listNew = uniqueIndices.Select(i => list[i]).ToList();

The disadvantage of this approach is that it is only suitable for selecting a small (compared to the size of the original sample) number of elements. If you need to select 9999 elements, this method will be suboptimal due to frequent random number matches.

 3
Author: default locale, 2018-06-14 07:10:26