Drawing strings from an array with weight

My doubt is as follows, I have an array with all cities in the country,+ - 5000 cities, I have a function that draws one of these cities and prints the result on the screen.

I would like the large cities to represent a greater occurrence in the result because they are more populous, that is, I would like to give mathematical weights based on percentage so that Capitals occur more often, for example, with their due weights numerical.

I can enumerate the weights for these cities manually, what I need to know is how I build the logic of this draw with weighted average

I could manually increase the occurrence of São Paulo, for example, by calculating how many times the city should appear so that the city of Serra da Saudade that has only 822 inhabitants occurs at a frequency type 0.0000...1% , but that would be crazy, it would only make sense in very small arrays anyway my minimum unit in % would be 1.

Author: Leo, 2014-11-07

5 answers

Creates a table with the inhabitants EX São paulo has 64000 inhabitants, São paulo represents the numbers of 1-64000, with this same analogy rio would represent for example the range 64000-128000, and its city of number 5000 would represent for example 8.900.000 - 9.000.000.

So, instead of drawing the city you would draw a number from 1 to 9,000,000.

If you don't need so much accuracy divide all tracks by the smallest track and use only the whole part.

Example city of smallest inhabitants has 100 residents.

Then it would represent a range of size 1.

Already são paulo that had 64000 will represent a range of 640, and etc,

So your program will get lighter and almost with the same result.

 13
Author: Joannis, 2014-11-07 11:13:42

Edit

I had not seen @Joannis's answer when I wrote, that she had given the same idea : (

I will keep only by example, but even I voted for her answer!


One idea would be to create a concept of raffle tickets.

  1. Get the lowest population value.

  2. To give equal straight, each city will have an entry number equal to the entire part of the division of its population by the smallest population

For example

Cidade     População  Entradas  Numeros para sorteio
Cidade A      5.000     1           1
Cidade B     18.000     3           2, 3, 4
Cidade C    153.245    30           5, .., 34
Cidade D  2.162.301   432           35, ..., 466
  1. sort a number between 1 and the highest (466), and thus each city will have the chance weighted in relation to its population. In this case the first city would have a chance of 1 / 466 and the largest 432 / 466 respecting proportionality. Example:

Pordierías if placed in an array with a repeat

var cidades = [
  { nome: 'cidadeA',
    populacao: 1000 },
  { nome: 'cidadeB',
    populacao: 3000 },
  { nome: 'cidadeC',
    populacao: 6000 }
];

var arraySorteio = [];

var menorPopulacao = 1000;

cidades.forEach(function (cidade) {
  var repeticoes = Math.floor(cidade.populacao/menorPopulacao);
  for (i = 0; i < repeticoes; i++) { 
    arraySorteio.push(cidade.nome);
  }  
});
    
alert(JSON.stringify(arraySorteio));

//e agora o sorteio
var posicaoSorteada = Math.floor((Math.random() * arraySorteio.length));
alert(posicaoSorteada + ' - ' + arraySorteio[posicaoSorteada]);
 7
Author: Caputo, 2014-11-07 13:21:36

I would make the choice in two phases:

1-drew a value between 1 and the number of inhabitants of the most populous city.

2-drew the city from among those that have a number of inhabitants equal to or greater than the value previously drawn.

The first draw favors the cities with the most inhabitants. Whoever has more inhabitants is more likely to advance to the second draw.

Although in the second draw all cities are standing equality was the number of inhabitants that dictated their presence in this draw.

 6
Author: ramaral, 2014-11-07 12:10:51

Compiling the answers so far I propose the following:

#include <iostream>
#include <vector>
#include <string>
#include <algorithm>

std::vector<std::pair<int,std::string>> cities;

void add_city(const std::string &name, int pop) {
    if (cities.empty()) {
        cities.emplace_back(pop, name);
    }
    else {
        cities.emplace_back(pop + cities.back().first , name);
    }
}

int total_population() {
    return cities.empty() ? 0 : cities.back().first;
}

const std::string select_city() {
    const int total = total_population();       
    const int pos = std::rand() % total;        
    const auto iter = std::lower_bound(begin(cities), end(cities), std::make_pair(pos, std::string()));     
    return iter->second;
}

int main() {        
    add_city("A",   50);
    add_city("B",  500);
    add_city("C",  1000);

    for (unsigned i=0; i<15; ++i) {
        const std::string city = select_city();
        std::cout << city << ' ';
    }

    return 0;
}

The idea is to have a list where each city is added and its population value is the sum of the populations of the previously added cities. Thus, to draw a city simply choose a random number considering the population of all cities, and then search the list using Binary Search where that number fits.

Result of Test , showing the proportion of cities selected:

C A B C B B B C C C C C B B C 
 5
Author: C. E. Gesser, 2014-11-07 14:39:15

I would do in a way to fetch an exact condition, which depending on the quality of the random number generator would be ideal for this case.

I would simulate a "roulette", kind of those that turns with your hands (as in events or in TV shows), where each strip of the circle indicates a single individual of the population, considering the total population of all the cities added together (this total is exactly the total of tracks in which the roulette will go through one or more in one of them, indicating which individual from which city stopped (drew).

insert the description of the image here I

The colored bands indicate the total population of each city, while the inner bands that divide each city, indicate each of its individuals.

If it were possible to physically build such a roulette with millions or billions of individuals, and it had exactly the same space between the tracks and was "calibrated" when spinning, not favoring or hindering aiming for any of these individuals, the draw would be perfect, since "all individuals have the identical possibility of being drawn" each time the roulette wheel was spun.

How to build and spin this roulette mathematically?

I made and tested the case on a Excel spreadsheet .

insert the description of the image here

There is the column "cities " (text) and population (value, with the total population of each city).

A cell C5 has the total sum of populations, the column " % of the total ", calculates the share of each population of each city of this total population (city population / Total population).

The column " accumulated " sums the value of the previous cell of the column " population" with the population of the city itself. At the end, we arrive at the same total population that is presented in cell C5.

Or field " spin roulette until "... "times", must receive an integer of" maximum spins " that the roulette wheel can give on itself. For example, if starting from the individual 1 from city to , it will be considered a turn when there is the passage of this same individual of this city by the marker.

The field " Result" tells you exactly how much the roulette wheel turned, that is, how many houses (individuals) passed through the marker until it stopped. The calculation is made like this:

=INT(ALEATÓRIO()*$C$5*($F$3-1))+$C$5

One can discuss the efficiency of the function "random" of the Excel (See following the answer I deal with this subject), but this is what we have in hand in a practical way by the hour, and it meets the purpose well of the way it was applied here, as can be confirmed below.

Note in the formula that the population value is added to the end, this ensures that " at least one full turn is given on the roulette wheel".

So, from the number of turning times is subtracted 1, for this value is summed once at the end.

The field " stopped at " indicates where the marker stopped, and its formula is:

=(F6/$C$5-INT(F6/$C$5))*$C$5

Result on the total population minus the entire part of this division, this will result only in the fractional part of this operation (decimal places only), which indicates the "how much the roulette wheel went after the last complete turn". For example, if 15,25 pulling apart whole would be 0,25 this indicates that after fifteen spins the roulette wheel spun over 25% of individuals. By multiplying this percentage by " Total population", it arrives at exactly which individual from which city the roulette stopped .

The column " roulette stop" points to the city of this individual, which in this case is the city drawn. The formula is:

=SE(OU(E($F$8>E17;$F$8<=E18);E(F14=0;E17=""));"<=== " & " Cdade sorteada: "& B18;"")

If the value belongs to the population group "accumulated " relating to a city, the message appears pointing to this city and indicating its name.

The red background is made by means of the" conditional formatting of cells", where if the cell is not empty, it passes the cell background to red and the letter color to white.

If it is not the range of a city, the formula returns "" (double quotes), which leaves the cell " empty", and its background is as it was, in white.

By pressing the key "F9", the calculations are made with other numbers, and the example below shows what happens immediately after typing "F9" (this was done after the previous example), eventually it may occur to result in the same city in the sequence, hardly or rarely will fall on the same individual (would have to exit exactly the same value from the previous draw).

insert the description of the image here

How to know if all this works like the expected?

You can do "step by step", or "hold" the key "F9", so that one draw takes place after another, and you will see that the draw will always focus on the cities with the largest population. The more times to calculate, the more it will be observed.

What's the problem with using random numbers to select values in a data range?

The "random" numbers that are used in programming languages or other platforms (as is the case I described), is that in fact these numbers are "pseudorandom", that is, they seek to approach what would really be a" draw without vices"; however, this is not what happens. See this page of Universidade Federal Fluminense on the topic that clarifies the problem very well and access the links indicated.

HOW DO COMPUTERS GENERATE RANDOM NUMBERS?

Because these are numbers generated by means of mathematical equations (calculation functions) a starting from an initial number (commonly referred to as "seed"), a sequence of numbers is generated in exactly the same sequence and always of the same value, that is, the sequence and the results do not change (whenever the "seed" that started generating the results has the same value).

To "get around" this problem, there are commands that "generate" a new value for this "seed" (the initial value only), thus using another sequence of results, however, they are other values of the same behavior.

The problem that this causes is that there is a "tendency" of concentration of results in certain points and little or no incidence of results in others (especially if these numbers are generated few times).

Then it is to be expected that the results are not fair, and it may even occur that a city is not drawn after numerous attempts, even if it is not the city with the smallest population (obviously it will depend on the value of the "seed" and the number of times that the draw is made, if it is a significantly high number, it may occur, but surely there will be disproportionate incidence among the other cities).

Why the proposed "roulette" significantly reduces this problem?

As each "draw" is not done rigidly within a dice track (one time per draw), but by means of the Roulette" spins", these tracks are completely exceeded several times, which causes the effect of having a "trend" to be significantly reduced. For the effect of this "trend" to appear in this case, it would be necessary to occur a gigantic coincidence, that regardless of the "turns" that the roulette gave, the "drawn" numbers pointed exactly to the same individual. Due to the high number of items (population of the country) treated individually, this fact would be extremely rare if it occurred.

 2
Author: Leo, 2016-07-09 16:32:16