What does the random state parameter in sklearn.manifest mean?TSNE and other SciKit-Learn classes?

Put 3 different values random_state, this is: (None, 0, 1).

I did not understand what the essence of this method is. I read the documentation, the answer on the site, but I did not understand.

Author: MaxU, 2018-07-21

2 answers

The essence of the parameter random_state (in all functions and methods from SciKit-Learn) is in reproducible random values. That is, if you explicitly set the value random_state different from None, then the generated pseudo - random values will have the same values every time you call it.

Example:

In [1]: import numpy as np

In [2]: np.random.seed(31415)

In [3]: np.random.randint(10, size=(5,5))
Out[3]:
array([[7, 3, 5, 8, 2],
       [6, 6, 3, 5, 6],
       [0, 0, 8, 3, 6],
       [1, 6, 8, 5, 1],
       [4, 6, 9, 2, 7]])

In [4]: np.random.seed(31415)

In [5]: np.random.randint(10, size=(5,5))
Out[5]:
array([[7, 3, 5, 8, 2],
       [6, 6, 3, 5, 6],
       [0, 0, 8, 3, 6],
       [1, 6, 8, 5, 1],
       [4, 6, 9, 2, 7]])

In [6]: np.random.seed(31415)

In [7]: np.random.randint(10, size=(5,5))
Out[7]:
array([[7, 3, 5, 8, 2],
       [6, 6, 3, 5, 6],
       [0, 0, 8, 3, 6],
       [1, 6, 8, 5, 1],
       [4, 6, 9, 2, 7]])

PS if you run this code on your computer, you will get the same values in the matrices.

Why is this necessary?

In machine learning tasks and not only often a pseudorandom number generator is used to initialize various parameters, weights in neural networks, and randomly divide the data set into training and test sets.

Accordingly, if we want to compare several methods or different sets of parameters, then for an honest comparison, we need to use the same training and test sets.

It can also be useful to create datasets in a random but reproducible way. For example you have created several different computing systems If you want to compare them or check their correctness , you need to use the same input data.


UPD: If you set the same value random_state, then the result t-SNE will also be the same on the same input data:

In [120]: from sklearn.manifold import TSNE

In [121]: a = np.random.rand(1000, 50)

In [122]: res1 = TSNE(n_components=2, random_state=123).fit_transform(a)

In [123]: res2 = TSNE(n_components=2, random_state=123).fit_transform(a)

In [124]: res1.sum()
Out[124]: -205.98636

In [125]: res2.sum()
Out[125]: -205.98636

In [126]: res1 == res2
Out[126]:
array([[ True,  True],
       [ True,  True],
       [ True,  True],
       ...,
       [ True,  True],
       [ True,  True],
       [ True,  True]])

In [127]: (res1 == res2).all()
Out[127]: True
 8
Author: MaxU, 2020-06-19 17:02:55

So everything is correct in MaxU's answer, but in general, the root cause here is that t-sne by its nature is a random algorithm. In Russian, it is called " Stochastic neighbor embedding with t-distribution". Of course, when visualizing the results of its work with a different random_state (or with a non-explicitly specified random_state), you are unlikely to notice the difference with your eye. But if you compare the numerical results of t-sne, they will be different each time. Well, unless you have a case of very little the number of well-separated data-then the result may be the same. In general, if you want to get the same result from t-sne on the same data, fix random_state. Well, we can also add that not all algorithms of SciKit-Learn and other machine learning libraries have an element of randomness, of course. But those that depend on randomness usually allow us to fix this randomness so that it is reproducible.

 2
Author: CrazyElf, 2020-06-19 07:35:04