Why do activation functions in neural networks take on such small values?

After all, even if the values of the activation function were in the values from -10 to 10, this would make the network more flexible, as it seems to me. After all, the problem can't just be the lack of a suitable formula. Please explain what I'm missing.

Author: kira, 2020-02-11

1 answers

Not a great expert in this, but as for me, the value ranges can be different. You can define your own, similar to a sigmoid, but differentiable, a function that will return values -10 to 10.

Areas of definition of some functions

  • ReLU [0 ,∞ )
  • Leaky ReLU (-∞ ,∞ )
  • GELU (-0.17 ,∞ ) - suddenly

In the abstraction of biological neural networks, the activation function is speed excitations of the action potential in the cell (these are our values just). That is, usually, one signal is not enough to activate. If we use any functions with a positive angular coefficient as such a function, it may turn out that we will have to "feed" our neuron with signals to infinity (they also say that the function is not normalizable, and such networks have unstable convergence).

These problems are solved by any sigmoid-like function. A sigmoid is a smooth monotone increasing nonlinear function. In a physical sense, it looks like this:

  1. Realistic model in 0.
  2. The signal came. The rate of excitation increases dramatically.
  3. If there are still similar signals, then the excitation rate reaches the asymptote, but not at the same speed. Or vice versa, the speed increases dramatically.

This seems to reflect reality, because in life, it takes some time or more to excite neurons. some more resources, since neurons can't physically excite faster than a certain speed.

But what is important here is not that, but that it is desirable that our function has asymptotes, i.e. some limits of convergence.

And so it turns out that we, by and large, do not care what the speed limits are, because by some point the speed will already be beyond the limit [like above the speed of light, but in the world of neurons ( - ; ]. And if so, then we are more interested in differentiability of our function at the site of the function definition, or the characteristic of the speed change. In other words, we are interested in how our function bends, not where it goes in the end. So, obviously, there are enough fractions from 0 to 1 to know the angle of the tangent, and enough definition area from -1 to 1 to decide on the correction by the method of back propagation of the error. I.e., this is all for beauty and simplicity, and not for anything there.

I bring it apologies if I didn't describe it clearly enough or, God forbid, incorrectly.

P.S. The function must be differentiable on the range of values so that we can just use the method of back propagation of the error.

 1
Author: contributorpw, 2020-02-14 06:04:26