What do the numbers in the "embedding" word vector mean?

Let's say there is an" embedding " vector of the word Watermelon = [-0.0415, -0.0079, -0.0261, ... 0.1022]. What do the numbers in this vector mean, and how are they obtained? Is it something to do with the number of times the word watermelon occurred in the i-th text to the number of words in this text, or what is it? I've already flipped through a lot of pages in Russian and English. I saw examples with King-Queen, etc. I understand what ready-made vectors mean and how the comparison is made. However, I did not find this information anywhere. I am interested in exactly how these numbers are obtained!

Author: MaxU, 2018-08-30

1 answers

Word embedding vector consists of numbers describing the strength of features, automatically calculated from the linguistic context. A list of the strongest signs, such as, for example:

  • gender
  • age category
  • fruit/not fruit
  • liquid/non-liquid
  • , etc.

It is selected from all the calculated attributes. Usually, when calculating the "word embedding matrix", the number of the strongest features that should be included in the the resulting matrix, for example: GloVe: Global Vectors for Word Representation is calculated for: 50d, 100d, 200d, 300d features. The names of the features in the human understanding in the "word embedding matrix" are not present, but they can be evaluated using a ready-made / calculated matrix.

For example, if you take the index of the column with the maximum values for the words: King and Queen, then this column will indicate the characteristic describing belonging to the royal blood, etc.

PS "An Intuitive Understanding of Word Embeddings: From Count Vectors to Word2Vec " - a good article in English with an explanation of how "Word Embeddings"are calculated

 4
Author: MaxU, 2018-08-30 13:21:57