word embedding

In machine learning word embedding refers to a method in natural language processing (NLP) which converts words into numerical vectors. The numerical vectors generally represent the words’ meaning, usage or context.

The following word embedding methods are the most common:

  • One-hot encoding assigns a unique binary vector to each word in a vocabulary. The vector has only one element with a value of 1 (the hot bit) and the rest with a value of 0. One-hot encoding does not capture any semantic or syntactic information about words.
  • Word2vec learns a dense and continuous vector representation for each word based on its context in a large corpus of text.
  • Skip-gram is similar to Word2vec and it is a prediction-based method which learns vector representations of words from their contexts in a large corpus of text.
  • TF-IDF is a frequency-based (count-based) word weighting method which assigns scores to words based on their importance in a document and in a corpus of documents.
  • GloVe stands for Global Vectors for Word Representation. It combines the advantages of count-based methods (such as TF-IDF) and predictive methods (such as Word2vec) to create word vectors.

Related Cloud terms