Embeddings

TLDR

In the section on RNNs, we used a "one-hot" representation. Basically, the letter F was a 65-dimensional vector like [0, 0, 0, ..., 1, 0, 0] in which all entries are zero except for the index corresponding to "F". This is like a dummy variable or indicator variable in the social sciences.

One-hot encoding/representation is super inefficient! Imagine instead of characters, we had an RNN that outputted English words. In that case we’d have something like 30k dimensions. There are a few problems with this. First, our computer is going to run out of RAM fast! Second, it doesn’t make a lot of sense. Consider the words "dog" and "cat". In a one-hot representation, their dot product is zero: they are orthogonal in vector space…"dog" is no more related to "cat" than it is to "semiconductor" or "volcano". But we know that "dog" and "cat" are quite related: one could often substitute "cat" in place of "dog" in a sentence just fine. Consider the sentence, "I sat on the couch after a tough day and pet my _____". Either "cat" or "dog" suffices here, whereas "semiconductor" and "volcano" do not. As the British linguist J.R. Firth famously said,

"You shall know a word by the company it keeps"

Here, "dog" and "cat" are often with "pet" and "couch"...and "volcano" rarely is. What we would prefer to the one-hot representation is some "dense" representation like [0.2, 0.232, 0.98, ..., 0.84] in which "dog" and "cat" might have a dot product (cosine) that is larger than their dot product with "semiconductor" or "volcano". In the LLM world, such representations are called "embeddings". That’s because we are taking a super high-dimension concept like "dog" and embedding it into a low-dimensional space, typically a few hundred dimensional.

Kyle's example code

Further reading

  1. The Illustrated Word2vec by Jay Alammar. This is a lucid summary of one of the most famous papers in the field of natural language processing: Google’s 2013 word2vec paper. Word2vec is a neural network that has two variants 1) Continuous bag-of-words in which we train a neural network to predict a target word given surrounding words and 2) skip-gram in which we train the network to predict surrounding words given a target word. In each case, we can extract from the network a dense vector embedding of words like "dog", "waiter", "volcano" that have intuitive relationships in vector space...we can, in effect, do math on words. 🤯

More advanced further reading

  1. Efficient Estimation of Word Representations in Vector Space by Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. This is the original word2vec paper. It’s a bit dense, but it’s worth a skim.