Embeddings

TLDR

In the section on RNNs, we used a "one-hot" representation. Basically, the letter F was a 65-dimensional vector like [0, 0, 0, ..., 1, 0, 0] in which all entries are zero except for the index corresponding to "F". This is like a dummy variable or indicator variable in the social sciences.

One-hot encoding/representation is super inefficient! Imagine instead of characters, we had an RNN that outputted English words. In that case we’d have something like 30k dimensions. There are a few problems with this. First, our computer is going to run out of RAM fast! Second, it doesn’t make a lot of sense. Consider the words "dog" and "cat". In a one-hot representation, their dot product is zero: they are orthogonal in vector space…"dog" is no more related to "cat" than it is to "semiconductor" or "volcano". But we know that "dog" and "cat" are quite related: one could often substitute "cat" in place of "dog" in a sentence just fine. Consider the sentence, "I sat on the couch after a tough day and pet my _____". Either "cat" or "dog" suffices here, whereas "semiconductor" and "volcano" do not. As the British linguist J.R. Firth famously said,

"You shall know a word by the company it keeps"

Here, "dog" and "cat" are often with "pet" and "couch"...and "volcano" rarely is. What we would prefer to the one-hot representation is some "dense" representation like [0.2, 0.232, 0.98, ..., 0.84] in which "dog" and "cat" might have a dot product (cosine) that is larger than their dot product with "semiconductor" or "volcano". In the LLM world, such representations are called "embeddings". That’s because we are taking a super high-dimension concept like "dog" and embedding it into a low-dimensional space, typically a few hundred dimensional.

Kyle's example code

💬 Visualizing word2vec embeddings

Embeddings

TLDR

Kyle's example code

Further reading

More advanced further reading