Attention

TLDR

I'm going to breeze over attention in class because...well, it would take us a long time to really get it. But, I'm including this content here so you can return to it after class!

Attention and transformers will take a while to sink in. Please do not get discouraged. After you see it a dozen times it is highly likely to be fine.

We saw in the section on embeddings how we can represent words (or "tokens") as numeric vectors. We call these "embeddings". Embeddings allow us to use math to represent various linguistic concepts. For example, we can represent word similarity and word analogy ("Richmond" is to "Virginia" as _______ is to "Connecticut"). Word2vec is a very famous 2013 technique for finding word embeddings.

When we use these embeddings instead of "one-hot" encodings (or dummy/indicator variables) in neural networks we simultaneously save space/memory and get more expressive networks. Free lunch!

But these embeddings are not a panacea. For example, we can’t just sum the embeddings for words in a sentence and get an embedding that makes sense. Recall, e.g., even though "butt" and "booty" are similar, and "call" and "dial" are similar, "booty call" is very different from "butt dial". Similarly, a word2vec style embedding for "chip" is a combination of concepts including both "Dorito" and "semiconductor".

In a sentence like "the chips were fabricated in the clean room" we really want "chips" to discard "Dorito"-type information and double down on its semiconductor-type information. "Attention" is the name of the mechanism we use for this. As always, 1) it’s a series of matrix multiplications; and 2) if we give a neural network tools, it’s going to learn to use them!

Attention is the heart of the "transformer". Almost all the models you know and love are transformer-based: voice recognition, ChatGPT, image generation, you name it.

Attention allows each word, e.g., "chips", to create a vector "query" that you can think of like "hey you neighbors, does anybody know about semiconductors?" and each word also has a vector "key", like "fabricated" saying "I know about semiconductors!" Through this mechanism, "chips" pays attention to "fabricated". And then, "fabricated" also has a "value" that’s something like "here is what I know about semiconductors…." Allowing "chips" to steal some of that info and, in the process, becoming context-aware, forgetting its "Dorito" info and doubling down on its "semiconductor" info. (Query, Key, and Value come from the world of databases and is an intuitive analogy for computer scientists. It seems less intuitive for others. Sorry.)

Kyle's example code

👀 Visualizing attention in GPT-2

Attention

TLDR

Kyle's example code

Further reading

More advanced further reading