Attention
TLDR
I'm going to breeze over attention in class because...well, it would take us a long time to really get it. But, I'm including this content here so you can return to it after class!
Attention and transformers will take a while to sink in. Please do not get discouraged. After you see it a dozen times it is highly likely to be fine.
We saw in the section on embeddings how we can represent words (or "tokens") as numeric vectors. We call these "embeddings". Embeddings allow us to use math to represent various linguistic concepts. For example, we can represent word similarity and word analogy ("Richmond" is to "Virginia" as _______ is to "Connecticut"). Word2vec is a very famous 2013 technique for finding word embeddings.
When we use these embeddings instead of "one-hot" encodings (or dummy/indicator variables) in neural networks we simultaneously save space/memory and get more expressive networks. Free lunch!
But these embeddings are not a panacea. For example, we can’t just sum the embeddings for words in a sentence and get an embedding that makes sense. Recall, e.g., even though "butt" and "booty" are similar, and "call" and "dial" are similar, "booty call" is very different from "butt dial". Similarly, a word2vec style embedding for "chip" is a combination of concepts including both "Dorito" and "semiconductor".
In a sentence like "the chips were fabricated in the clean room" we really want "chips" to discard "Dorito"-type information and double down on its semiconductor-type information. "Attention" is the name of the mechanism we use for this. As always, 1) it’s a series of matrix multiplications; and 2) if we give a neural network tools, it’s going to learn to use them!
Attention is the heart of the "transformer". Almost all the models you know and love are transformer-based: voice recognition, ChatGPT, image generation, you name it.
Attention allows each word, e.g., "chips", to create a vector "query" that you can think of like "hey you neighbors, does anybody know about semiconductors?" and each word also has a vector "key", like "fabricated" saying "I know about semiconductors!" Through this mechanism, "chips" pays attention to "fabricated". And then, "fabricated" also has a "value" that’s something like "here is what I know about semiconductors…." Allowing "chips" to steal some of that info and, in the process, becoming context-aware, forgetting its "Dorito" info and doubling down on its "semiconductor" info. (Query, Key, and Value come from the world of databases and is an intuitive analogy for computer scientists. It seems less intuitive for others. Sorry.)
Kyle's example code
Further reading
-
MIT 6.S191: Recurrent Neural Networks, Transformers, and Attention by Ava Amini. Watch from minute 48 to the end. This is pretty high-level.
-
Illustrated Guide to Transformers Neural Network: A step by step explanation by Michael Phi. Slightly more detail; mid-level.
-
Visual Guide to Transformer Neural Networks - Multi-Head & Self-Attention by Batool Haider. Same level as above, just a different approach the material.
More advanced further reading
-
Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch by Sebastian Raschka. Lucid explanation with code of the internal attention workings. Very helpful if you tend to think in code; might seem in the weeds if you don’t think in code.
-
Attention Is All You Need by Google people. This is the original transformer paper. It’s readable and just ~10 pages. It was cited about ~108k times since publication in 2017.
In that material you will see mention of "positional embedding". Don’t worry about it…it just means giving each word in the prompt/input knowledge of its position.
Never forget---this is all just a bunch of matrix multiplications with some non-linearities sprinkled in. We impose a loss function at the end and tweak the matrix values (weights and biases) until the network gives us what we want. In the case of GPT-like "decoder-only" transformers, what we want is a "next word" that makes sense given the previous words, just like we did with the recurrent neural networks (RNNs).