Transformers

TLDR

RAG is a method by which we give extra information to an LLM. The RAG process works as follows.

Find documents with which you wish to augment the LLMs knowledge.
These documents get "chunked" into overlapping chunks of text. Maybe it’s by paragraph, maybe larger. There will be a specificity/sensitivity trade off.
Each chunk is "embedded" to make an N-dimensional numeric vector. Each vector is stored in a vector database (we used chromadb in our workshop, but there are many options).
Imagine we uploaded my email (something ChatGPT does not have in its training data...I hope). Then I ask the LLM "remind me with whom I’ve gone to see Yale women’s hockey games at the whale."
My prompt is embedded into the same N-dimensional vector space as my documents (this doesn’t have to match the underlying vector space of your LLM—these two parts are independent).
My vector database compares the vector of my prompt with the vectors of each chunk, typically using cosine similarity. It returns the top-K matches where K is a number you specify.
My prompt is augmented with text like "Use these tidbits in your answer…" followed by each of the K-matches. This part is transparent to the user—you typically don’t see the augmented prompt, just your original.
Then the LLM just does its thing, but now it has access to K email chunks in which I discussed women’s hockey and the whale so it can tell me something that plain vanilla ChatGPT cannot.
Because LLMs are getting longer context windows, you might not need RAG if you just want to chat with a few PDFs. But I’m skeptical RAG is going away soon because it helps you with large datasets and also minimizes your inference costs because you’re not sending your huge database each time you prompt your LLM.

"Agents" are LLMs that are given skills or personalities. For example, we might have an agent that knows how to access our email or calendar. We might have an agent that knows how to turn our lights on at home. We might have one that is told it’s a great writer or coder, or one that is an obtuse contrarian.

It turns out a few LLMs are better than one. With agents, we can give vanilla LLMs some kind of higher order, "system 2" thinking ability. We just create a few agents then give them some ability to talk to each other and rules for doing so.

You can imagine creating an agent that has a few workers, one is a web researcher, one is a writer, one is an editor, etc. Then we ask the leader to write a tenure letter. Together, they come up with and execute a plan. Maybe that plan involves finding example tenure letters, downloading the candidate’s papers, studying citations, calculating an H-index, writing drafts of the letter, and editing those drafts.

Agents can get expensive because it takes a lot of computation. You will almost certainly see agents all over the place very soon. These will likely be small "bots" that can do one or two things and then you’ll use smarter agents/bots to coordinate them

Kyle's example code

Advanced reading

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation by Microsoft Research.
Retrieval-Augmented Generation for Large Language Models: A Survey by unknown. (haven’t read this one but looked decent)

Transformers

TLDR

Kyle's example code

Further reading

Advanced reading