Thoughts & Notes from Andrej Karpathy’s “Intro to Large Language Models (LLMs)”

At the risk of being better late than never, I’m posting my notes with supplemental research from Andrej Karpathy’s “Intro to Large Language Models” YouTube tutorial. It does a great job covering the basics as of November 2023, although the space is evolving so quickly that some of it is already outdated. 

Karpathy led the development of the Full Self-Driving (FSD) application at Tesla, while also being a founding member of OpenAI. He completed his undergraduate studies, attending Geoffrey Hinton’s (i.e. Godfather of AI) classes, at the University of Toronto and his PhD in Fei-Fei Li’s (i.e. Godmother of AI) Stanford Vision lab. 

Most of us use LLMs via applications such as ChatGPT or CoPilot, and a much smaller group of people are building or customizing them. Watching this tutorial let me appreciate the craft required to build them. The magical feeling of interacting with a LLM application gives a sense of comfort that everything under-the-hood must be a smooth running machine. The reality, however, is that there is still a lot of manual work in building and updating these models. 

In 2024, the major trend we’re seeing is specialization:

Large vs. Small Models 
LLMs are scaling up from millions to billions of users with improved distribution. Foundation models such as Open AI GPT4, Anthropic Claude, Google Gemini, and Meta Llama continue getting bigger and better for our daily ChatGPT use-cases. In contrast, Apple announced that they are planning on running much smaller, performant, low-latency models locally on iPhones. Just like that, LLMs will scale to a billion users. 

All-encompassing Chips vs. Inference Chips
NVIDIA announced their H200 chip which is up to 45% more powerful than their previous H100 model while maintaining the same energy consumption. Models can be trained faster and on larger datasets. Alternatively, other chip companies, such as Groq, are cheaper and focused on low-latency inference as a specialized use-case..

Capabilities are Scaling, While Becoming Cheaper 
LLMs are now multi-modal, which means they can process and generate text, images, video, and audio. In addition, GPT-4o’s context window is 128K tokens, a >10x increase over GPT4, enabling longer, more complex conversations. It’s also faster and cheaper than previous models such as GPT4.

Summary
It was a worthwhile exercise combining the knowledge from the tutorial, additional research of related topics, and industry trends. Putting it all together helped give me a clearer picture of where this industry is heading. I hope you find this summary helpful as well. 

Notes Section

LLM Basic Structure
A LLM is just 2 files on your system:

  • Parameters file which is the weights or parameters of the neural network (e.g 140GB for a 70B parameter model). 
  • Run file that implements the neural network architecture (e.g. 500 lines of C or Python)

Model building is 4 Steps
1. Pre-training
Take a large portion of internet text which is about 10 terabytes. Compress it down to a 140GB parameter zip file by training with 6K GPUs over 12 days, costing about $2M. The lossy compression of the internet data outputs Meta’s Llama 270B. Pre-training is the phase where there is high data quantity, but low data quality.

2. Fine-tuning
Fine-tuning transforms a base language model into an assistant model that you can interact with in a Q&A format. The optimization process remains the same as pre-training, using next word prediction, but the key difference is that the dataset uses roughly 100K human questions and answers. Fine-tuning is the phase where there is low data quantity, but high data quality.

3. Assistant Model
A LLM that’s been pre-trained and fine-tuned knows that it should answer questions in the style of an assistant. It still can access and use all the knowledge built up during pre-training.

4. Reinforcement Learning from Human Feedback (RLHF)
The model is tuned further by providing multiple versions of an answer and humans select which is better. This process generates relative rankings and scoring. Policies can also be implemented which make up facts less often and decrease toxic outputs. Aligning a model’s behaviour with desired traits such as helpfulness, truthfulness, and safety

Reinforcement Learning
Reinforcement learning aims to emulate the way that humans learn. AI agents learn holistically through trial-and-error, motivated with strong incentives to succeed. Reinforcement learning has 3 components:

1. State Space
Available information about the task at hand that is relevant to decisions the AI agent might make, including both known and unknown variables. The state space changes with each decision the agent makes.

2. Action Space
Contains all the decisions the AI agent might make. In a board game, the space is well defined. In text generation the space is the entire vocabulary of tokens available to an LLM.

3. Reward Function
Reward is a measure of success that incentivizes the AI agent. The feedback must be a scalar positive or negative number.

Model Inference
Running LLMs, called inference, is relatively cheap once they’re trained. They take in a sequence of words as input and predict the next word in a sequence. However, it doesn’t always choose the single most likely next word, but often samples from a distribution of likely words, which allows for variability in outputs. There are two inference modes:

1. Greedy decoding - always choosing the most likely world
2. Beam search - considering multiple possible continuations

A model has a limited memory, called the context window, which is the maximum number of tokens it can consider when making predictions.

Larger models (e.g. 70B parameters) are slower than smaller ones (e.g. 7B) in terms of inference speed. But, other factors can impact speed such as load balancing, caching, and optimizing for low latency.

LLM Scaling
LLM performance scales in a smooth, predictable manner as a function of:

N = # of parameters in the network
D = the amount of text we train on

Trends do not show signs of performance topping out, meaning bigger models will continue improving with more and more data.

Model Reasoning
The book “Thinking Fast and Slow” talks about how human brains have 2 systems to process information. “System 1” is a fast reaction and uses heuristics vs. “System 2” is slower and uses reasoning. LLMs only have a “System 1”.

For LLMs to have a System 2, we need to convert time to accuracy. For example, “here’s my question and take 30 minutes to answer’. The language model would create a tree-like structure of thoughts or reasoning paths and consider different branches before providing a final answer.

Security Challenges
There are multiple techniques where users can bypass its security and “jailbreak” a model. For example, a user may ask the LLM to act as a deceased grandmother who used to be a chemical engineer at a napalm factory, and ask for instructions on how to create napalm as a bedtime story. It frames the request as a comforting roleplay rather than a direct request for harmful information.

Other method of attacks include:

  • Base64 encoded prompts because models have been trained on it from internet data.
  • Adding certain random characters as a suffix to a prompt
  • Optimizing an image background that bypasses the model’s security 
  • Prompt injection is embedding white text in a photo or certain text in a document to be read and actioned as a new prompt 
  • Data poisoning which is training a model on data with a hidden trigger phrase. Model outputs become random, or changed in a certain way, when this trigger is activated during a prompt. 


YouTube Tutorial:  https://www.youtube.com/watch?v=zjkBMFhNj_g 
*Title image credit: Karpathy's conceptual model of LLMs as the kernel of an operating system for computing