Thoughts & Notes from Andrej Karpathy’s “Intro to Large Language Models (LLMs)”

At the risk of being better late than never, I’m posting my notes with supplemental research from Andrej Karpathy’s “Intro to Large Language Models” YouTube tutorial. It does a great job covering the basics as of November 2023, although the space is evolving so quickly that some of it is already outdated. 

Karpathy led the development of the Full Self-Driving (FSD) application at Tesla, while also being a founding member of OpenAI. He completed his undergraduate studies, attending Geoffrey Hinton’s (i.e. Godfather of AI) classes, at the University of Toronto and his PhD in Fei-Fei Li’s (i.e. Godmother of AI) Stanford Vision lab. 

Most of us use LLMs via applications such as ChatGPT or CoPilot, and a much smaller group of people are building or customizing them. Watching this tutorial let me appreciate the craft required to build them. The magical feeling of interacting with a LLM application gives a sense of comfort that everything under-the-hood must be a smooth running machine. The reality, however, is that there is still a lot of manual work in building and updating these models. 

In 2024, the major trend we’re seeing is specialization:

Large vs. Small Models 
LLMs are scaling up from millions to billions of users with improved distribution. Foundation models such as Open AI GPT4, Anthropic Claude, Google Gemini, and Meta Llama continue getting bigger and better for our daily ChatGPT use-cases. In contrast, Apple announced that they are planning on running much smaller, performant, low-latency models locally on iPhones. Just like that, LLMs will scale to a billion users. 

All-encompassing Chips vs. Inference Chips
NVIDIA announced their H200 chip which is up to 45% more powerful than their previous H100 model while maintaining the same energy consumption. Models can be trained faster and on larger datasets. Alternatively, other chip companies, such as Groq, are cheaper and focused on low-latency inference as a specialized use-case..

Capabilities are Scaling, While Becoming Cheaper 
LLMs are now multi-modal, which means they can process and generate text, images, video, and audio. In addition, GPT-4o’s context window is 128K tokens, a >10x increase over GPT4, enabling longer, more complex conversations. It’s also faster and cheaper than previous models such as GPT4.

Summary
It was a worthwhile exercise combining the knowledge from the tutorial, additional research of related topics, and industry trends. Putting it all together helped give me a clearer picture of where this industry is heading. I hope you find this summary helpful as well. 

Notes Section

LLM Basic Structure
A LLM is just 2 files on your system:

  • Parameters file which is the weights or parameters of the neural network (e.g 140GB for a 70B parameter model). 
  • Run file that implements the neural network architecture (e.g. 500 lines of C or Python)

Model building is 4 Steps
1. Pre-training
Take a large portion of internet text which is about 10 terabytes. Compress it down to a 140GB parameter zip file by training with 6K GPUs over 12 days, costing about $2M. The lossy compression of the internet data outputs Meta’s Llama 270B. Pre-training is the phase where there is high data quantity, but low data quality.

2. Fine-tuning
Fine-tuning transforms a base language model into an assistant model that you can interact with in a Q&A format. The optimization process remains the same as pre-training, using next word prediction, but the key difference is that the dataset uses roughly 100K human questions and answers. Fine-tuning is the phase where there is low data quantity, but high data quality.

3. Assistant Model
A LLM that’s been pre-trained and fine-tuned knows that it should answer questions in the style of an assistant. It still can access and use all the knowledge built up during pre-training.

4. Reinforcement Learning from Human Feedback (RLHF)
The model is tuned further by providing multiple versions of an answer and humans select which is better. This process generates relative rankings and scoring. Policies can also be implemented which make up facts less often and decrease toxic outputs. Aligning a model’s behaviour with desired traits such as helpfulness, truthfulness, and safety

Reinforcement Learning
Reinforcement learning aims to emulate the way that humans learn. AI agents learn holistically through trial-and-error, motivated with strong incentives to succeed. Reinforcement learning has 3 components:

1. State Space
Available information about the task at hand that is relevant to decisions the AI agent might make, including both known and unknown variables. The state space changes with each decision the agent makes.

2. Action Space
Contains all the decisions the AI agent might make. In a board game, the space is well defined. In text generation the space is the entire vocabulary of tokens available to an LLM.

3. Reward Function
Reward is a measure of success that incentivizes the AI agent. The feedback must be a scalar positive or negative number.

Model Inference
Running LLMs, called inference, is relatively cheap once they’re trained. They take in a sequence of words as input and predict the next word in a sequence. However, it doesn’t always choose the single most likely next word, but often samples from a distribution of likely words, which allows for variability in outputs. There are two inference modes:

1. Greedy decoding - always choosing the most likely world
2. Beam search - considering multiple possible continuations

A model has a limited memory, called the context window, which is the maximum number of tokens it can consider when making predictions.

Larger models (e.g. 70B parameters) are slower than smaller ones (e.g. 7B) in terms of inference speed. But, other factors can impact speed such as load balancing, caching, and optimizing for low latency.

LLM Scaling
LLM performance scales in a smooth, predictable manner as a function of:

N = # of parameters in the network
D = the amount of text we train on

Trends do not show signs of performance topping out, meaning bigger models will continue improving with more and more data.

Model Reasoning
The book “Thinking Fast and Slow” talks about how human brains have 2 systems to process information. “System 1” is a fast reaction and uses heuristics vs. “System 2” is slower and uses reasoning. LLMs only have a “System 1”.

For LLMs to have a System 2, we need to convert time to accuracy. For example, “here’s my question and take 30 minutes to answer’. The language model would create a tree-like structure of thoughts or reasoning paths and consider different branches before providing a final answer.

Security Challenges
There are multiple techniques where users can bypass its security and “jailbreak” a model. For example, a user may ask the LLM to act as a deceased grandmother who used to be a chemical engineer at a napalm factory, and ask for instructions on how to create napalm as a bedtime story. It frames the request as a comforting roleplay rather than a direct request for harmful information.

Other method of attacks include:

  • Base64 encoded prompts because models have been trained on it from internet data.
  • Adding certain random characters as a suffix to a prompt
  • Optimizing an image background that bypasses the model’s security 
  • Prompt injection is embedding white text in a photo or certain text in a document to be read and actioned as a new prompt 
  • Data poisoning which is training a model on data with a hidden trigger phrase. Model outputs become random, or changed in a certain way, when this trigger is activated during a prompt. 


YouTube Tutorial:  https://www.youtube.com/watch?v=zjkBMFhNj_g 
*Title image credit: Karpathy's conceptual model of LLMs as the kernel of an operating system for computing

Navigating Product Management: Seeing the Forest and the Trees

I once worked on a Marketing team where I constantly resolved disagreements with Sales. Each week, I helped decide which sales channels would get promotional discounts. I negotiated against one team that mostly wanted one thing: discounts. The work was one-dimensional.

I’ve since changed careers to product management. It requires collaborating across multiple teams to build software products. The work is high-dimensional. 

Previously, I’ve written about the product management process as an individual contributor, but this post reflects my experience when products get more complex and start pushing up against an organization's constraints.. 

I find it helps to view each team as a tree within the broader forest of product management. Teams need to cooperate on what to build (product vision), what to measure, and how to build.   

Product Vision 

Ask ten employees for their objectives and you'll likely get ten different answers. Aligning vision can be challenging.

Companies are always trying to improve cross-division alignment. In the 1930s, Proctor and Gamble (P&G) invented brand management, where each product team (soap, foods, etc.) had their own team to build and market the brands. 

The issue with this structure was there were redundant roles across each product line. In response, companies adopted a matrix structure where functional groups (Sales, Finance, R&D, etc.) supported multiple products at once. 

However, teams still clashed because functional teams optimize on how they’re measured. Finance has cost targets. Salespeople have sales targets. Manufacturing has quality targets. 

Product teams focus on the bigger picture. Where is the market going? What are customers struggling with? Then, they build the business case, the story, and work with functional teams to get feedback and buy-in. 

How to Operate a Data Platform: Challenges and Solutions

Why is operating a data platform hard? Because of the significant variability in the day-to-day.

A good data user experience is like running a Starbucks store - you walk in expecting a consistent tasting drink from any location in the world. An analyst using a reporting dashboard or a data scientist using a machine learning model is no different, they should expect consistent, reliable data.

Now what happens when an ingredient is out of stock? Your drink can’t be made. There could be many reasons why it’s out of stock - maybe the forecast for ingredients was too low, maybe some stores were sent more than others, maybe there was a storm that created supply issues, etc. There are more ways for things to go wrong than we could ever list.

A data platform is the same - it collects data from a variety of sources, carefully mixes in other quantities of data, and then gets shipped to a user via a web application, reporting platform, marketing platform, etc. There are more ways for things to go wrong than we could ever list.

This article will cover topics on how to operate a Data Platform, such as: process bottlenecks, data quality, and data monitoring systems. 

From Manufacturing to DevOps

A Starbucks drink and data differ in that the feedback loop is much faster for data. Moving physical goods takes days or hours; data can change within minutes or seconds. Humans process and ship physical goods, whereas machines complete these tasks for data.

Back in college, we read a book about plant operations called, ‘The Goal. I never thought I’d have a career in plant manufacturing, yet here I am reminiscing about it. The only thing I remember was the importance of managing bottlenecks. Bottlenecks should be used at maximum capacity to ensure high throughput. 

Pushed to the extreme, this concept meant that someone’s sole responsibility was monitoring bottlenecks to minimize downtime.

The spiritual child of ‘The Goal’ is a book focused on IT called, ‘The Phoenix Project’. It applied the same manufacturing principles to software development, which inspired the dev ops movement. The principles were maximizing throughput through bottlenecks, quality assurance ownership to minimize rework, and holistically examining software systems across departments. 

Managing Bottlenecks

Data comes in, data gets processed, data gets shipped. 

Running a data platform requires constantly monitoring bottlenecks, which comes in two forms: people and technical. 

People bottlenecks happen because they’re juggling multiple projects, don’t communicate regularly across departments, or see the downstream impact of their work. They context switch across projects and tasks - meetings, emails, or tickets. Teams optimize to improve their productivity (which is a good thing), but delaying one task might be the bottleneck for the entire Data Platform.

Regular communication to align on prioritization with other teams is the solution and a large part of managing a Data Platform. Hard problems take focus, and major projects have endless hard problems that require prioritization or they won’t get solved. 

Technical bottlenecks are more straightforward because, hypothetically, you have more influence over the solution. It’s ensuring that the machines processing the data have minimal downtime. Straightforward, however, does not mean easy. 

A Data Platform can process billions of rows of data a day. Consider a scenario where a bug fix needs to be applied on all the historical data. This change means reprocessing all the data with the required change. Machines take a fixed amount of time to complete this work, which is a bottleneck that’s straightforward to calculate and manage. 

What’s not easy are the new bottlenecks that appear as more data and users come onboard. Engineers have a variety of methods to increase platform capacity for higher volume or faster processing. Sometimes, solutions may be clever engineering; other times, it’s unglamorous. It’s just a data engineer rolling up their sleeves and getting things done manually. 

Data Rework: Differences Between Software vs. Data Engineering 

If software development is about managing libraries (open-source or 3rd-party) and logic, then data is about managing data sources and processing logic. There are some major differences between software and data engineering. 

Software can be more predictable. In software, you select the library versions. Major changes, hopefully, are in a different version. In data, you’re relying on the data source being consistent. Unfortunately, that’s not always the case. 

Software compiling is faster. In software, debugging can happen faster because a program compiles in seconds. In data, you may start by compiling a small dataset for debugging, but then query progressively larger and more complex data. Both are equally complex, just that data debugging can have longer feedback loops because of compilation and query time.  

Software bug fixes are usually a few snippets of code. In software, the complexity in debugging is finding the issue across libraries, data, and servers, but the actual fix is usually fast. In data, billions of rows get backed up when things go down. It’s resource-intensive, manual work that requires pushing technical bottlenecks to their maximum capacity to clear things out. This compounding rework in data is why automated monitoring is so important. 

Monitoring Data Quality to Minimize Rework

Data monitoring systems are important because new bottlenecks can appear at any time and clearing out traffic jams is expensive in time and effort. 

The operating challenge is the assumption that the source data’s organization has high data quality standards. That’s not always the case, and investing in a monitoring system is the best way to be proactive. A system can be built in-house that lets you nail down your use and edge cases, before deciding to outsource to a vendor. 

A sample checklist of items to monitor include:

  • Timing - data arriving on time consistently
  • Schema - data schema & data structure consistency
  • Volume - data size consistency
  • DateTime - date & timezone consistency 

Setting up monitoring is an investment in being able to provide reliable, consistent data to users, and peace of mind to the Data Platform teams. Over time, more edge-cases can be added to the monitoring as they come up. 

Summary

Data is everywhere in today’s world. Any digital system that you touch is powered by Data Platforms. Wrangling a consistent, reliable data experience is an exercise in smoothing out complexities. Running a Data Platform is an operational balancing act of being proactive with monitoring at a higher level, while managing people and technical bottlenecks at a daily level.


*Title image credit: https://ralphammer.com/make-me-think. I chose this image because it reflects how users should have simple, intuitive, data experiences. Those of us building Data Platforms should manage the complexity in the background.   

My Product Manager Survival Handbook


Developing software products is a messy process. It’s messy because it’s so flexible and fairly new, only about 60 years old. It’s also abstract and not limited by the physical world, only by our creativity. 

If you’re building with more concrete mediums, whether it’s furniture or a cake, there are well-documented rules about how parts work together and how it should look. In this sense, software engineers more closely resemble artists such as writers or musicians where rules are flexible and the result isn’t how it looks, but how it makes us feel. Great technology delights - think of the raw excitement the first time Steve Jobs presented the iPhone in 2007. Or the sheer joy of finishing tedious paperwork in minutes instead of hours because a computer automated it.

But, building delightful products is a process littered with worries. Work experience can help because it closes the gap between the execution required and final product vision. Unfortunately, experience is a nice way of saying learning from mistakes and there are always new ways to make mistakes. Experience isn't always enough. 

So, what are my worries? 

Thoughts from Toronto Machine Learning Summit 2019

I just attended the 2019 Toronto Machine Learning Summit (TMLS) last week - it was a great experience.The community was welcoming,the content was relevant, and it was well organized. 

One thing that I appreciated about the event is how down-to-earth it felt. Many conversations started with the person that I sat beside during a session, which doesn't happen often at conferences. People had a range of experience with Machine Learning, but everyone came from a genuine place to learn more about the topic. It's quite different from other conferences that I've attended where the marketing glamour is cranked up - leading to a mass-produced approach, hit-or-miss content, and participants feeling more distant. 

How Amazon Uses Machine Learning to Drive the Customer Experience

I read all 22 Amazon Shareholder letters. 

I wanted to understand how Amazon used machine learning to drive the customer experience. Here's what I learned.

Amazon is a Platform-as-a-Service (PaaS) company that just happens to be a retailer

Jeff Bezos has been consistent about Amazon's goal since the beginning - deliver an amazing customer experience. That means providing vast selection, fast convenience, and price reductions. Amazon has invested in building massive platforms, whether it's fulfillment or cloud computing centers, to support the pillars of selection and convenience. The last pillar, price reductions, is a result of the efficiencies from scaling. 

Here's their playbook: (a) create a platform for their own business needs (b) formalize this platform into an ecosystem by opening it up to 3rd-parties (c) refine this platform into a self-service option that's easy-to-use with an interface. Lather, rinse, repeat. 

Mythbusters: Were Overzealous Algorithms Responsible for Slow Sales at Loblaw Companies Ltd?

Well, this is new.

Retailers have blamed bad weather for poor sales before, but I don't think I've ever seen a retailer blame bad algorithms.

Loblaw Companies Ltd, Canada's largest grocery and pharmacy chain, which owns the mainstream brand Loblaws and discount brand No Frills, had a soft Q2 performance with same-store revenue growing 0.6%. Their President, Sarah Davis, blamed the performance on algorithms that prioritized increasing profit margins instead of promotional pricing to attract foot traffic. That is, Loblaw chose to increase the revenue from each customer instead of focusing on increasing the number of customers. She says:

We know exactly what we did and what we did was we focused on going for margin improvements...And in the excitement of seeing margin improvements in certain categories as we started to implement some of the algorithms, people were overzealous...You end up with fewer items on promotion in your flyer. 

The One Skill That Data Scientists Are Not Being Trained On

After attending the Toronto Machine Learning Micro-Summit this past week, one theme came up repeatedly during the presentations -  communicate with the business team early, and often, or you'll need to go back and re-do your work. 

There was the story of an insurance company that created a model that recommended whether to replace or fix a car after a damage claim. It sounded great - the Data Scientists got a prototype up and running and had business team buy-in. But, the problem was that their models weren't very accurate. Usually when that happens it means that your data is noisy or the algorithm isn't powerful enough. They went back to their business team and it turns out that they missed 2 key features: the age of the vehicle and if it's a luxury model. 

Another example was a telecom that built a model to optimize call center efficiency. The data science team spent a month building the model and everyone was excited to get it in production. Then, they were told that the call center runs on an outdated application. It turns out that integrating with the application would cost more than the ROI of the project.

Are Data Scientists Actually Surveillance Scientists? - Part 1

There was of course no way of knowing whether you were being watched at any given moment. How often, or on what system, the Thought Police plugged in on any individual wire was guesswork. It was even conceivable that they watched everybody all the time. But at any rate they could plug in your wire whenever they wanted to. You had to live—did live, from habit that became instinct—in the assumption that every sound you made was overheard, and, except in darkness, every movement scrutinized. -1984, George Orwell

Last summer I had a conversation with an acquaintance who had recently visited China. There was discussion about China's Social Credit System (SCS) and its impact on people's daily lives. The Social Credit System is a system that assigns scores to citizens based on their reputation, and that score can impact someone's ability to be outside in the evening, their eligibility to book a travel ticket or their suitability for a loan. It's similar to a Credit Score that North Americans are more familiar with but more encompassing as the SCS takes non-financial data into account. My acquaintance said that the initial feedback was positive - her friends and family felt safer walking the streets at night knowing that people deemed dangerous wouldn't be allowed outside. 

Lessons from University of Toronto's Data Science Statistics Course (3251)

I recently completed a Statistics for Data Science course at the University of Toronto for Continuing Studies and I wanted to share my reflections about the experience. 

Overall, the course was mostly interesting, some parts boring, and always challenging. 

I should begin with admitting that I managed to avoid any heavy maths or statistics classes in university even though I completed science and business degrees. I certainly felt behind when math notations and equations started popping up in class. But, where there's a will, there's a way (mostly). 

Most lectures were challenging to follow because of the pace of learning. I even tried to read ahead for some lectures to be more prepared, but that only marginally helped. As a result, I often sat in class with material that was way over my head and questioning if it was the best use of my time to spend that weekday night on campus, rather than at home self-studying. A younger me might of panicked and wondered what I had gotten myself into or worried that I would be outed as a fake. And I don't think I was the only one as I saw classmates drop out in the initial few weeks.