Day 1 — Interactive Workshop

Understanding Large Language Models

From Classical ML to Transformers — and everything in between

Rikhil Nellimarla • February 2026

Use ↓ ↑ or scroll to navigate

The Evolution

From rule-based systems to language models that write code

1950s–80s

Symbolic AI

Rule-based expert systems. Hand-coded if-then rules. Worked in narrow domains but couldn't generalize.

1990s–2000s

Classical ML

SVMs, Random Forests, Logistic Regression. The era of feature engineering — humans hand-crafting 100s of input features.

2012

Deep Learning

AlexNet crushes ImageNet. Neural networks learn their own features. GPUs make it possible.

2013–16

Word Embeddings & Seq2Seq

Word2Vec, GloVe — words become vectors. RNNs and LSTMs tackle sequences but struggle with long-range dependencies.

2017

Attention Is All You Need

The Transformer paper. Self-attention replaces recurrence entirely. Parallel training. Constant path length between any two tokens.

2020–Now

Large Language Models

GPT-3, GPT-4, Gemini, Claude. Billions of parameters. Emergent capabilities. The age of prompt engineering.

The Feature Engineering Problem

Classical ML required humans to decide what matters

🔧 Manual Features

word_count

avg_word_length

has_exclamation

uppercase_ratio

punctuation_count

sentence_length

stopword_ratio

noun_count

verb_count

adjective_ratio

... 90 more features

A human had to dream up each of these, test them, and hope they captured the right signal. For every new domain — start over.

🧠 Learned Features

A neural network learns what features matter directly from the data. The more data, the better the features. No domain expert required.

Deep Learning

Stacking layers to learn hierarchical representations

Input Layer

Raw data goes in — pixels, words, numbers

Hidden Layers

Each layer learns increasingly abstract features. Edges → shapes → objects.

Output Layer

Final prediction — classification, regression, next token

Gradient Descent

How neural networks learn — rolling down the loss landscape

Learning Rate: 0.05

Loss: —

Steps: 0

Backpropagation

The chain rule in action — gradients flow backward to update weights

Click Forward Pass to send data through the network, then Backward Pass to see gradients flow back and update weights.

Why Language Is Hard

The same word means different things. Order matters. Context is everything.

The word "run"

Physical

"I decided to go for a run in the park"

Business

"She runs a successful business"

Software

"I need to run a few tests"

Bag-of-Words vs Context

❌ Bag-of-Words

{"the": 2, "run": 1, "park": 1, ...}

Loses all word order. "Dog bites man" = "Man bites dog"

✅ Contextual Embeddings

Each word gets a different vector based on surrounding words

This is what attention gives us

What an LLM Is and Isn't

Myth

"LLMs understand language"

Reality

LLMs are next-token prediction engines. They find statistical patterns in text. Understanding is debatable.

Myth

"LLMs have a database of facts"

Reality

Knowledge is compressed into weights during training. There's no lookup table — this is why they hallucinate.

Myth

"More parameters = smarter"

Reality

Data quality, training objective, and architecture matter more. A well-trained 7B model can beat a poorly trained 70B model.

Myth

"LLMs are just autocomplete"

Reality

Technically yes — but at scale, next-token prediction produces emergent capabilities: reasoning, code generation, translation. The mechanism is simple; the behavior is complex.

Myth

"LLMs will replace developers"

Reality

LLMs are tools that amplify developers. They can't architect systems, debug production issues, or understand business constraints. They accelerate the loop.

Myth

"You need a PhD to use LLMs"

Reality

Using LLMs requires understanding their behavior patterns, not their math. Prompt engineering is about communication, not calculus.

👆 Click any card to flip

Tokenization

LLMs don't see words — they see tokens. Subword pieces that balance vocabulary size and coverage.

Type a sentence:

Characters: 0 Tokens (approx): 0 Ratio: 0

🔗 Open TikTokenizer for precise tokenization →

GPT-3/4 Vocabulary

~50,257 tokens — not words, but subword pieces learned via BPE (Byte-Pair Encoding)

Why Subwords?

"unhappiness" → ["un", "happiness"]. Handles rare words without a massive vocabulary.

~4 chars ≈ 1 token

Rule of thumb for English. Code and other languages may differ significantly.

Embeddings

Turning tokens into vectors with meaning — where geometry meets semantics

E(King) - E(Man) + E(Woman) ≈ E(Queen)

Gender as a direction in vector space

E(Paris) - E(France) + E(Japan) ≈ E(Tokyo)

Country-capital relationships encoded geometrically

E(cats) - E(cat) = plural direction

Grammatical features emerge as learned dimensions

Attention

The core innovation — letting every token talk to every other token

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Q (Query): "What am I looking for?"

K (Key): "What do I have to offer?"

V (Value): "Here's my actual information"

√d_k: Scaling to prevent vanishing gradients in softmax

The Transformer

The architecture behind every modern LLM

Input Embedding

Token → Vector (d=512)

+

Positional Encoding

sin/cos at different frequencies

↓

× N Layers

Multi-Head Self-Attention

Add & Norm

Feed-Forward Network

Add & Norm

↓

Unembedding

Vector → Logits → Softmax → Next Token Probability

96 Attention Heads (GPT-3)

96 Transformer Blocks

175B Parameters (GPT-3)

2048 Context Window

🔗 Explore the full 3D Transformer visualization →

How LLMs Are Trained

Next-token prediction at massive scale

The cat sat on the ▊

Temperature: 0.7

1. Pre-training

Predict the next token on trillions of tokens from the internet. The model learns language structure, facts, reasoning patterns.

Dataset: CommonCrawl, Books, Wikipedia, Code

2. Fine-tuning (SFT)

Train on curated instruction-response pairs. The model learns to follow instructions rather than just complete text.

Dataset: Human-written Q&A pairs

3. RLHF

Reinforcement Learning from Human Feedback. Humans rank outputs; a reward model teaches the LLM which responses humans prefer.

Signal: Human preference rankings

Prompting Playground

Live prompting techniques — edit, run, and observe

System Prompt:

Prompt:

Temperature: 0.7

Response:

Click "Run Prompt" to see the response here...

Recap & Resources

📐 What We Covered

AI → ML → DL → NLP → LLMs evolution
Feature engineering → learned representations
Gradient descent & backpropagation
Tokenization, embeddings, attention
Transformer architecture
Pre-training, SFT, RLHF
Prompting techniques (zero-shot → CoT)

🔗 Resources

📅 Tomorrow — Day 2

Advanced prompting strategies
Tool/function calling
Retrieval Augmented Generation (RAG)
Live agent building