Day 1 — Interactive Workshop

Understanding Large Language Models

From Classical ML to Transformers — and everything in between

Rikhil Nellimarla February 2026
Use or scroll to navigate

The Evolution

From rule-based systems to language models that write code

1950s–80s

Symbolic AI

Rule-based expert systems. Hand-coded if-then rules. Worked in narrow domains but couldn't generalize.

1990s–2000s

Classical ML

SVMs, Random Forests, Logistic Regression. The era of feature engineering — humans hand-crafting 100s of input features.

2012

Deep Learning

AlexNet crushes ImageNet. Neural networks learn their own features. GPUs make it possible.

2013–16

Word Embeddings & Seq2Seq

Word2Vec, GloVe — words become vectors. RNNs and LSTMs tackle sequences but struggle with long-range dependencies.

2017

Attention Is All You Need

The Transformer paper. Self-attention replaces recurrence entirely. Parallel training. Constant path length between any two tokens.

2020–Now

Large Language Models

GPT-3, GPT-4, Gemini, Claude. Billions of parameters. Emergent capabilities. The age of prompt engineering.

The Feature Engineering Problem

Classical ML required humans to decide what matters

🔧 Manual Features

word_count
avg_word_length
has_exclamation
uppercase_ratio
punctuation_count
sentence_length
stopword_ratio
noun_count
verb_count
adjective_ratio
... 90 more features

A human had to dream up each of these, test them, and hope they captured the right signal. For every new domain — start over.

🧠 Learned Features

A neural network learns what features matter directly from the data. The more data, the better the features. No domain expert required.

Deep Learning

Stacking layers to learn hierarchical representations

Input Layer

Raw data goes in — pixels, words, numbers

Hidden Layers

Each layer learns increasingly abstract features. Edges → shapes → objects.

Output Layer

Final prediction — classification, regression, next token

Gradient Descent

How neural networks learn — rolling down the loss landscape

Loss:

Steps: 0

Backpropagation

The chain rule in action — gradients flow backward to update weights

Click Forward Pass to send data through the network, then Backward Pass to see gradients flow back and update weights.

Why Language Is Hard

The same word means different things. Order matters. Context is everything.

The word "run"

Physical

"I decided to go for a run in the park"

Business

"She runs a successful business"

Software

"I need to run a few tests"

Bag-of-Words vs Context

❌ Bag-of-Words

{"the": 2, "run": 1, "park": 1, ...}

Loses all word order. "Dog bites man" = "Man bites dog"

✅ Contextual Embeddings

Each word gets a different vector based on surrounding words

This is what attention gives us

What an LLM Is and Isn't

Myth

"LLMs understand language"

Reality

LLMs are next-token prediction engines. They find statistical patterns in text. Understanding is debatable.

Myth

"LLMs have a database of facts"

Reality

Knowledge is compressed into weights during training. There's no lookup table — this is why they hallucinate.

Myth

"More parameters = smarter"

Reality

Data quality, training objective, and architecture matter more. A well-trained 7B model can beat a poorly trained 70B model.

Myth

"LLMs are just autocomplete"

Reality

Technically yes — but at scale, next-token prediction produces emergent capabilities: reasoning, code generation, translation. The mechanism is simple; the behavior is complex.

Myth

"LLMs will replace developers"

Reality

LLMs are tools that amplify developers. They can't architect systems, debug production issues, or understand business constraints. They accelerate the loop.

Myth

"You need a PhD to use LLMs"

Reality

Using LLMs requires understanding their behavior patterns, not their math. Prompt engineering is about communication, not calculus.

👆 Click any card to flip

Tokenization

LLMs don't see words — they see tokens. Subword pieces that balance vocabulary size and coverage.

Characters: 0 Tokens (approx): 0 Ratio: 0

GPT-3/4 Vocabulary

~50,257 tokens — not words, but subword pieces learned via BPE (Byte-Pair Encoding)

Why Subwords?

"unhappiness" → ["un", "happiness"]. Handles rare words without a massive vocabulary.

~4 chars ≈ 1 token

Rule of thumb for English. Code and other languages may differ significantly.

Embeddings

Turning tokens into vectors with meaning — where geometry meets semantics

E(King) - E(Man) + E(Woman) ≈ E(Queen)

Gender as a direction in vector space

E(Paris) - E(France) + E(Japan) ≈ E(Tokyo)

Country-capital relationships encoded geometrically

E(cats) - E(cat) = plural direction

Grammatical features emerge as learned dimensions

Attention

The core innovation — letting every token talk to every other token

Attention(Q, K, V) = softmax(QKT / √dk) V
Q (Query): "What am I looking for?"
K (Key): "What do I have to offer?"
V (Value): "Here's my actual information"
√dk: Scaling to prevent vanishing gradients in softmax

The Transformer

The architecture behind every modern LLM

Input Embedding

Token → Vector (d=512)

+

Positional Encoding

sin/cos at different frequencies

× N Layers

Multi-Head Self-Attention
Add & Norm
Feed-Forward Network
Add & Norm

Unembedding

Vector → Logits → Softmax → Next Token Probability

96 Attention Heads (GPT-3)
96 Transformer Blocks
175B Parameters (GPT-3)
2048 Context Window

How LLMs Are Trained

Next-token prediction at massive scale

The cat sat on the

1. Pre-training

Predict the next token on trillions of tokens from the internet. The model learns language structure, facts, reasoning patterns.

Dataset: CommonCrawl, Books, Wikipedia, Code

2. Fine-tuning (SFT)

Train on curated instruction-response pairs. The model learns to follow instructions rather than just complete text.

Dataset: Human-written Q&A pairs

3. RLHF

Reinforcement Learning from Human Feedback. Humans rank outputs; a reward model teaches the LLM which responses humans prefer.

Signal: Human preference rankings

Prompting Playground

Live prompting techniques — edit, run, and observe

Click "Run Prompt" to see the response here...

Recap & Resources

📐 What We Covered

  • AI → ML → DL → NLP → LLMs evolution
  • Feature engineering → learned representations
  • Gradient descent & backpropagation
  • Tokenization, embeddings, attention
  • Transformer architecture
  • Pre-training, SFT, RLHF
  • Prompting techniques (zero-shot → CoT)

📅 Tomorrow — Day 2

  • Advanced prompting strategies
  • Tool/function calling
  • Retrieval Augmented Generation (RAG)
  • Live agent building