import torch
torch.manual_seed(42)
print(torch.__version__)2.5.1
Practice companion for Chapter 01: Tensor Basics.
This notebook focuses on the mechanics that show up everywhere in ML code:
The problems start basic and gradually become more like the shape puzzles you see inside real model implementations.
For each problem:
The goal is not to memorize PyTorch APIs.
The goal is to build shape intuition.
These are warmups. Do not skip them. Most tensor bugs are shape bugs.
Given:
Predict the shape of a, b, c, d, and e.
Then verify using code.
# Problem 1 — your attempt
x = torch.randn(2, 3, 4)
a = x[0]
b = x[:, 1]
c = x[:, :, 2]
d = x[None]
e = x.unsqueeze(1)
# Fill these in before printing:
predicted_shapes = {
"a": None,
"b": None,
"c": None,
"d": None,
"e": None,
}
actual_shapes = {
"a": tuple(a.shape),
"b": tuple(b.shape),
"c": tuple(c.shape),
"d": tuple(d.shape),
"e": tuple(e.shape),
}
actual_shapesIndexing with an integer removes that dimension.
Slicing with : keeps the dimension.
None / unsqueeze adds a new dimension.
# Solution 1
x = torch.randn(2, 3, 4)
a = x[0] # choose first item from dim 0 -> (3, 4)
b = x[:, 1] # keep dim 0, choose index from dim 1 -> (2, 4)
c = x[:, :, 2] # keep dims 0 and 1, choose index from dim 2 -> (2, 3)
d = x[None] # add new leading dim -> (1, 2, 3, 4)
e = x.unsqueeze(1)# add dim at position 1 -> (2, 1, 3, 4)
assert a.shape == (3, 4)
assert b.shape == (2, 4)
assert c.shape == (2, 3)
assert d.shape == (1, 2, 3, 4)
assert e.shape == (2, 1, 3, 4)
print("All tests passed.")You have an LLM hidden-state tensor:
Create:
first_token: the first token embedding for every batch item.last_token: the last token embedding for every batch item.first_batch: all tokens for the first batch item.single_value: the scalar at batch 0, token 1, hidden dim 2.Use indexing only.
# Problem 2 — your attempt
batch, seq_len, hidden = 4, 6, 8
x = torch.randn(batch, seq_len, hidden)
first_token = None
last_token = None
first_batch = None
single_value = None
# Tests
# assert first_token.shape == (batch, hidden)
# assert last_token.shape == (batch, hidden)
# assert first_batch.shape == (seq_len, hidden)
# assert single_value.ndim == 0# Solution 2
batch, seq_len, hidden = 4, 6, 8
x = torch.randn(batch, seq_len, hidden)
first_token = x[:, 0, :]
last_token = x[:, -1, :]
first_batch = x[0, :, :]
single_value = x[0, 1, 2]
assert first_token.shape == (batch, hidden)
assert last_token.shape == (batch, hidden)
assert first_batch.shape == (seq_len, hidden)
assert single_value.ndim == 0
print("All tests passed.")Same tensor:
Extract the first token for every batch item, but keep the sequence dimension.
Expected shape:
This is common when you want the selected token to still broadcast against sequence-shaped tensors.
Using 0 drops the dimension.
Using 0:1 keeps the dimension.
These problems mimic common data and model preprocessing patterns.
Given:
Return only tokens at even positions:
Expected shape:
Vision Transformer-style tensors often prepend a class token:
Remove the first token and keep only patch tokens.
Expected shape:
Given:
Select the same token positions from every batch item.
Example:
Expected shape:
# Solution 6
batch, seq_len, hidden = 4, 8, 6
x = torch.randn(batch, seq_len, hidden)
positions = torch.tensor([0, 3, 5])
selected = x[:, positions, :]
assert selected.shape == (batch, len(positions), hidden)
assert torch.equal(selected[:, 0, :], x[:, 0, :])
assert torch.equal(selected[:, 1, :], x[:, 3, :])
assert torch.equal(selected[:, 2, :], x[:, 5, :])
print("All tests passed.")Broadcasting is one of the most important tensor ideas. It lets small tensors act like larger tensors without explicit copying.
Given:
Add the bias to every batch row.
Expected shape:
Given image tensor:
Add one bias value per channel.
Expected shape:
This is a classic broadcasting trap.
bias has shape (channels,), but the channel dimension in x is dimension 1.
So reshape bias to (1, channels, 1, 1).
# Solution 8
batch, channels, height, width = 2, 3, 4, 5
x = torch.randn(batch, channels, height, width)
bias = torch.randn(channels)
y = x + bias.view(1, channels, 1, 1)
assert y.shape == x.shape
assert torch.allclose(y[:, 0, :, :], x[:, 0, :, :] + bias[0])
assert torch.allclose(y[:, 1, :, :], x[:, 1, :, :] + bias[1])
print("All tests passed.")LLM-style hidden states:
Add pos_emb to every batch item.
Expected shape:
pos_emb broadcasts from (seq_len, hidden) to (batch, seq_len, hidden).
Given:
For each row independently, subtract that row’s mean.
Expected output:
Each row of y should have mean approximately zero.
Use keepdim=True so the row mean remains shape (batch, 1) and can broadcast back across hidden dimension.
Reductions collapse dimensions. The core skill is knowing which dimension disappears and when to keep it.
Given:
Compute the mean over the sequence dimension.
Expected shape:
Given:
Average over height and width.
Expected shape:
Given logits:
Find the most likely token id at every position.
Expected shape:
argmax(dim=-1) collapses the vocabulary dimension.
This is where shape thinking becomes real model implementation.
Given:
Reshape into:
This is the standard LLM attention-head layout.
First reshape hidden into (num_heads, head_dim), then move num_heads before seq_len.
Invert the previous operation.
Given:
Return:
where:
After permute, call .contiguous() before .view() because memory layout may not be contiguous.
Given image tensor:
Convert it into tokens:
This treats each spatial location as a token.
Move channels to the end, then flatten height and width.
Invert the previous operation.
Given:
Return:
These are simplified versions of patterns that appear in transformer implementations.
Create a causal attention mask for sequence length seq_len.
The mask should be shape:
And should contain:
True where a token is allowed to attendFalse where a token is not allowed to attendFor causal attention, position i can attend to positions j <= i.
torch.tril gives the lower triangle.
# Solution 18
seq_len = 6
mask = torch.tril(torch.ones(seq_len, seq_len, dtype=torch.bool))
assert mask.shape == (seq_len, seq_len)
assert mask.dtype == torch.bool
assert mask[0, 0] == True
assert mask[0, 1] == False
assert mask[5, 0] == True
assert mask[5, 5] == True
print(mask.int())
print("All tests passed.")Given attention scores:
Apply a causal mask so future positions get a very negative value, e.g. -1e9.
Expected output shape:
The mask has shape (seq_len, seq_len).
It broadcasts over batch and heads.
# Solution 19
batch, num_heads, seq_len = 2, 3, 5
scores = torch.randn(batch, num_heads, seq_len, seq_len)
mask = torch.tril(torch.ones(seq_len, seq_len, dtype=torch.bool))
masked_scores = scores.masked_fill(~mask, -1e9)
assert masked_scores.shape == scores.shape
assert masked_scores[0, 0, 0, 1] == -1e9
assert masked_scores[0, 0, 4, 0] != -1e9
print("All tests passed.")Autoregressive models often only need logits for the final position during generation.
Given:
Extract final-token logits.
Expected shape:
During autoregressive generation, key tensors may be stored as:
Append new_k along the sequence dimension.
Expected shape:
The sequence dimension is dimension 2.
# Solution 21
batch, num_heads, past_len, head_dim = 2, 4, 5, 8
past_k = torch.randn(batch, num_heads, past_len, head_dim)
new_k = torch.randn(batch, num_heads, 1, head_dim)
full_k = torch.cat([past_k, new_k], dim=2)
assert full_k.shape == (batch, num_heads, past_len + 1, head_dim)
assert torch.equal(full_k[:, :, :-1, :], past_k)
assert torch.equal(full_k[:, :, -1:, :], new_k)
print("All tests passed.")These are simplified patterns from image models, ViTs, U-Nets, and diffusion code.
Given:
Assume height and width are divisible by patch_size.
Convert image into patch tokens:
where:
# Problem 22 — your attempt
batch, channels, height, width = 2, 3, 8, 8
patch_size = 2
x = torch.randn(batch, channels, height, width)
patches = None
# num_patches = (height // patch_size) * (width // patch_size)
# patch_dim = channels * patch_size * patch_size
# assert patches.shape == (batch, num_patches, patch_dim)Use reshape to split height and width into grid dimensions and patch dimensions, then permute so patch-grid dimensions come before patch contents.
# Solution 22
batch, channels, height, width = 2, 3, 8, 8
patch_size = 2
x = torch.randn(batch, channels, height, width)
h_grid = height // patch_size
w_grid = width // patch_size
patches = x.view(batch, channels, h_grid, patch_size, w_grid, patch_size)
patches = patches.permute(0, 2, 4, 1, 3, 5).contiguous()
patches = patches.view(batch, h_grid * w_grid, channels * patch_size * patch_size)
num_patches = (height // patch_size) * (width // patch_size)
patch_dim = channels * patch_size * patch_size
assert patches.shape == (batch, num_patches, patch_dim)
print("All tests passed.")Invert the previous operation.
Given:
Return:
Use:
# Problem 23 — your attempt
batch, channels, height, width = 2, 3, 8, 8
patch_size = 2
h_grid = height // patch_size
w_grid = width // patch_size
num_patches = h_grid * w_grid
patch_dim = channels * patch_size * patch_size
patches = torch.randn(batch, num_patches, patch_dim)
x = None
# assert x.shape == (batch, channels, height, width)# Solution 23
batch, channels, height, width = 2, 3, 8, 8
patch_size = 2
h_grid = height // patch_size
w_grid = width // patch_size
num_patches = h_grid * w_grid
patch_dim = channels * patch_size * patch_size
patches = torch.randn(batch, num_patches, patch_dim)
x = patches.view(batch, h_grid, w_grid, channels, patch_size, patch_size)
x = x.permute(0, 3, 1, 4, 2, 5).contiguous()
x = x.view(batch, channels, height, width)
assert x.shape == (batch, channels, height, width)
print("All tests passed.")In diffusion-style models, a timestep embedding may be added to every spatial location.
Given:
Add t_emb to x.
Expected shape:
t_emb must become (batch, channels, 1, 1) to broadcast over height and width.
These problems are harder. They reward translating loops into tensor operations.
Given:
Compute:
Expected shape:
Do not use Python loops.
Given:
Compute the matrix:
Expected shape:
Do not use loops.
This is a matrix multiplication between x and x.T.
Given:
Find the indices of the k nearest neighbors of every row, excluding itself.
Expected shape:
Use squared Euclidean distance.
Hint: after computing distances, set diagonal to inf.
# Solution 27
n, d, k = 8, 3, 2
x = torch.randn(n, d)
dist = ((x[:, None, :] - x[None, :, :]) ** 2).sum(dim=-1)
dist.fill_diagonal_(float("inf"))
nearest_idx = dist.topk(k, largest=False).indices
assert nearest_idx.shape == (n, k)
# Make sure no row selects itself.
rows = torch.arange(n)[:, None]
assert not torch.any(nearest_idx == rows)
print("All tests passed.")Given:
Create one-hot vectors:
Avoid loops.
Use torch.nn.functional.one_hot.
# Solution 28
import torch.nn.functional as F
batch, seq_len, vocab_size = 3, 5, 10
ids = torch.randint(0, vocab_size, (batch, seq_len))
one_hot = F.one_hot(ids, num_classes=vocab_size).float()
assert one_hot.shape == (batch, seq_len, vocab_size)
assert torch.all(one_hot.sum(dim=-1) == 1)
print("All tests passed.")Given:
where mask == 1 means valid token and mask == 0 means padding.
Compute the mean hidden vector over valid tokens only.
Expected shape:
Avoid counting padded tokens in the denominator.
The mask must become (batch, seq_len, 1) so it can multiply hidden vectors.
# Solution 29
batch, seq_len, hidden = 3, 6, 4
x = torch.randn(batch, seq_len, hidden)
mask = torch.tensor([
[1, 1, 1, 0, 0, 0],
[1, 1, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 0],
], dtype=torch.float32)
mask_expanded = mask[:, :, None]
summed = (x * mask_expanded).sum(dim=1)
counts = mask.sum(dim=1, keepdim=True).clamp(min=1)
pooled = summed / counts
assert pooled.shape == (batch, hidden)
# Manual check for first row:
assert torch.allclose(pooled[0], x[0, :3].mean(dim=0))
print("All tests passed.")You have token features:
Each token belongs to one segment. Compute sum of token features per segment.
Example:
Expected output:
This pattern appears in batching, graph neural networks, ragged tensors, and pooling.
index_add_ accumulates rows of x into rows of segment_sums according to segment_ids.
# Solution 30
num_tokens, hidden, num_segments = 7, 4, 3
x = torch.randn(num_tokens, hidden)
segment_ids = torch.tensor([0, 0, 1, 2, 2, 1, 0])
segment_sums = torch.zeros(num_segments, hidden)
segment_sums.index_add_(0, segment_ids, x)
assert segment_sums.shape == (num_segments, hidden)
assert torch.allclose(segment_sums[0], x[[0, 1, 6]].sum(dim=0))
assert torch.allclose(segment_sums[1], x[[2, 5]].sum(dim=0))
assert torch.allclose(segment_sums[2], x[[3, 4]].sum(dim=0))
print("All tests passed.")This combines several ideas from the notebook.
Given:
and projection matrices:
Do the following:
q, k, v.(batch, num_heads, seq_len, head_dim).This is not meant to be a production attention implementation.
It is a shape exercise.
# Problem 31 — your attempt
batch, seq_len, hidden = 2, 5, 12
num_heads = 3
head_dim = hidden // num_heads
x = torch.randn(batch, seq_len, hidden)
Wq = torch.randn(hidden, hidden)
Wk = torch.randn(hidden, hidden)
Wv = torch.randn(hidden, hidden)
out = None
# assert out.shape == (batch, seq_len, hidden)# Solution 31
import math
import torch.nn.functional as F
batch, seq_len, hidden = 2, 5, 12
num_heads = 3
head_dim = hidden // num_heads
x = torch.randn(batch, seq_len, hidden)
Wq = torch.randn(hidden, hidden)
Wk = torch.randn(hidden, hidden)
Wv = torch.randn(hidden, hidden)
# 1. Linear projections
q = x @ Wq
k = x @ Wk
v = x @ Wv
assert q.shape == (batch, seq_len, hidden)
assert k.shape == (batch, seq_len, hidden)
assert v.shape == (batch, seq_len, hidden)
# 2. Split heads
def split_heads(t):
t = t.view(batch, seq_len, num_heads, head_dim)
t = t.permute(0, 2, 1, 3).contiguous()
return t
q = split_heads(q)
k = split_heads(k)
v = split_heads(v)
assert q.shape == (batch, num_heads, seq_len, head_dim)
assert k.shape == (batch, num_heads, seq_len, head_dim)
assert v.shape == (batch, num_heads, seq_len, head_dim)
# 3. Attention scores
scores = q @ k.transpose(-2, -1)
scores = scores / math.sqrt(head_dim)
assert scores.shape == (batch, num_heads, seq_len, seq_len)
# 4. Causal mask
mask = torch.tril(torch.ones(seq_len, seq_len, dtype=torch.bool))
scores = scores.masked_fill(~mask, -1e9)
# 5. Softmax
weights = F.softmax(scores, dim=-1)
assert weights.shape == (batch, num_heads, seq_len, seq_len)
# 6. Weighted sum over values
out_heads = weights @ v
assert out_heads.shape == (batch, num_heads, seq_len, head_dim)
# Merge heads
out = out_heads.permute(0, 2, 1, 3).contiguous()
out = out.view(batch, seq_len, hidden)
assert out.shape == (batch, seq_len, hidden)
print("All tests passed.")Use these to test whether you really understood the notebook.
x[:, 0, :] produce a different rank from x[:, 0:1, :]?(1, channels, 1, 1)?(batch, seq, heads, head_dim) to (batch, heads, seq, head_dim)?.contiguous() after permute() before calling .view()?keepdim=True useful in reductions?x[:, None, :] - x[None, :, :] create all pairs?02P — Matrix Multiplication Practice
Possible topics: