Chapter 1: Tensors, Shapes, and Views

The most important mental model

This series is not written for the “happy path” learner.

It is for the person who: - knows some math but still freezes at tensor shapes, - understands operations in isolation but gets lost in actual code, - does not want hand-wavy comfort, - wants a shape-first, debug-first, hacker-style understanding.

The thesis of this notebook is simple:

Most tensor confusion is not about algebra. It is about shape tracking.

If you build the habit of reading code as:

\[ [\text{shape of left}] \;\to\; \text{operation} \;\to\; [\text{shape of result}] \]

then a huge part of PyTorch becomes less magical and more mechanical.

We will stay practical, but we will not stay shallow.

Setup

We will use PyTorch, print shapes aggressively, and keep the examples small enough to inspect.

A recurring pattern in this notebook:

state the shape,
state the operation,
state the resulting shape,
then verify in code.

That habit matters more than memorizing functions.

import torch

torch.set_printoptions(sci_mode=False)
print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

PyTorch version: 2.5.1
CUDA available: False

START_BOLD, END_BOLD = "\033[1m", "\033[0m"

def printbold(text, val=""):
    print(START_BOLD + text + END_BOLD, val)

1. The real object is not the tensor. It is the tensor plus its shape

A tensor without its shape is half-known.

When you see:

x = torch.randn(2, 3, 4)

The important part is not the values. The important part is: \(x \in \mathbb{R}^{2 \times 3 \times 4}\)

That means: - Tensor has 3 dimensions, - First dimension has size 2, - Second has size 3, - Third has size 4.

In code, always ask:

x.shape
x.ndim
x.numel()

These three tell you most of what you need.

x = torch.randn(2, 3, 4)

printbold("x.shape:", x.shape)
printbold("x.ndim :", x.ndim)
printbold("x.numel():", x.numel())
print(x)

x.shape: torch.Size([2, 3, 4])

x.ndim : 3

x.numel(): 24

tensor([[[-1.7615,  0.6696, -0.7249, -1.7323],

         [-0.0892, -1.5081, -0.3786,  1.5826],

         [ 0.2613, -0.6379,  0.1860, -0.2336]],



        [[-0.6420,  1.2414, -0.6550,  0.3160],

         [ 1.0707, -2.3500,  0.7318,  0.1067],

         [-1.2665, -0.3141,  1.4356, -0.1020]]])

2. Scalars, vectors, matrices, higher-order tensors

This naming is useful, but only if it helps you reason.

scalar: 0D tensor
vector: 1D tensor
matrix: 2D tensor
tensor: anything above that, though technically all of them are tensors

The most common early mistake is this:

thinking [3] means [1,1]

It does not.

\([3] \neq [1,1]\)

Instead:

[3] means 1D tensor of length 3
[1,3] means 2D row-like matrix
[3,1] means 2D column-like matrix

scalar = torch.tensor(5.0)          # []
vector = torch.tensor([1.0, 2.0, 3.0]) # [3]
matrix = torch.tensor([[1.0, 2.0, 3.0]]) # [1, 3]
column = torch.tensor([[1.0], [2.0], [3.0]]) # [3, 1]

for name, t in [("scalar", scalar), ("vector", vector), ("matrix", matrix), ("column", column)]:
    print(f"{name:>6} -> shape={tuple(t.shape)}, ndim={t.ndim}")
    print(t)
    print("-" * 40)

scalar -> shape=(), ndim=0
tensor(5.)
----------------------------------------
vector -> shape=(3,), ndim=1
tensor([1., 2., 3.])
----------------------------------------
matrix -> shape=(1, 3), ndim=2
tensor([[1., 2., 3.]])
----------------------------------------
column -> shape=(3, 1), ndim=2
tensor([[1.],
        [2.],
        [3.]])
----------------------------------------

3. The shape-first reading habit

Do not read this:

C = A @ B

as “A times B”.

Read it as: \([m, n] @ [n, p] \to [m, p]\)

This is the habit that keeps you from guessing.

The inner dimensions must match: \([m, n] @ [n, p]\)

The result keeps the outer dimensions: \([m, p]\)

Matrix multiplication formula

\[ A \in \mathbb{R}^{m \times n}, \quad B \in \mathbb{R}^{n \times p} \]

then

\[ C = AB \in \mathbb{R}^{m \times p} \]

and each entry is:

\[ C_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj} \]

That summation index \((k)\) is the contracted dimension.

A = torch.randn(2, 3)
B = torch.randn(3, 4)
C = A @ B

printbold("A.shape:", A.shape)
printbold("B.shape:", B.shape)
printbold("C.shape:", C.shape)

A.shape: torch.Size([2, 3])

B.shape: torch.Size([3, 4])

C.shape: torch.Size([2, 4])

4. The most useful distinction in practice: `*` vs `@`

A lot of confusion comes from mixing up these two.

Element-wise multiplication

\[ [a_{ij}] * [b_{ij}] = [a_{ij} b_{ij}] \]

Same position multiplied with same position.

Matrix multiplication

\[ C_{ij} = \sum_k A_{ik} B_{kj} \]

This is not entry-by-entry. It is row-by-column with a sum.

A = torch.tensor([[1., 2.],
                  [3., 4.]])
B = torch.tensor([[10., 20.],
                  [30., 40.]])

printbold("Element-wise A * B:")
print(A * B)
print()

printbold("Matrix multiply A @ B:")
print(A @ B)

Element-wise A * B: 

tensor([[ 10.,  40.],

        [ 90., 160.]])



Matrix multiply A @ B: 

tensor([[ 70., 100.],

        [150., 220.]])

5. Shape mechanics for 1D tensors in PyTorch

This is where PyTorch takes some convenience liberties.

Case A: vector on the left

\[ [n] @ [n, p] \to [p] \]

Internally, PyTorch behaves roughly like:

\[ [1, n] @ [n, p] \to [1, p] \to [p] \]

Case B: vector on the right

\[ [m, n] @ [n] \to [m] \]

Internally:

\[ [m, n] @ [n, 1] \to [m, 1] \to [m] \]

This is convenient in code, but less explicit than pure linear algebra notation.

a = torch.randn(4)
B = torch.randn(4, 6)

A = torch.randn(5, 4)
b = torch.randn(4)

left_result = a @ B
right_result = A @ b

printbold("a.shape:", a.shape)
printbold("B.shape:", B.shape)
printbold("a @ B shape:", left_result.shape)
print("", "")

printbold("A.shape:", A.shape)
printbold("b.shape:", b.shape)
printbold("A @ b shape:", right_result.shape)

a.shape: torch.Size([4])

B.shape: torch.Size([4, 6])

a @ B shape: torch.Size([6])

 

A.shape: torch.Size([5, 4])

b.shape: torch.Size([4])

A @ b shape: torch.Size([5])

6. Broadcasting: the rule that removes loops

Broadcasting is one of the most important concepts in PyTorch.

The rule is simple:

compare shapes from the right
dimensions are compatible if they are equal, or one of them is 1
missing dimensions are treated like leading 1s

Example

\([2,3] + [3]\)

Right-align the shapes:

\([2,3]\)

\([1,3]\)

Now apply the rule:

last dimension: \((3)\) vs \((3)\) → OK
second dimension: \((2)\) vs \((1)\) → stretch \((1 \rightarrow 2)\)

So the second tensor behaves like:

\([2,3]\)

Final result:

\([2,3]\)

Broadcasting is just implicit expansion along size-1 dimensions.

A = torch.tensor([[1., 2., 3.],
                  [4., 5., 6.]])   # [2, 3]
b = torch.tensor([10., 20., 30.])  # [3]

C = A + b

printbold("A.shape:", A.shape)
printbold("b.shape:", b.shape)
printbold("C.shape:", C.shape)
print()
printbold("Value of C is:")
print(C)

A.shape: torch.Size([2, 3])

b.shape: torch.Size([3])

C.shape: torch.Size([2, 3])



Value of C is: 

tensor([[11., 22., 33.],

        [14., 25., 36.]])

7. Broadcasting is not copying in spirit. It is virtual expansion

Conceptually, PyTorch behaves as if a smaller tensor were expanded.

But you should think:

not “copy data everywhere”
but “treat it as repeatable along size-1 axes”

This matters because broadcasting is how tensor code stays compact and fast.

Two-way axis broadcasting

This is one of the most important non-happy-path patterns.

Take:

\([3,5,1] * [3,1,7] \to [3,5,7]\)

Why?

first dimension: \((3)\) matches \((3)\)
second: \((5)\) with \((1)\) stretches to \((5)\)
third: \((1)\) with \((7)\) stretches to \((7)\)

You are not “removing” information.
You are letting each tensor contribute structure on different axes.

C = torch.randn(3, 5, 1)
D = torch.randn(3, 1, 7)
E = C * D

printbold("C.shape:", C.shape)
printbold("D.shape:", D.shape)
printbold("E.shape:", E.shape)

C.shape: torch.Size([3, 5, 1])

D.shape: torch.Size([3, 1, 7])

E.shape: torch.Size([3, 5, 7])

8. `unsqueeze`: how to make broadcasting intentional

A common debugging move is to explicitly insert a dimension of size 1.

\(a \in \mathbb{R}^{3}\)

then:

a.unsqueeze(0) gives shape [1,3]
a.unsqueeze(1) gives shape [3,1]

That is often the difference between “PyTorch error” and “exactly the structure I meant”.

a = torch.tensor([1., 2., 3.])

row = a.unsqueeze(0)
col = a.unsqueeze(1)

printbold("a.shape   :", a.shape)
printbold("row.shape :", row.shape)
printbold("col.shape :", col.shape)
print()
printbold("row:")
print(row)
print()
printbold("col:")
print(col)

a.shape   : torch.Size([3])

row.shape : torch.Size([1, 3])

col.shape : torch.Size([3, 1])



row: 

tensor([[1., 2., 3.]])



col: 

tensor([[1.],

        [2.],

        [3.]])

9. Outer-product thinking via broadcasting

The outer product is a perfect example of intentional broadcasting.

For vectors \((a \in \mathbb{R}^n)\) and \((b \in \mathbb{R}^m)\),

\(C_{ij} = a_i b_j\)

Shape:

\([n] \otimes [m] \to [n,m]\)

Using broadcasting:

turn \((a)\) into [n,1]
keep \((b)\) as [m], which right-aligns to [1,m]
result is [n,m]

a = torch.tensor([1., 2., 3.])
b = torch.tensor([4., 5.])

outer_via_broadcast = a.unsqueeze(1) * b
outer_direct = torch.outer(a, b)

printbold("broadcast version:")
print(outer_via_broadcast)
print()
printbold("torch.outer version:")
print(outer_direct)

broadcast version: 

tensor([[ 4.,  5.],

        [ 8., 10.],

        [12., 15.]])



torch.outer version: 

tensor([[ 4.,  5.],

        [ 8., 10.],

        [12., 15.]])

10. `view`, `reshape`, and the danger of pretending shape changes are free of meaning

A lot of people learn view and then use it like duct tape.

That works until it doesn’t.

Important distinction

Changing shape is not automatically changing meaning correctly.

If you flatten, split, permute, or reshape, you should still know:

what axis means batch?
what axis means feature?
what axis means sequence?
what axis means channel?

A tensor is not just storage. It has semantics.

`view` and `reshape`

Both can give you a new shape with the same total number of elements.

If:

\(2 \times 3 \times 4 = 24\)

then you can reshape into any form whose product is still 24.

But the shape may be legal while the interpretation is wrong.

x = torch.arange(24.0).view(2, 3, 4)

flat = x.view(24)
matrix = x.view(6, 4)
cube = x.view(4, 3, 2)

printbold("x.shape     :", x.shape)
printbold("flat.shape  :", flat.shape)
printbold("matrix.shape:", matrix.shape)
printbold("cube.shape  :", cube.shape)

x.shape     : torch.Size([2, 3, 4])

flat.shape  : torch.Size([24])

matrix.shape: torch.Size([6, 4])

cube.shape  : torch.Size([4, 3, 2])

11. Views are about the same underlying data, not just the same numbers

A view usually means:

same storage, different interpretation of shape

This is powerful, but it means you should not think of a view as a new independent tensor in the conceptual sense.

x = torch.arange(12.0).view(3, 4)
y = x.view(2, 6)

printbold("Before modification:")
print("x:")
print(x)
print("y:")
print(y)

x[0, 0] = -999

printbold("\nAfter modifying x:")
print("x:")
print(x)
print("y:")
print(y)

Before modification: 

x:

tensor([[ 0.,  1.,  2.,  3.],

        [ 4.,  5.,  6.,  7.],

        [ 8.,  9., 10., 11.]])

y:

tensor([[ 0.,  1.,  2.,  3.,  4.,  5.],

        [ 6.,  7.,  8.,  9., 10., 11.]])



After modifying x: 

x:

tensor([[-999.,    1.,    2.,    3.],

        [   4.,    5.,    6.,    7.],

        [   8.,    9.,   10.,   11.]])

y:

tensor([[-999.,    1.,    2.,    3.,    4.,    5.],

        [   6.,    7.,    8.,    9.,   10.,   11.]])

12. Transpose and permute: same numbers, different axis meaning

This is a major point for anyone working with neural nets.

Sometimes the values are right, but the axes are wrong.

Matrix transpose

\(A \in \mathbb{R}^{m \times n}\)

then

\(A^T \in \mathbb{R}^{n \times m}\)

For higher tensors, transpose swaps two dimensions, and permute reorders multiple dimensions.

A = torch.randn(2, 3)
AT = A.transpose(0, 1)

x = torch.randn(2, 3, 4)
xp = x.permute(2, 0, 1)

printbold("A.shape :", A.shape)
printbold("AT.shape:", AT.shape)
print()
printbold("x.shape :", x.shape)
printbold("xp.shape:", xp.shape)

A.shape : torch.Size([2, 3])

AT.shape: torch.Size([3, 2])



x.shape : torch.Size([2, 3, 4])

xp.shape: torch.Size([4, 2, 3])

13. A hacker rule: every bug is a shape bug until proven otherwise

This is an exaggeration, but a useful one.

If something feels wrong in PyTorch, inspect:

.shape
.dtype
.device
whether you meant * or @
whether you needed unsqueeze
whether you accidentally permuted semantics

That debugging sequence saves a lot of pain.

def inspect_tensor(name, t):
    print(f"{name}: shape={tuple(t.shape)}, dtype={t.dtype}, device={t.device}, ndim={t.ndim}")

x = torch.randn(8, 16, 32)
w = torch.randn(32, 64)
b = torch.randn(64)

inspect_tensor("x", x)
inspect_tensor("w", w)
inspect_tensor("b", b)

y = x @ w + b
inspect_tensor("y", y)

x: shape=(8, 16, 32), dtype=torch.float32, device=cpu, ndim=3
w: shape=(32, 64), dtype=torch.float32, device=cpu, ndim=2
b: shape=(64,), dtype=torch.float32, device=cpu, ndim=1
y: shape=(8, 16, 64), dtype=torch.float32, device=cpu, ndim=3

14. GPU is not a different math world. It is the same tensor world on a different device

A lot of beginners over-mystify GPU.

The math does not change.

What changes is: - where tensors live, - how operations are executed, - and the requirement that participating tensors must be on the same device.

device = "cuda" if torch.cuda.is_available() else "cpu"

x = torch.randn(4, 5, device=device)
w = torch.randn(5, 3, device=device)
b = torch.randn(3, device=device)

y = x @ w + b

print("device:", device)
print("x.device:", x.device)
print("w.device:", w.device)
print("b.device:", b.device)
print("y.device:", y.device)
print("y.shape :", y.shape)

device: cpu
x.device: cpu
w.device: cpu
b.device: cpu
y.device: cpu
y.shape : torch.Size([4, 3])

15. The mental model of a linear layer

A linear layer is usually written as:

\(y = xW + b\)

In batched tensor form:

\((x)\): [batch, in_features]
\((W)\): [in_features, out_features]
\((b)\): [out_features]

Then:

\([batch, in] @ [in, out] \to [batch, out]\)

and then the bias is broadcast:

\([batch, out] + [out] \to [batch, out]\)

This one pattern appears everywhere.

batch = 4
in_features = 3
out_features = 2

x = torch.randn(batch, in_features)
W = torch.randn(in_features, out_features)
b = torch.randn(out_features)

y = x @ W + b

print("x.shape:", x.shape)
print("W.shape:", W.shape)
print("b.shape:", b.shape)
print("y.shape:", y.shape)

x.shape: torch.Size([4, 3])
W.shape: torch.Size([3, 2])
b.shape: torch.Size([2])
y.shape: torch.Size([4, 2])

16. Sequence-style tensors: batch is not the only axis that matters

A common shape in transformer-style work is:

\([batch, seq, hidden]\)

This is a good place to stop thinking only in 2D.

When you do:

\([batch, seq, hidden] @ [hidden, out]\)

you get:

\([batch, seq, out]\)

That is just batched matmul with additional leading structure.

x = torch.randn(2, 5, 8)   # [batch, seq, hidden]
W = torch.randn(8, 4)           # [hidden, out]

y = x @ W

print("x.shape:", x.shape)
print("W.shape:", W.shape)
print("y.shape:", y.shape)

x.shape: torch.Size([2, 5, 8])
W.shape: torch.Size([8, 4])
y.shape: torch.Size([2, 5, 4])

17. Tensor contraction mindset without jargon overload

A deep but useful idea:

Many tensor operations are of the form:

keep some axes,
multiply along some axes,
sum over those axes.

That is what matrix multiplication already does.

For:

\(C_{ij} = \sum_k A_{ik}B_{kj}\)

the axis \((k)\) is the contracted axis.

You do not “throw away” that axis.
You collapse it into a scalar contribution for each remaining index pair \(((i,j))\).

18. Why shape literacy matters more than memorizing APIs

You can forget function names and still survive if you know: - which axes should match, - which axes should remain, - which axes should broadcast, - and what each axis means.

But if you memorize functions without semantics, you’ll keep fighting the framework.

19. Anti-textbook checklist

Before writing a tensor operation, ask:

What does each axis mean?
What shape do I want at the end?
Which dimensions should match?
Which dimensions should broadcast?
Am I changing storage layout or just interpretation?
If I printed all shapes now, would the code still make sense?

That is how you stop “hoping PyTorch understands your intention”.

20. Mini summary

The core rules

A tensor is inseparable from its shape.
[3] is not [1,1].
* and @ are fundamentally different.
Broadcasting compares from the right.
Size-1 dimensions are stretchable.
unsqueeze makes broadcasting intentional.
view changes interpretation, not semantics automatically.
transpose / permute change axis order.
GPU changes device, not math.
Most PyTorch confusion is really shape confusion.

The notebook thesis

If you track shapes aggressively, tensors become less scary and more programmable.

# Final compact cheat block

print("[m, n] @ [n, p] -> [m, p]")
print("[..., m, n] @ [..., n, p] -> [..., m, p]")
print("[2, 3] + [3] -> [2, 3]  (broadcast)")
print("[3, 1] * [4] -> [3, 4]  (outer-product style broadcast)")
print("view/reshape: same number of elements, new shape")
print("transpose/permute: same values, different axis order")

[m, n] @ [n, p] -> [m, p]
[..., m, n] @ [..., n, p] -> [..., m, p]
[2, 3] + [3] -> [2, 3]  (broadcast)
[3, 1] * [4] -> [3, 4]  (outer-product style broadcast)
view/reshape: same number of elements, new shape
transpose/permute: same values, different axis order

The most important mental model

Setup

1. The real object is not the tensor. It is the tensor plus its shape

2. Scalars, vectors, matrices, higher-order tensors

3. The shape-first reading habit

Matrix multiplication formula

4. The most useful distinction in practice: * vs @

Element-wise multiplication

Matrix multiplication

5. Shape mechanics for 1D tensors in PyTorch

Case A: vector on the left

Case B: vector on the right

6. Broadcasting: the rule that removes loops

Example

7. Broadcasting is not copying in spirit. It is virtual expansion

Two-way axis broadcasting

8. unsqueeze: how to make broadcasting intentional

9. Outer-product thinking via broadcasting

10. view, reshape, and the danger of pretending shape changes are free of meaning

Important distinction

view and reshape

11. Views are about the same underlying data, not just the same numbers

12. Transpose and permute: same numbers, different axis meaning

Matrix transpose

13. A hacker rule: every bug is a shape bug until proven otherwise

14. GPU is not a different math world. It is the same tensor world on a different device

15. The mental model of a linear layer

16. Sequence-style tensors: batch is not the only axis that matters

17. Tensor contraction mindset without jargon overload

18. Why shape literacy matters more than memorizing APIs

19. Anti-textbook checklist

20. Mini summary

The core rules

The notebook thesis

4. The most useful distinction in practice: `*` vs `@`

8. `unsqueeze`: how to make broadcasting intentional

10. `view`, `reshape`, and the danger of pretending shape changes are free of meaning

`view` and `reshape`