Transformers changed natural language processing by replacing recurrence with a mechanism that can compare every token with every other token in parallel. This capability is the backbone of modern large language models (LLMs), and understanding it at a mathematical level helps you reason about why these models scale so well, where they can fail, and how engineers tune them in practice. For learners taking a generative AI course in Bangalore, this deep-dive also builds the intuition needed to move from “using an LLM” to “engineering with LLMs”.
Why Transformers Work So Well for Language
Language is inherently contextual. The meaning of a word depends on what surrounds it, sometimes many tokens away. Older architectures (like RNNs) attempted to compress all past information into a fixed-size hidden state, which becomes difficult for long sequences. Transformers instead compute direct interactions between tokens at each layer.
At a high level, a transformer block repeats two ideas:
- Self-attention to mix information across tokens.
- Position-wise feed-forward layers to transform features independently per token.
The magic is that self-attention learns which tokens should influence which others—and it does so with clean, differentiable linear algebra.
The Mathematics of Self-Attention
Assume your input sequence has length nnn and embedding dimension dmodeld_{model}dmodel. Represent token embeddings as a matrix:
X∈Rn×dmodelX \in \mathbb{R}^{n \times d_{model}}X∈Rn×dmodel
Self-attention projects XXX into three matrices using learned weights:
Q=XWQ,K=XWK,V=XWVQ = XW_Q,\quad K = XW_K,\quad V = XW_VQ=XWQ,K=XWK,V=XWV
where WQ,WK∈Rdmodel×dkW_Q, W_K \in \mathbb{R}^{d_{model} \times d_k}WQ,WK∈Rdmodel×dk and WV∈Rdmodel×dvW_V \in \mathbb{R}^{d_{model} \times d_v}WV∈Rdmodel×dv. Intuitively:
- Queries (Q) ask, “What am I looking for?”
- Keys (K) say, “What do I contain?”
- Values (V) carry the content to be aggregated.
The raw attention scores are pairwise dot products between queries and keys:
S=QK⊤∈Rn×nS = QK^\top \in \mathbb{R}^{n \times n}S=QK⊤∈Rn×n
Each entry SijS_{ij}Sij measures how much token iii attends to token jjj. To stabilise gradients and prevent overly peaky softmax outputs when dkd_kdk is large, transformers apply scaling:
S^=QK⊤dk\hat{S} = \frac{QK^\top}{\sqrt{d_k}}S^=dkQK⊤
Then apply softmax row-wise to obtain attention weights:
A=softmax(S^)A = \text{softmax}(\hat{S})A=softmax(S^)
Finally, compute the output as a weighted sum of values:
Attention(Q,K,V)=AV\text{Attention}(Q,K,V) = AVAttention(Q,K,V)=AV
This is the core operation: every token becomes a mixture of other tokens’ value vectors, with weights learned from query–key compatibility.
What This Means in Plain Terms
If token iii strongly matches token jjj, then AijA_{ij}Aij becomes large, and token iii’s representation absorbs more information from VjV_jVj. This is how LLMs learn that “it” might refer to a noun several words earlier, or that a question mark changes the function of a sentence.
Multi-Head Attention and Causal Masking
A single attention computation might focus on only one type of relationship (for example, local syntax). Transformers expand capacity by using multi-head attention, which runs attention in parallel hhh times with different learned projections:
MHA(X)=Concat(head1,…,headh)WO\text{MHA}(X) = \text{Concat}(\text{head}_1,\dots,\text{head}_h)W_OMHA(X)=Concat(head1,…,headh)WO
Each head has its own WQ,WK,WVW_Q, W_K, W_VWQ,WK,WV, allowing different heads to specialise: one head might capture subject–verb agreement while another captures entity references.
For autoregressive LLMs, attention must not “peek” into the future. This is enforced with a causal mask MMM, where future positions are set to −∞-\infty−∞ before softmax:
A=softmax(QK⊤dk+M)A = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} + M\right)A=softmax(dkQK⊤+M)
This ensures token iii can only attend to tokens ≤i\le i≤i. Many learners in a generative AI course in Bangalore find that causal masking is the simplest explanation for why LLMs generate text left-to-right while still benefiting from deep context.
Positional Encodings: Giving Order to Attention
Self-attention alone is permutation-invariant: if you shuffle tokens, the dot products remain valid but the meaning changes. So transformers inject position information into the embeddings.
Sinusoidal Positional Encoding
A classic approach adds a deterministic vector PPP to each token embedding:
X′=X+PX’ = X + PX′=X+P
with components defined as:
P(pos,2i)=sin?(pos100002i/dmodel),P(pos,2i+1)=cos?(pos100002i/dmodel)P(pos,2i) = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right),\quad P(pos,2i+1) = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)P(pos,2i)=sin(100002i/dmodelpos),P(pos,2i+1)=cos(100002i/dmodelpos)
This design encodes positions across multiple frequencies, letting the model infer relative distances through linear combinations. It also generalises to sequence lengths longer than those seen during training, because the functions are defined for any propose.
Learned and Relative Position Methods
Many modern models use learned positional embeddings (trainable vectors per position) or relative position approaches. A key idea is that language often cares more about relative distance (“the previous word”, “two tokens ago”) than absolute indices. Techniques like rotary position embeddings (RoPE) effectively rotate query and key vectors as a function of position, making dot products sensitive to relative offsets—useful for long-context behaviour.
Practical Takeaways for Builders
- Scaling matters: attention is O(n2)O(n^2)O(n2) in sequence length because of the n×nn \times nn×n score matrix. This is why long-context LLMs require optimisation tricks.
- Stability is designed: the dk\sqrt{d_k}scaling is not cosmetic; it keeps training well-behaved.
- Positions are essential: without positional encoding (or a relative alternative), attention cannot distinguish “dog bites man” from “man bites dog”.
If you are applying these ideas in projects from a generative AI course in Bangalore, try inspecting attention maps for a small transformer. Seeing which tokens attend to which others makes the equations feel much more concrete.
Conclusion
Transformers succeed because self-attention provides a mathematically simple yet powerful way to model token-to-token relationships, while positional encodings supply the missing notion of order. Together, they enable deep contextual understanding, parallel training, and strong scaling—exactly the ingredients that power modern LLMs.
