FundamentalsModel Architecture Transformer LM

Embedding and LM Head

Premium

Understand how token ids enter a continuous vector space, and how hidden states are projected back to vocabulary logits

Why are the input/output layers worth looking at separately?

The heaviest computation in a Decoder-only Transformer happens in the blocks in the middle, but the input/output relationship of a language model is actually determined by the two ends:

\text{token ids} \xrightarrow{\text{embedding}} \text{hidden states} \xrightarrow{\text{LM Head}} \text{next-token logits}

If these two ends are not understood clearly, many later questions get tangled together: why the input is integers, why the output is not a token id, why the loss needs shifted targets, why the vocabulary size significantly affects the parameter count, and even why some models share the weights of the input embedding and the output LM Head.

The main thread of this chapter only covers the two ends, input and output: Embedding and LM Head.

Log in to continue reading

This is premium content. Please log in to access the full article.

From token ids to logits

Understand how a Decoder-only Transformer turns token ids into next-token logits

Attention Mechanisms

From MHA / Causal / GQA to Attention Sink and Gated Attention, understand the design, flaws, and evolution of attention

Why are the input/output layers worth looking at separately?

Embedding

Looking up a table by token id

What does the embedding learn after training?

Adding new special tokens: the cleverness of padded slots

What did SFT change in the embedding

Reading SFT priorities from the special tokens

Reading SFT content from the ordinary BPE tokens

The assembled SFT recipe

LM Head

Geometric view: why having the same shape is not a coincidence

In code it is just one line of assignment

Cost: tying is not free

After wrapping a layer with LoRA, does tying still hold?

Summary

FundamentalsModel Architecture Transformer LM

Embedding and LM Head

Premium

Understand how token ids enter a continuous vector space, and how hidden states are projected back to vocabulary logits

Get code access

Why are the input/output layers worth looking at separately?

The heaviest computation in a Decoder-only Transformer happens in the blocks in the middle, but the input/output relationship of a language model is actually determined by the two ends: