FundamentalsModel Architecture Transformer LM

From token ids to logits

Premium

Understand how a Decoder-only Transformer turns token ids into next-token logits

Starting from token ids

In the tokenization section, we already turned text into token ids. By this point, what the model sees is no longer a string, but a sequence of integers.

The next question is: once these integers enter the Transformer, why do they ultimately become, for each position, a score over the entire vocabulary?

Let us first denote the complete model as $f_\theta$ , where $\theta$ stands for all the trainable parameters of the model. For a batch, the input is:

x \in \mathbb{Z}^{B \times T}

and the output is:

z = f_\theta(x), \quad z \in \mathbb{R}^{B \times T \times V}

Here $B$ is the batch size, $T$ is the sequence length, and $V$ is the vocabulary size. In other words, the model does not directly output the next token id; instead, for every position it outputs a vector of logits of length $V$ .

What are logits, and how big are they?

Logits are the raw scores before softmax; they can be positive or negative, and their magnitude is not fixed. They are not probabilities in themselves — only after softmax do they become a probability distribution.

A sense of typical scale: Qwen3.5-0.8B has $D = 1024$ and $V = 248{,}320$ ; Qwen3.5-9B has $D = 4096$ and $V = 248{,}320$ . $V$ is usually one to two orders of magnitude larger than $D$ , so the $V \times D$ matrix of the LM Head can take up a substantial fraction of the model:

Qwen3.5-0.8B: embedding and LM Head share weights (weight tying), and this single $V \times D$ matrix accounts for about 32% (~254M).
Qwen3.5-9B: the two are not shared; the LM Head alone is about 11% (~1.02B), and together with the embedding totals about 22%.

The smaller the model and the larger the vocabulary, the more extreme this ratio becomes (the next chapter, Embedding and LM Head, expands on the trade-offs of weight tying).

Log in to continue reading

This is premium content. Please log in to access the full article.

Transformer LM

From token ids to next-token logits — building an overall mental model of the Decoder-only Transformer

Embedding and LM Head

Understand how token ids enter a continuous vector space, and how hidden states are projected back to vocabulary logits

Starting from token ids

Why does every position have its own set of logits?

What shapes does one forward pass go through?

Why does it stay

[B,T,D]

throughout the middle?

Where do training and generation diverge?

Summary

FundamentalsModel Architecture Transformer LM

From token ids to logits

Premium

Understand how a Decoder-only Transformer turns token ids into next-token logits

Starting from token ids

In the tokenization section, we already turned text into token ids. By this point, what the model sees is no longer a string, but a sequence of integers.

The next question is: once these integers enter the Transformer, why do they ultimately become, for each position, a score over the entire vocabulary?

Let us first denote the complete model as $f_\theta$ , where $\theta$ stands for all the trainable parameters of the model. For a batch, the input is:

x \in \mathbb{Z}^{B \times T}

and the output is:

z = f_\theta(x), \quad z \in \mathbb{R}^{B \times T \times V}

What are logits, and how big are they?

Qwen3.5-0.8B: embedding and LM Head share weights (weight tying), and this single $V \times D$ matrix accounts for about 32% (~254M).
Qwen3.5-9B: the two are not shared; the LM Head alone is about 11% (~1.02B), and together with the embedding totals about 22%.

The smaller the model and the larger the vocabulary, the more extreme this ratio becomes (the next chapter, Embedding and LM Head, expands on the trade-offs of weight tying).