Encoder-Decoder Architecture: Cross-Attention, Autoregressive Decoding, and Seq2Seq Performance
The transformer encoder maps n input tokens to continuous representations z; the decoder autoregressively generates m output tokens via cross-attention over z; base model achieves 28.4 BLEU WMT EN-DE (Vaswani et al., 2017).
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Encoder: N layers (base model) | 6 | layers | Each layer: multi-head self-attention + position-wise FFN + residual + LayerNorm |
| Decoder: N layers (base model) | 6 | layers | Each layer: masked self-attention + cross-attention + FFN + residual + LayerNorm |
| Cross-attention keys/values source | Encoder output z | Decoder queries (Q) come from decoder state; K,V come from encoder memory z | |
| Autoregressive decoding steps | m (output length) | Decoder generates one token per step; each step attends to all previous outputs | |
| WMT EN-DE BLEU (base model) | 27.3 | BLEU | Encoder-decoder base transformer; Vaswani et al. Table 2 |
| WMT EN-DE BLEU (big model) | 28.4 | BLEU | Single model; surpassed all prior ensembles by >2 BLEU |
| WMT EN-FR BLEU (big model) | 41.8 | BLEU | State-of-the-art single model at publication; trained 3.5 days on 8 GPUs |
| Training FLOPs (base model) | 3.3 × 10¹⁸ | FLOPs | Substantially lower than prior RNN/CNN models at equivalent quality |
The encoder-decoder architecture is the original formulation of the transformer, designed for sequence-to-sequence tasks such as machine translation, summarization, and text-to-text generation. The two components are coupled through cross-attention: the encoder processes the full input in parallel, and the decoder generates the output autoregressively while reading from the encoder’s output at each step.
Encoder: Parallel Input Processing
The encoder maps an input sequence (x₁, …, xₙ) to a continuous representation z = (z₁, …, zₙ):
- Token embeddings + positional encodings → input to layer 1
- Each of the N=6 layers applies: LayerNorm(x + MultiHeadSelfAttention(x)), then LayerNorm(x + FFN(x))
- All n positions are processed in parallel — no recurrence
The final encoder output z is a sequence of n vectors, each of dimension d_model=512, capturing contextual meaning for each input token.
Decoder: Autoregressive Output Generation
The decoder generates the output sequence (y₁, …, yₘ) one token at a time:
- At step t, inputs are the previously generated tokens (y₁, …, y_{t-1}) + positional encodings
- Each of the N=6 decoder layers applies three sublayers:
- Masked multi-head self-attention — attends to all previous output tokens; future positions masked to −∞
- Multi-head cross-attention — Q from decoder state, K and V from encoder memory z
- Position-wise FFN — same two-layer structure as encoder
Architecture Comparison
| Configuration | Layers | d_model | Heads | Params | WMT EN-DE BLEU | Training FLOPs |
|---|---|---|---|---|---|---|
| Encoder-decoder base | 6+6 | 512 | 8 | 65M | 27.3 | 3.3×10¹⁸ |
| Encoder-decoder big | 6+6 | 1024 | 16 | 213M | 28.4 | 2.3×10¹⁹ |
| Decoder-only (matched) | ~12 | 512 | 8 | ~65M | ~26.0–26.5 | comparable |
Cross-Attention Mechanics
In decoder cross-attention, the projection matrices map from different sources:
- W_Q ∈ ℝ^{d_model × d_k} — applied to decoder hidden state
- W_K, W_V ∈ ℝ^{d_model × d_k} — applied to encoder output z
This means each decoder position can attend to any encoder position, creating a direct information path from any input token to any decoding step in O(1) sequential operations — something RNN encoder-decoder models (Cho et al., 2014) could not achieve without explicit attention mechanisms.
Encoder-Decoder vs Decoder-Only
Recent work revisiting encoder-decoder LLMs (Zhang et al., 2025) found that encoder-decoder architectures offer superior training and inference efficiency: the encoder reads input bidirectionally (all tokens attend to all other tokens), allowing richer input representations. Decoder-only models read input causally, processing each input token as if it were generated left-to-right, which is sub-optimal for comprehension tasks. However, decoder-only models benefit from unified pretraining objectives and dominate at very large scales.
For the self-attention formula used in all sublayers, see self-attention-mechanism. For the feed-forward block in each layer, see feed-forward-layers. For how positions are encoded before entering the encoder, see positional-encoding.
Related Pages
Sources
- Vaswani et al. (2017) — Attention Is All You Need. NeurIPS 2017
- Cho et al. (2014) — Learning Phrase Representations using RNN Encoder-Decoder. EMNLP 2014
- Ding et al. (2024) — Machine Translation with Large Language Models: Decoder Only vs. Encoder-Decoder. arXiv 2024
- Zhang et al. (2025) — Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder LLMs. arXiv 2025
Frequently Asked Questions
What is the role of cross-attention in the encoder-decoder architecture?
Cross-attention in each decoder layer connects the decoder to the encoder's output memory. The decoder's current hidden state forms the queries Q, while the encoder output z provides the keys K and values V. This allows every decoder step to directly inspect any part of the input sequence, which is critical for tasks like translation where output words can depend on non-local input words.
Why is decoder self-attention masked?
During training the full target sequence is available, but the decoder must not see future tokens when predicting position t. Masking the attention matrix (setting scores to −∞ for positions > t before softmax) ensures that predictions at position t only depend on tokens 0 through t−1, preserving the autoregressive property and making training compatible with inference.
When does encoder-decoder outperform decoder-only architectures?
Research by Ding et al. (2024) and Zhang et al. (2025) shows encoder-decoder architectures tend to outperform decoder-only models on structured seq2seq tasks like translation, especially at smaller parameter budgets. Decoder-only models can match performance with larger scale but require roughly 2× the parameters for equivalent quality on translation tasks.