Encoder-Decoder Architecture: Cross-Attention, Autoregressive Decoding, and Seq2Seq Performance

Name: Encoder-Decoder Architecture: Cross-Attention, Autoregressive Decoding, and Seq2Seq Performance
Creator: AI Tower
Published: 2026-02-27

Category: architecture Updated: 2026-02-27

The transformer encoder maps n input tokens to continuous representations z; the decoder autoregressively generates m output tokens via cross-attention over z; base model achieves 28.4 BLEU WMT EN-DE (Vaswani et al., 2017).

Key Data Points
Measure	Value	Unit	Notes
Encoder: N layers (base model)	6	layers	Each layer: multi-head self-attention + position-wise FFN + residual + LayerNorm
Decoder: N layers (base model)	6	layers	Each layer: masked self-attention + cross-attention + FFN + residual + LayerNorm
Cross-attention keys/values source	Encoder output z		Decoder queries (Q) come from decoder state; K,V come from encoder memory z
Autoregressive decoding steps	m (output length)		Decoder generates one token per step; each step attends to all previous outputs
WMT EN-DE BLEU (base model)	27.3	BLEU	Encoder-decoder base transformer; Vaswani et al. Table 2
WMT EN-DE BLEU (big model)	28.4	BLEU	Single model; surpassed all prior ensembles by >2 BLEU
WMT EN-FR BLEU (big model)	41.8	BLEU	State-of-the-art single model at publication; trained 3.5 days on 8 GPUs
Training FLOPs (base model)	3.3 × 10¹⁸	FLOPs	Substantially lower than prior RNN/CNN models at equivalent quality

The encoder-decoder architecture is the original formulation of the transformer, designed for sequence-to-sequence tasks such as machine translation, summarization, and text-to-text generation. The two components are coupled through cross-attention: the encoder processes the full input in parallel, and the decoder generates the output autoregressively while reading from the encoder’s output at each step.

Encoder: Parallel Input Processing

The encoder maps an input sequence (x₁, …, xₙ) to a continuous representation z = (z₁, …, zₙ):

Token embeddings + positional encodings → input to layer 1
Each of the N=6 layers applies: LayerNorm(x + MultiHeadSelfAttention(x)), then LayerNorm(x + FFN(x))
All n positions are processed in parallel — no recurrence

The final encoder output z is a sequence of n vectors, each of dimension d_model=512, capturing contextual meaning for each input token.

Decoder: Autoregressive Output Generation

The decoder generates the output sequence (y₁, …, yₘ) one token at a time:

At step t, inputs are the previously generated tokens (y₁, …, y_{t-1}) + positional encodings
Each of the N=6 decoder layers applies three sublayers:
- Masked multi-head self-attention — attends to all previous output tokens; future positions masked to −∞
- Multi-head cross-attention — Q from decoder state, K and V from encoder memory z
- Position-wise FFN — same two-layer structure as encoder

Architecture Comparison

Configuration	Layers	d_model	Heads	Params	WMT EN-DE BLEU	Training FLOPs
Encoder-decoder base	6+6	512	8	65M	27.3	3.3×10¹⁸
Encoder-decoder big	6+6	1024	16	213M	28.4	2.3×10¹⁹
Decoder-only (matched)	~12	512	8	~65M	~26.0–26.5	comparable

Cross-Attention Mechanics

In decoder cross-attention, the projection matrices map from different sources:

W_Q ∈ ℝ^{d_model × d_k} — applied to decoder hidden state
W_K, W_V ∈ ℝ^{d_model × d_k} — applied to encoder output z

This means each decoder position can attend to any encoder position, creating a direct information path from any input token to any decoding step in O(1) sequential operations — something RNN encoder-decoder models (Cho et al., 2014) could not achieve without explicit attention mechanisms.

Encoder-Decoder vs Decoder-Only

Recent work revisiting encoder-decoder LLMs (Zhang et al., 2025) found that encoder-decoder architectures offer superior training and inference efficiency: the encoder reads input bidirectionally (all tokens attend to all other tokens), allowing richer input representations. Decoder-only models read input causally, processing each input token as if it were generated left-to-right, which is sub-optimal for comprehension tasks. However, decoder-only models benefit from unified pretraining objectives and dominate at very large scales.

For the self-attention formula used in all sublayers, see self-attention-mechanism. For the feed-forward block in each layer, see feed-forward-layers. For how positions are encoded before entering the encoder, see positional-encoding.

🧠 🧠 🧠

Sources

Frequently Asked Questions

What is the role of cross-attention in the encoder-decoder architecture?

Cross-attention in each decoder layer connects the decoder to the encoder's output memory. The decoder's current hidden state forms the queries Q, while the encoder output z provides the keys K and values V. This allows every decoder step to directly inspect any part of the input sequence, which is critical for tasks like translation where output words can depend on non-local input words.

Why is decoder self-attention masked?

During training the full target sequence is available, but the decoder must not see future tokens when predicting position t. Masking the attention matrix (setting scores to −∞ for positions > t before softmax) ensures that predictions at position t only depend on tokens 0 through t−1, preserving the autoregressive property and making training compatible with inference.

When does encoder-decoder outperform decoder-only architectures?

Research by Ding et al. (2024) and Zhang et al. (2025) shows encoder-decoder architectures tend to outperform decoder-only models on structured seq2seq tasks like translation, especially at smaller parameter budgets. Decoder-only models can match performance with larger scale but require roughly 2× the parameters for equivalent quality on translation tasks.

← All AI pages · Dashboard