Multi-head attention

Multi-head Attention

Definition

Multi-head attention runs the self-attention mechanism in parallel across multiple learned projection subspaces (heads), allowing the model to simultaneously attend to different aspects of the input sequence. Each head computes its own query, key, and value projections; outputs are concatenated and projected back to the model dimension.

Multiple heads enable richer representational capacity than a single attention computation.

Related Terms

Attention mechanism

Attention Mechanism

Self-attention

Self-attention Mechanism

Transformer

Transformer Architecture

← Back to Glossary

Ship secure code faster

Crash Override integrates security into the developer workflow. No context switching, no waiting on reviews.

Talk to a Human See the Product