The Mathematics of Video Compression: H.264, HEVC, and AV1
Explore the mathematical and scientific foundations powering modern video codecs. From the Discrete Cosine Transform to motion estimation algorithms, understand how redundancy elimination achieves 100:1 compression ratios while maintaining visual quality.

The Core Principle: Exploiting Redundancy
The remarkable efficiency of modern video compression, which enables the storage and transmission of high-definition video at manageable bitrates, is not the result of a single breakthrough but rather a sophisticated synergy of principles from information theory, mathematics, and the study of human perception.
Two Types of Redundancy
Spatial Redundancy (Intra-frame)
Within a single frame, adjacent pixels often have similar values:
- • Large areas of uniform color (sky, walls)
- • Smooth gradients and textures
- • Predictable patterns
Temporal Redundancy (Inter-frame)
Consecutive frames are highly correlated:
- • Static backgrounds remain unchanged
- • Moving objects undergo simple translation
- • Frame-to-frame similarity is high
Raw digital video contains an immense amount of redundant information. A 1080p video at 30fps requires approximately 1.5 Gbps uncompressed. Video compression algorithms systematically identify and eliminate these redundancies through a multi-stage process.
The Hybrid Coding Model: Frame Types
Compressed independently without reference to other frames:
- • Complete, self-contained images
- • Reference points for subsequent frames
- • Enable random access and seeking
- • Largest frame type (least compression)
- • Use spatial redundancy reduction only
Encoded by referencing preceding I or P frames:
- • Store only differences from reference
- • Use motion compensation
- • Dramatically reduce temporal redundancy
- • Medium compression ratio
- • Forward prediction only
Reference both preceding and future frames:
- • Highest compression efficiency
- • Bidirectional motion compensation
- • Find best match from two directions
- • Introduce encoding/decoding delay
- • Smallest frame type
Group of Pictures (GOP) Structure
Typical GOP sequence: IBBPBBPBBP... Display order: I B B P B B P B B P Decode order: I P B B P B B P B B GOP size affects: • Random access granularity • Compression efficiency • Error propagation • Encoding complexity
Color Space Transformation: Y'CbCr
The first step in nearly every video compression pipeline is transforming the image's color representation from RGB to Y'CbCr, exploiting the human visual system's greater sensitivity to brightness than color.
// RGB to Y'CbCr conversion (BT.601 standard) Y' = 0.299 × R' + 0.587 × G' + 0.114 × B' Cb = -0.168736 × R' - 0.331264 × G' + 0.500 × B' + 128 Cr = 0.500 × R' - 0.418688 × G' - 0.081312 × B' + 128 // Y' = Luma (brightness) // Cb = Chroma blue-difference // Cr = Chroma red-difference
4:4:4
No subsampling. Full resolution for all channels.
4:2:2
Horizontal chroma resolution halved.
4:2:0
Both H & V chroma resolution halved. 75% data reduction.
The Discrete Cosine Transform (DCT)
After color space conversion, spatial redundancy is tackled by transforming data from the spatial domain to the frequency domain using the DCT. This mathematical operation expresses pixel blocks as sums of cosine functions at different frequencies.
// Two-dimensional DCT for M×N block B[p][q] = α[p] × α[q] × ΣΣ A[m][n] × cos[π(2m+1)p/2M] × cos[π(2n+1)q/2N] Where: • A[m][n] = Input pixel block • B[p][q] = DCT coefficients • α[p] = √(1/M) if p=0, else √(2/M) • α[q] = √(1/N) if q=0, else √(2/N) Key Properties: • B[0][0] = DC coefficient (average intensity) • Other coefficients = AC (frequencies) • Energy compaction in low frequencies • Most high-frequency coefficients ≈ 0
Why DCT for Video?
- • Superior energy compaction for natural images
- • Real-valued output (unlike FFT)
- • No complex arithmetic required
- • Decorrelates spatial data effectively
Integer Transforms in H.264/HEVC
Modern codecs replace floating-point DCT with integer approximations to eliminate encoder-decoder drift and ensure bit-exact reconstruction across all platforms.
// Forward transform matrix (H.264) Cf = [ 1 1 1 1] [ 2 1 -1 -2] [ 1 -1 -1 1] [ 1 -2 2 -1] // Transform: Y = Cf × X × CfT // No multiplications required! // Only additions, subtractions, bit-shifts // Inverse transform matrix Ci = [ 1 1 1 1/2] [ 1 1/2 -1 -1 ] [ 1 -1/2 -1 1 ] [ 1 -1 1 -1/2] // HEVC extends this to larger blocks: • 4×4, 8×8, 16×16, 32×32 transforms • Discrete Sine Transform (DST) for 4×4 luma • Better adaptation to content characteristics
Motion Estimation and Compensation
The greatest compression gains come from eliminating temporal redundancy through motion estimation—finding where blocks from the current frame appear in reference frames.
// Sum of Absolute Differences (SAD) SAD(u,v) = ΣΣ |C(x+i, y+j) - R(x+u+i, y+v+j)| Where: • C = Current block at (x,y) • R = Reference frame • (u,v) = Motion vector displacement • Search window typically ±16 to ±64 pixels
Process Steps
- 1. Partition frame into blocks
- 2. Search reference frame
- 3. Find minimum SAD match
- 4. Generate motion vector
- 5. Compute residual error
Optimization Techniques
- • Diamond search patterns
- • Hierarchical motion estimation
- • Early termination
- • Sub-pixel refinement
- • Hardware acceleration
H.264
- • Variable block sizes (16×16 to 4×4)
- • Quarter-pixel precision
- • Multiple reference frames
HEVC
- • CTUs up to 64×64
- • Advanced MV prediction
- • Merge mode
AV1
- • Warped motion
- • Global motion
- • Overlapped blocks
Quantization: Controlled Information Loss
Quantization is the primary stage where information is deliberately discarded, making it the core of lossy compression. It reduces the precision of DCT coefficients by dividing by a quantization step size.
// Scalar Quantization Z[i][j] = round(Y[i][j] / Qstep) // H.264 Logarithmic QP Relationship Qstep(QP) = 2^(QP/6) // QP ranges from 0-51 in H.264 QP = 0 → Qstep = 1 (highest quality) QP = 6 → Qstep = 2 QP = 12 → Qstep = 4 QP = 51 → Qstep = 224 (lowest quality) // Each 6-point QP increase doubles Qstep // ~12.2% increase per QP unit (2^(1/6) ≈ 1.122)
Division-Free Implementation
H.264 uses integer multiplication and bit-shifts:
Z[i][j] = (W[i][j] × MF + f) >> qbits Where MF and qbits are lookup tables indexed by QP
Entropy Coding: Lossless Packing
Context-Adaptive VLC in H.264 Baseline:
- • Assigns shorter codes to frequent symbols
- • Run-length encoding for zeros
- • Adapts tables based on neighboring blocks
- • Lower complexity, suitable for mobile
Context-Adaptive Binary Arithmetic Coding:
CABAC Process: 1. Binarization → Convert to binary string 2. Context Selection → Choose probability model 3. Arithmetic Encoding → Fractional bit allocation Benefits: • 10-15% better compression than CAVLC • Adapts to local statistics • Near-optimal entropy coding • Used in H.264 Main/High, all HEVC/AV1
Codec Evolution: H.264 → HEVC → AV1
Feature | H.264 (2003) | HEVC (2013) | AV1 (2018) |
---|---|---|---|
Block Structure | 16×16 macroblocks | CTUs up to 64×64 | Superblocks up to 128×128 |
Transform | 4×4, 8×8 DCT | 4×4 to 32×32 DCT/DST | 4×4 to 64×64, multiple types |
Intra Modes | 9 directions | 35 directions | 56 directions + advanced |
Compression | Baseline | ~50% better | ~30% better than HEVC |
Complexity | Low | ~2-4× H.264 | ~3-5× HEVC |
Licensing | MPEG-LA pool | Complex, multiple pools | Royalty-free |
In-Loop Filtering
Reduces block artifacts at transform boundaries:
- • H.264: Simple edge-based filter
- • HEVC: Improved with parallel processing
- • Applied in prediction loop for better references
HEVC addition for further artifact reduction:
- • Classifies pixels into categories
- • Applies offsets to reduce distortion
- • Band offset and edge offset modes
Advanced filtering pipeline:
- • CDEF: Constrained Directional Enhancement
- • Loop Restoration: Wiener and self-guided filters
- • Superior quality at low bitrates
Practical Implementation in FFmpeg
# H.264 encoding with x264
ffmpeg -i input.mp4 -c:v libx264 -preset slow -crf 23 output_h264.mp4
# HEVC encoding with x265
ffmpeg -i input.mp4 -c:v libx265 -preset slow -crf 28 output_hevc.mp4
# AV1 encoding with libaom
ffmpeg -i input.mp4 -c:v libaom-av1 -crf 30 -b:v 0 output_av1.webm
# AV1 encoding with SVT-AV1 (faster)
ffmpeg -i input.mp4 -c:v libsvtav1 -crf 35 -preset 6 output_svt.mp4
# Two-pass encoding for optimal quality
ffmpeg -i input.mp4 -c:v libx264 -b:v 2M -pass 1 -f null /dev/null ffmpeg -i input.mp4 -c:v libx264 -b:v 2M -pass 2 output.mp4
Performance Benchmarks
Encoding Speed (relative to H.264 baseline): ┌─────────────┬──────────┬─────────────┬──────────────┐ │ Codec │ Speed │ Quality │ File Size │ ├─────────────┼──────────┼─────────────┼──────────────┤ │ H.264 fast │ 1.0× │ Baseline │ 100% │ │ H.264 slow │ 0.3× │ +5% PSNR │ 85% │ │ HEVC fast │ 0.5× │ +8% PSNR │ 70% │ │ HEVC slow │ 0.1× │ +12% PSNR │ 55% │ │ AV1 fast │ 0.2× │ +10% PSNR │ 65% │ │ AV1 slow │ 0.02× │ +15% PSNR │ 45% │ └─────────────┴──────────┴─────────────┴──────────────┘ Hardware Acceleration Impact: • NVENC H.264: 5-10× faster than x264 • NVENC HEVC: 3-5× faster than x265 • Intel QSV: 3-8× speedup • Apple VTB: 4-6× speedup
Future Directions
Emerging Technologies
- Machine Learning Integration: Neural networks for better motion estimation, intra-prediction, and post-processing.
- VVC (H.266): Next-generation codec targeting 50% improvement over HEVC, optimized for 8K and 360° video.
- Content-Adaptive Encoding: Per-scene optimization using AI to analyze content characteristics.
- Cloud-Native Codecs: Designed for distributed encoding and streaming at scale.