Back to Blog
Video Compression
FFmpeg Series #2

The Mathematics of Video Compression: H.264, HEVC, and AV1

Explore the mathematical and scientific foundations powering modern video codecs. From the Discrete Cosine Transform to motion estimation algorithms, understand how redundancy elimination achieves 100:1 compression ratios while maintaining visual quality.

JewelMusic Engineering Team
February 5, 2025
25 min read
Video Compression Mathematics

The Core Principle: Exploiting Redundancy

The remarkable efficiency of modern video compression, which enables the storage and transmission of high-definition video at manageable bitrates, is not the result of a single breakthrough but rather a sophisticated synergy of principles from information theory, mathematics, and the study of human perception.

Two Types of Redundancy

Spatial Redundancy (Intra-frame)

Within a single frame, adjacent pixels often have similar values:

  • • Large areas of uniform color (sky, walls)
  • • Smooth gradients and textures
  • • Predictable patterns

Temporal Redundancy (Inter-frame)

Consecutive frames are highly correlated:

  • • Static backgrounds remain unchanged
  • • Moving objects undergo simple translation
  • • Frame-to-frame similarity is high

Raw digital video contains an immense amount of redundant information. A 1080p video at 30fps requires approximately 1.5 Gbps uncompressed. Video compression algorithms systematically identify and eliminate these redundancies through a multi-stage process.

The Hybrid Coding Model: Frame Types

I-frames (Intra-coded frames)

Compressed independently without reference to other frames:

  • • Complete, self-contained images
  • • Reference points for subsequent frames
  • • Enable random access and seeking
  • • Largest frame type (least compression)
  • • Use spatial redundancy reduction only
P-frames (Predictive-coded frames)

Encoded by referencing preceding I or P frames:

  • • Store only differences from reference
  • • Use motion compensation
  • • Dramatically reduce temporal redundancy
  • • Medium compression ratio
  • • Forward prediction only
B-frames (Bi-directionally predictive-coded frames)

Reference both preceding and future frames:

  • • Highest compression efficiency
  • • Bidirectional motion compensation
  • • Find best match from two directions
  • • Introduce encoding/decoding delay
  • • Smallest frame type

Group of Pictures (GOP) Structure

Typical GOP sequence: IBBPBBPBBP...

Display order:  I B B P B B P B B P
Decode order:   I P B B P B B P B B

GOP size affects:
• Random access granularity
• Compression efficiency
• Error propagation
• Encoding complexity

Color Space Transformation: Y'CbCr

The first step in nearly every video compression pipeline is transforming the image's color representation from RGB to Y'CbCr, exploiting the human visual system's greater sensitivity to brightness than color.

BT.601 Color Space Conversion
// RGB to Y'CbCr conversion (BT.601 standard)
Y'  = 0.299 × R' + 0.587 × G' + 0.114 × B'
Cb  = -0.168736 × R' - 0.331264 × G' + 0.500 × B' + 128
Cr  = 0.500 × R' - 0.418688 × G' - 0.081312 × B' + 128

// Y' = Luma (brightness)
// Cb = Chroma blue-difference
// Cr = Chroma red-difference

4:4:4

No subsampling. Full resolution for all channels.

4:2:2

Horizontal chroma resolution halved.

4:2:0

Both H & V chroma resolution halved. 75% data reduction.

The Discrete Cosine Transform (DCT)

After color space conversion, spatial redundancy is tackled by transforming data from the spatial domain to the frequency domain using the DCT. This mathematical operation expresses pixel blocks as sums of cosine functions at different frequencies.

2D-DCT Mathematical Formula
// Two-dimensional DCT for M×N block
B[p][q] = α[p] × α[q] × ΣΣ A[m][n] × 
          cos[π(2m+1)p/2M] × cos[π(2n+1)q/2N]

Where:
• A[m][n] = Input pixel block
• B[p][q] = DCT coefficients
• α[p] = √(1/M) if p=0, else √(2/M)
• α[q] = √(1/N) if q=0, else √(2/N)

Key Properties:
• B[0][0] = DC coefficient (average intensity)
• Other coefficients = AC (frequencies)
• Energy compaction in low frequencies
• Most high-frequency coefficients ≈ 0

Why DCT for Video?

  • • Superior energy compaction for natural images
  • • Real-valued output (unlike FFT)
  • • No complex arithmetic required
  • • Decorrelates spatial data effectively

Integer Transforms in H.264/HEVC

Modern codecs replace floating-point DCT with integer approximations to eliminate encoder-decoder drift and ensure bit-exact reconstruction across all platforms.

H.264 4×4 Integer Transform
// Forward transform matrix (H.264)
Cf = [ 1  1  1  1]
     [ 2  1 -1 -2]
     [ 1 -1 -1  1]
     [ 1 -2  2 -1]

// Transform: Y = Cf × X × CfT
// No multiplications required!
// Only additions, subtractions, bit-shifts

// Inverse transform matrix
Ci = [ 1    1    1   1/2]
     [ 1   1/2  -1   -1 ]
     [ 1  -1/2  -1    1 ]
     [ 1   -1    1  -1/2]

// HEVC extends this to larger blocks:
• 4×4, 8×8, 16×16, 32×32 transforms
• Discrete Sine Transform (DST) for 4×4 luma
• Better adaptation to content characteristics

Motion Estimation and Compensation

The greatest compression gains come from eliminating temporal redundancy through motion estimation—finding where blocks from the current frame appear in reference frames.

Block-Matching Algorithm (BMA)
// Sum of Absolute Differences (SAD)
SAD(u,v) = ΣΣ |C(x+i, y+j) - R(x+u+i, y+v+j)|

Where:
• C = Current block at (x,y)
• R = Reference frame
• (u,v) = Motion vector displacement
• Search window typically ±16 to ±64 pixels

Process Steps

  1. 1. Partition frame into blocks
  2. 2. Search reference frame
  3. 3. Find minimum SAD match
  4. 4. Generate motion vector
  5. 5. Compute residual error

Optimization Techniques

  • • Diamond search patterns
  • • Hierarchical motion estimation
  • • Early termination
  • • Sub-pixel refinement
  • • Hardware acceleration
Advanced Motion Features

H.264

  • • Variable block sizes (16×16 to 4×4)
  • • Quarter-pixel precision
  • • Multiple reference frames

HEVC

  • • CTUs up to 64×64
  • • Advanced MV prediction
  • • Merge mode

AV1

  • • Warped motion
  • • Global motion
  • • Overlapped blocks

Quantization: Controlled Information Loss

Quantization is the primary stage where information is deliberately discarded, making it the core of lossy compression. It reduces the precision of DCT coefficients by dividing by a quantization step size.

Quantization Process
// Scalar Quantization
Z[i][j] = round(Y[i][j] / Qstep)

// H.264 Logarithmic QP Relationship
Qstep(QP) = 2^(QP/6)

// QP ranges from 0-51 in H.264
QP = 0  → Qstep = 1    (highest quality)
QP = 6  → Qstep = 2
QP = 12 → Qstep = 4
QP = 51 → Qstep = 224  (lowest quality)

// Each 6-point QP increase doubles Qstep
// ~12.2% increase per QP unit (2^(1/6) ≈ 1.122)

Division-Free Implementation

H.264 uses integer multiplication and bit-shifts:

Z[i][j] = (W[i][j] × MF + f) >> qbits

Where MF and qbits are lookup tables indexed by QP

Entropy Coding: Lossless Packing

Variable-Length Coding (CAVLC)

Context-Adaptive VLC in H.264 Baseline:

  • • Assigns shorter codes to frequent symbols
  • • Run-length encoding for zeros
  • • Adapts tables based on neighboring blocks
  • • Lower complexity, suitable for mobile
Arithmetic Coding (CABAC)

Context-Adaptive Binary Arithmetic Coding:

CABAC Process:
1. Binarization → Convert to binary string
2. Context Selection → Choose probability model
3. Arithmetic Encoding → Fractional bit allocation

Benefits:
• 10-15% better compression than CAVLC
• Adapts to local statistics
• Near-optimal entropy coding
• Used in H.264 Main/High, all HEVC/AV1

Codec Evolution: H.264 → HEVC → AV1

FeatureH.264 (2003)HEVC (2013)AV1 (2018)
Block Structure16×16 macroblocksCTUs up to 64×64Superblocks up to 128×128
Transform4×4, 8×8 DCT4×4 to 32×32 DCT/DST4×4 to 64×64, multiple types
Intra Modes9 directions35 directions56 directions + advanced
CompressionBaseline~50% better~30% better than HEVC
ComplexityLow~2-4× H.264~3-5× HEVC
LicensingMPEG-LA poolComplex, multiple poolsRoyalty-free

In-Loop Filtering

Deblocking Filter

Reduces block artifacts at transform boundaries:

  • • H.264: Simple edge-based filter
  • • HEVC: Improved with parallel processing
  • • Applied in prediction loop for better references
Sample Adaptive Offset (SAO)

HEVC addition for further artifact reduction:

  • • Classifies pixels into categories
  • • Applies offsets to reduce distortion
  • • Band offset and edge offset modes
AV1 Restoration Filters

Advanced filtering pipeline:

  • • CDEF: Constrained Directional Enhancement
  • • Loop Restoration: Wiener and self-guided filters
  • • Superior quality at low bitrates

Practical Implementation in FFmpeg

Codec Comparison Commands

# H.264 encoding with x264

ffmpeg -i input.mp4 -c:v libx264 -preset slow -crf 23 output_h264.mp4

# HEVC encoding with x265

ffmpeg -i input.mp4 -c:v libx265 -preset slow -crf 28 output_hevc.mp4

# AV1 encoding with libaom

ffmpeg -i input.mp4 -c:v libaom-av1 -crf 30 -b:v 0 output_av1.webm

# AV1 encoding with SVT-AV1 (faster)

ffmpeg -i input.mp4 -c:v libsvtav1 -crf 35 -preset 6 output_svt.mp4

# Two-pass encoding for optimal quality

ffmpeg -i input.mp4 -c:v libx264 -b:v 2M -pass 1 -f null /dev/null
ffmpeg -i input.mp4 -c:v libx264 -b:v 2M -pass 2 output.mp4

Performance Benchmarks

Relative Performance Comparison
Encoding Speed (relative to H.264 baseline):
┌─────────────┬──────────┬─────────────┬──────────────┐
│ Codec       │ Speed    │ Quality     │ File Size    │
├─────────────┼──────────┼─────────────┼──────────────┤
│ H.264 fast  │ 1.0×     │ Baseline    │ 100%         │
│ H.264 slow  │ 0.3×     │ +5% PSNR    │ 85%          │
│ HEVC fast   │ 0.5×     │ +8% PSNR    │ 70%          │
│ HEVC slow   │ 0.1×     │ +12% PSNR   │ 55%          │
│ AV1 fast    │ 0.2×     │ +10% PSNR   │ 65%          │
│ AV1 slow    │ 0.02×    │ +15% PSNR   │ 45%          │
└─────────────┴──────────┴─────────────┴──────────────┘

Hardware Acceleration Impact:
• NVENC H.264: 5-10× faster than x264
• NVENC HEVC: 3-5× faster than x265
• Intel QSV: 3-8× speedup
• Apple VTB: 4-6× speedup

Future Directions

Emerging Technologies

  • Machine Learning Integration: Neural networks for better motion estimation, intra-prediction, and post-processing.
  • VVC (H.266): Next-generation codec targeting 50% improvement over HEVC, optimized for 8K and 360° video.
  • Content-Adaptive Encoding: Per-scene optimization using AI to analyze content characteristics.
  • Cloud-Native Codecs: Designed for distributed encoding and streaming at scale.

References & Resources

Continue Reading

Next Article
Audio Compression Science: From Psychoacoustics to AAC and Opus
Explore how the human auditory system shapes modern audio codecs through masking, MDCT transforms, and perceptual models.