Audio Compression Science: From Psychoacoustics to AAC and Opus
Discover how the human auditory system shapes modern audio codec design. From auditory masking to the MDCT transform, explore the sophisticated interplay of perception science and mathematics that enables transparent audio compression.

Psychoacoustics: Compressing for the Human Ear
While video compression focuses on exploiting statistical redundancies in spatial and temporal signals, audio compression operates on 1D time-series signals and is fundamentally rooted in psychoacoustics—the science of how humans perceive sound. All modern lossy audio codecs are "perceptual audio codecs," designed not just to remove redundant data, but to identify and discard acoustical information that is irrelevant or inaudible to the human auditory system.
Core Psychoacoustic Principles
Frequency Perception
- • Human hearing: 20 Hz - 20 kHz
- • Most sensitive: 2-4 kHz range
- • Non-linear frequency resolution
- • Age-related high-frequency loss
Loudness Perception
- • Fletcher-Munson curves
- • Logarithmic loudness scaling
- • Dynamic range: 120 dB
- • Just-noticeable differences
The psychoacoustic model transforms the blind mathematical problem of quantization into a perceptually optimized one. It analyzes the audio signal and computes a "perceptual noise budget"—the maximum amount of quantization noise that can be introduced at different frequencies without being audible.
The Threshold of Hearing and Critical Bands
// Empirical approximation of hearing threshold (in dB SPL) Tq(f) = 3.64(f/1000)^-0.8 - 6.5*exp(-0.6*(f/1000 - 3.3)^2) + 10^-3*(f/1000)^4 Key characteristics: • Minimum at ~3.5 kHz (most sensitive) • Rises steeply below 500 Hz • Gradual rise above 10 kHz • Defines quietest audible sounds • Varies ±20 dB between individuals
Bark Scale
Maps frequency to critical bands:
Bark(f) = 13*arctan(0.00076*f) + 3.5*arctan((f/7500)^2)
Critical Bands
- • 24 bands total (Bark scale)
- • ~100 Hz wide below 500 Hz
- • ~20% of center frequency above 500 Hz
- • Sounds within band interact strongly
Auditory Masking: The Key to Perceptual Coding
The most powerful psychoacoustic phenomenon exploited by audio codecs is auditory masking—where a loud sound renders a simultaneously occurring quieter sound inaudible, even if the quieter sound would be audible on its own.
A loud tone masks quieter tones at nearby frequencies:
Masking Spread Function: • Asymmetric shape (more masking at higher freq) • Level-dependent (louder = wider spread) • Tonal vs Noise maskers differ Tonal Masker at 1 kHz, 60 dB: 500 Hz: 15 dB masking 1 kHz: 60 dB (center) 2 kHz: 25 dB masking 4 kHz: 10 dB masking
Masking extends through time:
Pre-masking
- • Duration: ~20 ms
- • Backward masking effect
- • Brain processing delay
Post-masking
- • Duration: 100-200 ms
- • Forward masking effect
- • Exponential decay
Calculating the Global Masking Threshold
1. Time-to-Frequency Mapping: FFT(audio_frame) → spectrum[1024] 2. Identify Maskers: for each spectral_peak: if (peak > neighbors + 7dB): mark as TONAL masker else: group as NOISE masker 3. Calculate Individual Thresholds: for each masker: threshold = masker_level - spreading_function(Δf) spreading_function(Δf) = 15.81 + 7.5*(Δf + 0.474) - 17.5*sqrt(1 + (Δf + 0.474)^2) 4. Combine Thresholds: global_threshold = max( absolute_threshold, sum(individual_masking_thresholds) ) 5. Result: Frequency-dependent noise floor Any quantization noise below this is inaudible
Signal-to-Mask Ratio (SMR)
SMR = Signal_Level - Global_Masking_Threshold
This determines bit allocation: Higher SMR → More bits needed
The Modified Discrete Cosine Transform (MDCT)
While the psychoacoustic model uses FFT for analysis, the actual audio data transformation uses the Modified Discrete Cosine Transform. The MDCT possesses unique properties that make it exceptionally well-suited for audio compression.
// Forward MDCT (2N samples → N coefficients) X[k] = Σ(n=0 to 2N-1) x[n] * w[n] * cos[π/N * (n + 1/2 + N/2) * (k + 1/2)] // Inverse IMDCT (N coefficients → 2N samples) y[n] = (2/N) * w[n] * Σ(k=0 to N-1) X[k] * cos[π/N * (n + 1/2 + N/2) * (k + 1/2)] // Window function must satisfy Princen-Bradley: w[n]² + w[n+N]² = 1 (for perfect reconstruction) Common windows: • Sine window: w[n] = sin(π(n+0.5)/2N) • Kaiser-Bessel Derived (KBD) • Vorbis window (modified KBD)
MDCT Advantages
- • 50% overlap without redundancy
- • Critical sampling (N→N)
- • No block boundary artifacts
- • Perfect reconstruction
TDAC Property
- • Time-Domain Aliasing Cancellation
- • Aliasing in overlaps cancels perfectly
- • Enables seamless reconstruction
- • Key to MDCT's success
Advanced Audio Coding (AAC) Family
AAC, standardized as part of MPEG-2 and MPEG-4, was designed as the successor to MP3. It achieves superior audio quality through several technical improvements, including pure MDCT (eliminating MP3's hybrid filterbank), flexible window sizes, and a more sophisticated psychoacoustic model.
The most widely used AAC profile:
- • Transparent quality at 128-192 kbps stereo
- • Pure MDCT with 1024/128 sample windows
- • Temporal Noise Shaping (TNS)
- • Perceptual Noise Substitution (PNS)
- • Standard for iTunes, YouTube, broadcasting
ffmpeg -i input.wav -c:a aac -b:a 192k output.m4a
Optimized for low bitrates with SBR:
Spectral Band Replication (SBR): • Encode only 0-8 kHz with AAC-LC • Transmit parametric data for 8-16 kHz • Reconstruct highs from lows + parameters • 30% bitrate reduction at same quality Effective at 48-64 kbps stereo
ffmpeg -i input.wav -c:a libfdk_aac -profile:a aac_he -b:a 64k output.m4a
Adds Parametric Stereo (PS) to HE-AAC:
Parametric Stereo (PS): • Convert stereo → mono downmix • Extract spatial parameters: - Inter-channel Intensity Difference (IID) - Inter-channel Phase Difference (IPD) - Inter-channel Coherence (ICC) • Reconstruct stereo from mono + params Effective at 24-32 kbps stereo
ffmpeg -i input.wav -c:a libfdk_aac -profile:a aac_he_v2 -b:a 32k output.m4a
Opus: The Hybrid Codec for Interactive Audio
Opus is a highly versatile, open, and royalty-free audio codec standardized by the IETF as RFC 6716. It was specifically designed for real-time, interactive applications over the internet, where low latency is critical.
Opus = SILK (speech) + CELT (music) SILK Mode (Linear Predictive Coding): • Optimized for speech (8-12 kHz) • Based on Skype's SILK codec • LPC analysis + residual coding • Excellent at 6-40 kbps CELT Mode (MDCT-based): • Optimized for music (full bandwidth) • Low-latency MDCT (2.5-20ms frames) • Constrained energy lapped transform • Excellent at 48-510 kbps Hybrid Mode: • SILK encodes 0-8 kHz • CELT encodes 8-20 kHz • Seamless transition • Optimal for mixed content Dynamic switching per frame based on content!
Latency
- • Default: 26.5 ms
- • Minimum: 5 ms
- • Maximum: 60 ms
- • Configurable per use
Bitrate Range
- • 6 kbps (narrowband speech)
- • 510 kbps (stereo music)
- • Seamless rate switching
- • VBR/CBR support
Applications
- • WebRTC (default)
- • Discord, WhatsApp
- • Streaming (YouTube)
- • Game voice chat
Codec Comparison and Use Cases
Codec | Best Quality Range | Latency | Complexity | Primary Use |
---|---|---|---|---|
MP3 | 128-320 kbps | High (~100ms) | Low | Legacy compatibility |
AAC-LC | 96-256 kbps | Moderate (~50ms) | Medium | Streaming, storage |
HE-AAC v2 | 24-64 kbps | Moderate-High | High | Low bitrate streaming |
Opus | 6-510 kbps | Very Low (5-60ms) | Medium | Real-time communication |
FLAC | 400-1000 kbps | Low | Low | Lossless archival |
FFmpeg Audio Processing Examples
# Convert to AAC with quality-based VBR
ffmpeg -i input.wav -c:a aac -q:a 2 output.m4a
# Opus encoding for voice (optimized for speech)
ffmpeg -i input.wav -c:a libopus -b:a 32k -application voip output.opus
# HE-AAC v2 for low-bitrate streaming
ffmpeg -i input.wav -c:a libfdk_aac -profile:a aac_he_v2 -b:a 32k output.m4a
# Multi-codec output for adaptive streaming
ffmpeg -i input.wav \ -c:a aac -b:a 128k high.m4a \ -c:a libfdk_aac -profile:a aac_he -b:a 64k medium.m4a \ -c:a libfdk_aac -profile:a aac_he_v2 -b:a 32k low.m4a
# Analyze audio with loudness measurement
ffmpeg -i input.wav -af loudnorm=print_format=json -f null -
# Apply psychoacoustic enhancement
ffmpeg -i input.wav -af "acompressor=threshold=0.5,psychoacoustic=model=2" output.wav
Advanced Audio Processing Techniques
Techniques for efficient stereo encoding:
M/S Stereo
Mid = (L + R) / 2 Side = (L - R) / 2 • Better for centered content • Adaptive per frequency band
Intensity Stereo
High freq: mono + panning • Above 2-4 kHz typically • Phase insensitive region • Major bitrate savings
Controls temporal distribution of quantization noise:
- • Applies predictive coding in frequency domain
- • Prevents pre-echo artifacts in transients
- • Shapes noise to follow signal envelope
- • Critical for percussion and speech
Replaces noise-like signals with synthetic noise:
- • Detects noise-like spectral regions
- • Transmits only energy level
- • Decoder generates matching noise
- • Huge savings for applause, rain, etc.
Quality Metrics and Evaluation
Objective Metrics: • SNR: Signal-to-Noise Ratio (limited correlation) • PEAQ: Perceptual Evaluation of Audio Quality (ITU-R BS.1387) • PEMO-Q: Perceptual Model Quality assessment • ViSQOL: Virtual Speech Quality Objective Listener Subjective Testing (ITU-R BS.1116): • MUSHRA: MUltiple Stimuli with Hidden Reference • ABX: Double-blind comparison testing • 5-point scale: Imperceptible → Very annoying Transparency Thresholds: • AAC-LC: ~128 kbps (stereo, typical content) • Opus: ~96 kbps (stereo, music mode) • HE-AAC: ~64 kbps (stereo, with SBR) • Critical content may require +50% bitrate
Future Directions in Audio Compression
Emerging Technologies
- Neural Audio Codecs: End-to-end learned compression with neural networks, achieving unprecedented quality at very low bitrates (3-6 kbps).
- Object-Based Audio: MPEG-H 3D Audio and Dolby Atmos for immersive, spatially-aware compression.
- Semantic Audio Coding: Separate and code different sources (voice, music, effects) with specialized models.
- AI-Enhanced Psychoacoustics: Machine learning models that adapt to individual hearing profiles and preferences.