Back to Blog
Audio Compression
FFmpeg Series #3

Audio Compression Science: From Psychoacoustics to AAC and Opus

Discover how the human auditory system shapes modern audio codec design. From auditory masking to the MDCT transform, explore the sophisticated interplay of perception science and mathematics that enables transparent audio compression.

JewelMusic Engineering Team
February 6, 2025
22 min read
Audio Compression and Psychoacoustics

Psychoacoustics: Compressing for the Human Ear

While video compression focuses on exploiting statistical redundancies in spatial and temporal signals, audio compression operates on 1D time-series signals and is fundamentally rooted in psychoacoustics—the science of how humans perceive sound. All modern lossy audio codecs are "perceptual audio codecs," designed not just to remove redundant data, but to identify and discard acoustical information that is irrelevant or inaudible to the human auditory system.

Core Psychoacoustic Principles

Frequency Perception

  • • Human hearing: 20 Hz - 20 kHz
  • • Most sensitive: 2-4 kHz range
  • • Non-linear frequency resolution
  • • Age-related high-frequency loss

Loudness Perception

  • • Fletcher-Munson curves
  • • Logarithmic loudness scaling
  • • Dynamic range: 120 dB
  • • Just-noticeable differences

The psychoacoustic model transforms the blind mathematical problem of quantization into a perceptually optimized one. It analyzes the audio signal and computes a "perceptual noise budget"—the maximum amount of quantization noise that can be introduced at different frequencies without being audible.

The Threshold of Hearing and Critical Bands

Absolute Threshold of Hearing
// Empirical approximation of hearing threshold (in dB SPL)
Tq(f) = 3.64(f/1000)^-0.8 
        - 6.5*exp(-0.6*(f/1000 - 3.3)^2) 
        + 10^-3*(f/1000)^4

Key characteristics:
• Minimum at ~3.5 kHz (most sensitive)
• Rises steeply below 500 Hz
• Gradual rise above 10 kHz
• Defines quietest audible sounds
• Varies ±20 dB between individuals

Bark Scale

Maps frequency to critical bands:

Bark(f) = 13*arctan(0.00076*f) 
         + 3.5*arctan((f/7500)^2)

Critical Bands

  • • 24 bands total (Bark scale)
  • • ~100 Hz wide below 500 Hz
  • • ~20% of center frequency above 500 Hz
  • • Sounds within band interact strongly

Auditory Masking: The Key to Perceptual Coding

The most powerful psychoacoustic phenomenon exploited by audio codecs is auditory masking—where a loud sound renders a simultaneously occurring quieter sound inaudible, even if the quieter sound would be audible on its own.

Simultaneous Masking (Frequency Domain)

A loud tone masks quieter tones at nearby frequencies:

Masking Spread Function:
• Asymmetric shape (more masking at higher freq)
• Level-dependent (louder = wider spread)
• Tonal vs Noise maskers differ

Tonal Masker at 1 kHz, 60 dB:
  500 Hz: 15 dB masking
  1 kHz:  60 dB (center)
  2 kHz:  25 dB masking
  4 kHz:  10 dB masking
Temporal Masking (Time Domain)

Masking extends through time:

Pre-masking

  • • Duration: ~20 ms
  • • Backward masking effect
  • • Brain processing delay

Post-masking

  • • Duration: 100-200 ms
  • • Forward masking effect
  • • Exponential decay

Calculating the Global Masking Threshold

MPEG Psychoacoustic Model Process
1. Time-to-Frequency Mapping:
   FFT(audio_frame) → spectrum[1024]

2. Identify Maskers:
   for each spectral_peak:
      if (peak > neighbors + 7dB):
         mark as TONAL masker
      else:
         group as NOISE masker

3. Calculate Individual Thresholds:
   for each masker:
      threshold = masker_level - spreading_function(Δf)
      
   spreading_function(Δf) = 
      15.81 + 7.5*(Δf + 0.474) 
      - 17.5*sqrt(1 + (Δf + 0.474)^2)

4. Combine Thresholds:
   global_threshold = max(
      absolute_threshold,
      sum(individual_masking_thresholds)
   )

5. Result: Frequency-dependent noise floor
   Any quantization noise below this is inaudible

Signal-to-Mask Ratio (SMR)

SMR = Signal_Level - Global_Masking_Threshold
This determines bit allocation: Higher SMR → More bits needed

The Modified Discrete Cosine Transform (MDCT)

While the psychoacoustic model uses FFT for analysis, the actual audio data transformation uses the Modified Discrete Cosine Transform. The MDCT possesses unique properties that make it exceptionally well-suited for audio compression.

MDCT Mathematical Foundation
// Forward MDCT (2N samples → N coefficients)
X[k] = Σ(n=0 to 2N-1) x[n] * w[n] * 
       cos[π/N * (n + 1/2 + N/2) * (k + 1/2)]

// Inverse IMDCT (N coefficients → 2N samples)
y[n] = (2/N) * w[n] * Σ(k=0 to N-1) X[k] * 
       cos[π/N * (n + 1/2 + N/2) * (k + 1/2)]

// Window function must satisfy Princen-Bradley:
w[n]² + w[n+N]² = 1  (for perfect reconstruction)

Common windows:
• Sine window: w[n] = sin(π(n+0.5)/2N)
• Kaiser-Bessel Derived (KBD)
• Vorbis window (modified KBD)

MDCT Advantages

  • • 50% overlap without redundancy
  • • Critical sampling (N→N)
  • • No block boundary artifacts
  • • Perfect reconstruction

TDAC Property

  • • Time-Domain Aliasing Cancellation
  • • Aliasing in overlaps cancels perfectly
  • • Enables seamless reconstruction
  • • Key to MDCT's success

Advanced Audio Coding (AAC) Family

AAC, standardized as part of MPEG-2 and MPEG-4, was designed as the successor to MP3. It achieves superior audio quality through several technical improvements, including pure MDCT (eliminating MP3's hybrid filterbank), flexible window sizes, and a more sophisticated psychoacoustic model.

AAC-LC (Low Complexity)

The most widely used AAC profile:

  • • Transparent quality at 128-192 kbps stereo
  • • Pure MDCT with 1024/128 sample windows
  • • Temporal Noise Shaping (TNS)
  • • Perceptual Noise Substitution (PNS)
  • • Standard for iTunes, YouTube, broadcasting
ffmpeg -i input.wav -c:a aac -b:a 192k output.m4a
HE-AAC (High-Efficiency AAC)

Optimized for low bitrates with SBR:

Spectral Band Replication (SBR):
• Encode only 0-8 kHz with AAC-LC
• Transmit parametric data for 8-16 kHz
• Reconstruct highs from lows + parameters
• 30% bitrate reduction at same quality

Effective at 48-64 kbps stereo
ffmpeg -i input.wav -c:a libfdk_aac -profile:a aac_he -b:a 64k output.m4a
HE-AAC v2

Adds Parametric Stereo (PS) to HE-AAC:

Parametric Stereo (PS):
• Convert stereo → mono downmix
• Extract spatial parameters:
  - Inter-channel Intensity Difference (IID)
  - Inter-channel Phase Difference (IPD)
  - Inter-channel Coherence (ICC)
• Reconstruct stereo from mono + params

Effective at 24-32 kbps stereo
ffmpeg -i input.wav -c:a libfdk_aac -profile:a aac_he_v2 -b:a 32k output.m4a

Opus: The Hybrid Codec for Interactive Audio

Opus is a highly versatile, open, and royalty-free audio codec standardized by the IETF as RFC 6716. It was specifically designed for real-time, interactive applications over the internet, where low latency is critical.

Opus Hybrid Architecture
Opus = SILK (speech) + CELT (music)

SILK Mode (Linear Predictive Coding):
• Optimized for speech (8-12 kHz)
• Based on Skype's SILK codec
• LPC analysis + residual coding
• Excellent at 6-40 kbps

CELT Mode (MDCT-based):
• Optimized for music (full bandwidth)
• Low-latency MDCT (2.5-20ms frames)
• Constrained energy lapped transform
• Excellent at 48-510 kbps

Hybrid Mode:
• SILK encodes 0-8 kHz
• CELT encodes 8-20 kHz
• Seamless transition
• Optimal for mixed content

Dynamic switching per frame based on content!

Latency

  • • Default: 26.5 ms
  • • Minimum: 5 ms
  • • Maximum: 60 ms
  • • Configurable per use

Bitrate Range

  • • 6 kbps (narrowband speech)
  • • 510 kbps (stereo music)
  • • Seamless rate switching
  • • VBR/CBR support

Applications

  • • WebRTC (default)
  • • Discord, WhatsApp
  • • Streaming (YouTube)
  • • Game voice chat

Codec Comparison and Use Cases

CodecBest Quality RangeLatencyComplexityPrimary Use
MP3128-320 kbpsHigh (~100ms)LowLegacy compatibility
AAC-LC96-256 kbpsModerate (~50ms)MediumStreaming, storage
HE-AAC v224-64 kbpsModerate-HighHighLow bitrate streaming
Opus6-510 kbpsVery Low (5-60ms)MediumReal-time communication
FLAC400-1000 kbpsLowLowLossless archival

FFmpeg Audio Processing Examples

Practical Audio Encoding Commands

# Convert to AAC with quality-based VBR

ffmpeg -i input.wav -c:a aac -q:a 2 output.m4a

# Opus encoding for voice (optimized for speech)

ffmpeg -i input.wav -c:a libopus -b:a 32k -application voip output.opus

# HE-AAC v2 for low-bitrate streaming

ffmpeg -i input.wav -c:a libfdk_aac -profile:a aac_he_v2 -b:a 32k output.m4a

# Multi-codec output for adaptive streaming

ffmpeg -i input.wav \
  -c:a aac -b:a 128k high.m4a \
  -c:a libfdk_aac -profile:a aac_he -b:a 64k medium.m4a \
  -c:a libfdk_aac -profile:a aac_he_v2 -b:a 32k low.m4a

# Analyze audio with loudness measurement

ffmpeg -i input.wav -af loudnorm=print_format=json -f null -

# Apply psychoacoustic enhancement

ffmpeg -i input.wav -af "acompressor=threshold=0.5,psychoacoustic=model=2" output.wav

Advanced Audio Processing Techniques

Joint Stereo Coding

Techniques for efficient stereo encoding:

M/S Stereo

Mid = (L + R) / 2
Side = (L - R) / 2
• Better for centered content
• Adaptive per frequency band

Intensity Stereo

High freq: mono + panning
• Above 2-4 kHz typically
• Phase insensitive region
• Major bitrate savings
Temporal Noise Shaping (TNS)

Controls temporal distribution of quantization noise:

  • • Applies predictive coding in frequency domain
  • • Prevents pre-echo artifacts in transients
  • • Shapes noise to follow signal envelope
  • • Critical for percussion and speech
Perceptual Noise Substitution (PNS)

Replaces noise-like signals with synthetic noise:

  • • Detects noise-like spectral regions
  • • Transmits only energy level
  • • Decoder generates matching noise
  • • Huge savings for applause, rain, etc.

Quality Metrics and Evaluation

Objective and Subjective Metrics
Objective Metrics:
• SNR: Signal-to-Noise Ratio (limited correlation)
• PEAQ: Perceptual Evaluation of Audio Quality (ITU-R BS.1387)
• PEMO-Q: Perceptual Model Quality assessment
• ViSQOL: Virtual Speech Quality Objective Listener

Subjective Testing (ITU-R BS.1116):
• MUSHRA: MUltiple Stimuli with Hidden Reference
• ABX: Double-blind comparison testing
• 5-point scale: Imperceptible → Very annoying

Transparency Thresholds:
• AAC-LC: ~128 kbps (stereo, typical content)
• Opus: ~96 kbps (stereo, music mode)
• HE-AAC: ~64 kbps (stereo, with SBR)
• Critical content may require +50% bitrate

Future Directions in Audio Compression

Emerging Technologies

  • Neural Audio Codecs: End-to-end learned compression with neural networks, achieving unprecedented quality at very low bitrates (3-6 kbps).
  • Object-Based Audio: MPEG-H 3D Audio and Dolby Atmos for immersive, spatially-aware compression.
  • Semantic Audio Coding: Separate and code different sources (voice, music, effects) with specialized models.
  • AI-Enhanced Psychoacoustics: Machine learning models that adapt to individual hearing profiles and preferences.

References & Resources

Continue Reading

Next Article
FFmpeg.wasm: Bringing Multimedia Processing to the Browser
Learn how WebAssembly and Emscripten enable FFmpeg to run entirely client-side, with architecture, performance analysis, and implementation details.