Audio Feature Extraction

Understanding spectral, temporal, and timbral analysis for AI detection

Why Audio Analysis Matters

Audio feature extraction analyzes the acoustic properties of music to identify patterns characteristic of AI generation. v9.3 uses librosa, a professional-grade Python library for audio analysis.

Key Insight: AI-generated music has distinctive audio signatures that differ from human-created music, even when it sounds similar to the human ear.

Spectral Features

Spectral features analyze the frequency content of audio signals.

Spectral Centroid

The "center of mass" of the audio spectrum.

  • What it measures: Average frequency weighted by amplitude
  • Human music: Variable centroid (1500-4000 Hz typical)
  • AI music: More consistent centroid (lower variance)
  • Indicator: Low variance (<300) suggests AI

Spectral Rolloff

Frequency below which 85% of spectrum energy is concentrated.

  • What it measures: Distribution of energy across frequencies
  • Human music: Natural rolloff patterns with variation
  • AI music: Smooth, predictable rolloff
  • Typical range: 3000-8000 Hz

Spectral Flatness

Measure of how noise-like versus tone-like the audio is.

  • What it measures: Spectral energy distribution uniformity
  • Human music: Higher flatness (acoustic instruments = more noise)
  • AI music: Lower flatness (<0.05 = more tonal, synthetic)
  • Range: 0 (pure tone) to 1 (white noise)

Temporal Features

Zero-Crossing Rate (ZCR)

How often the audio signal changes from positive to negative.

  • What it measures: Signal oscillation frequency
  • Human music: Variable ZCR with natural fluctuations
  • AI music: Consistent ZCR (lower standard deviation)
  • Indicator: Very consistent ZCR suggests synthetic generation

Tempo Detection

Beat tracking and rhythm analysis.

  • What it measures: Beats per minute (BPM) and rhythmic consistency
  • Human music: Slight tempo variations (natural performance)
  • AI music: Perfectly quantized tempo (machine precision)
  • Analysis: Tempo stability over time

RMS Energy

Root Mean Square energy measures loudness over time.

  • What it measures: Signal amplitude and dynamics
  • Human music: Wide dynamic range (>0.05 range)
  • AI music: Compressed dynamics (<0.05 range)
  • Indicator: Very consistent loudness suggests AI compression

Timbral Features (MFCCs)

Mel-Frequency Cepstral Coefficients (MFCCs) capture the timbral texture of audio - essentially the "color" or "quality" of the sound.

What are MFCCs?

MFCCs represent the shape of the spectral envelope, which characterizes timbre.

  • Extraction: 13 coefficients per audio frame
  • Purpose: Capture unique timbral signature
  • Analysis: Mean and standard deviation across time

MFCC Analysis for AI Detection

  • Human music: Variable MFCCs (natural instrument variations)
  • AI music: Uniform MFCCs (consistent synthetic timbre)
  • Standard Deviation: <10 suggests AI generation
  • Pattern: AI shows repetitive timbral patterns

Why MFCCs Matter: They're extremely sensitive to subtle differences in sound production. Human performances and acoustic instruments produce more variable MFCCs than synthesized sounds.

Chroma Features

Chroma features represent the 12 pitch classes (C, C#, D, D#, E, F, F#, G, G#, A, A#, B) in music.

Chromagram Analysis

  • What it measures: Distribution of energy across 12 semitones
  • Purpose: Harmonic and melodic content analysis
  • Human music: Natural key modulations and harmonic complexity
  • AI music: Sometimes simpler harmonic structures

Harmonic Complexity

AI-generated music may show:

  • Simpler chord progressions
  • Less adventurous modulations
  • More predictable harmonic patterns
  • Limited use of jazz/extended harmony

AI Detection Indicators Summary

Low Spectral Variance

<300 variance indicates uniform frequency distribution

Compressed Dynamics

<0.05 RMS range shows over-compression

Low Spectral Flatness

<0.05 indicates synthetic tonal generation

MFCC Uniformity

Std <10 suggests consistent synthetic timbre

Important Disclaimer

No single indicator is definitive. v9.3 combines multiple features with AI analysis for comprehensive assessment. Always consider confidence scores and context.

Technical Implementation

Audio Processing Pipeline

  1. Download: Fetch 30-second preview from Deezer (15s timeout)
  2. Load: Load audio with librosa (mono, 22050 Hz sample rate)
  3. Extract: Compute all feature vectors
  4. Aggregate: Calculate mean and standard deviation
  5. Analyze: Compare against AI detection thresholds
  6. Report: Return features with indicators

librosa Functions Used

  • librosa.feature.spectral_centroid()
  • librosa.feature.spectral_rolloff()
  • librosa.feature.spectral_flatness()
  • librosa.feature.zero_crossing_rate()
  • librosa.beat.beat_track()
  • librosa.feature.mfcc()
  • librosa.feature.chroma_stft()
  • librosa.feature.rms()

Interpreting Audio Feature Results

Example: Human-Created Track

  • Spectral Centroid: 2847 Hz (mean), 842 Hz (std) - High variance ✓
  • Spectral Flatness: 0.08 (mean) - Higher flatness ✓
  • Zero-Crossing Rate: 0.12 (mean), 0.04 (std) - Natural variation ✓
  • MFCC Std: 15.3 - High timbral variation ✓
  • RMS Energy Range: 0.12 - Wide dynamic range ✓

Conclusion: Multiple indicators suggest human creation

Example: AI-Generated Track

  • Spectral Centroid: 2100 Hz (mean), 180 Hz (std) - Low variance
  • Spectral Flatness: 0.03 (mean) - Low flatness
  • Zero-Crossing Rate: 0.09 (mean), 0.01 (std) - Very consistent
  • MFCC Std: 6.8 - Low timbral variation
  • RMS Energy Range: 0.03 - Compressed dynamics

Conclusion: Multiple indicators suggest AI generation

Related Topics