Audio Feature Extraction
Understanding spectral, temporal, and timbral analysis for AI detection
Why Audio Analysis Matters
Audio feature extraction analyzes the acoustic properties of music to identify patterns characteristic of AI generation. v9.3 uses librosa, a professional-grade Python library for audio analysis.
Key Insight: AI-generated music has distinctive audio signatures that differ from human-created music, even when it sounds similar to the human ear.
Spectral Features
Spectral features analyze the frequency content of audio signals.
Spectral Centroid
The "center of mass" of the audio spectrum.
- What it measures: Average frequency weighted by amplitude
- Human music: Variable centroid (1500-4000 Hz typical)
- AI music: More consistent centroid (lower variance)
- Indicator: Low variance (<300) suggests AI
Spectral Rolloff
Frequency below which 85% of spectrum energy is concentrated.
- What it measures: Distribution of energy across frequencies
- Human music: Natural rolloff patterns with variation
- AI music: Smooth, predictable rolloff
- Typical range: 3000-8000 Hz
Spectral Flatness
Measure of how noise-like versus tone-like the audio is.
- What it measures: Spectral energy distribution uniformity
- Human music: Higher flatness (acoustic instruments = more noise)
- AI music: Lower flatness (<0.05 = more tonal, synthetic)
- Range: 0 (pure tone) to 1 (white noise)
Temporal Features
Zero-Crossing Rate (ZCR)
How often the audio signal changes from positive to negative.
- What it measures: Signal oscillation frequency
- Human music: Variable ZCR with natural fluctuations
- AI music: Consistent ZCR (lower standard deviation)
- Indicator: Very consistent ZCR suggests synthetic generation
Tempo Detection
Beat tracking and rhythm analysis.
- What it measures: Beats per minute (BPM) and rhythmic consistency
- Human music: Slight tempo variations (natural performance)
- AI music: Perfectly quantized tempo (machine precision)
- Analysis: Tempo stability over time
RMS Energy
Root Mean Square energy measures loudness over time.
- What it measures: Signal amplitude and dynamics
- Human music: Wide dynamic range (>0.05 range)
- AI music: Compressed dynamics (<0.05 range)
- Indicator: Very consistent loudness suggests AI compression
Timbral Features (MFCCs)
Mel-Frequency Cepstral Coefficients (MFCCs) capture the timbral texture of audio - essentially the "color" or "quality" of the sound.
What are MFCCs?
MFCCs represent the shape of the spectral envelope, which characterizes timbre.
- Extraction: 13 coefficients per audio frame
- Purpose: Capture unique timbral signature
- Analysis: Mean and standard deviation across time
MFCC Analysis for AI Detection
- Human music: Variable MFCCs (natural instrument variations)
- AI music: Uniform MFCCs (consistent synthetic timbre)
- Standard Deviation: <10 suggests AI generation
- Pattern: AI shows repetitive timbral patterns
Why MFCCs Matter: They're extremely sensitive to subtle differences in sound production. Human performances and acoustic instruments produce more variable MFCCs than synthesized sounds.
Chroma Features
Chroma features represent the 12 pitch classes (C, C#, D, D#, E, F, F#, G, G#, A, A#, B) in music.
Chromagram Analysis
- What it measures: Distribution of energy across 12 semitones
- Purpose: Harmonic and melodic content analysis
- Human music: Natural key modulations and harmonic complexity
- AI music: Sometimes simpler harmonic structures
Harmonic Complexity
AI-generated music may show:
- Simpler chord progressions
- Less adventurous modulations
- More predictable harmonic patterns
- Limited use of jazz/extended harmony
AI Detection Indicators Summary
Low Spectral Variance
<300 variance indicates uniform frequency distribution
Compressed Dynamics
<0.05 RMS range shows over-compression
Low Spectral Flatness
<0.05 indicates synthetic tonal generation
MFCC Uniformity
Std <10 suggests consistent synthetic timbre
Important Disclaimer
No single indicator is definitive. v9.3 combines multiple features with AI analysis for comprehensive assessment. Always consider confidence scores and context.
Technical Implementation
Audio Processing Pipeline
- Download: Fetch 30-second preview from Deezer (15s timeout)
- Load: Load audio with librosa (mono, 22050 Hz sample rate)
- Extract: Compute all feature vectors
- Aggregate: Calculate mean and standard deviation
- Analyze: Compare against AI detection thresholds
- Report: Return features with indicators
librosa Functions Used
librosa.feature.spectral_centroid()librosa.feature.spectral_rolloff()librosa.feature.spectral_flatness()librosa.feature.zero_crossing_rate()librosa.beat.beat_track()librosa.feature.mfcc()librosa.feature.chroma_stft()librosa.feature.rms()
Interpreting Audio Feature Results
Example: Human-Created Track
- Spectral Centroid: 2847 Hz (mean), 842 Hz (std) - High variance ✓
- Spectral Flatness: 0.08 (mean) - Higher flatness ✓
- Zero-Crossing Rate: 0.12 (mean), 0.04 (std) - Natural variation ✓
- MFCC Std: 15.3 - High timbral variation ✓
- RMS Energy Range: 0.12 - Wide dynamic range ✓
Conclusion: Multiple indicators suggest human creation
Example: AI-Generated Track
- Spectral Centroid: 2100 Hz (mean), 180 Hz (std) - Low variance
- Spectral Flatness: 0.03 (mean) - Low flatness
- Zero-Crossing Rate: 0.09 (mean), 0.01 (std) - Very consistent
- MFCC Std: 6.8 - Low timbral variation
- RMS Energy Range: 0.03 - Compressed dynamics
Conclusion: Multiple indicators suggest AI generation