Audio Fundamentals - Voice & Audio

Audio forensics begins with understanding digital audio at a technical level. This module covers the fundamentals of digital audio representation, frequency-domain analysis, and the tools you need to analyze audio content for signs of synthesis, manipulation, or tampering.

Key takeaway: Audio forensics operates primarily in the frequency domain — analyzing spectrograms rather than waveforms. Understanding sampling rates, bit depth, frequency components, and compression artifacts gives you the foundation to detect voice cloning, audio splicing, and synthesis artifacts.

Digital Audio Basics

Sound is a continuous wave of air pressure variations. Digital audio represents this wave by measuring ("sampling") the pressure level at regular intervals and storing each measurement as a number. Two parameters define the quality of this representation.

Sample Rate

How many times per second the audio is measured. CD quality is 44,100 Hz (44.1 kHz), meaning 44,100 measurements per second. The Nyquist theorem says this captures frequencies up to half the sample rate (22,050 Hz), covering the full range of human hearing.

Common rates: 8 kHz (phone), 16 kHz (voice), 44.1 kHz (CD), 48 kHz (video), 96 kHz (studio)

Bit Depth

How precisely each sample is measured. 16-bit audio has 65,536 possible values per sample, giving ~96 dB of dynamic range. 24-bit has 16.7 million values (~144 dB). Higher bit depth means lower noise floor and more detail in quiet passages.

Common depths: 8-bit (low quality), 16-bit (CD), 24-bit (studio), 32-bit float (processing)

Understanding Spectrograms

A spectrogram is the forensic analyst's primary visualization tool. It displays three dimensions of audio simultaneously: time (horizontal axis), frequency (vertical axis), and amplitude (color intensity). Learning to read spectrograms is the single most important skill in audio forensics.

Spectrograms reveal patterns invisible in waveform view. Human speech shows characteristic formant bands (resonant frequencies of the vocal tract), breathing patterns, lip sounds, and natural background noise. Synthetic speech may lack these natural features or display them with artificial uniformity.

What to Look For in Spectrograms

Natural Speech Indicators

• Irregular breath patterns between phrases
• Variable formant transitions
• Ambient noise floor continuity
• Micro-hesitations and filler sounds

Synthetic Speech Indicators

• Perfectly regular formant spacing
• Absence of natural breathing
• Clean noise floor (suspiciously quiet)
• Uniform prosody and pacing

Essential Tools

Tool	Type	Best For	Cost
Audacity	Audio editor	Spectrogram view, noise analysis, basic forensics	Free
Praat	Phonetics analysis	Formant analysis, pitch tracking, voice comparison	Free
SoX	CLI audio tool	Batch processing, format conversion, statistics	Free
iZotope RX	Pro forensics	Advanced spectral editing, noise profiling, repair	$399+

Audio Compression and Artifacts

Understanding audio compression is essential for forensic analysis because compression artifacts can reveal manipulation. Lossy formats (MP3, AAC, OGG) discard frequency data to reduce file size, creating characteristic artifacts. When audio is re-encoded multiple times (generation loss), these artifacts accumulate and become detectable. A splice between two audio segments compressed at different bitrates will show visible boundaries in a spectrogram.

In the next module, Spectral Analysis Basics, you will apply these fundamentals to real-world voice analysis — reading spectrograms for formant patterns, detecting pitch manipulation, and identifying the spectral signatures of different voice synthesis technologies.