Biometric Defense
Intermediate
T1123 T1589

Voice Biometrics & Audio Tracking

Speaker recognition systems identify individuals from spectral characteristics of their voice — timbre, formant frequencies, pitch contour, and temporal speech dynamics. These voiceprints are increasingly used for authentication, surveillance, and cross-platform identity linking.

Scale of Voice Collection

Major tech platforms process billions of voice queries daily. Call centers routinely enroll voiceprints for authentication. Law enforcement agencies maintain voice databases for speaker identification. Every meeting recording, voice message, and phone call is a potential voiceprint source.

How Speaker Recognition Works

Feature Extraction

Raw audio is converted to spectral features — primarily MFCCs (Mel-Frequency Cepstral Coefficients), which capture the spectral envelope of the vocal tract.

  • • 13–20 MFCC coefficients per frame
  • • Delta and delta-delta features capture temporal dynamics
  • • F0 (fundamental frequency) captures pitch characteristics
  • • Formant frequencies (F1–F4) encode vocal tract shape

Embedding Generation

Deep neural networks (d-vectors, x-vectors) compress variable-length audio into fixed-dimensional speaker embeddings for comparison.

  • d-vectors: GE2E (Generalized End-to-End) loss training
  • x-vectors: TDNN (Time-Delay Neural Network) architecture
  • ECAPA-TDNN: Current SOTA, uses attention mechanisms
  • • Typical embedding dimension: 128–512

Verification / Identification

Embeddings are compared using cosine similarity or PLDA (Probabilistic Linear Discriminant Analysis) against enrolled templates.

  • Verification: 1:1 — "Is this person who they claim?"
  • Identification: 1:N — "Who is this speaker?"
  • Diarization: "Who spoke when in this recording?"

Deployment Contexts

Voice biometrics are deployed across banking, call centers, smart assistants, law enforcement, and meeting platforms.

  • • Banking: Nuance/Microsoft voice auth (300M+ voiceprints)
  • • Smart assistants: continuous voice profiles
  • • Law enforcement: NSA voice databases, wiretap matching
  • • Enterprise: Zoom/Teams transcription with speaker ID

Voice Data Risk Surfaces

Source Data Quality Retention Risk Mitigation
Phone calls High (clean enrollment) 8kHz narrowband but sufficient for matching Use encrypted VoIP, minimize call duration
Video meetings Very high (wideband) Recording + transcription with speaker labels Disable recording, audit AI transcription TOS
Smart speakers High (always-on mic) Cloud-stored voice clips, voice profiles Delete history, disable voice match, mute when idle
Voice messages Medium-high Stored on sender/recipient devices + cloud Use disappearing messages, prefer text
Ambient capture Low-medium (noisy) Variable, often incidental Awareness of mic-equipped spaces, acoustic hygiene
Podcasts / social Very high (produced) Indefinite public availability Assume enrollment-quality; limit if high-risk

Defensive Controls

🔇 Acoustic Hygiene

Control your recording environment: closed rooms with sound absorption, push-to-talk over always-on microphones, and physical mic switches for sensitive devices.

🔀 Channel Separation

Separate personal and professional voice channels. Use different communication platforms and personas to reduce cross-platform voiceprint linking.

🔊 Noise Masking

Lawful ambient noise generation can reduce capture quality for passive microphones. White/pink noise generators near meeting areas degrade distant recording quality.

📋 Consent & Policy

Enforce explicit recording disclosures. Audit AI transcription services for hidden retention and model-training clauses. Challenge unauthorized voice capture where rights apply.

⏱️ Ephemeral Communication

Use disappearing messages and auto-delete policies. Prefer text for sensitive topics. When voice is required, use end-to-end encrypted platforms with minimal metadata retention.

🧹 Metadata Sanitization

Strip metadata from shared audio files. Normalize audio levels to prevent volume-based fingerprinting. Reduce sample rate for non-critical sharing.

Voice Feature Analysis

Extract and analyze the spectral features that speaker recognition systems use to build voiceprints.

voice_features.py
python
#!/usr/bin/env python3
# Prerequisites: pip install librosa numpy soundfile
"""Analyze voice features that speaker recognition systems use.
Compare speaker embeddings across different conditions."""
import librosa
import numpy as np
import json

def extract_voice_features(audio_path):
    """Extract MFCC-based speaker features from an audio file."""
    y, sr = librosa.load(audio_path, sr=16000)  # 16kHz sample rate — telephony standard, sufficient for speech
    
    # MFCCs (primary speaker identity features)
    mfcc = librosa.feature.mfcc(
        y=y, sr=sr,
        n_mfcc=20,       # 20 cepstral coefficients — captures enough vocal tract detail for speaker ID
        n_fft=512,        # 512-sample FFT window (~32ms at 16kHz) — standard for speech
        hop_length=160    # 10ms hop between frames — standard for speech analysis
    )
    
    # Delta and delta-delta (temporal dynamics)
    mfcc_delta = librosa.feature.delta(mfcc)
    mfcc_delta2 = librosa.feature.delta(mfcc, order=2)
    
    # Pitch (F0) contour
    f0, voiced_flag, voiced_probs = librosa.pyin(
        y, fmin=librosa.note_to_hz('C2'), fmax=librosa.note_to_hz('C7')
    )
    f0_clean = f0[~np.isnan(f0)]
    
    # Spectral features
    spectral_centroid = librosa.feature.spectral_centroid(y=y, sr=sr)
    spectral_bandwidth = librosa.feature.spectral_bandwidth(y=y, sr=sr)
    
    features = {
        "mfcc_mean": np.mean(mfcc, axis=1).tolist(),
        "mfcc_std": np.std(mfcc, axis=1).tolist(),
        "delta_mean": np.mean(mfcc_delta, axis=1).tolist(),
        "f0_mean": float(np.mean(f0_clean)) if len(f0_clean) > 0 else 0,
        "f0_std": float(np.std(f0_clean)) if len(f0_clean) > 0 else 0,
        "f0_range": float(np.ptp(f0_clean)) if len(f0_clean) > 0 else 0,
        "spectral_centroid_mean": float(np.mean(spectral_centroid)),
        "spectral_bandwidth_mean": float(np.mean(spectral_bandwidth)),
        "duration_sec": float(len(y) / sr),
    }
    return features

# Compare baseline vs conditions
conditions = {
    "baseline_quiet_room": "audio/baseline.wav",
    "noisy_background": "audio/noisy_bg.wav",
    "different_microphone": "audio/diff_mic.wav",
    "whispered": "audio/whispered.wav",
    "altered_pitch": "audio/pitch_shifted.wav",
}

for name, path in conditions.items():
    features = extract_voice_features(path)
    print(f"\n--- {name} ---")
    print(f"  F0 mean: {features['f0_mean']:.1f} Hz, range: {features['f0_range']:.1f} Hz")
    print(f"  MFCC[0]: {features['mfcc_mean'][0]:.2f} ± {features['mfcc_std'][0]:.2f}")
    print(f"  Spectral centroid: {features['spectral_centroid_mean']:.0f} Hz")

# --- Expected Output ---
# --- baseline_quiet_room ---
#   F0 mean: 121.4 Hz, range: 78.3 Hz
#   MFCC[0]: -243.17 ± 58.42
#   Spectral centroid: 1847 Hz
#
# --- noisy_background ---
#   F0 mean: 124.8 Hz, range: 65.1 Hz
#   MFCC[0]: -198.53 ± 71.20
#   Spectral centroid: 2341 Hz
#
# --- whispered ---
#   F0 mean: 0.0 Hz, range: 0.0 Hz
#   MFCC[0]: -312.85 ± 43.07
#   Spectral centroid: 3102 Hz
#
# --- altered_pitch ---
#   F0 mean: 167.2 Hz, range: 91.7 Hz
#   MFCC[0]: -221.09 ± 62.38
#   Spectral centroid: 2054 Hz
#!/usr/bin/env python3
# Prerequisites: pip install librosa numpy soundfile
"""Analyze voice features that speaker recognition systems use.
Compare speaker embeddings across different conditions."""
import librosa
import numpy as np
import json

def extract_voice_features(audio_path):
    """Extract MFCC-based speaker features from an audio file."""
    y, sr = librosa.load(audio_path, sr=16000)  # 16kHz sample rate — telephony standard, sufficient for speech
    
    # MFCCs (primary speaker identity features)
    mfcc = librosa.feature.mfcc(
        y=y, sr=sr,
        n_mfcc=20,       # 20 cepstral coefficients — captures enough vocal tract detail for speaker ID
        n_fft=512,        # 512-sample FFT window (~32ms at 16kHz) — standard for speech
        hop_length=160    # 10ms hop between frames — standard for speech analysis
    )
    
    # Delta and delta-delta (temporal dynamics)
    mfcc_delta = librosa.feature.delta(mfcc)
    mfcc_delta2 = librosa.feature.delta(mfcc, order=2)
    
    # Pitch (F0) contour
    f0, voiced_flag, voiced_probs = librosa.pyin(
        y, fmin=librosa.note_to_hz('C2'), fmax=librosa.note_to_hz('C7')
    )
    f0_clean = f0[~np.isnan(f0)]
    
    # Spectral features
    spectral_centroid = librosa.feature.spectral_centroid(y=y, sr=sr)
    spectral_bandwidth = librosa.feature.spectral_bandwidth(y=y, sr=sr)
    
    features = {
        "mfcc_mean": np.mean(mfcc, axis=1).tolist(),
        "mfcc_std": np.std(mfcc, axis=1).tolist(),
        "delta_mean": np.mean(mfcc_delta, axis=1).tolist(),
        "f0_mean": float(np.mean(f0_clean)) if len(f0_clean) > 0 else 0,
        "f0_std": float(np.std(f0_clean)) if len(f0_clean) > 0 else 0,
        "f0_range": float(np.ptp(f0_clean)) if len(f0_clean) > 0 else 0,
        "spectral_centroid_mean": float(np.mean(spectral_centroid)),
        "spectral_bandwidth_mean": float(np.mean(spectral_bandwidth)),
        "duration_sec": float(len(y) / sr),
    }
    return features

# Compare baseline vs conditions
conditions = {
    "baseline_quiet_room": "audio/baseline.wav",
    "noisy_background": "audio/noisy_bg.wav",
    "different_microphone": "audio/diff_mic.wav",
    "whispered": "audio/whispered.wav",
    "altered_pitch": "audio/pitch_shifted.wav",
}

for name, path in conditions.items():
    features = extract_voice_features(path)
    print(f"\n--- {name} ---")
    print(f"  F0 mean: {features['f0_mean']:.1f} Hz, range: {features['f0_range']:.1f} Hz")
    print(f"  MFCC[0]: {features['mfcc_mean'][0]:.2f} ± {features['mfcc_std'][0]:.2f}")
    print(f"  Spectral centroid: {features['spectral_centroid_mean']:.0f} Hz")

# --- Expected Output ---
# --- baseline_quiet_room ---
#   F0 mean: 121.4 Hz, range: 78.3 Hz
#   MFCC[0]: -243.17 ± 58.42
#   Spectral centroid: 1847 Hz
#
# --- noisy_background ---
#   F0 mean: 124.8 Hz, range: 65.1 Hz
#   MFCC[0]: -198.53 ± 71.20
#   Spectral centroid: 2341 Hz
#
# --- whispered ---
#   F0 mean: 0.0 Hz, range: 0.0 Hz
#   MFCC[0]: -312.85 ± 43.07
#   Spectral centroid: 3102 Hz
#
# --- altered_pitch ---
#   F0 mean: 167.2 Hz, range: 91.7 Hz
#   MFCC[0]: -221.09 ± 62.38
#   Spectral centroid: 2054 Hz

Speaker Verification Testing

Measure how environmental conditions and vocal modifications affect speaker matching confidence.

speaker_verify.py
python
#!/usr/bin/env python3
# Prerequisites: pip install resemblyzer numpy
"""Speaker verification using Resemblyzer (d-vector approach).
Test how voice modifications affect speaker match confidence."""
from resemblyzer import VoiceEncoder, preprocess_wav
from pathlib import Path
import numpy as np

encoder = VoiceEncoder()

def get_speaker_embedding(audio_path):
    """Generate d-vector speaker embedding from audio file."""
    wav = preprocess_wav(Path(audio_path))
    return encoder.embed_utterance(wav)

def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Reference enrollment (high-quality sample)
ref_embedding = get_speaker_embedding("audio/enrollment_clean.wav")

# Test conditions
test_files = {
    "clean_match": "audio/test_clean.wav",
    "phone_quality": "audio/test_phone.wav",
    "background_noise": "audio/test_noisy.wav",
    "different_room": "audio/test_different_room.wav",
    "whispered": "audio/test_whisper.wav",
    "slow_speech": "audio/test_slow.wav",
    "masked_voice": "audio/test_masked.wav",
}

print(f"{'Condition':<25} {'Similarity':>12} {'Match':>8}")
print("-" * 48)
for condition, path in test_files.items():
    try:
        test_emb = get_speaker_embedding(path)
        sim = cosine_sim(ref_embedding, test_emb)
        # 0.75 = high-confidence match threshold (d-vector cosine similarity; range 0–1)
        # 0.60 = possible match zone — recommend manual review
        match = "YES" if sim > 0.75 else "MAYBE" if sim > 0.60 else "NO"
        print(f"{condition:<25} {sim:>12.4f} {match:>8}")
    except Exception as e:
        print(f"{condition:<25} {'ERROR':>12} {'N/A':>8}")

# Expected output:
# === Speaker Verification Results ===
# Reference: speaker_baseline.wav
#
# Test File                | Similarity | Verdict
# ────────────────────────────────────────────────
# normal_voice.wav         |     0.8934 | YES — confirmed match
# pitch_shifted.wav        |     0.6821 | MAYBE — review needed
# voice_changer.wav        |     0.4213 | NO — different speaker profile
# whispered.wav            |     0.5102 | NO — below threshold
# background_noise.wav     |     0.7234 | MAYBE — degraded but detectable
#!/usr/bin/env python3
# Prerequisites: pip install resemblyzer numpy
"""Speaker verification using Resemblyzer (d-vector approach).
Test how voice modifications affect speaker match confidence."""
from resemblyzer import VoiceEncoder, preprocess_wav
from pathlib import Path
import numpy as np

encoder = VoiceEncoder()

def get_speaker_embedding(audio_path):
    """Generate d-vector speaker embedding from audio file."""
    wav = preprocess_wav(Path(audio_path))
    return encoder.embed_utterance(wav)

def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Reference enrollment (high-quality sample)
ref_embedding = get_speaker_embedding("audio/enrollment_clean.wav")

# Test conditions
test_files = {
    "clean_match": "audio/test_clean.wav",
    "phone_quality": "audio/test_phone.wav",
    "background_noise": "audio/test_noisy.wav",
    "different_room": "audio/test_different_room.wav",
    "whispered": "audio/test_whisper.wav",
    "slow_speech": "audio/test_slow.wav",
    "masked_voice": "audio/test_masked.wav",
}

print(f"{'Condition':<25} {'Similarity':>12} {'Match':>8}")
print("-" * 48)
for condition, path in test_files.items():
    try:
        test_emb = get_speaker_embedding(path)
        sim = cosine_sim(ref_embedding, test_emb)
        # 0.75 = high-confidence match threshold (d-vector cosine similarity; range 0–1)
        # 0.60 = possible match zone — recommend manual review
        match = "YES" if sim > 0.75 else "MAYBE" if sim > 0.60 else "NO"
        print(f"{condition:<25} {sim:>12.4f} {match:>8}")
    except Exception as e:
        print(f"{condition:<25} {'ERROR':>12} {'N/A':>8}")

# Expected output:
# === Speaker Verification Results ===
# Reference: speaker_baseline.wav
#
# Test File                | Similarity | Verdict
# ────────────────────────────────────────────────
# normal_voice.wav         |     0.8934 | YES — confirmed match
# pitch_shifted.wav        |     0.6821 | MAYBE — review needed
# voice_changer.wav        |     0.4213 | NO — different speaker profile
# whispered.wav            |     0.5102 | NO — below threshold
# background_noise.wav     |     0.7234 | MAYBE — degraded but detectable

Audio Sanitization

Clean metadata and normalize audio properties before sharing files externally.

sanitize-audio.sh
bash
#!/bin/bash
# Prerequisites: apt install ffmpeg (or brew install ffmpeg on macOS)
# Audio metadata stripping and quality normalization
# Strip all metadata from audio files
ffmpeg -i recording.wav -map_metadata -1 -c copy sanitized.wav

# Normalize audio levels (prevents volume-based fingerprinting)
# EBU R128 broadcast loudness standard: I=-16 LUFS, True Peak=-1.5 dBTP, Loudness Range=11 LU
ffmpeg -i recording.wav -af "loudnorm=I=-16:TP=-1.5:LRA=11" normalized.wav

# Reduce sample rate to phone quality (degrades speaker features)
ffmpeg -i recording.wav -ar 8000 -ac 1 phone_quality.wav

# Add low-amplitude white noise (~-34dB SNR) — masks speaker micro-patterns without audible distortion
ffmpeg -i recording.wav -af "aeval=val(0)+random(0)*0.02" noise_masked.wav

# Batch process a folder
for f in *.wav; do
    ffmpeg -y -i "$f" -map_metadata -1 -af "loudnorm" "clean_${f}"
done
#!/bin/bash
# Prerequisites: apt install ffmpeg (or brew install ffmpeg on macOS)
# Audio metadata stripping and quality normalization
# Strip all metadata from audio files
ffmpeg -i recording.wav -map_metadata -1 -c copy sanitized.wav

# Normalize audio levels (prevents volume-based fingerprinting)
# EBU R128 broadcast loudness standard: I=-16 LUFS, True Peak=-1.5 dBTP, Loudness Range=11 LU
ffmpeg -i recording.wav -af "loudnorm=I=-16:TP=-1.5:LRA=11" normalized.wav

# Reduce sample rate to phone quality (degrades speaker features)
ffmpeg -i recording.wav -ar 8000 -ac 1 phone_quality.wav

# Add low-amplitude white noise (~-34dB SNR) — masks speaker micro-patterns without audible distortion
ffmpeg -i recording.wav -af "aeval=val(0)+random(0)*0.02" noise_masked.wav

# Batch process a folder
for f in *.wav; do
    ffmpeg -y -i "$f" -map_metadata -1 -af "loudnorm" "clean_${f}"
done

Voice Cloning & Deepfake Threats

Modern TTS and voice cloning tools can replicate a person's voice from minutes of sample audio, creating serious risks for social engineering, fraud, and identity spoofing.

Voice Cloning Tools

  • ElevenLabs: Cloud API clones voice from ~60 seconds of audio; near-human quality
  • Coqui XTTS v2: Open-source multi-language TTS; 6-second voice cloning (self-hosted)
  • Bark (Suno AI): Open-source text-to-audio with voice presets and speaker prompts
  • RVC (Retrieval-based Voice Conversion): Real-time voice conversion; popular in live-call spoofing

Threat Scenarios

  • Vishing (voice phishing): Clone executive's voice for wire-transfer fraud
  • Speaker verification bypass: Defeat voiceprint auth with cloned sample
  • Deniability attacks: Generate fabricated audio of target saying anything
  • Ultrasonic cross-device tracking: Inaudible beacons embedded in audio streams link devices across locations

Defensive Controls Against Voice Cloning

  • Minimize public voice samples: Limit podcast appearances, social media voice posts, and public speaking recordings that provide cloning material.
  • Establish verbal verification codes: Use pre-shared code words for high-stakes phone calls (wire transfers, access requests) that can't be predicted by a cloning model.
  • Deploy audio watermarking: Tools like AudioSeal (Meta) and Resemble AI Detect embed imperceptible watermarks in generated audio that survive common transformations.
  • Block ultrasonic tracking: Use ultrasonic firewall apps or hardware high-pass filters to prevent cross-device beacon tracking via inaudible audio.

Voice Cloning Reality Check

A 2024 McAfee study found that 77% of voice clone attempts were rated 'convincing' by listeners. ElevenLabs can produce usable clones from as little as 60 seconds of clean audio. Treat any phone-based authentication as low-assurance and layer with out-of-band verification.

Defense Strategy Summary

  • Reduce enrollment samples: limit public voice recordings, podcast appearances, and social media audio
  • Control recording environments: use acoustic isolation and push-to-talk for sensitive conversations
  • Sanitize shared audio: strip metadata, normalize levels, reduce quality when full fidelity isn't needed
  • Separate voice identities: use different platforms and personas for different risk contexts
  • Audit AI services: review transcription and voice-assistant ToS for training and retention clauses

Legal Boundaries

Avoid illegal jamming or unauthorized interference with communications systems. Active voice modification during lawful recordings may violate wiretapping statutes in some jurisdictions. Focus on lawful privacy controls: consent enforcement, channel separation, and policy-based protections.
🎯

Voice Privacy Labs

Hands-on exercises to understand and reduce your voice biometric exposure.

🔧
Voice Feature Extraction & Analysis Custom Lab medium
Record voice samples in 5 different conditionsExtract MFCC, pitch, and spectral features with librosaCompute speaker embeddings with ResemblyzerCompare similarity scores across conditionsIdentify which conditions most reduce speaker match confidence
🔧
Audio Hygiene Workflow Custom Lab easy
Strip metadata from audio files using ffmpegNormalize audio levels across a set of recordingsCompare speaker verification before/after sanitizationAudit smart device voice history and delete stored samplesReview meeting platform recording and transcription settings