Dr. SHAP-AV:
Decoding Modality
Contributions via Shapley Attribution in AVSR

Umberto Cappellazzo1 · Stavros Petridis1,2 · Maja Pantic1,2
1 Imperial College London    2 NatWest AI Research

A Shapley-based framework revealing how audio-visual speech recognition models balance what they hear and what they see — across noise levels, decoding stages, and architectures.

📄 Read Paper 💻 GitHub 📋 BibTeX
SCROLL
// Abstract

What is Dr. SHAP-AV?

A unified framework for understanding how AVSR models balance audio and visual modalities via Shapley values.

Audio-Visual Speech Recognition leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. These findings expose a persistent audio bias, motivating ad-hoc modality-weighting mechanisms and Shapley-based attribution as a standard AVSR diagnostic.
// Method

Three Axes of Analysis

From the Shapley matrix Φ, capturing each input feature's contribution to each generated token, we derive three complementary metrics.

🌐

Global SHAP

Aggregates contributions across all features and tokens to quantify overall audio vs. visual balance.

\( \begin{align} \text{A-SHAP} &= \frac{\displaystyle\sum_{j \in \mathcal{A}} \sum_{t=1}^{T} |\phi_{j,t}|}{\displaystyle\sum_{j \in \mathcal{F}} \sum_{t=1}^{T} |\phi_{j,t}|} \\ \text{V-SHAP} &= \frac{\displaystyle\sum_{j \in \mathcal{V}} \sum_{t=1}^{T} |\phi_{j,t}|}{\displaystyle\sum_{j \in \mathcal{F}} \sum_{t=1}^{T} |\phi_{j,t}|}\end{align} \)
📈

Generative SHAP

Tracks how modality reliance evolves across windowed stages of autoregressive decoding.

\( \begin{align} \text{A-SHAP}^{(w)} &= \frac{\displaystyle\sum_{j \in \mathcal{A}} \sum_{t \in \mathcal{T}_w} |\phi_{j,t}|}{\displaystyle\sum_{j \in \mathcal{F}} \sum_{t \in \mathcal{T}_w} |\phi_{j,t}|}, \label{eq:gen_ashap} \\ \text{V-SHAP}^{(w)} &= 1 - \text{A-SHAP}^{(w)}. \label{eq:gen_vshap} \end{align} \)
🔗

Alignment SHAP

Examines temporal correspondence between input feature positions and output token positions.

\( H_{k,w}^{(m)} = \frac{\displaystyle\sum_{j \in \mathcal{F}_k^{(m)}} \sum_{t \in \mathcal{T}_w} |\phi_{j,t}|}{\displaystyle\sum_{w'=1}^{W} \sum_{j \in \mathcal{F}_k^{(m)}} \sum_{t \in \mathcal{T}_{w'}} |\phi_{j,t}|} \)
Overview of the three proposed SHAP-based analyses
Overview of the three proposed SHAP-based analyses. From the Shapley matrix Φ, we compute: Global SHAP (overall modality balance), Generative SHAP (contribution dynamics across decoding), and Temporal Alignment SHAP (input-output correspondence).
// Models

Six Models, Two Paradigms

We evaluate models spanning both LLM-based and encoder-decoder AVSR architectures on LRS2 and LRS3. Hover over a model to see its architecture.

Llama-AVSR
LLM-based
First multimodal LLM for AVSR with linear projectors
Llama-SMoP
LLM-based
Sparse mixture-of-experts projectors for enhanced fusion
Omni-AVSR
LLM-based
Unified ASR/VSR/AVSR with matryoshka representations
AV-HuBERT
Encoder-Decoder
Self-supervised masked audio-visual prediction
Auto-AVSR
Encoder-Decoder
MLP-based fusion with CTC/attention training
Whisper-Flamingo
Encoder-Decoder
Gated cross-attention on top of Whisper backbone
🏗 Hover over a model to see its architecture diagram
// Findings

Six Key Discoveries

From analyzing modality contributions across six state-of-the-art AVSR models on LRS2 and LRS3. Hover over each finding to explore the corresponding result.

🔊01
Persistent Audio Bias
Models maintain 38–46% audio contribution even at −10 dB SNR, where substantially stronger visual dominance would be expected.
02
Dynamic Generation Shift
Whisper-Flamingo and Omni-AVSR progressively increase audio reliance during decoding; AV-HuBERT maintains stable balance throughout.
🔗03
Robust Temporal Alignment
Both modalities independently maintain input-output temporal correspondence, even under severe acoustic noise.
🔍 Hover over a finding to explore the corresponding figure
🎵04
Noise-Type Sensitivity
Different noise types induce varying degrees of visual reliance, with more challenging conditions producing larger shifts toward vision.
05
Architecture-Dependent Duration
Utterance duration affects modality contributions differently across architectures: no universal trend exists.
📊06
SNR Drives Balance
Acoustic conditions are the dominant factor driving modality balance; recognition difficulty has minimal effect.
🔍 Hover over a finding to explore the corresponding figure
// Citation

Cite This Work

@article{drshapav2026, title = {Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition}, author = {Cappellazzo, Umberto and Petridis, Stavros and Pantic, Maja}, journal = {arXiv preprint arXiv:}, year = {2026} }