Dr. SHAP-AV: Decoding Relative Modality Contributions in AVSR

// Abstract

What is Dr. SHAP-AV?

A unified framework for understanding how AVSR models balance audio and visual modalities via Shapley values.

Audio-Visual Speech Recognition leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. These findings expose a persistent audio bias, motivating ad-hoc modality-weighting mechanisms and Shapley-based attribution as a standard AVSR diagnostic.

// Method

Three Axes of Analysis

From the Shapley matrix Φ, capturing each input feature's contribution to each generated token, we derive three complementary metrics.

🌐

Global SHAP

Aggregates contributions across all features and tokens to quantify overall audio vs. visual balance.

\( \begin{align} \text{A-SHAP} &= \frac{\displaystyle\sum_{j \in \mathcal{A}} \sum_{t=1}^{T} |\phi_{j,t}|}{\displaystyle\sum_{j \in \mathcal{F}} \sum_{t=1}^{T} |\phi_{j,t}|} \\ \text{V-SHAP} &= \frac{\displaystyle\sum_{j \in \mathcal{V}} \sum_{t=1}^{T} |\phi_{j,t}|}{\displaystyle\sum_{j \in \mathcal{F}} \sum_{t=1}^{T} |\phi_{j,t}|}\end{align} \)

📈

Generative SHAP

Tracks how modality reliance evolves across windowed stages of autoregressive decoding.

\( \begin{align} \text{A-SHAP}^{(w)} &= \frac{\displaystyle\sum_{j \in \mathcal{A}} \sum_{t \in \mathcal{T}_w} |\phi_{j,t}|}{\displaystyle\sum_{j \in \mathcal{F}} \sum_{t \in \mathcal{T}_w} |\phi_{j,t}|}, \label{eq:gen_ashap} \\ \text{V-SHAP}^{(w)} &= 1 - \text{A-SHAP}^{(w)}. \label{eq:gen_vshap} \end{align} \)

🔗

Alignment SHAP

Examines temporal correspondence between input feature positions and output token positions.

\( H_{k,w}^{(m)} = \frac{\displaystyle\sum_{j \in \mathcal{F}_k^{(m)}} \sum_{t \in \mathcal{T}_w} |\phi_{j,t}|}{\displaystyle\sum_{w'=1}^{W} \sum_{j \in \mathcal{F}_k^{(m)}} \sum_{t \in \mathcal{T}_{w'}} |\phi_{j,t}|} \)

Overview of the three proposed SHAP-based analyses. From the Shapley matrix Φ, we compute: Global SHAP (overall modality balance), Generative SHAP (contribution dynamics across decoding), and Temporal Alignment SHAP (input-output correspondence).

// Models

Six Models, Two Paradigms

We evaluate models spanning both LLM-based and encoder-decoder AVSR architectures on LRS2 and LRS3. Hover over a model to see its architecture.

Llama-AVSR

LLM-based

First multimodal LLM for AVSR with linear projectors

Llama-SMoP

LLM-based

Sparse mixture-of-experts projectors for enhanced fusion

Omni-AVSR

LLM-based

Unified ASR/VSR/AVSR with matryoshka representations

AV-HuBERT

Encoder-Decoder

Self-supervised masked audio-visual prediction

Auto-AVSR

Encoder-Decoder

MLP-based fusion with CTC/attention training

Whisper-Flamingo

Encoder-Decoder

Gated cross-attention on top of Whisper backbone

🏗 Hover over a model to see its architecture diagram

// Findings

Six Key Discoveries

From analyzing modality contributions across six state-of-the-art AVSR models on LRS2 and LRS3. Hover over each finding to explore the corresponding result.

🔊01

Persistent Audio Bias

Models maintain 38–46% audio contribution even at −10 dB SNR, where substantially stronger visual dominance would be expected.

⚡02

Dynamic Generation Shift

Whisper-Flamingo and Omni-AVSR progressively increase audio reliance during decoding; AV-HuBERT maintains stable balance throughout.

🔗03

Robust Temporal Alignment

Both modalities independently maintain input-output temporal correspondence, even under severe acoustic noise.

🔍 Hover over a finding to explore the corresponding figure

🎵04

Noise-Type Sensitivity

Different noise types induce varying degrees of visual reliance, with more challenging conditions producing larger shifts toward vision.

⏱05

Architecture-Dependent Duration

Utterance duration affects modality contributions differently across architectures: no universal trend exists.

📊06

SNR Drives Balance

Acoustic conditions are the dominant factor driving modality balance; recognition difficulty has minimal effect.

🔍 Hover over a finding to explore the corresponding figure

// Citation

Cite This Work

@article{drshapav2026, title = {Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition}, author = {Cappellazzo, Umberto and Petridis, Stavros and Pantic, Maja}, journal = {arXiv preprint arXiv:}, year = {2026} }

Dr. SHAP-AV: Decoding ModalityContributions via Shapley Attribution in AVSR

What is Dr. SHAP-AV?

Three Axes of Analysis

Global SHAP

Generative SHAP

Alignment SHAP

Six Models, Two Paradigms

Six Key Discoveries

Cite This Work

Dr. SHAP-AV:
Decoding Modality
Contributions via Shapley Attribution in AVSR