I'm Umberto Cappellazzo, and I work as a Gen AI Research Engineer at NatWest Group, London, UK, in the CAIRO group led by Maja Pantic. My manager is Stavros Petridis. Previously I was a Reseaarch Associate at Imperial College London. I've been working on self-supervised audio representation learning, speech tokenizers, and multimodal LLMs. In particular, I've mainly focused on advancing audio-visual speech recognition through Large Language Models, in close collaboration with Meta AI. I've published several papers along this direction (IEEE ICASSP x3, Interspeech x3, IEEE ASRU, NeurIPS). Previously, I obtained my PhD in Information Engineering and Computer Science from the University of Trento, Italy.

Read Papers Scholar GitHub LinkedIn Twitter

RESEARCH

🗣️

Audio-Visual Speech Recognition

LLM-based AVSR that reads lips and listens at once — state-of-the-art on LRS2/LRS3 via modality-aware compression and LoRA.

Llama-AVSRLRS3Multimodal LLM

🪆

Elastic & Matryoshka Models

One model, many granularities. Matryoshka representation learning and Mixture-of-Experts for adaptive inference.

MoMEOmni-AVSRElastic

🎛️

Parameter-Efficient Fine-Tuning

Adapters, LoRA, prompt-tuning, and soft Mixture-of-Adapters — matching full fine-tuning at a fraction of the cost.

PEFTSoft-MoAAST

🌊

Self-Supervised Audio Learning

Large-scale self-supervised audio pre-training via next-embedding auto-regressive objectives in latent space.

SSLSpectrogramsTokenizers

🔍

Interpreting Multimodal LLMs

Probing attention sinks, massive activations, and modality contributions via Shapley attribution.

SHAPLEY VALUESAttention SinksMASSIVE ACTIVATIONS

♻️

Continual Learning for Speech

Learning sequentially without forgetting — rehearsal, distillation, and contrastive objectives for spoken language understanding (SLU).

Continual LearningSLUDistillation

BIO

July 2026 — Present

Gen AI Research Engineer· NatWest Group (CAIRO team)

Mar 2025 — July 2026

Research Associate · Imperial College London (iBUG team)

Advised by Stavros Petridis in the group led by Maja Pantic. Focus on multimodal LLMs and self-supervised audio representation learning.

Feb 2024 — Nov 2024

Visiting Researcher · Imperial College London

Nine-month visit with iBUG exploring LLMs for AVSR, advised by Stavros Petridis — the work behind Llama-AVSR.

Summer 2023

JSALT 2023 · Le Mans, France

Finite-state methods with modern neural architectures group; early-exit techniques for CTC/MMI.

Nov 2021 — Jan 2025

PhD · University of Trento

"Efficient Knowledge Transfer and Adaptation for Speech and Beyond." Defended cum laude, Jan 2025. Supervised by Daniele Falavigna and Alessio Brutti.

2016 — 2019

M.S. Telecommunication Engineering · University of Padova

Thesis: deep-learning-based ECG delineator. Supervised by Michele Rossi and Matteo Gadaleta.

2013 — 2016

B.S. Information Engineering · University of Padova

Thesis: message authentication over an ideal or noisy channel. Supervised by Nicola Laurenti.

PUBLICATIONS

Interspeech '26 [Long Paper Track]

Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in AVSR

U. Cappellazzo, S. Petridis, M. Pantic

Web Paper Code

Interspeech '26

VIB-AVSR: Variational Information Bottleneck for Noise-Robust LLM-Based AVSR

P. Arora, N. Singh, U. Cappellazzo, S. Petridis, M. Pantic

Paper Code

Interspeech '26

MambAdapter: Lightweight Mamba-Based Adapters for PEFT in Speech and Audio

S. Ali, U. Cappellazzo, M. Ravanelli

Paper Code

ICASSP '26

Omni-AVSR: Towards Unified Multimodal Speech Recognition with LLMs

U. Cappellazzo, X. Liu, P. Ma, S. Petridis, M. Pantic

Paper Web Code

ICASSP '26

Mitigating Attention Sinks and Massive Activations in AVSR with LLMs

Anand, U. Cappellazzo, S. Petridis, M. Pantic

Paper Code

NeurIPS '25

MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition

U. Cappellazzo, M. Kim, P. Ma, H. Chen, X. Liu, S. Petridis, M. Pantic

Paper OpenReview

ASRU '25

Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs

U. Cappellazzo, M. Kim, S. Petridis

Paper

Interspeech '25

Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach

U. Cappellazzo, M. Kim, S. Petridis, D. Falavigna, A. Brutti

Paper

ICASSP '25

Large Language Models Are Strong Audio-Visual Speech Recognition Learners

U. Cappellazzo, M. Kim, H. Chen, P. Ma, S. Petridis, D. Falavigna, A. Brutti, M. Pantic

Paper Code

MLSP '24

Parameter-Efficient Transfer Learning of Audio Spectrogram Transformers

U. Cappellazzo, D. Falavigna, A. Brutti, M. Ravanelli

Paper Code

ACL Findings '24

Continual Contrastive Spoken Language Understanding

U. Cappellazzo, E. Fini, M. Yang, D. Falavigna, A. Brutti, B. Raj

Paper

Interspeech '24

Efficient Fine-tuning of Audio Spectrogram Transformers via Soft Mixture of Adapters

U. Cappellazzo, D. Falavigna, A. Brutti

Paper Code

Interspeech '24

Evaluating and Improving Continual Learning in Spoken Language Understanding

M. Yang, X. Li, U. Cappellazzo, S. Watanabe, B. Raj

Paper

ICASSP '24

Improving Continual Learning of Acoustic Scene Classification via Mutual Information Optimization

M. Yang, U. Cappellazzo, X. Li, S. Watanabe, B. Raj

Paper

ICASSP '24 WS

Training Dynamic Models using Early Exits for ASR on Resource-Constrained Devices

G. A. Wright, U. Cappellazzo, S. Zaiem, D. Raj, L. Ondel Yang, D. Falavigna, M. Ali, A. Brutti

Paper Code

Interspeech '23

Sequence-Level Knowledge Distillation for Class-Incremental End-to-End SLU

U. Cappellazzo, M. Yang, D. Falavigna, A. Brutti

Paper Code

Interspeech '23

An Investigation of the Combination of Rehearsal and Knowledge Distillation in Continual Learning for SLU

U. Cappellazzo, D. Falavigna, A. Brutti

Paper Code

News

LATEST UPDATES

05 Jun '26

3/3 papers accepted to INTERSPEECH 2026 (1 long, 2 regular): Dr. SHAP-AV, VIB-AVSR, MambAdapter. See you in Sydney! 🇦🇺

13 Mar '26

New paper Dr. SHAP-AV — first comprehensive study of modality contributions in AVSR at scale. Project · Paper · Code

17 Jan '26

Two papers accepted to IEEE ICASSP 2026: Omni-AVSR and a study on attention sinks & massive activations in audio-visual LLMs.

19 Sep '25

MoME accepted to NeurIPS 2025 — unifying Matryoshka representation learning with sparse Mixture-of-Experts.

07 Aug '25

Llama-MTSK accepted to IEEE ASRU 2025. See you in Honolulu! 🌺

22 May '25

Llama-SMoP accepted to Interspeech 2025 — a sparse Mixture of Projectors for LLM-based AVSR.

11 Mar '25

Joined Imperial College London (iBUG) as a Research Associate, advised by Stavros Petridis.

15 Jan '25

Defended my PhD cum laude at the University of Trento. Dissertation · Slides

RESEARCH

Audio-Visual Speech Recognition

Elastic & Matryoshka Models

Parameter-Efficient Fine-Tuning

Self-Supervised Audio Learning

Interpreting Multimodal LLMs

Continual Learning for Speech

BIO

Gen AI Research Engineer· NatWest Group (CAIRO team)

Research Associate · Imperial College London (iBUG team)

Visiting Researcher · Imperial College London

JSALT 2023 · Le Mans, France

PhD · University of Trento

M.S. Telecommunication Engineering · University of Padova

B.S. Information Engineering · University of Padova

PUBLICATIONS

Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in AVSR

VIB-AVSR: Variational Information Bottleneck for Noise-Robust LLM-Based AVSR

MambAdapter: Lightweight Mamba-Based Adapters for PEFT in Speech and Audio

Omni-AVSR: Towards Unified Multimodal Speech Recognition with LLMs

Mitigating Attention Sinks and Massive Activations in AVSR with LLMs

MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition

Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs

Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach

Large Language Models Are Strong Audio-Visual Speech Recognition Learners

Parameter-Efficient Transfer Learning of Audio Spectrogram Transformers

Continual Contrastive Spoken Language Understanding

Efficient Fine-tuning of Audio Spectrogram Transformers via Soft Mixture of Adapters

Evaluating and Improving Continual Learning in Spoken Language Understanding

Improving Continual Learning of Acoustic Scene Classification via Mutual Information Optimization

Training Dynamic Models using Early Exits for ASR on Resource-Constrained Devices

Sequence-Level Knowledge Distillation for Class-Incremental End-to-End SLU

An Investigation of the Combination of Rehearsal and Knowledge Distillation in Continual Learning for SLU

LATEST UPDATES

LET'S COLLABORATE