Large language models (LLMs) have recently achieved impressive results in speech recognition across multiple modalities, including Auditory Speech Recognition (ASR), Visual Speech Recognition (VSR), and Audio-Visual Speech Recognition (AVSR). Despite this progress, current LLM-based approaches typically address each task independently, training separate models that raise computational and deployment resource use while missing potential cross-task synergies. They also rely on fixed-rate token compression, which restricts flexibility in balancing accuracy with efficiency. These limitations highlight the need for a unified framework that can support ASR, VSR, and AVSR while enabling elastic inference. To this end, we present Omni-AVSR, a unified audio-visual LLM that combines efficient multi-granularity training with parameter-efficient adaptation. Specifically, we adapt the matryoshka representation learning paradigm to efficiently train across multiple audio and visual granularities, reducing its inherent training resource use. Furthermore, we explore three LoRA-based strategies for adapting the backbone LLM, balancing shared and task-specific specialization. Experiments on LRS2 and LRS3 show that Omni-AVSR achieves comparable or superior accuracy to state-of-the-art baselines while training a single model at substantially lower training and deployment resource use. The model also remains robust under acoustic noise, and we analyze its scaling behavior as LLM size increases, providing insights into the trade-off between performance and efficiency.
Learn about Omni-AVSR key components: efficient matryoshka traning and LoRA-based finetuning.
Omni-AVSR results against state-of-the-art methods on LRS2 and LRS3 datasets.
Explore Omni-AVSR scaling behavior or the importance of each task in the training paradigm.
The goal of Omni-AVSR is to train a single unified LLM-based model capable of performing ASR, VSR, and AVSR. At the same time, it enables flexible control of audio-visual granularity at inference according to resource constraints. In this way, Omni-AVSR supports multiple modalities and granularities within a single set of weights, while reducing training and deployment costs and achieving performance on par with, or even surpassing, state-of-the-art models trained independently for specific tasks or granularities.
Following prior audio-visual LLMs (e.g., Llama-AVSR, Llama-SMoP), Omni-AVSR comprises pre-trained audio and video encoders, projection layers, and an LLM backbone (see Figure 1a). In the next sections, we detail how Omni-AVSR is endowed with 1) explicit control over audio-visual granularities during inference and 2) the ability to jointly support ASR, VSR, and AVSR within a single model.
Omni-AVSR introduces a multi-granularity training scheme that allows the model to flexibly trade off efficiency and accuracy during inference. Instead of relying on fixed audio-visual compression rates, Omni-AVSR builds on the Matryoshka Representation Learning (MRL) principle to support inference at multiple token granularities within a single unified model.
Given an input audio waveform \( \mathbf{a} \) and its corresponding lip-movement video \( \mathbf{v} \), the pre-trained encoders produce token sequences \( \mathbf{Z}^a \) and \( \mathbf{Z}^v \), respectively. During training, token sequences at varying granularities are generated by applying \( C_A \) audio compression rates {\(a_1, a_2,\cdots,a_{C_A} \)} and \( C_V \) video rates {\(v_1, v_2,\cdots,v_{C_V} \)}.
Naïvely combining these rates would require \( C_A \) LLM forward/backward passes per batch for ASR, \( C_V \) for VSR, and \( C_A \cdot C_V \) for AVSR. This would lead to prohibitive computational overhead and potential interference among multiple objectives.
To overcome this limitation, we introduce a key modification: during training, we randomly select one audio rate \( a_i \) and one video rate \( v_j \) at each iteration, yielding compressed sequences \( \mathbf{Z}^{a_i} \) and \( \mathbf{Z}^{v_j} \). This reduces the number of forward/backward LLM passes to only three, one per task, instead of \( C_A + C_V + C_A \cdot C_V \).
These compressed sequences are then passed through modality-specific projection layers and concatenated with task-specific text tokens \( X_t^{\text{P}} \), where \( t \in \{\mathsf{ASR}, \mathsf{VSR}, \mathsf{AVSR}\} \) and \(X^\text{P}_t\) encodes both the task prompt and the transcription. Therefore, we obtain: \(\mathbf{Z}_{\mathsf{ASR}} = [\mathbf{Z}^{a_i}, X^\text{P}_{\mathsf{ASR}}]\), \(\mathbf{Z}_{\mathsf{VSR}} = [\mathbf{Z}^{v_j}, X^\text{P}_{\mathsf{VSR}}]\), and \(\mathbf{Z}_{\mathsf{AVSR}} = [\mathbf{Z}^{a_i}, \mathbf{Z}^{v_j} , X^\text{P}_{\mathsf{AVSR}}]\). This strategy preserves the flexibility of MRL at inference while substantially reducing its training cost.
Omni-AVSR is trained by averaging the auto-regressive next token prediction loss for each task for each input data. For each task-specific sequence \(\mathbf{Z}_t\), the probability of the target \(\mathbf{Y}\) is computed by \(p(\mathbf{Y}|\mathbf{Z}_t) = \prod_{s=1}^{S}p_\theta(y_s|\mathbf{Z}_t, y_{< s})\), and the corresponding loss is defined as \(\mathcal{L}_t = - \log p(\mathbf{Y}|\mathbf{Z}_t)\), where \(y_{< s}\) is the generated output sequence up to token \(s-1\), \(\theta\) is the trainable parameters, and \(t \in \{\mathsf{ASR}, \mathsf{VSR}, \mathsf{AVSR}\}\). Overall, the final objective we train on is: $$\mathcal{L}_{\text{OMNI}} = \lambda_{\mathsf{ASR}}\mathcal{L}_{\mathsf{ASR}} + \lambda_{\mathsf{VSR}}\mathcal{L}_{\mathsf{VSR}} + \lambda_{\mathsf{AVSR}}\mathcal{L}_{\mathsf{AVSR}},$$ where \(\lambda_{\mathsf{ASR}}\), \(\lambda_{\mathsf{VSR}}\), \(\lambda_{\mathsf{AVSR}}\) are task-specific weights.
In Omni-AVSR, the pre-trained LLM is kept frozen while low-rank LoRA modules are employed to parameter-efficiently fine-tune it. Given our multi-task setting, we explore three configurations: 1) Omni-LoRA-S, 2) Omni-LoRA-T, and 3) Omni-LoRA-ST, illustrated in Figure 1b.
During training, Omni-LoRA-T and Omni-LoRA-ST activate all task-specific modules. At inference, however, only the module corresponding to the selected task is used, ensuring efficiency.
We conduct experiments on LRS2 and LRS3 datasets. For a detailed description of these datasets, the pre-processing, the traning/inference details and more, please take a look at our paper.
Table 1 reports the ASR/VSR/AVSR results of our three Omni-AVSR variants on LRS2 and LRS3. On LRS2, the task- specific variant Omni-AVSR-T achieves the best performance, while on LRS3 all three variants yield comparable results. Compared with the baselines, we observe the following: (1) all Omni-AVSR variants consistently outperform Llama-AVSR, which requires a separate model per rate and task; (2) Omni-AVSR-T on LRS2, and all three variants on LRS3, match or surpass Llama-MTSK and Llama-MT; (3) task-wise, Omni-AVSR particularly benefits VSR; and (4) performance trends remain consistent across compression rates.
Beyond delivering strong recognition performance, Omni-AVSR also offers significant computational advantages, as summarized in Table 2. (1) Omni-AVSR requires training only a single model, independent of the number of tasks \(T\) (ASR, VSR, and AVSR in our case, so \(T\) = 3) and the number of audio \(C_A\) and video \(C_V\) compression rates (\(C_A = C_V = 2\) in our setup). (2) In terms of the number of forward/backward passes required over the LLM, Omni-AVSR computes the loss only once per task, as it samples a single audio and video rate at each iteration, thus reducing the requirement to just T passes. Overall, Omni-AVSR requires only a single model and substantially reduces overall training computations compared to all baselines.
To evaluate the robustness of Omni-AVSR under noisy conditions, we inject babble noise at varying SNRs. As shown in Table 3, Omni-AVSR-ST consistently outperforms Llama-AVSR and Llama-MTSK, and remains competitive with Llama-MT across noise levels, often surpassing it at lower SNRs.
In Table 4, we compare Omni-AVSR-ST with three state-of- the-art methods that train a single model for ASR, VSR, and AVSR: u-HuBERT, MultiAVSR, and USR. At the (4,2) compression setting, Omni-AVSR-ST achieves the best performance across all tasks while requiring significantly fewer parameters and surpassing u-HuBERT, despite the latter being trained on 1759 hours of data.
Below we include some clip videos from the LRS3 test set and the corresponding generated transcriptions using Omni-AVSR-ST using audio-only, video-only, or audio-visual modalities at different compression rates. The *WER* displayed is the one obtained by each model across the 5 videos.
In Table 5, we analyze the impact of varying the loss weight coefficients for each task on the LRS2 dataset. The best performance is given by \(\lambda_{\mathsf{ASR}} = \lambda_{\mathsf{AVSR}} = 1\) and \(\lambda_{\mathsf{VSR}} = 1.5\). Since VSR is the most challenging of the three tasks, assigning it a higher weight leads to improved overall results.
Figure 2 presents a comparison of Omni-AVSR-ST with recent state-of-the-art approaches, whose details can be found in our paper. Omni-AVSR-ST (evaluated at audio-video rates of (4,2)) achieves competitive WERs while requiring substantially fewer parameters and training data hours than all baselines, within one consistent framework.
We study how scaling the LLM impacts performance across ASR, VSR, and AVSR in Figure 3, using models of different sizes from the Llama and Qwen 2.5 families. As shown, performance improves with larger LLMs, with higher gains observed on more challenging tasks (e.g., VSR) or under higher compression (e.g., ASR at rate 16). However, larger models incur greater training computations, memory usage, and slower inference. Overall, LLMs in the 1–3B parameter range represent a favorable trade-off between accuracy and efficiency.
In this work, we introduce Omni-AVSR, the first unified audio-visual LLM that jointly supports ASR, VSR, and AVSR while enabling elastic inference under a single set of weights. By combining efficient matryoshka-based multi-granularity training with LoRA adaptation strategies, Omni-AVSR achieves strong performance while reducing training and deployment costs. Experiments on LRS2 and LRS3 show that Omni-AVSR matches or surpasses state-of-the-art baselines, remains robust in noisy conditions, and delivers favorable trade-offs when scaling LLM size. Furthermore, Omni-AVSR provides significant computational savings, requiring only one model and a reduced number of LLM passes during training.
If you find this work useful, please cite our paper using the following BibTeX entry:
@article{Omni-AVSR,
title={Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models},
author={Umberto Cappellazzo and Xubo Liu and Pingchuan Ma and Stavros Petridis and Maja Pantic},
journal={arxiv 2025},
year={2025},
}