Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models

Abstract

Large language models (LLMs) have recently achieved impressive results in speech recognition across multiple modalities, including Auditory Speech Recognition (ASR), Visual Speech Recognition (VSR), and Audio-Visual Speech Recognition (AVSR). Despite this progress, current LLM-based approaches typically address each task independently, training separate models that raise computational and deployment resource use while missing potential cross-task synergies. They also rely on fixed-rate token compression, which restricts flexibility in balancing accuracy with efficiency. These limitations highlight the need for a unified framework that can support ASR, VSR, and AVSR while enabling elastic inference. To this end, we present Omni-AVSR, a unified audio-visual LLM that combines efficient multi-granularity training with parameter-efficient adaptation. Specifically, we adapt the matryoshka representation learning paradigm to efficiently train across multiple audio and visual granularities, reducing its inherent training resource use. Furthermore, we explore three LoRA-based strategies for adapting the backbone LLM, balancing shared and task-specific specialization. Experiments on LRS2 and LRS3 show that Omni-AVSR achieves comparable or superior accuracy to state-of-the-art baselines while training a single model at substantially lower training and deployment resource use. The model also remains robust under acoustic noise, and we analyze its scaling behavior as LLM size increases, providing insights into the trade-off between performance and efficiency.

Overview

OMNI-AVSR Details

Learn about Omni-AVSR key components: efficient matryoshka traning and LoRA-based finetuning.

Main Results

Omni-AVSR results against state-of-the-art methods on LRS2 and LRS3 datasets.

Ablation Studies

Explore Omni-AVSR scaling behavior or the importance of each task in the training paradigm.

Omni-AVSR Details

The goal of Omni-AVSR is to train a single unified LLM-based model capable of performing ASR, VSR, and AVSR. At the same time, it enables flexible control of audio-visual granularity at inference according to resource constraints. In this way, Omni-AVSR supports multiple modalities and granularities within a single set of weights, while reducing training and deployment costs and achieving performance on par with, or even surpassing, state-of-the-art models trained independently for specific tasks or granularities.
Following prior audio-visual LLMs (e.g., Llama-AVSR, Llama-SMoP), Omni-AVSR comprises pre-trained audio and video encoders, projection layers, and an LLM backbone (see Figure 1a). In the next sections, we detail how Omni-AVSR is endowed with 1) explicit control over audio-visual granularities during inference and 2) the ability to jointly support ASR, VSR, and AVSR within a single model.

Method Overview — **Figure 1**: Overview of **(a)** the proposed Omni-AVSR model and **(b)** its Omni-LoRA variants.

Multi-Granularity via Efficient Matryoshka Training

Omni-AVSR introduces a multi-granularity training scheme that allows the model to flexibly trade off efficiency and accuracy during inference. Instead of relying on fixed audio-visual compression rates, Omni-AVSR builds on the Matryoshka Representation Learning (MRL) principle to support inference at multiple token granularities within a single unified model.

Given an input audio waveform $ \mathbf{a} $ and its corresponding lip-movement video $ \mathbf{v} $, the pre-trained encoders produce token sequences $ \mathbf{Z}^a $ and $ \mathbf{Z}^v $, respectively. During training, token sequences at varying granularities are generated by applying $ C_A $ audio compression rates {$a_1, a_2,\cdots,a_{C_A} $} and $ C_V $ video rates {$v_1, v_2,\cdots,v_{C_V} $}.
Naïvely combining these rates would require $ C_A $ LLM forward/backward passes per batch for ASR, $ C_V $ for VSR, and $ C_A \cdot C_V $ for AVSR. This would lead to prohibitive computational overhead and potential interference among multiple objectives.

To overcome this limitation, we introduce a key modification: during training, we randomly select one audio rate $ a_i $ and one video rate $ v_j $ at each iteration, yielding compressed sequences $ \mathbf{Z}^{a_i} $ and $ \mathbf{Z}^{v_j} $. This reduces the number of forward/backward LLM passes to only three, one per task, instead of $ C_A + C_V + C_A \cdot C_V $.
These compressed sequences are then passed through modality-specific projection layers and concatenated with task-specific text tokens $ X_t^{\text{P}} $, where $ t \in \{\mathsf{ASR}, \mathsf{VSR}, \mathsf{AVSR}\} $ and $X^\text{P}_t$ encodes both the task prompt and the transcription. Therefore, we obtain: $\mathbf{Z}_{\mathsf{ASR}} = [\mathbf{Z}^{a_i}, X^\text{P}_{\mathsf{ASR}}]$, $\mathbf{Z}_{\mathsf{VSR}} = [\mathbf{Z}^{v_j}, X^\text{P}_{\mathsf{VSR}}]$, and $\mathbf{Z}_{\mathsf{AVSR}} = [\mathbf{Z}^{a_i}, \mathbf{Z}^{v_j} , X^\text{P}_{\mathsf{AVSR}}]$. This strategy preserves the flexibility of MRL at inference while substantially reducing its training cost.

Joint ASR-VSR-AVSR Training Formulation

Omni-AVSR is trained by averaging the auto-regressive next token prediction loss for each task for each input data. For each task-specific sequence $\mathbf{Z}_t$, the probability of the target $\mathbf{Y}$ is computed by $p(\mathbf{Y}|\mathbf{Z}_t) = \prod_{s=1}^{S}p_\theta(y_s|\mathbf{Z}_t, y_{< s})$, and the corresponding loss is defined as $\mathcal{L}_t = - \log p(\mathbf{Y}|\mathbf{Z}_t)$, where $y_{< s}$ is the generated output sequence up to token $s-1$, $\theta$ is the trainable parameters, and $t \in \{\mathsf{ASR}, \mathsf{VSR}, \mathsf{AVSR}\}$. Overall, the final objective we train on is: $$\mathcal{L}_{\text{OMNI}} = \lambda_{\mathsf{ASR}}\mathcal{L}_{\mathsf{ASR}} + \lambda_{\mathsf{VSR}}\mathcal{L}_{\mathsf{VSR}} + \lambda_{\mathsf{AVSR}}\mathcal{L}_{\mathsf{AVSR}},$$ where $\lambda_{\mathsf{ASR}}$, $\lambda_{\mathsf{VSR}}$, $\lambda_{\mathsf{AVSR}}$ are task-specific weights.

Efficient LLM Adaptation via Omni-LoRA

In Omni-AVSR, the pre-trained LLM is kept frozen while low-rank LoRA modules are employed to parameter-efficiently fine-tune it. Given our multi-task setting, we explore three configurations: 1) Omni-LoRA-S, 2) Omni-LoRA-T, and 3) Omni-LoRA-ST, illustrated in Figure 1b.

The Omni-LoRA-S variant employs a single Shared LoRA module to adapt the query and value projection matrices of each LLM self- attention layer across ASR, VSR, and AVSR tasks.
The Omni-LoRA-T variant instead defines separate Task-specific LoRA modules, each specialized to one specific task.
Omni-LoRA-ST combines both Shared and Task-specific LoRA modules, resulting in a hybrid approach of the previous variants.

During training, Omni-LoRA-T and Omni-LoRA-ST activate all task-specific modules. At inference, however, only the module corresponding to the selected task is used, ensuring efficiency.

Main Results

We conduct experiments on LRS2 and LRS3 datasets. For a detailed description of these datasets, the pre-processing, the traning/inference details and more, please take a look at our paper.

Method Figure 1 — **Table 1**: ASR, VSR, AVSR results in terms of WER (%) across different audio and video compression rates.

Method Figure 2 — **Table 2**: Comparison with state-of-the-art methods using a single model for ASR, VSR, and AVSR on LRS3.

Main ASR/VSR/AVSR Results on LRS2/LRS3

Table 1 reports the ASR/VSR/AVSR results of our three Omni-AVSR variants on LRS2 and LRS3. On LRS2, the task- specific variant Omni-AVSR-T achieves the best performance, while on LRS3 all three variants yield comparable results. Compared with the baselines, we observe the following: (1) all Omni-AVSR variants consistently outperform Llama-AVSR, which requires a separate model per rate and task; (2) Omni-AVSR-T on LRS2, and all three variants on LRS3, match or surpass Llama-MTSK and Llama-MT; (3) task-wise, Omni-AVSR particularly benefits VSR; and (4) performance trends remain consistent across compression rates.

Computational Cost Analysis

Beyond delivering strong recognition performance, Omni-AVSR also offers significant computational advantages, as summarized in Table 2. (1) Omni-AVSR requires training only a single model, independent of the number of tasks $T$ (ASR, VSR, and AVSR in our case, so $T$ = 3) and the number of audio $C_A$ and video $C_V$ compression rates ($C_A = C_V = 2$ in our setup). (2) In terms of the number of forward/backward passes required over the LLM, Omni-AVSR computes the loss only once per task, as it samples a single audio and video rate at each iteration, thus reducing the requirement to just T passes. Overall, Omni-AVSR requires only a single model and substantially reduces overall training computations compared to all baselines.

Results under Acoustic Noise

To evaluate the robustness of Omni-AVSR under noisy conditions, we inject babble noise at varying SNRs. As shown in Table 3, Omni-AVSR-ST consistently outperforms Llama-AVSR and Llama-MTSK, and remains competitive with Llama-MT across noise levels, often surpassing it at lower SNRs.

Comparison with Other Multi-task Methods

In Table 4, we compare Omni-AVSR-ST with three state-of- the-art methods that train a single model for ASR, VSR, and AVSR: u-HuBERT, MultiAVSR, and USR. At the (4,2) compression setting, Omni-AVSR-ST achieves the best performance across all tasks while requiring significantly fewer parameters and surpassing u-HuBERT, despite the latter being trained on 1759 hours of data.

Qualitative Results

Below we include some clip videos from the LRS3 test set and the corresponding generated transcriptions using Omni-AVSR-ST using audio-only, video-only, or audio-visual modalities at different compression rates. The *WER* displayed is the one obtained by each model across the 5 videos.