Speech large language models (LLMs) have emerged as a prominent research focus in speech processing. We propose VocalNet-1B and VocalNet-8B, a series of high-performance, low-latency speech LLMs enabled by a scalable and model-agnostic training framework for real-time voice interaction. Departing from the conventional next-token prediction (NTP), we introduce multi-token prediction (MTP), a novel approach optimized for speech LLMs that simultaneously improves generation speed and quality. Experiments show that VocalNet outperforms mainstream Omni LLMs despite using significantly less training data, while also surpassing existing open-source speech LLMs by a substantial margin. To support reproducibility and community advancement, we will open-source all model weights, inference code, training data, and framework implementations upon publication.
Figure 1: On the left: The architecture of the VocalNet model. On the right: A depiction of VocalNet's dual-stage training strategy.
Figure 2: (a) Non-Streaming Attention Mask: vLLMi attends to the complete text positions, and si attends to the complete text positions and its previous speech positions; (b) Streaming Attention Mask: vLLMi attends to itself and its previous text positions, and si attends to chunk-limited text positions, itself and its previous speech positions.
Figure 3: Illustration of various accelerate implementations. (a): Group Modeling; (b): MTP-Parallel-Linear; (c): MTP-DeepSeek; (d): Our MTP implementation.
👨 Male Voice Response
👩 Female Voice Response
I'm trying to teach myself to have nicer handwriting. Can you help?
What are the best exercises for beginners?
Is there a meaning for Christmas wreaths?
Which city is known as the 'City of Canals'?
What is the name of the largest rainforest in the world?
Which bird can fly faster than any other bird?
Could you please tell me how to live a healthy life?
What do you think a perfect society looks like? Please provide a brief description.
Which is more important: happiness or fun? Could you briefly share your thoughts?
用高兴的语音讲述中奖的经历。
😊 Joy: 哇塞!我竟然中了大奖,那种激动的心情简直无法用言语形容。这可是我梦寐以求的时刻啊,感觉所有的努力和等待都是值得的!
用悲伤的情绪描述毕业典礼上对同窗不舍的感情。
😢 Sadness: 毕业典礼上我们互道分别,泪水不禁涌出。大家脸上满载笑容却掩饰不住内心的离愁,想到今后不能常聚一起,实在感到十分不舍。
用生气的语气回应别人的不尊重行为。
😠 Angry: 你这种不尊重别人的行为真是太过分了!我不能容忍这样的无理举动,希望你能意识到自己的错误并改正。
用恐惧的声音描绘一次夜半惊魂。
😰 Fear: 深夜四周静谧的可怕,突然一阵怪风吹来窗户砰的一声巨响,仿佛有什么东西要闯进来,我吓得浑身僵硬大气也不敢出,只能紧紧抱住自己,祈祷着这一切快点结束。
用厌恶语气吐槽糟糕的晚餐体验。
🤢 Disgust: 昨晚的那顿饭简直让人难以忍受,油腻腻的口感加上刺鼻的味道,吃完后胃里翻江倒海。
用歉意的话语表达迟到的原因。
😔 Sorry: 真的很对不起,因为路上遇到了交通堵塞,导致我迟到了。我知道这给你带来了不便,请你原谅我的疏忽。
今天心情糟透了,研究了一下约翰·洛克关于奴隶待遇的观点,实在是让人反感。
😢 Sadness: 听到你的心情,我感到很遗憾。约翰·洛克的观点确实让人难以接受,但他的思想在当时的历史背景下有其复杂性。希望你能找到一些让自己心情好转的方法。
唉,我生病了,好难受。
😢 Sadness: 听到你生病了,我很难过。希望你能快点好起来。
谢谢你的理解。我读了点关于约翰·洛克对美国革命和宪法的贡献,感觉心情好了一些。
😊 Joy:很高兴听到你的心情有所好转。约翰·洛克的思想对美国革命和宪法的形成确实有重要影响。他的观点在当时是非常先进的。希望这些积极的信息能让你感到更加乐观。
你怎么给我提供了错误的信息!
😔 Sorry: 真的很抱歉给你带来了困扰,我会尽快更正错误的信息。
Bold indicates the optimal result in each subgroup and underline indicates the suboptimal result.
Model | LLM size | Modality | AlpacaEval | Llama Questions | TriviaQA | Web Questions |
---|---|---|---|---|---|---|
Mini-Omni | 0.5B | s→t | 1.84 | 2.7 | 0.12 | 0.22 |
s→s | 1.80 | 2.7 | 0.08 | 0.20 | ||
SLAM-Omni | 0.5B | s→t | 3.50 | 29.4 | 0.39 | 0.84 |
s→s | 3.01 | 26.7 | 0.34 | 0.69 | ||
VocalNet-1B (VA) | 1B | s→t | 5.38 | 70.3 | 3.38 | 4.93 |
s→s | 4.83 | 61.0 | 2.78 | 4.47 | ||
VocalNet-1B | 1B | s→t | 5.79 | 71.7 | 3.60 | 5.16 |
s→s | 5.03 | 63.7 | 3.06 | 4.68 | ||
LLaMA-Omni | 8B | s→t | 5.31 | 69.7 | 4.44 | 5.44 |
s→s | 3.89 | 55.1 | 2.44 | 4.00 | ||
Freeze-Omni | 7B | s→t | 4.51 | 77.7 | 5.32 | 6.41 |
s→s | 2.99 | 60.2 | 3.53 | 4.78 | ||
GLM-4-Voice | 9B | s→t | 5.86 | 77.4 | 4.95 | 5.56 |
s→s | 5.27 | 64.3 | 4.63 | 5.40 | ||
Baichuan-Omni-1.5 | 7B | s→t | 5.20 | 77.6 | 5.72 | 6.12 |
s→s | 4.10 | 61.2 | 4.13 | 5.18 | ||
MiniCPM-o | 8B | s→t | 6.13 | 77.2 | 6.43 | 7.16 |
s→s | 4.95 | 65.8 | 4.99 | 6.22 | ||
Minmo* | 8B | s→t | - | 78.9 | 4.83 | 5.50 |
s→s | 6.48 | 64.1 | 3.75 | 3.99 | ||
Qwen2.5-Omni | 8B | s→t | 6.01 | 79.0 | 5.89 | 6.88 |
s→s | 5.73 | 76.3 | 5.59 | 6.70 | ||
VocalNet-8B (VA) | 8B | s→t | 7.05 | 77.1 | 6.15 | 6.34 |
s→s | 6.30 | 71.4 | 5.24 | 5.81 | ||
VocalNet-8B | 8B | s→t | 7.12 | 79.5 | 6.24 | 6.48 |
s→s | 6.37 | 73.1 | 5.67 | 6.16 |
Bold indicates the optimal result in each subgroup and underline indicates the suboptimal result.
Model | AlpacaEval | Llama Questions | TriviaQA | Web Questions | Avg | |||||
---|---|---|---|---|---|---|---|---|---|---|
WER | UTMOS | WER | UTMOS | WER | UTMOS | WER | UTMOS | WER | UTMOS | |
Mini-Omni | 20.78 | 4.429 | 5.20 | 4.428 | 7.43 | 4.428 | 8.51 | 4.433 | 8.66 | 4.430 |
SLAM-Omni | 5.52 | 4.439 | 5.55 | 4.467 | 6.16 | 4.470 | 6.50 | 4.461 | 6.17 | 4.464 |
VocalNet-1B (VA) | 3.43 | 4.495 | 3.65 | 4.498 | 5.97 | 4.499 | 6.40 | 4.489 | 5.66 | 4.495 |
VocalNet-1B | 3.43 | 4.491 | 3.27 | 4.497 | 6.73 | 4.486 | 4.88 | 4.493 | 5.31 | 4.491 |
LLaMA-Omni | 6.00 | 3.942 | 10.00 | 4.003 | 20.93 | 3.965 | 14.60 | 3.935 | 15.90 | 3.956 |
Freeze-Omni | 14.33 | 4.377 | 14.20 | 4.417 | 20.39 | 4.404 | 18.25 | 4.398 | 18.31 | 4.401 |
GLM-4-Voice | 18.71 | 4.025 | 14.45 | 4.152 | 8.33 | 4.306 | 6.08 | 4.214 | 8.99 | 4.228 |
Baichuan-omni-1.5 | 20.84 | 4.082 | 22.82 | 4.332 | 22.36 | 4.401 | 23.29 | 4.350 | 22.67 | 4.347 |
MiniCPM-o | 15.35 | 4.102 | 5.73 | 4.228 | 8.08 | 4.128 | 8.94 | 4.125 | 8.72 | 4.137 |
Qwen2.5-Omni | 2.41 | 4.299 | 0.93 | 4.315 | 1.13 | 4.339 | 4.68 | 4.363 | 2.63 | 4.342 |
VocalNet-8B (VA) | 2.65 | 4.490 | 3.00 | 4.503 | 5.02 | 4.499 | 4.21 | 4.485 | 4.26 | 4.493 |
VocalNet-8B | 4.71 | 4.489 | 2.68 | 4.500 | 4.04 | 4.482 | 3.11 | 4.492 | 3.56 | 4.489 |
Bold indicates the optimal result.
Method | Group Size/Module Num | Speedup Ratio | WER↓ | UTMOS↑ |
---|---|---|---|---|
Baseline (NTP) | - | 1× | 10.62 | 4.488 |
Group-Linear | 3 | 3× | 11.50 | 4.488 |
5 | 5× | 17.61 | 4.414 | |
Group-Trans | 3 | 3× | 14.34 | 4.489 |
5 | 5× | 17.90 | 4.468 | |
MTP-Parallel-Linear | 5 | 1× | 8.61 | 4.492 |
3× | 8.00 | 4.494 | ||
5× | 10.57 | 4.467 | ||
MTP-DeepSeek | 5 | 1× | 9.14 | 4.493 |
3× | 9.02 | 4.498 | ||
5× | 18.23 | 4.488 | ||
MTP-VocalNet | 5 | 1× | 6.84 | 4.494 |
3× | 5.66 | 4.495 | ||
5× | 6.46 | 4.486 |
Bold indicates the optimal result and underline indicates the suboptimal result.
Module Num | Speedup | AlpacaEval | Llama Questions | TriviaQA | Web Questions | Avg | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
WER | UTMOS | WER | UTMOS | WER | UTMOS | WER | UTMOS | WER | UTMOS | ||
3 | 1× | 5.38 | 4.489 | 5.24 | 4.504 | 7.59 | 4.500 | 9.23 | 4.484 | 7.79 | 4.493 |
3× | 3.37 | 4.493 | 3.95 | 4.498 | 5.97 | 4.498 | 6.43 | 4.485 | 5.70 | 4.493 | |
5 | 1× | 4.14 | 4.485 | 4.48 | 4.502 | 6.52 | 4.497 | 8.41 | 4.491 | 6.84 | 4.495 |
3× | 3.43 | 4.495 | 3.65 | 4.498 | 5.97 | 4.499 | 6.40 | 4.489 | 5.66 | 4.495 | |
5× | 3.84 | 4.478 | 4.28 | 4.493 | 6.40 | 4.489 | 7.70 | 4.483 | 6.46 | 4.486 | |
7 | 1× | 5.38 | 4.489 | 5.24 | 4.502 | 7.59 | 4.480 | 9.23 | 4.490 | 7.79 | 4.487 |
3× | 3.40 | 4.490 | 3.92 | 4.499 | 5.91 | 4.498 | 7.57 | 4.494 | 6.14 | 4.496 | |
5× | 4.26 | 4.481 | 4.33 | 4.489 | 6.32 | 4.496 | 8.76 | 4.484 | 6.89 | 4.489 | |
7× | 5.50 | 4.470 | 5.19 | 4.474 | 8.28 | 4.478 | 9.20 | 4.462 | 8.06 | 4.470 |
Experiments are conducted on 1 NVIDIA L20 GPU.
Model | Speech Encoder (ms) | LLM (ms) | Speech Decoder (ms) | Speech Vocoder (ms) | Sum (ms) |
---|---|---|---|---|---|
VocalNet-1B | 35.86 | 33.95 | 24.74 | 225.18 | 319.73 |
VocalNet-8B | 36.08 | 126.71 | 40.02 | 225.56 | 428.38 |