VocalNet Logo

VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation

Yuhao Wang1,2*, Heyang Liu1,2*, Ziyang Cheng3*, Ronghua Wu2, Qunshan Gu2, Yanfeng Wang1, Yu Wang1†
1 Shanghai Jiao Tong University
2 Ant Group
3 Wuhan University

Abstract

Speech large language models (LLMs) have emerged as a prominent research focus in speech processing. We propose VocalNet-1B and VocalNet-8B, a series of high-performance, low-latency speech LLMs enabled by a scalable and model-agnostic training framework for real-time voice interaction. Departing from the conventional next-token prediction (NTP), we introduce multi-token prediction (MTP), a novel approach optimized for speech LLMs that simultaneously improves generation speed and quality. Experiments show that VocalNet outperforms mainstream Omni LLMs despite using significantly less training data, while also surpassing existing open-source speech LLMs by a substantial margin. To support reproducibility and community advancement, we will open-source all model weights, inference code, training data, and framework implementations upon publication.

Overview

VocalNet
VocalNet Architecture and Training Strategy

Figure 1: On the left: The architecture of the VocalNet model. On the right: A depiction of VocalNet's dual-stage training strategy.

Streaming Speech Decoding
Non-Streaming and Streaming Attention Masks

Figure 2: (a) Non-Streaming Attention Mask: vLLMi attends to the complete text positions, and si attends to the complete text positions and its previous speech positions; (b) Streaming Attention Mask: vLLMi attends to itself and its previous text positions, and si attends to chunk-limited text positions, itself and its previous speech positions.

Our MTP Implementation
Multi-Token Prediction Implementations

Figure 3: Illustration of various accelerate implementations. (a): Group Modeling; (b): MTP-Parallel-Linear; (c): MTP-DeepSeek; (d): Our MTP implementation.

Video Demos

Chinese Dialogue Demo (with Interruption)

👨 Male Voice Response

English Dialogue Demo (with Interruption)

👩 Female Voice Response

Audio Demos

AlpacaEval for Instruction Following

I'm trying to teach myself to have nicer handwriting. Can you help?

Question Audio:
Answer Audio:

What are the best exercises for beginners?

Question Audio:
Answer Audio:

Is there a meaning for Christmas wreaths?

Question Audio:
Answer Audio:

Llama Questions for Knowledge Q&A

Which city is known as the 'City of Canals'?

Question Audio:
Answer Audio:

What is the name of the largest rainforest in the world?

Question Audio:
Answer Audio:

Which bird can fly faster than any other bird?

Question Audio:
Answer Audio:

Open-Ended Insights

Could you please tell me how to live a healthy life?

Question Audio:
Answer Audio:

What do you think a perfect society looks like? Please provide a brief description.

Question Audio:
Answer Audio:

Which is more important: happiness or fun? Could you briefly share your thoughts?

Question Audio:
Answer Audio:

Emotional Generation

Voice Control

用高兴的语音讲述中奖的经历。

Question Audio:

😊 Joy: 哇塞!我竟然中了大奖,那种激动的心情简直无法用言语形容。这可是我梦寐以求的时刻啊,感觉所有的努力和等待都是值得的!

Answer Audio:

用悲伤的情绪描述毕业典礼上对同窗不舍的感情。

Question Audio:

😢 Sadness: 毕业典礼上我们互道分别,泪水不禁涌出。大家脸上满载笑容却掩饰不住内心的离愁,想到今后不能常聚一起,实在感到十分不舍。

Answer Audio:

用生气的语气回应别人的不尊重行为。

Question Audio:

😠 Angry: 你这种不尊重别人的行为真是太过分了!我不能容忍这样的无理举动,希望你能意识到自己的错误并改正。

Answer Audio:

用恐惧的声音描绘一次夜半惊魂。

Question Audio:

😰 Fear: 深夜四周静谧的可怕,突然一阵怪风吹来窗户砰的一声巨响,仿佛有什么东西要闯进来,我吓得浑身僵硬大气也不敢出,只能紧紧抱住自己,祈祷着这一切快点结束。

Answer Audio:

用厌恶语气吐槽糟糕的晚餐体验。

Question Audio:

🤢 Disgust: 昨晚的那顿饭简直让人难以忍受,油腻腻的口感加上刺鼻的味道,吃完后胃里翻江倒海。

Answer Audio:

用歉意的话语表达迟到的原因。

Question Audio:

😔 Sorry: 真的很对不起,因为路上遇到了交通堵塞,导致我迟到了。我知道这给你带来了不便,请你原谅我的疏忽。

Answer Audio:

Emotional Dialogue

今天心情糟透了,研究了一下约翰·洛克关于奴隶待遇的观点,实在是让人反感。

Question Audio:

😢 Sadness: 听到你的心情,我感到很遗憾。约翰·洛克的观点确实让人难以接受,但他的思想在当时的历史背景下有其复杂性。希望你能找到一些让自己心情好转的方法。

Answer Audio:

唉,我生病了,好难受。

Question Audio:

😢 Sadness: 听到你生病了,我很难过。希望你能快点好起来。

Answer Audio:

谢谢你的理解。我读了点关于约翰·洛克对美国革命和宪法的贡献,感觉心情好了一些。

Question Audio:

😊 Joy:很高兴听到你的心情有所好转。约翰·洛克的思想对美国革命和宪法的形成确实有重要影响。他的观点在当时是非常先进的。希望这些积极的信息能让你感到更加乐观。

Answer Audio:

你怎么给我提供了错误的信息!

Question Audio:

😔 Sorry: 真的很抱歉给你带来了困扰,我会尽快更正错误的信息。

Answer Audio:

Experimental Results

Table 1: Comparison with different speech LLMs and omni LLMs on OpenAudioBench

Bold indicates the optimal result in each subgroup and underline indicates the suboptimal result.

Model LLM size Modality AlpacaEval Llama Questions TriviaQA Web Questions
Mini-Omni 0.5B s→t 1.84 2.7 0.12 0.22
s→s 1.80 2.7 0.08 0.20
SLAM-Omni 0.5B s→t 3.50 29.4 0.39 0.84
s→s 3.01 26.7 0.34 0.69
VocalNet-1B (VA) 1B s→t 5.38 70.3 3.38 4.93
s→s 4.83 61.0 2.78 4.47
VocalNet-1B 1B s→t 5.79 71.7 3.60 5.16
s→s 5.03 63.7 3.06 4.68
LLaMA-Omni 8B s→t 5.31 69.7 4.44 5.44
s→s 3.89 55.1 2.44 4.00
Freeze-Omni 7B s→t 4.51 77.7 5.32 6.41
s→s 2.99 60.2 3.53 4.78
GLM-4-Voice 9B s→t 5.86 77.4 4.95 5.56
s→s 5.27 64.3 4.63 5.40
Baichuan-Omni-1.5 7B s→t 5.20 77.6 5.72 6.12
s→s 4.10 61.2 4.13 5.18
MiniCPM-o 8B s→t 6.13 77.2 6.43 7.16
s→s 4.95 65.8 4.99 6.22
Minmo* 8B s→t - 78.9 4.83 5.50
s→s 6.48 64.1 3.75 3.99
Qwen2.5-Omni 8B s→t 6.01 79.0 5.89 6.88
s→s 5.73 76.3 5.59 6.70
VocalNet-8B (VA) 8B s→t 7.05 77.1 6.15 6.34
s→s 6.30 71.4 5.24 5.81
VocalNet-8B 8B s→t 7.12 79.5 6.24 6.48
s→s 6.37 73.1 5.67 6.16
Table 2: Comparison with different models in response alignment and acoustic performance

Bold indicates the optimal result in each subgroup and underline indicates the suboptimal result.

Model AlpacaEval Llama Questions TriviaQA Web Questions Avg
WER UTMOS WER UTMOS WER UTMOS WER UTMOS WER UTMOS
Mini-Omni 20.78 4.429 5.20 4.428 7.43 4.428 8.51 4.433 8.66 4.430
SLAM-Omni 5.52 4.439 5.55 4.467 6.16 4.470 6.50 4.461 6.17 4.464
VocalNet-1B (VA) 3.43 4.495 3.65 4.498 5.97 4.499 6.40 4.489 5.66 4.495
VocalNet-1B 3.43 4.491 3.27 4.497 6.73 4.486 4.88 4.493 5.31 4.491
LLaMA-Omni 6.00 3.942 10.00 4.003 20.93 3.965 14.60 3.935 15.90 3.956
Freeze-Omni 14.33 4.377 14.20 4.417 20.39 4.404 18.25 4.398 18.31 4.401
GLM-4-Voice 18.71 4.025 14.45 4.152 8.33 4.306 6.08 4.214 8.99 4.228
Baichuan-omni-1.5 20.84 4.082 22.82 4.332 22.36 4.401 23.29 4.350 22.67 4.347
MiniCPM-o 15.35 4.102 5.73 4.228 8.08 4.128 8.94 4.125 8.72 4.137
Qwen2.5-Omni 2.41 4.299 0.93 4.315 1.13 4.339 4.68 4.363 2.63 4.342
VocalNet-8B (VA) 2.65 4.490 3.00 4.503 5.02 4.499 4.21 4.485 4.26 4.493
VocalNet-8B 4.71 4.489 2.68 4.500 4.04 4.482 3.11 4.492 3.56 4.489
Table 3: Comparison with different Implementation of MTP

Bold indicates the optimal result.

Method Group Size/Module Num Speedup Ratio WER↓ UTMOS↑
Baseline (NTP) - 10.62 4.488
Group-Linear 3 11.50 4.488
5 17.61 4.414
Group-Trans 3 14.34 4.489
5 17.90 4.468
MTP-Parallel-Linear 5 8.61 4.492
8.00 4.494
10.57 4.467
MTP-DeepSeek 5 9.14 4.493
9.02 4.498
18.23 4.488
MTP-VocalNet 5 6.84 4.494
5.66 4.495
6.46 4.486
Table 4: Comparison with different numbers of MTP modules utilized in the training and inferring phase

Bold indicates the optimal result and underline indicates the suboptimal result.

Module Num Speedup AlpacaEval Llama Questions TriviaQA Web Questions Avg
WER UTMOS WER UTMOS WER UTMOS WER UTMOS WER UTMOS
3 5.38 4.489 5.24 4.504 7.59 4.500 9.23 4.484 7.79 4.493
3.37 4.493 3.95 4.498 5.97 4.498 6.43 4.485 5.70 4.493
5 4.14 4.485 4.48 4.502 6.52 4.497 8.41 4.491 6.84 4.495
3.43 4.495 3.65 4.498 5.97 4.499 6.40 4.489 5.66 4.495
3.84 4.478 4.28 4.493 6.40 4.489 7.70 4.483 6.46 4.486
7 5.38 4.489 5.24 4.502 7.59 4.480 9.23 4.490 7.79 4.487
3.40 4.490 3.92 4.499 5.91 4.498 7.57 4.494 6.14 4.496
4.26 4.481 4.33 4.489 6.32 4.496 8.76 4.484 6.89 4.489
5.50 4.470 5.19 4.474 8.28 4.478 9.20 4.462 8.06 4.470
Table 5: Speech generation latency of VocalNet.

Experiments are conducted on 1 NVIDIA L20 GPU.

Model Speech Encoder (ms) LLM (ms) Speech Decoder (ms) Speech Vocoder (ms) Sum (ms)
VocalNet-1B 35.86 33.95 24.74 225.18 319.73
VocalNet-8B 36.08 126.71 40.02 225.56 428.38