VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation

Yuhao Wang1,2*, Heyang Liu1,2*, Ziyang Cheng3*, Ronghua Wu2, Qunshan Gu2, Yanfeng Wang1, Yu Wang1†

1 Shanghai Jiao Tong University

2 Ant Group

3 Wuhan University

Abstract

Speech large language models (LLMs) have emerged as a prominent research focus in speech processing. We propose VocalNet-1B and VocalNet-8B, a series of high-performance, low-latency speech LLMs enabled by a scalable and model-agnostic training framework for real-time voice interaction. Departing from the conventional next-token prediction (NTP), we introduce multi-token prediction (MTP), a novel approach optimized for speech LLMs that simultaneously improves generation speed and quality. Experiments show that VocalNet outperforms mainstream Omni LLMs despite using significantly less training data, while also surpassing existing open-source speech LLMs by a substantial margin. To support reproducibility and community advancement, we will open-source all model weights, inference code, training data, and framework implementations upon publication.

Cost-Effective Scalable Framework for Speech LLMs We propose a scalable, model-agnostic training framework to cost-effectively enable LLMs with real-time voice interaction capabilities, advancing the development of speech LLMs.
Efficient Multi-Token Prediction for Speech Generation We introduce the MTP approach for speech LLMs and propose an effective MTP implementation. Through detailed analysis and experimental comparison, we identify the limitations of previous methods, and further propose a simple and more efficient MTP implementation specifically for speech LLMs. This approach not only accelerates speech generation but also achieves consistent quality improvements, providing new insights for speech LLMs.
Superior Performance of VocalNet with Limited Data We conduct extensive experiments that demonstrate the superior voice interaction performance of VocalNet with a limited training corpus, highlighting the efficiency, scalability, and cost-effectiveness of the proposed framework and the effectiveness of the MTP approach.

Overview

VocalNet

Figure 1: On the left: The architecture of the VocalNet model. On the right: A depiction of VocalNet's dual-stage training strategy.

Streaming Speech Decoding

Non-Streaming and Streaming Attention Masks

Figure 2: (a) Non-Streaming Attention Mask: v_LLMⁱ attends to the complete text positions, and sⁱ attends to the complete text positions and its previous speech positions; (b) Streaming Attention Mask: v_LLMⁱ attends to itself and its previous text positions, and sⁱ attends to chunk-limited text positions, itself and its previous speech positions.

Our MTP Implementation

Figure 3: Illustration of various accelerate implementations. (a): Group Modeling; (b): MTP-Parallel-Linear; (c): MTP-DeepSeek; (d): Our MTP implementation.

Video Demos

Chinese Dialogue Demo (with Interruption)

👨 Male Voice Response

English Dialogue Demo (with Interruption)

👩 Female Voice Response

Audio Demos

AlpacaEval for Instruction Following

I'm trying to teach myself to have nicer handwriting. Can you help?

Question Audio:

Answer Audio:

What are the best exercises for beginners?

Question Audio:

Answer Audio:

Is there a meaning for Christmas wreaths?

Question Audio:

Answer Audio:

Llama Questions for Knowledge Q&A

Which city is known as the 'City of Canals'?

Question Audio:

Answer Audio:

What is the name of the largest rainforest in the world?

Question Audio:

Answer Audio:

Which bird can fly faster than any other bird?

Question Audio:

Answer Audio:

Open-Ended Insights

Could you please tell me how to live a healthy life?

Question Audio:

Answer Audio:

What do you think a perfect society looks like? Please provide a brief description.

Question Audio:

Answer Audio:

Which is more important: happiness or fun? Could you briefly share your thoughts?

Question Audio:

Answer Audio:

Emotional Generation

Voice Control

用高兴的语音讲述中奖的经历。

Question Audio:

😊 Joy: 哇塞！我竟然中了大奖，那种激动的心情简直无法用言语形容。这可是我梦寐以求的时刻啊，感觉所有的努力和等待都是值得的！

Answer Audio:

用悲伤的情绪描述毕业典礼上对同窗不舍的感情。

Question Audio:

😢 Sadness: 毕业典礼上我们互道分别，泪水不禁涌出。大家脸上满载笑容却掩饰不住内心的离愁,想到今后不能常聚一起,实在感到十分不舍。

Answer Audio:

用生气的语气回应别人的不尊重行为。

Question Audio:

😠 Angry: 你这种不尊重别人的行为真是太过分了！我不能容忍这样的无理举动，希望你能意识到自己的错误并改正。

Answer Audio:

用恐惧的声音描绘一次夜半惊魂。

Question Audio:

😰 Fear: 深夜四周静谧的可怕，突然一阵怪风吹来窗户砰的一声巨响，仿佛有什么东西要闯进来，我吓得浑身僵硬大气也不敢出，只能紧紧抱住自己，祈祷着这一切快点结束。

Answer Audio:

用厌恶语气吐槽糟糕的晚餐体验。

Question Audio:

🤢 Disgust: 昨晚的那顿饭简直让人难以忍受，油腻腻的口感加上刺鼻的味道，吃完后胃里翻江倒海。

Answer Audio:

用歉意的话语表达迟到的原因。

Question Audio:

😔 Sorry: 真的很对不起，因为路上遇到了交通堵塞，导致我迟到了。我知道这给你带来了不便，请你原谅我的疏忽。

Answer Audio:

Emotional Dialogue

今天心情糟透了，研究了一下约翰·洛克关于奴隶待遇的观点，实在是让人反感。

Question Audio:

😢 Sadness: 听到你的心情，我感到很遗憾。约翰·洛克的观点确实让人难以接受，但他的思想在当时的历史背景下有其复杂性。希望你能找到一些让自己心情好转的方法。

Answer Audio:

唉，我生病了，好难受。

Question Audio:

😢 Sadness: 听到你生病了，我很难过。希望你能快点好起来。

Answer Audio:

谢谢你的理解。我读了点关于约翰·洛克对美国革命和宪法的贡献，感觉心情好了一些。

Question Audio:

😊 Joy:很高兴听到你的心情有所好转。约翰·洛克的思想对美国革命和宪法的形成确实有重要影响。他的观点在当时是非常先进的。希望这些积极的信息能让你感到更加乐观。

Answer Audio:

你怎么给我提供了错误的信息！

Question Audio:

😔 Sorry: 真的很抱歉给你带来了困扰，我会尽快更正错误的信息。

Answer Audio:

Experimental Results

Table 1: Comparison with different speech LLMs and omni LLMs on OpenAudioBench

Bold indicates the optimal result in each subgroup and underline indicates the suboptimal result.

Model	LLM size	Modality	AlpacaEval	Llama Questions	TriviaQA	Web Questions
Mini-Omni	0.5B	s→t	1.84	2.7	0.12	0.22
Mini-Omni	0.5B	s→s	1.80	2.7	0.08	0.20
SLAM-Omni	0.5B	s→t	3.50	29.4	0.39	0.84
SLAM-Omni	0.5B	s→s	3.01	26.7	0.34	0.69
VocalNet-1B (VA)	1B	s→t	5.38	70.3	3.38	4.93
VocalNet-1B (VA)	1B	s→s	4.83	61.0	2.78	4.47
VocalNet-1B	1B	s→t	5.79	71.7	3.60	5.16
VocalNet-1B	1B	s→s	5.03	63.7	3.06	4.68
LLaMA-Omni	8B	s→t	5.31	69.7	4.44	5.44
LLaMA-Omni	8B	s→s	3.89	55.1	2.44	4.00
Freeze-Omni	7B	s→t	4.51	77.7	5.32	6.41
Freeze-Omni	7B	s→s	2.99	60.2	3.53	4.78
GLM-4-Voice	9B	s→t	5.86	77.4	4.95	5.56
GLM-4-Voice	9B	s→s	5.27	64.3	4.63	5.40
Baichuan-Omni-1.5	7B	s→t	5.20	77.6	5.72	6.12
Baichuan-Omni-1.5	7B	s→s	4.10	61.2	4.13	5.18
MiniCPM-o	8B	s→t	6.13	77.2	6.43	7.16
MiniCPM-o	8B	s→s	4.95	65.8	4.99	6.22
Minmo*	8B	s→t	-	78.9	4.83	5.50
Minmo*	8B	s→s	6.48	64.1	3.75	3.99
Qwen2.5-Omni	8B	s→t	6.01	79.0	5.89	6.88
Qwen2.5-Omni	8B	s→s	5.73	76.3	5.59	6.70
VocalNet-8B (VA)	8B	s→t	7.05	77.1	6.15	6.34
VocalNet-8B (VA)	8B	s→s	6.30	71.4	5.24	5.81
VocalNet-8B	8B	s→t	7.12	79.5	6.24	6.48
VocalNet-8B	8B	s→s	6.37	73.1	5.67	6.16

Table 2: Comparison with different models in response alignment and acoustic performance

Bold indicates the optimal result in each subgroup and underline indicates the suboptimal result.

Model	AlpacaEval		Llama Questions		TriviaQA		Web Questions		Avg
Model	WER	UTMOS	WER	UTMOS	WER	UTMOS	WER	UTMOS	WER	UTMOS
Mini-Omni	20.78	4.429	5.20	4.428	7.43	4.428	8.51	4.433	8.66	4.430
SLAM-Omni	5.52	4.439	5.55	4.467	6.16	4.470	6.50	4.461	6.17	4.464
VocalNet-1B (VA)	3.43	4.495	3.65	4.498	5.97	4.499	6.40	4.489	5.66	4.495
VocalNet-1B	3.43	4.491	3.27	4.497	6.73	4.486	4.88	4.493	5.31	4.491
LLaMA-Omni	6.00	3.942	10.00	4.003	20.93	3.965	14.60	3.935	15.90	3.956
Freeze-Omni	14.33	4.377	14.20	4.417	20.39	4.404	18.25	4.398	18.31	4.401
GLM-4-Voice	18.71	4.025	14.45	4.152	8.33	4.306	6.08	4.214	8.99	4.228
Baichuan-omni-1.5	20.84	4.082	22.82	4.332	22.36	4.401	23.29	4.350	22.67	4.347
MiniCPM-o	15.35	4.102	5.73	4.228	8.08	4.128	8.94	4.125	8.72	4.137
Qwen2.5-Omni	2.41	4.299	0.93	4.315	1.13	4.339	4.68	4.363	2.63	4.342
VocalNet-8B (VA)	2.65	4.490	3.00	4.503	5.02	4.499	4.21	4.485	4.26	4.493
VocalNet-8B	4.71	4.489	2.68	4.500	4.04	4.482	3.11	4.492	3.56	4.489

Table 3: Comparison with different Implementation of MTP

Bold indicates the optimal result.

Method	Group Size/Module Num	Speedup Ratio	WER↓	UTMOS↑
Baseline (NTP)	-	1×	10.62	4.488
Group-Linear	3	3×	11.50	4.488
Group-Linear	5	5×	17.61	4.414
Group-Trans	3	3×	14.34	4.489
Group-Trans	5	5×	17.90	4.468
MTP-Parallel-Linear	5	1×	8.61	4.492
		3×	8.00	4.494
		5×	10.57	4.467
MTP-DeepSeek	5	1×	9.14	4.493
		3×	9.02	4.498
		5×	18.23	4.488
MTP-VocalNet	5	1×	6.84	4.494
		3×	5.66	4.495
		5×	6.46	4.486

Table 4: Comparison with different numbers of MTP modules utilized in the training and inferring phase

Bold indicates the optimal result and underline indicates the suboptimal result.

Module Num	Speedup	AlpacaEval		Llama Questions		TriviaQA		Web Questions		Avg
Module Num	Speedup	WER	UTMOS	WER	UTMOS	WER	UTMOS	WER	UTMOS	WER	UTMOS
3	1×	5.38	4.489	5.24	4.504	7.59	4.500	9.23	4.484	7.79	4.493
3	3×	3.37	4.493	3.95	4.498	5.97	4.498	6.43	4.485	5.70	4.493
5	1×	4.14	4.485	4.48	4.502	6.52	4.497	8.41	4.491	6.84	4.495
	3×	3.43	4.495	3.65	4.498	5.97	4.499	6.40	4.489	5.66	4.495
	5×	3.84	4.478	4.28	4.493	6.40	4.489	7.70	4.483	6.46	4.486
7	1×	5.38	4.489	5.24	4.502	7.59	4.480	9.23	4.490	7.79	4.487
	3×	3.40	4.490	3.92	4.499	5.91	4.498	7.57	4.494	6.14	4.496
	5×	4.26	4.481	4.33	4.489	6.32	4.496	8.76	4.484	6.89	4.489
	7×	5.50	4.470	5.19	4.474	8.28	4.478	9.20	4.462	8.06	4.470

Table 5: Speech generation latency of VocalNet.

Experiments are conducted on 1 NVIDIA L20 GPU.

Model	Speech Encoder (ms)	LLM (ms)	Speech Decoder (ms)	Speech Vocoder (ms)	Sum (ms)
VocalNet-1B	35.86	33.95	24.74	225.18	319.73
VocalNet-8B	36.08	126.71	40.02	225.56	428.38