VALL-E: Neural Codec Language Models for Speech Synthesis (Microsoft Research)

web

Microsoft·microsoft.com/en-us/research/project/vall-e-x/

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Microsoft

Relevant to AI safety discussions around synthetic media misuse, voice fraud, and the challenge of maintaining reliable identity authentication as voice cloning capabilities rapidly advance.

Metadata

Importance: 55/100tool pagehomepage

Summary

VALL-E X is Microsoft Research's cross-lingual speech synthesis model that can clone voices and generate speech in multiple languages using only a short audio prompt. It extends the original VALL-E model to enable zero-shot cross-lingual voice transfer, meaning it can reproduce a speaker's voice characteristics even in a language they never spoke. This raises significant concerns about deepfake audio, voice fraud, and authentication systems.

Key Points

•Enables zero-shot voice cloning across languages using only a 3-second audio sample as a prompt
•Demonstrates high-fidelity speech synthesis that preserves speaker identity, tone, and acoustic environment
•Highlights serious risks for voice authentication systems and audio-based identity verification
•Represents a significant capability jump in synthetic media generation, lowering the barrier to voice fraud
•Microsoft acknowledges misuse potential and states the model is not being released publicly without safeguards

Cited by 1 page

Page	Type	Quality
Deepfakes	Risk	50.0

Cached Content Preview

HTTP 200Fetched Apr 7, 20264 KB

VALL-E 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 

 
 
 

 
 
 
 

 
 
 
 

 
 

 

 

 

 
 
 
 

 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 

 Skip to main content 

 
 

 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 VALL-E Family

 Neural codec language models for speech synthesis

 
 
 
 
 
 
 

 
 

 
 
 

 
 
 
 We introduce a language modeling approach for text-to-speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E ) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as a prompt. VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, VALL-E could preserve the speaker&#8217;s emotion and acoustic environment of the acoustic prompt in synthesis. Extending its capabilities, VALL-E X adapts to multi-lingual scenarios, facilitating cross-lingual zero-shot TTS. Meanwhile, VALL-E R introduces a phoneme monotonic alignment strategy, bolstering the robustness of speech generation. With the integration of repetition-aware sampling and grouped code modeling techniques, VALL-E 2 achieves a groundbreaking milestone: human parity in zero-shot TTS performance on LibriSpeech and VCTK datasets. This marks the first instance of such an achievement, setting a new standard for the field. MELLE is a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. FELLE effectively predicts continuous-valued tokens (mel-spectrograms), by leveraging the autoregressive nature of language models and the generative efficacy of flow matching. PALLE combines explicit temporal modeling from AR with parallel generation from NAR, and generates dynamic-length spans at fixed time steps, followed with a second stage, in which low-confidence tokens are iteratively refined in parallel leveraging the global contextual information.

 Model versions

 
 
 
 
 
 
 
 
 
 
 
 See VALL-E samples 
 
 
 See VALL-E X samples 
 
 

 
 
 
 
 
 
 
 
 
 
 See VALL-E R samples 
 
 
 See VALL-E 2 samples 
 
 

 
 
 
 
 
 
 
 
 
 
 See MELLE samples 
 
 
 See FELLE samples 
 
 

 
 
 
 
 
 
 
 
 
 See PALLE samples 
 
 
 
 
 

 Ethics statement

 VALL-E could synthesize speech that maintains speaker identity and could be used for educational learning, entertainment, journalistic, self-authored content, accessibility features, inte

... (truncated, 4 KB total)

Resource ID: 0727e48c90269b22 | Stable ID: sid_CQ1gEnWA4q