(cache)Qwen

Github HuggingFace Huggingface Demo ModelScope Demo Paper

Qwen3-TTS is a series of powerful speech generation capabilities developed by Qwen, offering comprehensive support for voice clone, voice design, ultra-high-quality human-like speech generation, and natural language-based voice control. It provides developers and users with the most extensive set of speech generation features available. Powered by the innovative Qwen3-TTS-Tokenizer-12Hz multi-codebook speech encoder, Qwen3-TTS achieves efficient compression and robust representation of speech signals. This not only fully preserves paralinguistic information and acoustic environmental features but also enables high-speed, high-fidelity speech reconstruction via a lightweight non-DiT architecture. Utilizing Dual-Track modeling, Qwen3-TTS achieves extreme bidirectional streaming generation speeds, where the first audio packet is delivered after processing just a single character. The entire Qwen3-TTS multi-codebook model series is now open-sourced, featuring two sizes: 1.7B and 0.6B. The 1.7B model delivers peak performance and powerful control capabilities, while the 0.6B model offers an ideal balance between performance and efficiency. The models support 10 mainstream languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian) along with various dialects to meet global application demands. Furthermore, the models exhibit strong contextual understanding, allowing them to adapt tone, rhythm, and emotional expression based on instructions and text semantics, while significantly improving robustness to input text noise. Now open-sourced on GitHub and accessible via the Qwen API.

Model List

1.7B Model

Model	Features	Language Support	Streaming	Instruction Control
Qwen3-TTS-12Hz-1.7B-VoiceDesign	Performs voice design based on user-provided descriptions.	Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian	✅	✅
Qwen3-TTS-12Hz-1.7B-CustomVoice	Provides style control over target timbres via user instructions; supports 9 premium timbres covering various combinations of gender, age, language, and dialect.	Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian	✅	✅
Qwen3-TTS-12Hz-1.7B-Base	Base model capable of 3-second rapid voice clone from user audio input; can be used for fine-tuning (FT) other models.	Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian	✅

0.6B Models

Model	Features	Language Support	Streaming	Instruction Control
Qwen3-TTS-12Hz-0.6B-CustomVoice	Supports 9 premium timbres covering various combinations of gender, age, language, and dialect.	Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian	✅
Qwen3-TTS-12Hz-0.6B-Base	Base model capable of 3-second rapid voice clone from user audio input; can be used for fine-tuning (FT) other models.	Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian	✅

Qwen3-TTS Key Features

Main Features:

Powerful Speech Representation：Powered by the self-developed Qwen3-TTS-Tokenizer-12Hz, it achieves efficient acoustic compression and high-dimensional semantic modeling of speech signals. It fully preserves paralinguistic information and acoustic environmental features, enabling high-speed, high-fidelity speech reconstruction through a lightweight non-DiT architecture.
Universal End-to-End Architecture：Utilizing a discrete multi-codebook LM architecture, it realizes full-information end-to-end speech modeling. This completely bypasses the information bottlenecks and cascading errors inherent in traditional LM+DiT schemes, significantly enhancing the model’s versatility, generation efficiency, and performance ceiling.
Extreme Low-Latency Streaming Generation：Based on the innovative Dual-Track hybrid streaming generation architecture, a single model supports both streaming and non-streaming generation. It can output the first audio packet immediately after a single character is input, with end-to-end synthesis latency as low as 97ms, meeting the rigorous demands of real-time interactive scenarios.
Intelligent Text Understanding and Voice Control：Supports speech generation driven by natural language instructions, allowing for flexible control over multi-dimensional acoustic attributes such as timbre, emotion, and prosody. By deeply integrating text semantic understanding, the model adaptively adjusts tone, rhythm, and emotional expression, achieving lifelike “what you imagine is what you hear” output.

Model Performance

We have conducted a comprehensive evaluation of Qwen3-TTS across dimensions such as voice clone, voice design, and control. The results demonstrate that it has achieved SOTA performance across multiple metrics. Specifically:

In voice design tasks: Qwen3-TTS-VoiceDesign outperformed the MiniMax-Voice-Design closed-source model in both instruction-following capability and generative expressiveness on the InstructTTS-Eval benchmark, while significantly leading other open-source models.
In voice control tasks: Qwen3-TTS-Instruct demonstrates single-speaker multilingual generalization with an average Word Error Rate (WER) of 2.34%. It also features the ability to maintain timbre while providing precise style control, achieving a score of 75.4% on InstructTTS-Eval. Furthermore, it shows exceptional long-form speech generation capabilities, with a WER of 2.36% (Chinese) and 2.81% (English) during continuous 10-minute synthesis.
In voice clone tasks: Qwen3-TTS-VoiceClone surpassed MiniMax and SeedTTS in speech stability for both Chinese and English cloning on Seed-tts-eval. On the TTS multilingual test set across 10 languages, it achieved an average WER of 1.835% and a speaker similarity of 0.789, outperforming MiniMax and ElevenLabs. Its cross-lingual voice clone capabilities also reached SOTA, surpassing CosyVoice3.

Tokenizer Performance

We evaluated Qwen-TTS-Tokenizer for speech reconstruction. Results on the LibriSpeech test-clean set demonstrate that it achieves SOTA performance across all key metrics. Specifically, in Perceptual Evaluation of Speech Quality (PESQ), Qwen-TTS-Tokenizer achieved scores of 3.21 and 3.68 in wideband and narrowband respectively, significantly leading similar tokenizers. In Short-Time Objective Intelligibility (STOI) and UTMOS, Qwen-TTS-Tokenizer achieved scores of 0.96 and 4.16, demonstrating superior reconstruction quality. In speaker similarity, Qwen-TTS-Tokenizer achieved a score of 0.95, significantly surpassing comparison models, indicating its near-lossless speaker information preservation capability.

Samples

Qwen3-TTS-12Hz-1.7B-VoiceDesign

Voice Design

Qwen3-TTS supports generating customized timbre identities through natural language descriptions. Users can freely input acoustic attributes, persona descriptions, background information, and other free-form descriptions, easily creating their desired voice identities.

Control Type	Control Instruction	Text	Samples
Acoustic Attribute Control	采用高亢的男性嗓音，语调随兴奋情绪不断上扬，以快速而充满活力的节奏传达信息。音量要足够响亮，近乎喊叫，以体现紧迫感。发音务必清晰精准、字字分明，让每个词都铿锵有力。整体表达需流畅自然、明亮生动，富有戏剧性，展现出外向、自信且张扬的个性，同时传递出一种威严而宏大的宣告语气，洋溢着满溢的激动之情。	好了各位，往后退，往后退！我有个天大的好消息要宣布：Qwen-TTS正式开源啦！	00:07
	gender: Male. pitch: Low male pitch with significant upward inflections for emphasis and excitement. speed: Fast-paced delivery with deliberate pauses for dramatic effect. volume: Loud and projecting, increasing notably during moments of praise and announcements. age: Young adult to middle-aged adult. clarity: Highly articulate and distinct pronunciation. fluency: Very fluent speech with no hesitations. accent: British English. texture: Bright and clear vocal texture. emotion: Enthusiastic and excited, especially when complimenting. tone: Upbeat, authoritative, and performative. personality: Confident, extroverted, and engaging.	Nine different, exciting ways of cooking sausage. Incredible. There were three outstanding deliveries in terms of the sausage being the hero. The first dish that we want to dissect, this individual smartly combined different proteins in their sausage. Great seasoning. The blend was absolutely spot on. Congratulations. Please step forward. Natasha.	00:25
	展现出悲苦沙哑的声音质感,语速偏慢,情绪浓烈且带有哭腔,以标准普通话缓慢诉说,情感强烈,语调哀怨高亢,音高起伏大。	皇上啊！臣妾一片真心可昭日月，为何您竟信那毒妇谗言，将我打入冷宫？这心……比雪还凉啊……	00:18
	gender: Male. pitch: Artificially high-pitched, slightly lowering after the initial laugh. speed: Rapid during the laugh, then slowing to a deliberate pace. volume: Loud laugh transitioning to a standard conversational level. age: Young adult to middle-aged, performing a character voice. clarity: Clear and distinct articulation. fluency: Fluent delivery without hesitation. accent: American English. texture: Slightly strained and somewhat nasal quality. emotion: Forced amusement shifting to feigned resignation. tone: Initially playful, then shifts to a slightly put-upon tone. personality: Theatrical and expressive.	Good one. Okay, fine, I'm just gonna leave this sock monkey here. Goodbye.	00:06
Age Control	体现撒娇稚嫩的萝莉女声，音调偏高且起伏明显，营造出黏人、做作又刻意卖萌的听觉效果。	哥哥，你回来啦，人家等了你好久好久了，要抱抱！	00:05
	Speak as a sarcastic, assertive teenage girl: crisp enunciation, controlled volume, with vocal emphasis that conveys disdain and authority.	Blah, blah, blah. We're all very fascinated, Whitey, but we'd like to get paid.	00:07
	性别: 男性. 音高: 男性低沉音域，音高稳定. 语速: 语速稍快，节奏紧凑. 音量: 音量洪亮，力度强劲. 年龄: 中老年. 清晰度: 发音清晰，字句有力. 流畅度: 表达流畅，一气呵成. 口音: 标准普通话. 音色质感: 嗓音浑厚，略带沙哑感. 情绪: 严肃告诫，指令明确. 语调: 命令式语调，强调果断. 性格: 权威果断，不容置喙.	把你所有的表情都藏在面具里，保持你的中性状态，不用表情，只用身体的语言，要记住，要学会藏。	00:10
	gender: Male. pitch: Low male pitch, generally stable. speed: Deliberate pace, slowing slightly after the initial exclamation. volume: Starts loud, then transitions to a projected conversational volume. age: Middle-aged adult. clarity: High clarity with distinct pronunciation. fluency: Highly fluent. accent: American English. texture: Resonant and slightly gravelly. emotion: Initially commanding, shifting to narrative amusement. tone: Authoritative start, moving to an engaging, descriptive tone. personality: Confident and performative.	Older gentleman, 110, maybe 111 years old, sort of a surly Elvis thing happening with him. He smiles like this. Seen him around?	00:13
Gradual Control	性别: 男性音高: 男性低沉音区，偶有拔高. 语速: 初始平稳，后段因激动逐渐加快. 音量: 初始音量正常，后段逐渐提高至喊叫. 年龄: 中年男性. 清晰度: 吐字清晰，发音准确. 流畅度: 言语连贯，表达自然. 口音: 标准普通话发音. 音色质感: 音质略带粗砺，富有力量感. 情绪: 初始不耐烦，迅速转为恼怒斥责. 语调: 质问命令式，语带不悦与威慑. 性格: 急躁易怒，态度强硬.	你在干什么?有什么好看的?喂!我叫你走，你在干什么?给我走啊!	00:05
Gradual Control	gender: Female. pitch: Mid-range female pitch, rising sharply with frustration. speed: Starts measured, then accelerates rapidly during emotional outburst. volume: Begins conversational, escalates quickly to loud and forceful. age: Young adult to middle-aged. clarity: High clarity and distinct articulation throughout. fluency: Highly fluent with no significant pauses or fillers. accent: General American English. texture: Bright and clear vocal quality. emotion: Shifts abruptly from neutral acceptance to intense resentment and anger. tone: Initially accepting, becomes sharply accusatory and confrontational. personality: Assertive and emotionally expressive when provoked.	Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you-	00:08
Human-likeness	自然感的女声，语调活泼带笑意，模仿别人‘嘘’你时压低嗓音，就是平时聊天的感觉	我跟我闺蜜看电影就特别有画面感。就你知道吗，一紧张我就忍不住吃爆米花，吃得特别快，然后手里那杯可乐也跟着晃，差点就洒了，真的差一点点。然后旁边那个人就突然来一句，嘘——声音压得特别低。哎我当下那个情绪，既想笑又有点气，太尴尬了。	00:24
Human-likeness	A relaxed, naturally expressive male voice in his late twenties to early thirties, with a moderately low pitch, casual speaking rate, and conversational volume; deliver lines with a light, self-deprecating tone, breaking into genuine, easygoing laughter at moments of embarrassment, while maintaining clear articulation and an overall warm, approachable clarity.	Yeah, so—uh—I’m a digital nomad, right? So… pretty much all my communication is just, like, texts and messages. And now, you know, there’s these AI agents that can, uh… reply for you? Which is—heh—convenient, sure, I guess? But also… kinda delicate, you know? Like, you’ll type something super short—like, “Yep, sounds good”—and it’ll turn that into this whole… warm, polished paragraph. Like, way nicer than I’d ever write myself. huh… ha Seriously, I sound like a Hallmark card all of a sudden. But then… once you outsource that… what’s the other person actually hearing? Are they hearing me… or just some… generic, friendly-bot voice? Man, that’s weird to even say out loud.	00:53
Background Information	角色姓名：林怀岳音色信息：音量洪亮，音域低沉，力度感强的中年男性声音。身份背景：某国家重点科研项目首席顾问，年近七十的资深战略科学家。曾参与国家重大科技攻关工程，历经数十年风雨，见证了从落后追赶到自主创新的艰难历程。现任国家科技咨询委员会终身荣誉委员，仍坚持在一线培养青年人才，为国家战略发展建言献策。外貌特征：身形挺拔，两鬓斑白，眉宇间刻着岁月沉淀的坚毅。常着深色中山装或简洁正装，眼神沉静而锐利，举手投足间自带威严与从容。性格特质：意志如钢，信念坚定，面对挑战从不退缩；胸怀家国，心系民族未来，将个人命运与国家兴衰紧密相连；严谨自律，言出必行，话语中充满责任感与历史担当；外冷内热，表面严肃，实则对后辈寄予厚望，甘为人梯。人生信条：“我们这一代人，不是为了站在光里，而是为了把路铺到光里。”	有些事，只要国家需要，就得有人扛起来。我们那一代人，是背着泥土铺路的；你们要做的，是让这条路，通向星辰大海。	00:12
Background Information	Character Name: Marcus Cole Voice Profile: A bright, agile male voice with a natural upward lift, delivering lines at a brisk, energetic pace. Pitch leans high with spark, volume projects clearly—near-shouting at peaks—to convey urgency and excitement. Speech flows seamlessly, fluently, each word sharply defined, riding a current of dynamic rhythm. Background: Longtime broadcast booth announcer for national television, specializing in live interstitials and public engagement spots. His voice bridges segments, rallies action, and keeps momentum alive—from voter drives to entertainment news. Presence: Late 50s, neatly groomed, dressed in a crisp shirt under studio lights. Moves with practiced ease, eyes locked on the script, energy coiled and ready. Personality: Energetic, precise, inherently engaging. He doesn’t just read—he propels. Behind the speed is intent: to inform fast, to move people to act. Whether it’s “text VOTE to 5703” or a star-studded tease, he makes it feel immediate, vital.	Lot being you watching. 1-866-IDLE-03 for JPL. That's 1-866-436-5703. Or text the word VOTE to 5703. Diana DeGarmo's next with more from the movies right after this brief intermission on American Idol.	00:18

Timbre Reuse

Users can also persistently store and repeatedly call the timbres created by Qwen3-TTS, generating vivid and natural multi-turn, multi-character long-form dialogues.

Control Instruction	Text	Samples
"旁白": "声音特征沉稳、客观、略带叙事感的女播音腔，普通话标准，语速适中，带有轻微的环境氛围渲染，语调平缓但富有感染力，在关键情节时稍作停顿，增强画面感。情感冷静旁观，偶尔带一丝微妙的反讽" "小林": "25岁男性上班族，声音清亮但时常犹豫，语速时快时慢，紧张时会轻微结巴。情绪波动明显，从低声呢喃到突然激动再到自我怀疑的叹气。肢体语言丰富，经常无意识的小动作" "御姐": "模拟成熟性感的御姐音色，声音略带磁性且沉稳，语速不快不慢，语调充满自信和一丝挑逗，尾音可以稍微拖长并上扬，给人一种游刃有余的掌控感。"	旁白: 小林今天第三次走神了。酒吧昏黄的灯光晃得他心跳加速，而吧台对面那个红唇微扬的女人，正用指尖轻轻摩挲着酒杯边缘。御姐: 小弟弟，有兴趣陪姐姐喝一杯吗？小林: 啊？我、我……我其实不太会喝酒…… 旁白: 他的手指无意识地抠着杯沿，喉结上下滚动，像被什么无形的东西掐住了呼吸。御姐: 不会喝？那正好——姐姐教你。这杯莫吉托，甜得刚好，就像你刚才偷看我的眼神。小林: 我、我没偷看！……好吧，看了一眼。就一眼！旁白: 他猛地坐直，又立刻缩回肩膀，仿佛那句话烫伤了自己的嘴。御姐: 紧张什么？你连坐姿都在发抖……要不要靠过来一点？这里太吵了。小林: 靠过去？可、可我们才第一次见面……你都不认识我…… 御姐: 名字不重要，感觉才重要。......而我感觉……你有点可爱。旁白: 小林的耳朵瞬间红透，连耳后那颗小痣都像在发烫。他想逃，脚却像钉在了高脚凳上。小林: 可爱？没人这么说过我……他们都说我太闷，连朋友圈都发不出手…… 御姐: 那现在呢？敢不敢发一条——'今晚，和一个危险又迷人的姐姐喝了一杯'？小林: ……我连配图都不敢选。你笑起来太……太有杀伤力了。御姐: 那就别发了。有些故事，只适合藏在两个人的记忆里——比如，接下来你打算请我跳支舞吗？旁白: 他张了张嘴，没发出声音。但这一次，他没有低头，而是轻轻推开了那杯没动过的苏打水，朝她伸出了手。	02:20
""Lucas": "Male, 17 years old, tenor range, gaining confidence - deeper breath support now, though vowels still tighten when nervous" "Mia": "Female, 16 years old, mezzo-soprano range, softening - lowering register to intimate speaking voice, consonants softening"	Lucas:H-hey! You dropped your... uh... calculus notebook? I mean, I think it's yours? Maybe? Mia:Oh wow, my mortal enemy - Mr. Thompson's problem sets. Thanks for rescuing me from that F. Lucas:No problem! I actually... kinda finished those already? If you want to compare answers or something... Mia:Is this your sneaky way of saying you want to study together, Lucas? Because I saw you staring during lab partners sign-up. Lucas:What? No! I mean yes but not like... I just think you're... your titration technique is really precise! Mia:That's the nerdiest compliment I've ever gotten. Tell you what - help me survive pre-calc and I'll teach you how to actually flirt. Lucas:Wow, harsh. And here I thought my titration line was smooth. Mia:It was adorable. Like when you tripped over your shoelaces in the hall yesterday. Or that time you— Lucas:Okay okay! I get it, I'm a disaster. So... library after school? I'll bring the graphing calculators? Mia:Only if you promise not to spill coffee on my notes again... though I guess watching you panic-clean was pretty cute.	01:16

Qwen3-TTS-12Hz-1.7B-CustomVoice

Timbre Control

After performing speaker-specific fine-tuning, Qwen3-TTS can maintain the target timbre while inheriting the style control capabilities and single-speaker multilingual capabilities of the base model.

Control Type	Timbres	Control Instruction	Text	Samples
Single Attribute Control	甜茶 Ryan	spoke with a very sad and tearful voice.	She said she would be here by noon.	00:02
		Very happy.		00:02
		用特别愤怒的语气说		00:02
		请特别小声的悄悄说		00:02
		Speaking at an extremely slow pace		00:03
		音调低沉		00:02
Multi-Attribute Control	十三 Vivan	性别: 女性声音. 音高: 女性中高音区，语调富于变化. 语速: 语速明快，偶有加速. 音量: 正常交谈音量，笑声响亮. 清晰度: 吐字清晰，发音标准. 流畅度: 表达流畅自如. 口音: 普通话. 音色质感: 音色明亮，略带爽朗. 情绪: 愉悦友好，伴随爽朗笑意. 语调: 语调上扬活泼，疑问时尤为明显. 性格: 外向开朗，热情健谈.	就算你自己不想治，你也得考虑考虑别人的感受吧。我们这些朋友的感受你不在乎无所谓，那你家人呢？你家人的感受你难道一点都不在乎吗！	00:12
		以极度悲伤、带着明显哭腔的语气，用较小的音量缓缓诉说，语速缓慢，仿佛每一个字都承载着沉重的痛楚，声音颤抖而压抑，吐字虽轻却清晰可辨，透出深藏心底的哀伤与无助。		00:23
		保持青年女性的声线特征，展现出一种清亮且略具紧迫感的音色，语速从平稳开始在叙述过程中逐渐加快，音量在情绪波动时增加，语调在句末调高以强调劝告的语气。		00:12
Single-speaker Cross-lingual Generalization	十三 Vivan	在语速偏快的情况下流畅自然地表达,音质清亮,音调略高,吐字清晰标准,给人一种开心愉悦的感觉。	(Korean) 안녕하세요, 오늘은 어떤 용건입니까?	00:02
		A deep, rich, and solid vocal register characteristic of a middle-aged woman, with full and powerful volume. Speech is delivered at a steady pace, articulation clear and precise, with fluent and confident intonation that rises slightly at the end of sentences.	(Japanese) こんにちは、本日はどのようなご用件でしょうか？	00:04
		语音应表现为直率且略显主观强势的中年女性,音色略带尖锐感,流畅表达中偶尔断句以凸显语气,情绪略带不满,音量随情感激动略有增强。	(Chinese Dialect - Sichuan Dialect) 我早就该下班了，就是跟你说我这事情干不完，我现在走不脱。	00:05

Timbre List

Qwen3-TTS has open-sourced a total of 9 timbres in this release, covering various combinations of gender, age, language, and dialect to meet personalized speech generation needs in different scenarios.

Timbres	Languages	Text	Samples
苏瑶 Serena	Chinese	其实我真的有发现，我是一个特别善于观察别人情绪的人。	00:04
福伯 Uncle Fu	Chinese	其实我真的有发现，我是一个特别善于观察别人情绪的人。	00:05
十三 Vivian	Chinese	其实我真的有发现，我是一个特别善于观察别人情绪的人。	00:05
艾登 Aiden	English	Then by the end of the movie, when Dorothy clicks her heels and says, “There’s no place like home,” I got a little bit teary, I’ll admit. You know, I don’t even know why—I just, I just felt.	00:15
甜茶 Ryan	English	Then by the end of the movie, when Dorothy clicks her heels and says, “There’s no place like home,” I got a little bit teary, I’ll admit. You know, I don’t even know why—I just, I just felt.	00:14
小野杏 Ono Anna	Japanese	やばい、明日のプレゼン資料まだ完成してない… 助けて！	00:05
素熙 Sohee	Korean	야, 오늘 점심에 뭐 먹을지 생각해 봤어? 근처에 새로 생긴 분식집 어때?	00:06
晓东 Dylan	Chinese Dialect - Beijing Dialect	我们就在山上啊，就是其实也没什么，就是在土坡上跑来跑去，然后谁捡个那个嗯比较威风的棍儿，完了我们就就瞎打，呃要不就是什么掏个洞啊什么的。	00:13
程川 Eric	Chinese Dialect - Sichuan Dialect	你龟儿太过分了，把我的东西都搞坏了，还晓不晓得认错，硬是要把我整冒火你才安逸嗦，莫再烦老子爬球开。	00:09

Qwen3-TTS-12Hz-1.7B-Base

Voice Clone

Control Type	Reference Samples	Text	Samples
Chinese Voice Clone	00:07	你眼中的太阳，只是我指间的玩物。	00:04
Chinese Voice Clone	00:08	祝您在马年里事业一马当先，业绩万马奔腾，在新的一年里快马加鞭，再创辉煌！	00:06
English Voice Clone	00:10	In the absence of confiscation, the Portuguese inquisitors were not earnest in tracing the heresies of ancestors or in following up the records of fugitives.	00:08
English Voice Clone	00:08	An ideal harmonious society. Humanism, is a lighthouse on this way to guide us in case we are getting lost.	00:07
Cross-lingual Voice Clone	00:07	(Japanese) 要約すれば、全米国民のためにアメリカを再興するという使命を、我々は開始したのである。	00:08
Cross-lingual Voice Clone	00:07	(Korean) 광활한 우주 속에, 지구라고 불리는 아름다운 푸른 행성이 있습니다.	00:05
Text Robustness	00:07	Qwen-TTS 是支持音色克隆、生成、控制的开源语音合成模型，不仅支持多语言multilingual，还支持各种复杂文本，如pin1 yin1，特殊符号等(◍•͈⌔•͈◍)；能读出各种生僻字詞。快来试试吧！	00:14
Text Robustness	00:08	I am solving the equation: x = [-b ± √(b²-4ac)] / 2a? Nobody can — it's a disaster (◍•͈⌔•͈◍), very sad!	00:11