Training and inference code for Irodori-TTS, a Flow Matching-based Text-to-Speech model. The architecture and training design largely follow Echo-TTS, using DACVAE continuous latents as the generation target.
Important
main tracks the v2 codebase and is intended for use with the Irodori-TTS-500M-v2 and Irodori-TTS-500M-v2-VoiceDesign model releases.
If you need the previous v1 code, use the v1 tag.
v1 and v2 checkpoints / preprocessing are not compatible across versions.
The previous public v1 model is available at Aratako/Irodori-TTS-500M.
For model weights and audio samples, please refer to the base model card and the VoiceDesign model card.
- Flow Matching TTS: Rectified Flow Diffusion Transformer (RF-DiT) over continuous DACVAE latents
- Voice Cloning: Zero-shot voice cloning from reference audio
- Voice Design: Caption-conditioned style control
- Multi-GPU Training: Distributed training via
uv run torchrunwith gradient accumulation, mixed precision (bf16), and W&B logging - PEFT LoRA Fine-Tuning: Parameter-efficient adaptation with PEFT/LoRA for released checkpoints
- Flexible Inference: CLI, Gradio Web UI, and HuggingFace Hub checkpoint support
The v2 codebase supports two closely related checkpoint families:
- Base model (
Aratako/Irodori-TTS-500M-v2): Text encoder + reference latent encoder + diffusion transformer. The reference latent encoder consumes patched DACVAE latents from reference audio for speaker/style conditioning. - VoiceDesign model (
Aratako/Irodori-TTS-500M-v2-VoiceDesign): Text encoder + caption encoder + diffusion transformer. The caption encoder consumes style-control text and the speaker/reference branch is disabled.
Shared building blocks:
- Text Encoder: Token embeddings initialized from a pretrained LLM, followed by self-attention + SwiGLU transformer layers with RoPE
- Condition Encoder: Either a reference latent encoder for the base model or a caption encoder for the VoiceDesign model
- Diffusion Transformer: Joint-attention DiT blocks with Low-Rank AdaLN (timestep-conditioned adaptive layer normalization), half-RoPE, and SwiGLU MLPs
Audio is represented as continuous latent sequences via the codec configured by the checkpoint. v2 uses the 32-dim Semantic-DACVAE-Japanese-32dim codec for 48kHz waveform reconstruction.
git clone https://github.com/Aratako/Irodori-TTS.git
cd Irodori-TTS
uv syncNote: For Linux/Windows with CUDA, PyTorch is automatically installed from the cu128 index. For macOS (MPS) or CPU-only usage, uv sync will install the default PyTorch build.
uv run python infer.py \
--hf-checkpoint Aratako/Irodori-TTS-500M-v2 \
--text "今日はいい天気ですね。" \
--ref-wav path/to/reference.wav \
--output-wav outputs/sample.wavuv run python infer.py \
--hf-checkpoint Aratako/Irodori-TTS-500M-v2 \
--text "今日はいい天気ですね。" \
--no-ref \
--output-wav outputs/sample.wavuv run python infer.py \
--hf-checkpoint Aratako/Irodori-TTS-500M-v2-VoiceDesign \
--text "今日はいい天気ですね。" \
--caption "落ち着いた女性の声で、近い距離感でやわらかく自然に読み上げてください。" \
--no-ref \
--output-wav outputs/sample_voice_design.wavuv run python gradio_app.py --server-name 0.0.0.0 --server-port 7860Then access the UI at http://localhost:7860.
The hosted v2 demo is available at Aratako/Irodori-TTS-500M-v2-Demo.
For the VoiceDesign checkpoint, use the dedicated UI:
uv run python gradio_app_voicedesign.py --server-name 0.0.0.0 --server-port 7861The hosted VoiceDesign demo is available at Aratako/Irodori-TTS-500M-v2-VoiceDesign-Demo.
gradio_app.py is for Aratako/Irodori-TTS-500M-v2. gradio_app_voicedesign.py is for Aratako/Irodori-TTS-500M-v2-VoiceDesign.
uv run python infer.py \
--hf-checkpoint Aratako/Irodori-TTS-500M-v2 \
--text "今日はいい天気ですね。" \
--ref-wav path/to/reference.wav \
--output-wav outputs/sample.wavLocal checkpoints (.pt or .safetensors) are also supported:
uv run python infer.py \
--checkpoint outputs/checkpoint_final.safetensors \
--text "今日はいい天気ですね。" \
--ref-wav path/to/reference.wav \
--output-wav outputs/sample.wavVoiceDesign checkpoints also support caption conditioning:
uv run python infer.py \
--hf-checkpoint Aratako/Irodori-TTS-500M-v2-VoiceDesign \
--text "今日はいい天気ですね。" \
--caption "落ち着いた、近い距離感の女性話者" \
--no-ref \
--output-wav outputs/sample_voice_design.wav| Parameter | Default | Description |
|---|---|---|
--checkpoint / --hf-checkpoint |
(required, either one) | Local checkpoint file or Hugging Face repo id |
--text |
(required) | Text to synthesize |
--caption |
None | Optional style-control text for VoiceDesign checkpoints |
--output-wav |
output.wav |
Output waveform path |
--ref-wav |
None | Reference waveform path for speaker conditioning |
--ref-latent |
None | Pre-computed reference latent (.pt) for speaker conditioning |
--no-ref |
False | Disable speaker reference conditioning |
--max-ref-seconds |
30.0 |
Maximum reference duration in seconds |
--ref-normalize-db |
-16.0 | Reference loudness target before DACVAE encode (set none to disable) |
--ref-ensure-max |
True | Scale reference down only when peak exceeds 1.0 (used when --ref-normalize-db is disabled) |
--codec-repo |
Aratako/Semantic-DACVAE-Japanese-32dim |
Codec repo used for latent encode/decode |
--codec-deterministic-encode |
True | Use deterministic DACVAE encode path |
--codec-deterministic-decode |
True | Use deterministic DACVAE watermark-message decode path |
--enable-watermark |
False | Enable DACVAE watermark branch during decode |
--max-text-len |
checkpoint metadata or 256 |
Maximum token length for text conditioning |
--max-caption-len |
checkpoint metadata or max_text_len |
Maximum token length for caption conditioning |
--num-steps |
40 | Number of Euler integration steps |
--num-candidates |
1 | Number of candidates to generate in one pass |
--decode-mode |
sequential |
Codec decode mode: sequential or batch |
--cfg-scale-text |
3.0 | CFG scale for text conditioning |
--cfg-scale-caption |
3.0 | CFG scale for caption conditioning |
--cfg-scale-speaker |
5.0 | CFG scale for speaker conditioning |
--cfg-guidance-mode |
independent |
CFG mode: independent, joint, alternating |
--cfg-scale |
None | Deprecated shared CFG override for all enabled conditions |
--cfg-min-t |
0.5 |
Lower timestep bound for CFG |
--cfg-max-t |
1.0 |
Upper timestep bound for CFG |
--truncation-factor |
None | Scale initial Gaussian noise before sampling |
--rescale-k / --rescale-sigma |
None | Temporal score rescaling parameters; must be set together |
--context-kv-cache |
True | Precompute context K/V projections for faster sampling |
--speaker-kv-scale |
None | Extra speaker K/V scaling for stronger speaker identity |
--speaker-kv-min-t |
0.9 |
Disable speaker K/V scaling after this timestep threshold |
--speaker-kv-max-layers |
None | Apply speaker K/V scaling only to first N diffusion layers |
--model-device |
auto | Device for model (cuda, mps, cpu) |
--codec-device |
auto | Device for DACVAE codec |
--model-precision |
fp32 |
Model precision (fp32, bf16) |
--codec-precision |
fp32 |
Codec precision (fp32, bf16) |
--seed |
random | Random seed for reproducibility |
--compile-model |
False | Enable torch.compile for faster inference |
--compile-dynamic |
False | Use dynamic=True for torch.compile |
--trim-tail |
True | Trim trailing silence via flattening heuristic |
--tail-window-size |
20 |
Window size used for tail trimming |
--tail-std-threshold |
0.05 |
Std threshold for tail trimming |
--tail-mean-threshold |
0.1 |
Mean threshold for tail trimming |
--show-timings |
True | Print per-stage timing breakdown |
Encodes audio from a Hugging Face dataset into DACVAE latents and produces a JSONL manifest for training.
uv run python prepare_manifest.py \
--dataset myorg/my_dataset \
--split train \
--audio-column audio \
--text-column text \
--output-manifest data/train_manifest.jsonl \
--latent-dir data/latents \
--device cudaTo include speaker_id in the manifest (for speaker-conditioned training):
uv run python prepare_manifest.py \
--dataset myorg/my_dataset \
--split train \
--audio-column audio \
--text-column text \
--speaker-column speaker \
--output-manifest data/train_manifest.jsonl \
--latent-dir data/latents \
--device cudaTo include caption in the manifest (for caption-conditioned voice design training):
uv run python prepare_manifest.py \
--dataset myorg/my_dataset \
--split train \
--audio-column audio \
--text-column text \
--caption-column caption \
--speaker-column speaker \
--output-manifest data/train_manifest.jsonl \
--latent-dir data/latents \
--device cudaWhen training the caption-conditioned voice-design model, speaker_id is optional. The
voice-design path disables speaker/reference conditioning and learns from text + caption.
This produces a JSONL manifest with entries like:
{"text": "こんにちは", "caption": "落ち着いた、近い距離感の女性話者", "latent_path": "data/latents/00001.pt", "speaker_id": "myorg/my_dataset:speaker_001", "num_frames": 750}Single-GPU training:
uv run python train.py \
--config configs/train_500m_v2.yaml \
--manifest data/train_manifest.jsonl \
--output-dir outputs/irodori_ttsVoiceDesign training uses a dedicated config:
uv run python train.py \
--config configs/train_500m_v2_voice_design.yaml \
--manifest data/train_manifest.jsonl \
--output-dir outputs/irodori_tts_voice_designconfigs/train_500m_v2_voice_design.yaml sets use_caption_condition: true and disables the
speaker/reference branch. Caption-free configs continue to use speaker conditioning when
speaker_id / reference inputs are available.
The VoiceDesign config also enables caption_warmup: true for optional caption-branch warmup.
warmup_steps controls the LR scheduler, while caption_warmup_steps controls how long
non-caption gradients are discarded before normal joint training resumes.
Multi-GPU DDP training:
uv run torchrun --nproc_per_node 4 train.py \
--config configs/train_500m_v2.yaml \
--manifest data/train_manifest.jsonl \
--output-dir outputs/irodori_tts \
--device cudaTraining supports YAML config files with model and train sections. CLI arguments take precedence over YAML values. See uv run python train.py --help for all available options.
Start a new training run from released inference weights (.safetensors). This initializes only the model weights; optimizer / scheduler state starts fresh.
uv run python train.py \
--config configs/train_500m_v2.yaml \
--manifest data/train_manifest.jsonl \
--output-dir outputs/irodori_tts_ft \
--init-checkpoint path/to/Irodori-TTS-500M-v2.safetensorsLoRA fine-tuning:
uv run python train.py \
--config configs/train_500m_v2_lora.yaml \
--manifest data/train_manifest.jsonl \
--output-dir outputs/irodori_tts_lora \
--init-checkpoint path/to/Irodori-TTS-500M-v2.safetensorsCaption-conditioned voice-design LoRA fine-tuning:
uv run python train.py \
--config configs/train_500m_v2_voice_design_lora.yaml \
--manifest data/train_manifest.jsonl \
--output-dir outputs/irodori_tts_voice_design_lora \
--init-checkpoint path/to/Irodori-TTS-500M-v2.safetensorsAvailable LoRA target presets:
text_attn_mlp: text encoder attention + attention gate + MLPcaption_attn_mlp: caption encoder attention + attention gate + MLPspeaker_attn_mlp: speaker encoder attention + attention gate + MLP, plusspeaker_encoder.in_projdiffusion_attn: diffusion attention only, including text/speaker/caption context KV and attention gatediffusion_attn_mlp:diffusion_attn+ diffusion MLPall_attn: all attention blocks across text/caption/speaker/diffusion, including attention gatesdiffusion_full: diffusion stack broadly:cond_module,in_proj/out_proj, diffusion attention, diffusion MLP, and AdaLNadaln: diffusion-block AdaLN layers onlyconditioning: conditioning-side projections only:cond_module,speaker_encoder.in_proj, and diffusion context KV projectionsall_attn_mlp:all_attn+ text/caption/speaker/diffusion MLP, plusspeaker_encoder.in_projall_linear: allnn.Linearlayers in the model; embeddings and norm weights are not included
--lora-target-modules also accepts a regex string or a comma-separated list of module suffixes. Resume automatically restores the saved LoRA config from the training checkpoint unless you explicitly override it.
When --lora is enabled, checkpoints are saved as adapter-only directories containing PEFT adapter weights plus trainer state for resume.
Resume an existing training run from a training checkpoint. Full-model runs use .pt; LoRA runs use checkpoint directories. Both restore optimizer, scheduler, and step state.
uv run python train.py \
--config configs/train_500m_v2.yaml \
--manifest data/train_manifest.jsonl \
--output-dir outputs/irodori_tts \
--resume outputs/irodori_tts/checkpoint_0010000.ptLoRA resume example:
uv run python train.py \
--config configs/train_500m_v2_lora.yaml \
--manifest data/train_manifest.jsonl \
--output-dir outputs/irodori_tts_lora \
--resume outputs/irodori_tts_lora/checkpoint_0010000If you move a LoRA checkpoint to another environment and the original base-checkpoint path is no longer valid, pass --init-checkpoint path/to/base_model.safetensors together with --resume to override the saved base-model path.
Convert a training checkpoint to inference-only safetensors format:
uv run python convert_checkpoint_to_safetensors.py outputs/checkpoint_final.ptLoRA adapter checkpoints can also be converted directly:
uv run python convert_checkpoint_to_safetensors.py outputs/irodori_tts_lora/checkpoint_finalLoRA adapter checkpoints are merged into the base model automatically during conversion, so the exported .safetensors file is directly usable for inference.
Irodori-TTS/
├── train.py # Training entry point (DDP support)
├── infer.py # CLI inference
├── gradio_app.py # Gradio web UI
├── gradio_app_voicedesign.py # Gradio web UI for VoiceDesign checkpoints
├── prepare_manifest.py # Dataset -> DACVAE latent preprocessing
├── convert_checkpoint_to_safetensors.py # Checkpoint converter
│
├── irodori_tts/ # Core library
│ ├── model.py # TextToLatentRFDiT architecture
│ ├── rf.py # Rectified Flow utilities & Euler CFG sampling
│ ├── codec.py # DACVAE codec wrapper
│ ├── dataset.py # Dataset and collator
│ ├── tokenizer.py # Pretrained LLM tokenizer wrapper
│ ├── config.py # Model / Train / Sampling config dataclasses
│ ├── inference_runtime.py # Cached, thread-safe inference runtime
│ ├── lora.py # PEFT LoRA integration helpers
│ ├── text_normalization.py # Japanese text normalization
│ ├── optim.py # Muon + AdamW optimizer
│ └── progress.py # Training progress tracker
│
└── configs/
├── train_500m_v2.yaml # 500M v2 model config
├── train_500m_v2_lora.yaml # 500M v2 LoRA fine-tuning config
├── train_500m_v2_voice_design.yaml # 500M v2 VoiceDesign full fine-tuning config
├── train_500m_v2_voice_design_lora.yaml # 500M v2 VoiceDesign LoRA fine-tuning config
├── train_500m.yaml # 500M v1 model config
└── train_2.5b.yaml # 2.5B parameter model config
- Code: MIT License
- Model Weights: Please refer to the base model card and the VoiceDesign model card for licensing details
This project builds upon the following works:
@misc{irodori-tts,
author = {Chihiro Arata},
title = {Irodori-TTS: A Flow Matching-based Text-to-Speech Model with Emoji-driven Style Control},
year = {2026},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/Aratako/Irodori-TTS}}
}