MoeSpeech

日本語はこちら

Changelog

Updates planned regularly: additions of characters, audio filtering, new audio files for existing characters, etc.

2024-01-24: Added 24 characters (473 characters, 394k files, 622 hours, 368GB)
2024-01-22: Initial version (449 characters, 363k files, 581 hours)

Dataset Summary

A high-quality dataset of character acting speech audio by Japanese professional voice actors, recorded in a studio, free of noise and background music.
Each audio is a monaural wav file of 2-15 seconds (mostly 44.1kHz, some at 48kHz).
The dataset is organized into folders for each character. Each character is anonymized and has a random 8-character alphanumeric identifier generated by uuid.uuid4().hex[:8].
Currently, it includes a total of 473 characters, approximately 394k audio files, and a total of about 622 hours and 368GB of audio.
The audio has been mechanically filtered for quality, suitable for tasks such as TTS.

Supported Tasks and Leaderboards

May be useful for research and development of voice-related tasks such as voice conversion and synthesis, especially focused on Japanese moe culture.
By performing transcriptions through voice recognition, the content of the dialogues might be usable for language models and similar applications.

Languages

The language of the audio in the dataset is predominantly Japanese, with the possibility of a very small number of English phrases.

Dataset Structure

├── info.csv
└── data/
      ├── {uuid1}/
      │     └── wav/
      │          ├── {uuid1}_001.wav
      │          ├── {uuid1}_002.wav
      │          ├── {uuid1}_003.wav
      │       ...
      ├── {uuid2}/
      │     └── wav/
      │          ├── {uuid2}_0001.wav
      │          ├── {uuid2}_0002.wav
      │          ├── {uuid2}_0003.wav
      │       ...
      ...

info.csv:

name,num_files,total_duration_min,f0_mean
{uuid1},516,45.21,211.2
...

name: Character identifier (a random 8-character alphanumeric string)
num_files: Number of audio files
total_duration_min: Total duration of audio files (in minutes)
f0_mean: Average fundamental frequency (Hz) of the audio files, representing the pitch of the voice
The order is in ascending name

Download

Using the huggingface-cli can be convenient. Create a token from the Hugging Face settings page and then log in using the following commands:

pip install -U "huggingface_hub[cli]"
huggingface-cli login

To download data for a specific identifier {uuid}:

huggingface-cli download litagin/moe-speech --repo-type dataset --local-dir path/to/download/ --local-dir-use-symlinks False --include "data/{uuid}/*"

To download all data (be mindful of the storage requirements):

huggingface-cli download litagin/moe-speech --repo-type dataset --local-dir path/to/download/ --local-dir-use-symlinks False

For more details, refer to the Hugging Face CLI documentation.

Dataset Creation

Curation Rationale

This dataset was created to promote research and development in Japanese character voice synthesis and voice conversion, as there are not many large-scale, high-quality datasets of character lines voiced by voice actors.

Source Data

Recordings from PC games that were purchased through legal means and are personally owned.

Initial Data Collection and Normalization

The audio files were filtered as follows:

Filter audios with the following conditions.

Mono only (convert to mono if stereo but left channel == right channel)
Duration: 2-15 seconds
NISQA mos quality score: decide threshold from histogram for each character, often Q1 (because mos score largely depends on the speaker)
Get speech ratio (speech duration / total duration) using Silero VAD, and require >=0.5.
The number of resulting audios for each character should be >= 100, and the total duration should be >= 15 minutes.

Annotations

Fundamental frequencies were obtained using the dio function in pyworld, and their averages were calculated.

Personal and Sensitive Information

Due to the nature of the dataset's source, it may contain dialogues with the following characteristics:

Lines with sexual content.
Sexual sounds (though many of these should have been excluded during the filtering process).
(There is interest in creating a dataset focused solely on sexual sounds, but mechanical creation is challenging, and collaborators are sought.)

To prevent misuse for enjoyment purposes, the following measures have been taken:

Game names and character names are concealed, no categorization by game is done in folder organization, and random alphanumeric strings are used as character identifiers.
The order of audio files in each character folder is randomized to prevent identification of the sequence of dialogues.

Considerations for Using the Data

Discussion of Biases

Due to its nature, the dataset may exhibit certain biases, such as:

A tendency for a larger volume of data for female characters.

Other Known Limitations

The same voice actor may play multiple characters, or the same character may appear across multiple games. In such cases, they are assigned different identifiers.
Although the audio has been filtered for quality, not all files have been individually checked. Therefore, it's possible that some files may include:
- Audio processed to sound like it has an echo.
- Audio processed to sound as if it's coming through a phone or from behind a wall.

TODO: Exclude such audio through manual effort or some other means.

Additional Information

Licensing Information

Please refer to LICENSE for details. It is essential to read and understand the LICENSE before using this dataset to ensure compliance with its terms.

Disclaimer

The providers of this dataset are not responsible for any troubles or damages arising from the use of this dataset.
Users must comply with the laws of their country or region when using this dataset.

The legal basis for publishing this dataset is as follows: Copyright Law of Japan (Law No. 48 of May 6, 1970) Article 30-4:

(Quotation starts)

Article 30-4: It is permissible to exploit a work, in any way and to the extent considered necessary, in any of the following cases, or in any other case in which it is not a person's purpose to personally enjoy or cause another person to enjoy the thoughts or sentiments expressed in that work; provided, however, that this does not apply if the action would unreasonably prejudice the interests of the copyright owner in light of the nature or purpose of the work or the circumstances of its exploitation:

(i) if it is done for use in testing to develop or put into practical use technology that is connected with the recording of sounds or visuals of a work or other such exploitation;

(ii) if it is done for use in data analysis (meaning the extraction, comparison, classification, or other statistical analysis of the constituent language, sounds, images, or other elemental data from a large number of works or a large volume of other such data; the same applies in Article 47-5, paragraph (1), item (ii));

(iii) if it is exploited in the course of computer data processing or otherwise exploited in a way that does not involve what is expressed in the work being perceived by the human senses (for works of computer programming, such exploitation excludes the execution of the work on a computer), beyond as set forth in the preceding two items.

(End of quotation)

This dataset is considered to fall under the second category mentioned above.
The dataset is structured to meet the condition of "person's purpose to personally enjoy or cause another person to enjoy the thoughts or sentiments expressed in that work", as specified in LICENSE, and users are prohibited from using it for enjoyment purposes.
Regarding "this does not apply if the action would unreasonably prejudice the interests of the copyright owner in light of the nature or purpose of the work or the circumstances of its exploitation", this dataset conceals the source game, voice actor names, and character names, and the order of the audio files is randomized. Additionally, identifiers with the same source are not disclosed, making it impossible to use this dataset for the original purpose of the work (enjoying the game's scenario with voice and images), and thus it is believed that the publication of this dataset does not unfairly harm the interests of the copyright holder.
Even if a user tries to ignore the license and use it for enjoyment, it is practically impossible to recreate the original game's scenario as there are over 400 speakers with numerous audio files for each, making it difficult to identify speakers with the same source.
As stated in LICENSE, any use that "unfairly harms the interests of the copyright holder", as well as actions that hinder the considerations mentioned above (providing information to third parties about the voice actors or original works, or redistributing in a way that makes these associations identifiable), are prohibited.

MoeSpeech日本語版README [Japanese Version]

Dataset Description

Dataset Summary

日本人プロ声優による高音質（スタジオ録音）でノイズ・BGM等無しのキャラクター演技セリフ発話音声データセット
1音声は2-15秒のモノラルwavファイル（ほぼ全て44.1kHz、いくつかは48kHz）
キャラクターごとにフォルダ分けされている（各キャラクターは匿名化され、uuid.uuid4().hex[:8]によるランダムな8文字の英数字による識別子を持つ）
現在は合計473キャラクター、約39万の音声ファイル、合計約622時間、368GBの音声が含まれる
TTS等のタスクに使える質になるよう、機械的に音声の質によりフィルタリング済み

Changelog

随時更新予定：キャラクターの追加、音声のフィルタリング、既存キャラクターの音声ファイルの新たな追加等

2024-01-24: Added 24 characters (473 characters, 394k files, 622 hours, 368GB)
2024-01-22: Initial version (449 characters, 363k files, 581 hours, 343GB)

Supported Tasks and Leaderboards

日本の萌え文化に特化した音声変換や音声合成等の音声関連タスクの研究や開発に使えるかもしれない。
音声認識により書き起こしを行うことで、セリフ内容を言語モデル等に利用できるかもしれない。

Languages

データセット内の音声の言語は（極めて少数の英文セリフ等の可能性を除き）日本語のみ。

Dataset Structure

├── info.csv
└── data/
      ├── {uuid1}/
      │     └── wav/
      │          ├── {uuid1}_001.wav
      │          ├── {uuid1}_002.wav
      │          ├── {uuid1}_003.wav
      │       ...
      ├── {uuid2}/
      │     └── wav/
      │          ├── {uuid2}_0001.wav
      │          ├── {uuid2}_0002.wav
      │          ├── {uuid2}_0003.wav
      │       ...
      ...

info.csv:

name,num_files,total_duration_min,f0_mean
{uuid1},516,45.21,211.2
...

name: キャラクター識別子（8文字のランダムな英数字）
num_files: 音声ファイル数
total_duration_min: 音声ファイルの合計時間（分）
f0_mean: 音声ファイルの平均基本周波数（Hz）の平均、つまり声の高さ
並び順はnameの昇順

Download

huggingface-cliを使うと便利です。 Huggung Faceの設定ページからトークンを作り、以下でログインします。

pip install -U "huggingface_hub[cli]"
huggingface-cli login

識別名{uuid}のデータのみをダウンロードする場合。

huggingface-cli download litagin/moe-speech --repo-type dataset --local-dir path/to/download/ --local-dir-use-symlinks False --include "data/{uuid}/*"

全てのデータをダウンロードする場合（容量に注意してください）。

huggingface-cli download litagin/moe-speech --repo-type dataset --local-dir path/to/download/ --local-dir-use-symlinks False

詳細はHugging Face CLIのドキュメントを参照。

Dataset Creation

Curation Rationale

声優による高音質で大規模なキャラクターセリフ音声データセットはあまり存在していないので、日本語キャラクター音声合成や音声変換の研究発展の促進を目的として作成した。

Source Data

正規の手段で購入して個人的に所持しているPCゲームから録音したもの

Initial Data Collection and Normalization

以下のように音声ファイルをフィルタリングした。

Filter audios with the following conditions.

Mono only (convert to mono if stereo but left channel == right channel)
Duration: 2-15 seconds
NISQA mos quality score: decide threshold from histogram for each character, often Q1 (because mos score largely depends on the speaker)
Get speech ratio (speech duration / total duration) using Silero VAD, and require >=0.5.
The number of resulting audios for each character should be >= 100, and the total duration should be >= 15 minutes.

Annotations

基本周波数はpyworldのdioで取得し、その平均を取った。

Personal and Sensitive Information

データセットの元の都合上、以下のようなセリフが含まれている可能性がある。

性的な内容のセリフ
性的の音声（ただしこれはフィルタリングの過程で多くが除外されているはず）
（性的な音声のみによるデータセットも作りたいが作成を機械的に行うのが困難そうなため分からず、協力者求む）

享受目的での利用を防ぐため、以下のような手段を取っている。

ゲーム名やキャラクター名を伏せ、ゲームによるフォルダ分け類別はせず、またキャラクター識別子としてランダムな英数字の名前を使用
各キャラクターフォルダ内の音声ファイルの並び順はランダムにし、セリフの順番を特定できないようにする

Considerations for Using the Data

Discussion of Biases

性質上、以下のようなバイアスがある可能性がある。

女性キャラクターの方がデータ量が多い傾向がある

Other Known Limitations

同一声優が複数のキャラクターを演じている場合や、同一キャラクターでも複数ゲームにまたがっている場合があり、その場合は別の識別子を持つ。
音声品質によりフィルタリングしているが、全てをチェックしたわけではないので、以下のような音声がたまに入っている可能性がある。
- エコーが入ったような加工がされた音声
- 電話越し・壁越しであるかのような加工がされた音声

TODO: このような音声を手作業か何らかの手段により除外する。

Additional Information

Licensing Information

LICENSEを参照。このデータセットを利用する場合は、必ずLICENSEを読んで利用条件を確認すること。

English:

Please refer to LICENSE. If you use this dataset, be sure to read the LICENSE and check the usage conditions.

Disclaimer

このデータセットの利用によって発生したいかなるトラブルや損害に対しても、データセットの提供者は責任を負わない。
このデータセットの利用に際して、自身の国または地域の法律に従うこと。

このデータセットを公開している根拠は、以下の通り。著作権法（昭和45年5月6日法律第48号）第三十条の四:

(以下引用)

著作物は、次に掲げる場合その他の当該著作物に表現された思想又は感情を自ら享受し又は他人に享受させることを目的としない場合には、その必要と認められる限度において、いずれの方法によるかを問わず、利用することができる。ただし、当該著作物の種類及び用途並びに当該利用の態様に照らし著作権者の利益を不当に害することとなる場合は、この限りでない。

　一　著作物の録音、録画その他の利用に係る技術の開発又は実用化のための試験の用に供する場合

　二　情報解析（多数の著作物その他の大量の情報から、当該情報を構成する言語、音、影像その他の要素に係る情報を抽出し、比較、分類その他の解析を行うことをいう。第四十七条の五第一項第二号において同じ。）の用に供する場合

　三　前二号に掲げる場合のほか、著作物の表現についての人の知覚による認識を伴うことなく当該著作物を電子計算機による情報処理の過程における利用その他の利用（プログラムの著作物にあつては、当該著作物の電子計算機における実行を除く。）に供する場合

(引用終わり)

このデータセットは、上記の第二号に該当すると考えられる。
「著作物に表現された思想又は感情を自ら享受し又は他人に享受させることを目的としない場合」という条件を満たすように配慮したデータセットの構造となっており、LICENSEにある通り、利用者は享受目的の利用を禁止されている。
「当該著作物の種類及び用途並びに当該利用の態様に照らし著作権者の利益を不当に害することとなる場合」については、本データセットでは参照元や声優名やキャラクター名を伏せている上に音声の順番もシャッフルされており、また同一参照元を持つ識別子も公開していないことから、当該著作物（ゲーム）の使用用途（シナリオを音声と絵をあわせて楽しむ）で利用することは不可能であり、このデータセットの公開によって著作権者の利益を不当に害することはないと考えられる。
もし仮に利用者がライセンスを無視し享受利用しようとしたとしても、話者数が400人以上で多数のため同じソースを持つ話者識別名の特定は困難であり、また音声ファイルも1話者につき100ファイル以上であるので、元のゲームのシナリオを再現することは事実上不可能である。
LICENSEにある通り、「当該著作物の種類及び用途並びに当該利用の態様に照らし著作権者の利益を不当に害することとなる」ような利用方法は禁じられており、また以上の配慮を妨げるような行為（声優や元著作物との対応についての第三者への情報提供やそれらが分かるような形での再配布）は禁止されている。

You need to agree to share your contact information to access this dataset

MoeSpeech

Changelog

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Download

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Discussion of Biases

Other Known Limitations

Additional Information

Licensing Information

Disclaimer

MoeSpeech日本語版README [Japanese Version]

Dataset Description

Dataset Summary

Changelog

Supported Tasks and Leaderboards

Languages

Dataset Structure

Download

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Discussion of Biases

Other Known Limitations

Additional Information

Licensing Information

Disclaimer