The Samples of Synthesized Speech
samples for paper submitted to ICASSP 2015
RNN |
RNN_WR1_tuned | RNN_upper_bound | |
Hyperion , however , tumbles erratically as gravity from nearby moons tugs on its irregular shape . | RNN | RNN_WR1_tuned | RNN_upper_bound |
They've all dried out ; it's all carrot juice .
|
RNN | RNN_WR1_tuned | RNN_upper_bound |
That's why Kathy could not change Ruby's behavior . |
RNN | RNN_WR1_tuned | RNN_upper_bound |
But to hear South African coach Kitch Christie talk , it's Lomu who should be worried . |
RNN | RNN_WR1_tuned | RNN_upper_bound |
He is allowed brief visits on Friday mornings .
|
RNN | RNN_WR1_tuned | RNN_upper_bound |
When coaxing failed , the child's nose was plugged . |
RNN | RNN_WR1_tuned | RNN_upper_bound |
Improve integration with VS , MSDN , Media SDK , DirectX SDK , DDK , etc. |
RNN | RNN_WR1_tuned | RNN_upper_bound |
My wife has the showplace she always wanted . |
RNN | RNN_WR1_tuned | RNN_upper_bound |
France, Japan and Germany all now give more aid to Africa than America does. |
RNN | RNN_WR1_tuned | RNN_upper_bound |
Shoe the trainer never matched Shoe the jockey . |
RNN | RNN_WR1_tuned | RNN_upper_bound |
--------------------------------------------------------------
Samples of Synthesized Speech
Please click HMM, DNN and RNN to play | HMM (MDL=1) |
DNN (1k*3 MSE SGE) |
RNN (512*2_Sigmoid +512*2_BLSTM) |
Hyperion, however, tumbles erratically as gravity from nearby moons tugs on its irregular shape. | HMM | DNN DNN_SGE | RNN |
They've all dried out ; it's all carrot juice. | HMM | DNN DNN_SGE | RNN |
That's why Kathy could not change Ruby's behavior. | HMM | DNN DNN_SGE | RNN |
But to hear South African coach Kitch Christie talk , it's Lomu who should be worried. | HMM | DNN DNN_SGE | RNN |
When coaxing failed , the child's nose was plugged. | HMM | DNN DNN_SGE | RNN |
My wife has the showplace she always wanted. | HMM | DNN DNN_SGE | RNN |
France, Japan and Germany all now give more aid to Africa than America does. | HMM | DNN DNN_SGE | RNN |
Shoe the trainer never matched Shoe the jockey . | HMM | DNN DNN_SGE | RNN |
The Scottish club beat out bids from English teams Aston Villa , Leeds and Chelsea . | HMM | DNN DNN_SGE | RNN |
The drugs made her so tired she could barely stay awake during school . | HMM | DNN DNN_SGE | RNN |
Submitted to Singal Processing Letters
Sequence Generation Error (SGE) Minimization Based DNN Training for Text-to-Speech Synthesis
Yuchen Fan, Yao Qianand Frank K. Soong
Abstract – Feed-forward deep neural network (DNN) based TTS, which employs a multi-layered structure to exploit the statistical correlations between rich contextual information and the corresponding acoustic features, has been shown to outperform a decision tree-based, GMM-HMM counterpart. However, the DNN TTS training has not taken the whole synthesized sequence, i.e., sentence into account in the optimization procedure, hence results in some intrinsic inconsistency between training and testing. In this paper we propose a “sequence generation error” (SGE) minimization for DNN-based TTS training. By incorporating the whole sequence parameter generation into the training process, the mismatch between training and testing is eliminated and the original constraints between the static and dynamic features are naturally embedded in the optimization process. Experimental results performed on a speech database of 5 hours show that DNN-based TTS trained with this new SGE minimization criterion can further improve the DNN baseline performance, particularly, in subjective listening tests.
---------------------------------------------------------------------------------------------------------------
To be appeared in Interspeech 2014
TTS Synthesis with Bidirectional LSTM based Recurrent Neural Networks
Yuchen Fan, Yao Qian, Fenglong Xie, and Frank K. Soong
Abstract
Feed-forward, Deep neural networks (DNN)-based TTS systems have been recently shown to outperform decision-tree based, HMM TTS systems . However, the long time span contextual effect in a speech utterance is still not easy to accommodate, due to the intrinsic, feed-forward nature in DNN-based modeling. Also, to synthesize a smooth speech trajectory, the dynamic features are needed to constrain speech parameter trajectory generation in HMM-based TTS. In this paper, Recurrent Neural Networks (RNNs) with Bidirectional Long Short Term Memory (BLSTM) cells are adopted to capture the correlation or co-occurring information between any two instants in a speech utterance for parametric TTS synthesis. Experimental results show that a hybrid system of DNN and BLSTM-RNN, i.e., lower hidden layers with a feed-forward structure which is cascaded with upper hidden layers with a bidirectional RNN structure of LSTM, can outperform either the conventional, decision tree-based HMM, or a DNN TTS system, both objectively and subjectively. The speech trajectory generated by the BLSTM-RNN TTS is fairly smooth and no dynamic constraints are needed.
---------------------------------------------------------------------------------------------------------------
ICASSP 2014
On the Training Aspects of Deep Neural Network (DNN) for Parametric TTS synthesis
Yao Qian, Yuchen Fan, Wenping Hu, Frank Soong
Abstract
Deep Neural Network (DNN), which can model a long-span, intricate transform compactly with a deep-layered structure, was investigated for parametric TTS synthesis with a huge corpus (33,000 utterances). In this paper, we examine DNN TTS synthesis with a moderate size corpus of 5 hours, which is more commonly used for parametric TTS training. DNN is used to map input text features into output acoustic features (LSP, F0 and V/U). Experimental results show that DNN can outperform the conventional HMM, which is firstly trained in ML and then refined by MGE. Both objective and subjective measures indicate that DNN can synthesize speech better than HMM-based baseline. The improvement is mainly on the prosody, i.e., the RMSE of natural and generated F0 trajectories by DNN is improved by 2 Hz. This benefit is likely from the key characteristics of DNN, which can exploit feature correlations, e.g., between F0 and spectrum, without using a more restricted, e.g. diagonal Gaussian, probability density family. Our experimental results also show: the layer-wise BP pre-training can drive weights to a better starting point than random initialization and result in a better DNN; state boundary info is important for training DNN to yield better synthesized speech; and the hyperbolic tangent activation function in DNN hidden layers can help training to converge faster than sigmoid.