I. Introduction
Human interaction is one of the primal needs of human beings. In today’s globalized world, this need reaches its peak making better communication a significant goal. The computer-aided research on the pursuit of ultimate efficiency focuses mostly on Natural Language Processing. In the telecommunications sub-domain, Speech-to-Text (STT) and Text-to-Speech (TTS) models of the NLP family form the basis [1–7], In the former case, STT algorithms tie closely with the Speech and Language Recognition models for the overall behavior of capturing the sound waveforms and output textual data. This functionality has been targeted mainly for enabling communication in a medical area such as hearing-impaired patients [8–9] or for providing better interaction among users in multilanguage environments [10–12], On the other hand, TTS algorithms are constituted of text normalization, phonetic analysis, and voice synthesis stages [2], Like its STT counterpart, TTSs are also mainly focused on users with medical conditions such as speech disorder or visual impairment [13–17] but they have been also adopted in other areas such as education [18–19], entertainment [20–22], and telecommunications.