Deep Learning for Videoconferencing: A Brief Examination of Speech to Text and Speech Synthesis

Publisher: IEEE

Abstract:Integrating AI-powered applications into video conferencing systems is further expected to blow up in various industrial scenarios. In this modern era of the video confer...View more
Abstract:
Integrating AI-powered applications into video conferencing systems is further expected to blow up in various industrial scenarios. In this modern era of the video conferencing industry, deep learning techniques are revolutionizing to improve the quality of communication in use cases such as resolution improvement, background noise reduction, video compression, face alignment, transcription, and speech synthesis. This paper reviews the latest works on deep learning-based transcription and speech synthesis methods. They are classified into three categories: Speech to text, text to speech, and speech to text to speech. We included experimental studies conducted in two specific methods in a speech to text and speech synthesis. Experimental results on various test scenarios of two state-of-the-art pre-trained models are also analyzed. Finally, the future development trend of AI- powered video conferencing system is predicted.
Date of Conference: 15-17 September 2021
Date Added to IEEE Xplore: 13 October 2021
ISBN Information:
ISSN Information:
INSPEC Accession Number: 21227105
Publisher: IEEE
Conference Location: Ankara, Turkey

I. Introduction

Human interaction is one of the primal needs of human beings. In today’s globalized world, this need reaches its peak making better communication a significant goal. The computer-aided research on the pursuit of ultimate efficiency focuses mostly on Natural Language Processing. In the telecommunications sub-domain, Speech-to-Text (STT) and Text-to-Speech (TTS) models of the NLP family form the basis [1–7], In the former case, STT algorithms tie closely with the Speech and Language Recognition models for the overall behavior of capturing the sound waveforms and output textual data. This functionality has been targeted mainly for enabling communication in a medical area such as hearing-impaired patients [8–9] or for providing better interaction among users in multilanguage environments [10–12], On the other hand, TTS algorithms are constituted of text normalization, phonetic analysis, and voice synthesis stages [2], Like its STT counterpart, TTSs are also mainly focused on users with medical conditions such as speech disorder or visual impairment [13–17] but they have been also adopted in other areas such as education [18–19], entertainment [20–22], and telecommunications.

References

References is not available for this document.