Conferences >2021 6th International Confer...

Deep Learning for Videoconferencing: A Brief Examination of Speech to Text and Speech Synthesis

Download PDF
View References
Request Permissions
Save to
Alerts

Abstract:Integrating AI-powered applications into video conferencing systems is further expected to blow up in various industrial scenarios. In this modern era of the video confer...View more

Metadata

Abstract:

Integrating AI-powered applications into video conferencing systems is further expected to blow up in various industrial scenarios. In this modern era of the video conferencing industry, deep learning techniques are revolutionizing to improve the quality of communication in use cases such as resolution improvement, background noise reduction, video compression, face alignment, transcription, and speech synthesis. This paper reviews the latest works on deep learning-based transcription and speech synthesis methods. They are classified into three categories: Speech to text, text to speech, and speech to text to speech. We included experimental studies conducted in two specific methods in a speech to text and speech synthesis. Experimental results on various test scenarios of two state-of-the-art pre-trained models are also analyzed. Finally, the future development trend of AI- powered video conferencing system is predicted.

Published in: 2021 6th International Conference on Computer Science and Engineering (UBMK)

Date of Conference: 15-17 September 2021

Date Added to IEEE Xplore: 13 October 2021

ISBN Information:

ISSN Information:

INSPEC Accession Number: 21227105

DOI: 10.1109/UBMK52708.2021.9558954

Conference Location: Ankara, Turkey

Contents

I. Introduction

Human interaction is one of the primal needs of human beings. In today’s globalized world, this need reaches its peak making better communication a significant goal. The computer-aided research on the pursuit of ultimate efficiency focuses mostly on Natural Language Processing. In the telecommunications sub-domain, Speech-to-Text (STT) and Text-to-Speech (TTS) models of the NLP family form the basis [1–7], In the former case, STT algorithms tie closely with the Speech and Language Recognition models for the overall behavior of capturing the sound waveforms and output textual data. This functionality has been targeted mainly for enabling communication in a medical area such as hearing-impaired patients [8–9] or for providing better interaction among users in multilanguage environments [10–12], On the other hand, TTS algorithms are constituted of text normalization, phonetic analysis, and voice synthesis stages [2], Like its STT counterpart, TTSs are also mainly focused on users with medical conditions such as speech disorder or visual impairment [13–17] but they have been also adopted in other areas such as education [18–19], entertainment [20–22], and telecommunications.

References is not available for this document.

Deep Learning for Videoconferencing: A Brief Examination of Speech to Text and Speech Synthesis

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Deep Learning for Videoconferencing: A Brief Examination of Speech to Text and Speech Synthesis

Alerts

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?