论文标题
语音处理中的深度代表性学习:挑战,最新进展和未来趋势
Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends
论文作者
论文摘要
传统上,对语音处理的研究已考虑设计手工设计的声学特征(功能工程)作为与设计有效机器学习(ML)模型的任务不同的问题,以做出预测和分类决策。这种方法有两个主要缺点:首先,手动工程的功能工程很麻烦,需要人类知识。其次,设计的功能可能不是目前的目标。这促使语音社区最近的趋势采用了代表学习技术的利用,该技术可以自动学习输入信号的中间表示,从而更好地适合手头任务,从而导致性能的提高。 The significance of representation learning has increased with advances in deep learning (DL), where the representations are more useful and less dependent on human knowledge, making it very conducive for tasks like classification, prediction, etc. The main contribution of this paper is to present an up-to-date and comprehensive survey on different techniques of speech representation learning by bringing together the scattered research across three distinct research areas including Automatic Speech Recognition (ASR), Speaker Recognition (SR), and Speaker情绪识别(SER)。但是,对于ASR,SR和SER进行了语音评论,但是,这些都没有集中于从语音中学习的表示,这是我们的调查旨在桥接的差距。
Research on speech processing has traditionally considered the task of designing hand-engineered acoustic features (feature engineering) as a separate distinct problem from the task of designing efficient machine learning (ML) models to make prediction and classification decisions. There are two main drawbacks to this approach: firstly, the feature engineering being manual is cumbersome and requires human knowledge; and secondly, the designed features might not be best for the objective at hand. This has motivated the adoption of a recent trend in speech community towards utilisation of representation learning techniques, which can learn an intermediate representation of the input signal automatically that better suits the task at hand and hence lead to improved performance. The significance of representation learning has increased with advances in deep learning (DL), where the representations are more useful and less dependent on human knowledge, making it very conducive for tasks like classification, prediction, etc. The main contribution of this paper is to present an up-to-date and comprehensive survey on different techniques of speech representation learning by bringing together the scattered research across three distinct research areas including Automatic Speech Recognition (ASR), Speaker Recognition (SR), and Speaker Emotion Recognition (SER). Recent reviews in speech have been conducted for ASR, SR, and SER, however, none of these has focused on the representation learning from speech -- a gap that our survey aims to bridge.