摘要翻译:
本文提出将话语级置换不变训练(uPIT)用于独立于说话人的多说话人语音分离和降噪。具体来说,我们使用uPIT训练深度双向长短期记忆(LSTM)递归
神经网络(RNNs),用于在多种噪声条件下,包括合成和现实噪声信号,进行单通道独立于说话人的多说话人语音分离。我们的实验集中于模型的推广性和噪声鲁棒性,这些模型依赖于不同类型的先验知识,例如噪声类型和同时说话者的数量。结果表明,在噪声环境中,利用uPIT训练的深度双向LSTM RNNs在不同噪声类型和信噪比下,在与说话人无关的多说话人语音分离和去噪任务中,可以提高信噪比(SDR)和扩展短时客观可懂度(ESTOI)测度。具体地说,我们首先表明,当使用已知噪声类型进行评估时,LSTM RNN可以实现较大的SDR和ESTOI改进,并且单个模型能够处理多种噪声类型,而性能仅略有下降。此外,我们还证明了一个LSTM RNN可以同时处理两个说话人和三个说话人的混合噪声,而不需要事先知道说话人的确切数目。最后,我们证明了使用uPIT训练的LSTM RNNs能够很好地泛化训练过程中未见的噪声类型。
---
英文标题:
《Joint Separation and Denoising of Noisy Multi-talker Speech using
Recurrent Neural Networks and Permutation Invariant Training》
---
作者:
Morten Kolb{\ae}k, Dong Yu, Zheng-Hua Tan, Jesper Jensen
---
最新提交年份:
2017
---
分类信息:
一级分类:Computer Science 计算机科学
二级分类:Sound 声音
分类描述:Covers all aspects of computing with sound, and sound as an information channel. Includes models of sound, analysis and synthesis, audio user interfaces, sonification of data, computer music, and sound signal processing. Includes ACM Subject Class H.5.5, and intersects with H.1.2, H.5.1, H.5.2, I.2.7, I.5.4, I.6.3, J.5, K.4.2.
涵盖了声音计算的各个方面,以及声音作为一种信息通道。包括声音模型、分析和合成、音频用户界面、数据的可听化、计算机音乐和声音信号处理。包括ACM学科类H.5.5,并与H.1.2、H.5.1、H.5.2、I.2.7、I.5.4、I.6.3、J.5、K.4.2交叉。
--
一级分类:Electrical Engineering and Systems Science 电气工程与系统科学
二级分类:Audio and Speech Processing 音频和语音处理
分类描述:Theory and methods for processing signals representing audio, speech, and language, and their applications. This includes analysis, synthesis, enhancement, transformation, classification and interpretation of such signals as well as the design, development, and evaluation of associated signal processing systems. Machine learning and pattern analysis applied to any of the above areas is also welcome. Specific topics of interest include: auditory modeling and hearing aids; acoustic beamforming and source localization; classification of acoustic scenes; speaker separation; active noise control and echo cancellation; enhancement; de-reverberation; bioacoustics; music signals analysis, synthesis and modification; music information retrieval; audio for multimedia and joint audio-video processing; spoken and written language modeling, segmentation, tagging, parsing, understanding, and translation; text mining; speech production, perception, and psychoacoustics; speech analysis, synthesis, and perceptual modeling and coding; robust speech recognition; speaker recognition and characterization; deep learning, online learning, and graphical models applied to speech, audio, and language signals; and implementation aspects ranging from system architecture to fast algorithms.
处理代表音频、语音和语言的信号的理论和方法及其应用。这包括分析、合成、增强、转换、分类和解释这些信号,以及相关信号处理系统的设计、开发和评估。机器学习和模式分析应用于上述任何领域也是受欢迎的。感兴趣的具体主题包括:听觉建模和助听器;声波束形成与声源定位;声场景分类;说话人分离;有源噪声控制和回声消除;增强;去混响;生物声学;音乐信号的分析、合成与修饰;音乐信息检索;多媒体音频和联合音视频处理;口语和书面语建模、切分、标注、句法分析、理解和翻译;文本挖掘;言语产生、感知和心理声学;语音分析、合成、感知建模和编码;鲁棒语音识别;说话人识别与特征描述;应用于语音、音频和语言信号的
深度学习、在线学习和图形模型;以及从系统架构到快速算法的实现方面。
--
---
英文摘要:
In this paper we propose to use utterance-level Permutation Invariant Training (uPIT) for speaker independent multi-talker speech separation and denoising, simultaneously. Specifically, we train deep bi-directional Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNNs) using uPIT, for single-channel speaker independent multi-talker speech separation in multiple noisy conditions, including both synthetic and real-life noise signals. We focus our experiments on generalizability and noise robustness of models that rely on various types of a priori knowledge e.g. in terms of noise type and number of simultaneous speakers. We show that deep bi-directional LSTM RNNs trained using uPIT in noisy environments can improve the Signal-to-Distortion Ratio (SDR) as well as the Extended Short-Time Objective Intelligibility (ESTOI) measure, on the speaker independent multi-talker speech separation and denoising task, for various noise types and Signal-to-Noise Ratios (SNRs). Specifically, we first show that LSTM RNNs can achieve large SDR and ESTOI improvements, when evaluated using known noise types, and that a single model is capable of handling multiple noise types with only a slight decrease in performance. Furthermore, we show that a single LSTM RNN can handle both two-speaker and three-speaker noisy mixtures, without a priori knowledge about the exact number of speakers. Finally, we show that LSTM RNNs trained using uPIT generalize well to noise types not seen during training.
---
PDF链接:
https://arxiv.org/pdf/1708.09588