全部版块 我的主页
论坛 经济学人 二区 外文文献专区
611 0
2022-04-05
摘要翻译:
本文提出了一种卷积递归神经网络用于三维空间中多个重叠声音事件的联合声音事件定位与检测(SELD)。该网络以一系列连续的谱图时间帧作为输入,并将其并行映射到两个输出。作为第一输出,声音事件检测(SED)作为多标记分类任务在每个时间帧上执行,产生所有声音事件类的时间活动。作为第二输出,通过使用多输出回归估计每个声音事件类的到达方向(DOA)的三维笛卡尔坐标来执行定位。所提出的方法能够将多个DOA与各自的声音事件标签相关联,并进一步跟踪这种相对于时间的关联。该方法分别使用在每个声道上计算的频谱图的相位和幅度分量作为特征,从而避免了任何方法和阵列特定的特征提取。在消声、混响和实际场景中,对五个双圆阵和两个圆阵格式的具有不同重叠声音事件的数据集进行了评估。将该方法与两种SED、三种DOA估计和一种SELD基线进行了比较。结果表明,该方法具有通用性,适用于任何阵列结构,对未知DOA值、混响和低信噪比等情况具有较强的鲁棒性。与最佳基线相比,所提出的方法在数据集上取得了一致的更高的估计DOA数的召回率。此外,对于更多的重叠声音事件,这种回忆被观察到明显优于最佳基线方法。
---
英文标题:
《Sound Event Localization and Detection of Overlapping Sources Using
  Convolutional Recurrent Neural Networks》
---
作者:
Sharath Adavanne, Archontis Politis, Joonas Nikunen, Tuomas Virtanen
---
最新提交年份:
2018
---
分类信息:

一级分类:Computer Science        计算机科学
二级分类:Sound        声音
分类描述:Covers all aspects of computing with sound, and sound as an information channel. Includes models of sound, analysis and synthesis, audio user interfaces, sonification of data, computer music, and sound signal processing. Includes ACM Subject Class H.5.5, and intersects with H.1.2, H.5.1, H.5.2, I.2.7, I.5.4, I.6.3, J.5, K.4.2.
涵盖了声音计算的各个方面,以及声音作为一种信息通道。包括声音模型、分析和合成、音频用户界面、数据的可听化、计算机音乐和声音信号处理。包括ACM学科类H.5.5,并与H.1.2、H.5.1、H.5.2、I.2.7、I.5.4、I.6.3、J.5、K.4.2交叉。
--
一级分类:Electrical Engineering and Systems Science        电气工程与系统科学
二级分类:Audio and Speech Processing        音频和语音处理
分类描述:Theory and methods for processing signals representing audio, speech, and language, and their applications. This includes analysis, synthesis, enhancement, transformation, classification and interpretation of such signals as well as the design, development, and evaluation of associated signal processing systems. Machine learning and pattern analysis applied to any of the above areas is also welcome.  Specific topics of interest include: auditory modeling and hearing aids; acoustic beamforming and source localization; classification of acoustic scenes; speaker separation; active noise control and echo cancellation; enhancement; de-reverberation; bioacoustics; music signals analysis, synthesis and modification; music information retrieval;  audio for multimedia and joint audio-video processing; spoken and written language modeling, segmentation, tagging, parsing, understanding, and translation; text mining; speech production, perception, and psychoacoustics; speech analysis, synthesis, and perceptual modeling and coding; robust speech recognition; speaker recognition and characterization; deep learning, online learning, and graphical models applied to speech, audio, and language signals; and implementation aspects ranging from system architecture to fast algorithms.
处理代表音频、语音和语言的信号的理论和方法及其应用。这包括分析、合成、增强、转换、分类和解释这些信号,以及相关信号处理系统的设计、开发和评估。机器学习和模式分析应用于上述任何领域也是受欢迎的。感兴趣的具体主题包括:听觉建模和助听器;声波束形成与声源定位;声场景分类;说话人分离;有源噪声控制和回声消除;增强;去混响;生物声学;音乐信号的分析、合成与修饰;音乐信息检索;多媒体音频和联合音视频处理;口语和书面语建模、切分、标注、句法分析、理解和翻译;文本挖掘;言语产生、感知和心理声学;语音分析、合成、感知建模和编码;鲁棒语音识别;说话人识别与特征描述;应用于语音、音频和语言信号的深度学习、在线学习和图形模型;以及从系统架构到快速算法的实现方面。
--

---
英文摘要:
  In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.
---
PDF链接:
https://arxiv.org/pdf/1807.00129
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群