音乐中音频和歌词的深度跨模态相关学习检索

nandehutu2022

311

收藏 2022-03-04

摘要翻译：
跨模态相关学习的研究很少，其中考虑了不同数据模态的时间结构，如音频和歌词。从音乐的时间结构特征出发，我们有动机去了解音频和歌词之间的深层顺序关联。在这项工作中，我们针对音频情态和文本情态（歌词）提出了一个包含双分支深度神经网络的深度跨情态关联学习架构。将不同的模态数据转换到同一典型空间，以模态间典型相关分析为目标函数，计算时间结构的相似性。这是首次通过学习音频和歌词的成对时间相关性的深层架构来理解语言和音乐音频之间的相关性的研究。采用预训练的Doc2vec模型和全连接层（全连接深度神经网络）来表示歌词。在音频方面有两个重要的贡献：i）研究了预训练的CNN和全连接层来表示音乐音频。ii）我们进一步提出了一种同时训练卷积层和全连接层的端到端深度结构，以更好地学习音乐音频的时间结构。特别地，我们的端到端深度结构包含两个特性：同时实现特征学习和跨模态相关学习，以及通过考虑时间结构来学习联合表示。通过使用音频检索歌词或使用歌词检索音频的实验结果，验证了所提出的深度相关学习结构在跨模态音乐检索中的有效性。
---
英文标题：
《Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music
  Retrieval》
---
作者：
Yi Yu, Suhua Tang, Francisco Raposo, Lei Chen
---
最新提交年份：
2017
---
分类信息：

一级分类：Computer Science 计算机科学
二级分类：Information Retrieval 信息检索
分类描述：Covers indexing, dictionaries, retrieval, content and analysis. Roughly includes material in ACM Subject Classes H.3.0, H.3.1, H.3.2, H.3.3, and H.3.4.
涵盖索引，字典，检索，内容和分析。大致包括ACM主题课程H.3.0、H.3.1、H.3.2、H.3.3和H.3.4中的材料。
--
一级分类：Computer Science 计算机科学
二级分类：Sound 声音
分类描述：Covers all aspects of computing with sound, and sound as an information channel. Includes models of sound, analysis and synthesis, audio user interfaces, sonification of data, computer music, and sound signal processing. Includes ACM Subject Class H.5.5, and intersects with H.1.2, H.5.1, H.5.2, I.2.7, I.5.4, I.6.3, J.5, K.4.2.
涵盖了声音计算的各个方面，以及声音作为一种信息通道。包括声音模型、分析和合成、音频用户界面、数据的可听化、计算机音乐和声音信号处理。包括ACM学科类H.5.5，并与H.1.2、H.5.1、H.5.2、I.2.7、I.5.4、I.6.3、J.5、K.4.2交叉。
--
一级分类：Electrical Engineering and Systems Science 电气工程与系统科学
二级分类：Audio and Speech Processing 音频和语音处理
分类描述：Theory and methods for processing signals representing audio, speech, and language, and their applications. This includes analysis, synthesis, enhancement, transformation, classification and interpretation of such signals as well as the design, development, and evaluation of associated signal processing systems. Machine learning and pattern analysis applied to any of the above areas is also welcome.  Specific topics of interest include: auditory modeling and hearing aids; acoustic beamforming and source localization; classification of acoustic scenes; speaker separation; active noise control and echo cancellation; enhancement; de-reverberation; bioacoustics; music signals analysis, synthesis and modification; music information retrieval;  audio for multimedia and joint audio-video processing; spoken and written language modeling, segmentation, tagging, parsing, understanding, and translation; text mining; speech production, perception, and psychoacoustics; speech analysis, synthesis, and perceptual modeling and coding; robust speech recognition; speaker recognition and characterization; deep learning, online learning, and graphical models applied to speech, audio, and language signals; and implementation aspects ranging from system architecture to fast algorithms.
处理代表音频、语音和语言的信号的理论和方法及其应用。这包括分析、合成、增强、转换、分类和解释这些信号，以及相关信号处理系统的设计、开发和评估。机器学习和模式分析应用于上述任何领域也是受欢迎的。感兴趣的具体主题包括：听觉建模和助听器；声波束形成与声源定位；声场景分类；说话人分离；有源噪声控制和回声消除；增强；去混响；生物声学；音乐信号的分析、合成与修饰；音乐信息检索；多媒体音频和联合音视频处理；口语和书面语建模、切分、标注、句法分析、理解和翻译；文本挖掘；言语产生、感知和心理声学；语音分析、合成、感知建模和编码；鲁棒语音识别；说话人识别与特征描述；应用于语音、音频和语言信号的深度学习、在线学习和图形模型；以及从系统架构到快速算法的实现方面。
--

---
英文摘要：
  Little research focuses on cross-modal correlation learning where temporal structures of different data modalities such as audio and lyrics are taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn the deep sequential correlation between audio and lyrics. In this work, we propose a deep cross-modal correlation learning architecture involving two-branch deep neural networks for audio modality and text modality (lyrics). Different modality data are converted to the same canonical space where inter modal canonical correlation analysis is utilized as an objective function to calculate the similarity of temporal structures. This is the first study on understanding the correlation between language and music audio through deep architectures for learning the paired temporal correlation of audio and lyrics. Pre-trained Doc2vec model followed by fully-connected layers (fully-connected deep neural network) is used to represent lyrics. Two significant contributions are made in the audio branch, as follows: i) pre-trained CNN followed by fully-connected layers is investigated for representing music audio. ii) We further suggest an end-to-end architecture that simultaneously trains convolutional layers and fully-connected layers to better learn temporal structures of music audio. Particularly, our end-to-end deep architecture contains two properties: simultaneously implementing feature learning and cross-modal correlation learning, and learning joint representation by considering temporal structures. Experimental results, using audio to retrieve lyrics or using lyrics to retrieve audio, verify the effectiveness of the proposed deep correlation learning architectures in cross-modal music retrieval.
---
PDF链接：
https://arxiv.org/pdf/1711.08976

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群