解码Visemes：改进机器唇读

493

收藏 2022-03-06

摘要翻译：
机器唇读(MLR)是基于视觉线索的语音识别，是语音处理和计算机视觉领域的一个研究热点。目前的挑战分为两类：视频的内容，如语速或；视频记录的参数，例如视频分辨率。我们表明，高清视频不需要成功地用计算机进行lipread。术语“viseme”在机器唇读中被用来表示一个视觉提示或手势，它对应于一个音素子组，其中音素在视觉上是不可区分的。音素是一个人能发出的最小的声音，因为每个音素有更多的音素，单元之间的映射显示了多对一的关系。许多地图已经提出，我们比较这些和我们的结果显示李是最好的。我们提出了一种新的说话人相关的音素到视素映射方法，并与Lee的方法进行了比较。我们的结果显示了音素聚类的敏感性，并利用我们的新知识对一个传统的MLR系统进行了扩充。在MLR中已经观察到，分类器需要对测试对象进行训练才能达到精度。因此，机器唇读是高度依赖于说话人的。相反，说话人独立性是对非训练说话人的鲁棒分类。我们研究了说话人之间音素-视象图的依赖性，发现视象图没有很大的变异性，但在相同的基本事实下，单个说话人的视象图之间的轨迹有很大的变异性。这意味着依赖于每个个体在每个集合中的视象数量。我们发现以前的音素到视素地图很少有足够的视素，最佳大小随说话人的不同而变化，从11到35不等。最后，我们从音素到音素再解码成单词。我们的新方法在音素分类器的分层训练中使用了最优范围视场，证明了分类精度的显著提高。
---
英文标题：
《Decoding visemes: improving machine lipreading》
---
作者：
Helen L Bear
---
最新提交年份：
2017
---
分类信息：

一级分类：Computer Science 计算机科学
二级分类：Computer Vision and Pattern Recognition 计算机视觉与模式识别
分类描述：Covers image processing, computer vision, pattern recognition, and scene understanding. Roughly includes material in ACM Subject Classes I.2.10, I.4, and I.5.
涵盖图像处理、计算机视觉、模式识别和场景理解。大致包括ACM课程I.2.10、I.4和I.5中的材料。
--
一级分类：Electrical Engineering and Systems Science 电气工程与系统科学
二级分类：Audio and Speech Processing 音频和语音处理
分类描述：Theory and methods for processing signals representing audio, speech, and language, and their applications. This includes analysis, synthesis, enhancement, transformation, classification and interpretation of such signals as well as the design, development, and evaluation of associated signal processing systems. Machine learning and pattern analysis applied to any of the above areas is also welcome. Specific topics of interest include: auditory modeling and hearing aids; acoustic beamforming and source localization; classification of acoustic scenes; speaker separation; active noise control and echo cancellation; enhancement; de-reverberation; bioacoustics; music signals analysis, synthesis and modification; music information retrieval; audio for multimedia and joint audio-video processing; spoken and written language modeling, segmentation, tagging, parsing, understanding, and translation; text mining; speech production, perception, and psychoacoustics; speech analysis, synthesis, and perceptual modeling and coding; robust speech recognition; speaker recognition and characterization; deep learning, online learning, and graphical models applied to speech, audio, and language signals; and implementation aspects ranging from system architecture to fast algorithms.
处理代表音频、语音和语言的信号的理论和方法及其应用。这包括分析、合成、增强、转换、分类和解释这些信号，以及相关信号处理系统的设计、开发和评估。机器学习和模式分析应用于上述任何领域也是受欢迎的。感兴趣的具体主题包括：听觉建模和助听器；声波束形成与声源定位；声场景分类；说话人分离；有源噪声控制和回声消除；增强；去混响；生物声学；音乐信号的分析、合成与修饰；音乐信息检索；多媒体音频和联合音视频处理；口语和书面语建模、切分、标注、句法分析、理解和翻译；文本挖掘；言语产生、感知和心理声学；语音分析、合成、感知建模和编码；鲁棒语音识别；说话人识别与特征描述；应用于语音、音频和语言信号的深度学习、在线学习和图形模型；以及从系统架构到快速算法的实现方面。
--

---
英文摘要：
Machine lipreading (MLR) is speech recognition from visual cues and a niche research problem in speech processing & computer vision. Current challenges fall into two groups: the content of the video, such as rate of speech or; the parameters of the video recording e.g, video resolution. We show that HD video is not needed to successfully lipread with a computer. The term "viseme" is used in machine lipreading to represent a visual cue or gesture which corresponds to a subgroup of phonemes where the phonemes are visually indistinguishable. A phoneme is the smallest sound one can utter, because there are more phonemes per viseme, maps between units show a many-to-one relationship. Many maps have been presented, we compare these and our results show Lee's is best. We propose a new method of speaker-dependent phoneme-to-viseme maps and compare these to Lee's. Our results show the sensitivity of phoneme clustering and we use our new knowledge to augment a conventional MLR system. It has been observed in MLR, that classifiers need training on test subjects to achieve accuracy. Thus machine lipreading is highly speaker-dependent. Conversely speaker independence is robust classification of non-training speakers. We investigate the dependence of phoneme-to-viseme maps between speakers and show there is not a high variability of visemes, but there is high variability in trajectory between visemes of individual speakers with the same ground truth. This implies a dependency upon the number of visemes within each set for each individual. We show that prior phoneme-to-viseme maps rarely have enough visemes and the optimal size, which varies by speaker, ranges from 11-35. Finally we decode from visemes back to phonemes and into words. Our novel approach uses the optimum range visemes within hierarchical training of phoneme classifiers and demonstrates a significant increase in classification accuracy.
---
PDF链接：
https://arxiv.org/pdf/1710.01288

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群