摘要翻译:
在追求更高的计算机唇读性能的过程中,存在着许多默认的假设,这些假设要么存在于数据集中(例如高分辨率),要么存在于方法中(例如识别被称为视觉的口语视觉单元)。在这里,我们回顾了这些和其他假设,并展示了令人惊讶的结果,即计算机唇读不受视频分辨率、姿势、灯光和其他实际因素的严重限制。然而,视素作为音素的视觉等价物,是识别的最佳单位这一工作假设还需要进一步的检验。我们的结论是,一个多世纪前定义的视象不太可能是现代计算机唇读系统的最佳选择。
---
英文标题:
《Some observations on computer lip-reading: moving from the dream to the
reality》
---
作者:
Helen L. Bear, Gari Owen, Richard Harvey, and Barry-John Theobald
---
最新提交年份:
2017
---
分类信息:
一级分类:Computer Science 计算机科学
二级分类:Computer Vision and Pattern Recognition 计算机视觉与模式识别
分类描述:Covers image processing, computer vision, pattern recognition, and scene understanding. Roughly includes material in ACM Subject Classes I.2.10, I.4, and I.5.
涵盖图像处理、计算机视觉、模式识别和场景理解。大致包括ACM课程I.2.10、I.4和I.5中的材料。
--
一级分类:Electrical Engineering and Systems Science 电气工程与系统科学
二级分类:Image and Video Processing 图像和视频处理
分类描述:Theory, algorithms, and architectures for the formation, capture, processing, communication, analysis, and display of images, video, and multidimensional signals in a wide variety of applications. Topics of interest include: mathematical, statistical, and perceptual image and video modeling and representation; linear and nonlinear filtering, de-blurring, enhancement, restoration, and reconstruction from degraded, low-resolution or tomographic data; lossless and lossy compression and coding; segmentation, alignment, and recognition; image rendering, visualization, and printing; computational imaging, including ultrasound, tomographic and magnetic resonance imaging; and image and video analysis, synthesis, storage, search and retrieval.
用于图像、视频和多维信号的形成、捕获、处理、通信、分析和显示的理论、算法和体系结构。感兴趣的主题包括:数学,统计,和感知图像和视频建模和表示;线性和非线性滤波、去模糊、增强、恢复和重建退化、低分辨率或层析数据;无损和有损压缩编码;分割、对齐和识别;图像渲染、可视化和打印;计算成像,包括超声、断层和磁共振成像;以及图像和视频的分析、合成、存储、搜索和检索。
--
---
英文摘要:
In the quest for greater computer lip-reading performance there are a number of tacit assumptions which are either present in the datasets (high resolution for example) or in the methods (recognition of spoken visual units called visemes for example). Here we review these and other assumptions and show the surprising result that computer lip-reading is not heavily constrained by video resolution, pose, lighting and other practical factors. However, the working assumption that visemes, which are the visual equivalent of phonemes, are the best unit for recognition does need further examination. We conclude that visemes, which were defined over a century ago, are unlikely to be optimal for a modern computer lip-reading system.
---
PDF链接:
https://arxiv.org/pdf/1710.01084