摘要翻译:
当前最先进的方法通过将2D锚点扩展到帧堆栈上的3D-长方体建议来解决时空动作局部化,以生成称为\textIt{action micro-tubes}的时间连接的包围盒集。然而,他们没有考虑到潜在的锚提议假设也应该像演员或摄像机一样从一帧到另一帧移动(过渡)。假设我们在每个帧中计算$N$2D锚,那么对于一个$F$连续帧序列,从每个2D锚到下一个锚的可能转换次数是$O(n^f)$的顺序,即使对于$F$的小值也是昂贵的。为了避免这个问题,我们引入了一个基于转移矩阵的网络(TraMNet),它依赖于计算锚点建议之间的转移概率,同时最大化它们与跨帧的地面真值包围盒的重叠,并通过转移阈值来加强稀疏性。由于生成的转移矩阵是稀疏的和随机的,这将建议假设搜索空间从$O(N^f)$减少到阈值化矩阵的基数。在训练时,转移是特定于特征映射的单元位置,因此稀疏(有效)转移矩阵被用来训练网络。在测试时,可以通过降低阈值或将来自任何小区位置的所有相对转移添加到该转移矩阵中来获得更密集的转移矩阵,允许网络处理测试数据中可能不存在于训练数据中的转移,并使检测平移不变。最后,我们展示了我们的网络可以处理稀疏注释,如DALY数据集中可用的注释。我们报告了在DALY、UCF101-24和Transformed-UCF101-24数据集上的广泛实验来支持我们的主张。
---
英文标题:
《TraMNet - Transition Matrix Network for Efficient Action Tube Proposals》
---
作者:
Gurkirt Singh, Suman Saha, Fabio Cuzzolin
---
最新提交年份:
2018
---
分类信息:
一级分类:Electrical Engineering and Systems Science 电气工程与系统科学
二级分类:Image and Video Processing 图像和视频处理
分类描述:Theory, algorithms, and architectures for the formation, capture, processing, communication, analysis, and display of images, video, and multidimensional signals in a wide variety of applications. Topics of interest include: mathematical, statistical, and perceptual image and video modeling and representation; linear and nonlinear filtering, de-blurring, enhancement, restoration, and reconstruction from degraded, low-resolution or tomographic data; lossless and lossy compression and coding; segmentation, alignment, and recognition; image rendering, visualization, and printing; computational imaging, including ultrasound, tomographic and magnetic resonance imaging; and image and video analysis, synthesis, storage, search and retrieval.
用于图像、视频和多维信号的形成、捕获、处理、通信、分析和显示的理论、算法和体系结构。感兴趣的主题包括:数学,统计,和感知图像和视频建模和表示;线性和非线性滤波、去模糊、增强、恢复和重建退化、低分辨率或层析数据;无损和有损压缩编码;分割、对齐和识别;图像渲染、可视化和打印;计算成像,包括超声、断层和磁共振成像;以及图像和视频的分析、合成、存储、搜索和检索。
--
一级分类:Computer Science 计算机科学
二级分类:Computer Vision and Pattern Recognition 计算机视觉与模式识别
分类描述:Covers image processing, computer vision, pattern recognition, and scene understanding. Roughly includes material in ACM Subject Classes I.2.10, I.4, and I.5.
涵盖图像处理、计算机视觉、模式识别和场景理解。大致包括ACM课程I.2.10、I.4和I.5中的材料。
--
一级分类:Computer Science 计算机科学
二级分类:Robotics 机器人学
分类描述:Roughly includes material in ACM Subject Class I.2.9.
大致包括ACM科目I.2.9类的材料。
--
---
英文摘要:
Current state-of-the-art methods solve spatiotemporal action localisation by extending 2D anchors to 3D-cuboid proposals on stacks of frames, to generate sets of temporally connected bounding boxes called \textit{action micro-tubes}. However, they fail to consider that the underlying anchor proposal hypotheses should also move (transition) from frame to frame, as the actor or the camera does. Assuming we evaluate $n$ 2D anchors in each frame, then the number of possible transitions from each 2D anchor to the next, for a sequence of $f$ consecutive frames, is in the order of $O(n^f)$, expensive even for small values of $f$. To avoid this problem, we introduce a Transition-Matrix-based Network (TraMNet) which relies on computing transition probabilities between anchor proposals while maximising their overlap with ground truth bounding boxes across frames, and enforcing sparsity via a transition threshold. As the resulting transition matrix is sparse and stochastic, this reduces the proposal hypothesis search space from $O(n^f)$ to the cardinality of the thresholded matrix. At training time, transitions are specific to cell locations of the feature maps, so that a sparse (efficient) transition matrix is used to train the network. At test time, a denser transition matrix can be obtained either by decreasing the threshold or by adding to it all the relative transitions originating from any cell location, allowing the network to handle transitions in the test data that might not have been present in the training data, and making detection translation-invariant. Finally, we show that our network can handle sparse annotations such as those available in the DALY dataset. We report extensive experiments on the DALY, UCF101-24 and Transformed-UCF101-24 datasets to support our claims.
---
PDF链接:
https://arxiv.org/pdf/1808.00297