论文阅读

Learning to Promote Saliency Detectors

https://github.com/lartpang/M…

缩写标注:

  • SD: Saliency Detection
  • ZSL: Zero-Shot Learning

关键内容:

  • 没有训练直接将图像映射到标签中的DNN。相反,将DNN拟合为一个嵌入函数,以将像素和显著/背景区域的属性映射到度量空间。显着/背景区域的属性被映射为度量空间中的锚点。然后,在该空间中构造最近邻(NN)分类器,将最近的锚点的标签分配给对应的像素.
  • 保持分辨率的手段:

    1. 移除了最后两个卷积块的池化层, 使用扩张卷积来维持卷积滤波器的感受野
    2. 添加亚像素卷积层到每个VGG特征提取器的卷积块后, 来上采样每个卷积块的特征图到输入图像大小.
  • 使用了迭代训练/测试的策略.

    • 这里没有提到训练迭代次数如何确定
    • 测试的迭代次数是人工给定的

一些想法:

类似于R3Net, 最后的添加的结构都是反复迭代测试后才确定使用多少次, 而且按照相关的测试可以看出来, 一定次数后, 提升的效果就趋于饱和了, 只能说这里的提到的方法对于现有网络的提升具有一定的助益.

对于这里提到的, 这是类似于一种ZSL的方法, 也就是利用现有的SD算法产生的结果(“过去的知识”), 添加的新结构, 不断利用过去的知识迭代, 实现对于最终”后处理”后结果的一个促进(“来对现有的SD算法进行推广”).

一些疑惑:

如何将这个方法应用到现有的架构呢? 如何改造现有架构?

改造后的结构, 训练的时候也要按照文中那样, 随机翻转真值中的像素标签么?

《论文<Learning to Promote Saliency Detectors>阅读》” /></span></p><p>这里的第6层是哪里来的?是前面的5C通道的特征输出汇总来的?</p><h2>Abstract</h2><p>The categories and appearance of salient objects varyfrom image to image, therefore, saliency detection is animage-specific task. Due to lack of large-scale saliency training data, using deep neural networks (DNNs) with pre-training is difficult to precisely capture the image-specific saliency cues. To solve this issue, we formulate a <strong>zero-shot learning</strong> problem to promote existing saliency detectors.</p><p>Concretely, a DNN is trained as an embedding function to map pixels and the attributes of the salient/background regions of an image into the same metric space, in which an image-specific classifier is learned to classify the pixels.</p><p>Since the image-specific task is performed by the classifier, the DNN embedding effectively plays the role of a general feature extractor.</p><p>Compared with transferring the learning to a new recognition task using limited data, this formulation makes the DNN learn more effectively from small data.</p><p>Extensive experiments on five data sets showthat our method significantly improves accuracy of existing methods and compares favorably against state-of-the-art approaches.</p><p>显着对象的类别和外观因图像而异,因此,显着性检测是特定于图像的任务。由于缺乏大规模显着性训练数据,使用具有预训练的深度神经网络(DNN)难以精确捕获图像特定显着性线索。为了解决这个问题,我们制定了一个<strong>零次学习</strong>问题来推广现有的显着性检测器。</p><p>具体地,DNN被训练为一个嵌入函数,以将像素和图像的显着/背景区域的属性映射到相同的度量空间,其中, 图像特定的分类器被学习来对像素进行分类。</p><p>由于图像特定任务由分类器执行,因此DNN嵌入有效地扮演一般特征提取器的角色。</p><p>与使用有限数据将学习转移到新的识别任务相比,该设定使DNN从小数据中更有效地学习。</p><p>对五个数据集进行的大量实验表明,我们的方法显着提高了现有方法的准确性,并且与最先进的方法相比具有优势。</p><p>这里提到了一点, 使用 ZSL 问题来推广现有的SD器. 怎么推广?</p><p>补充内容 ZSL[零次学习(Zero-Shot Learning)]</p><blockquote><p> 假设小暗(纯粹因为不想用小明)和爸爸,到了动物园,看到了马,然后爸爸告诉他,这就是马;之后,又看到了老虎,告诉他:“看,这种身上有条纹的动物就是老虎。”;最后,又带他去看了熊猫,对他说:“你看这熊猫是黑白色的。”然后,爸爸给小暗安排了一个任务,让他在动物园里找一种他从没见过的动物,叫斑马,并告诉了小暗有关于斑马的信息:“斑马有着马的轮廓,身上有像老虎一样的条纹,而且它像熊猫一样是黑白色的。”最后,小暗根据爸爸的提示,在动物园里找到了斑马(意料之中的结局。。。)。</p><p>上述例子中包含了一个人类的推理过程,就是<strong>利用过去的知识(马,老虎,熊猫和斑马的描述),在脑海中推理出新对象的具体形态,从而能对新对象进行辨认</strong>。</p><p>ZSL就是希望能够模仿人类的这个推理过程,使得计算机具有识别新事物的能力。</p></blockquote><p>结合起来看, 也就是说可以利用过去的知识, 来对现有的SD器进行推广.</p><blockquote><p> 这个过程是什么样的呢?</p></blockquote><h2>Introduction</h2><p>传统的显着性检测方法通常是利用低级别的特征和启发式先验,它们不能在复杂的场景中发现显着的对象,也就不能够捕获语义对象。随着DNN的流行, 可以学习来自训练样本的更为高层的语义特征, 因此对于定位语义显著性区域更为有效, 在复杂场景下也会更为有效.</p><p>使用DNN就要考虑一个问题, 数据. DNN通常在大量数据的基础上来训练, 而SD的数据是比较有限的, 这个问题通常用在其他任务的大数据集(如分类任务)上预训练的手段来解决, 然而这很容易导致其他问题:</p><p><span><img layer-src=http://www.vartang.com/2013/0…) on an image graph model, where saliency of each region is defined as its absorbed time from boundary nodes.

  • Yang et al. [32] rank the similarity of the image regions with foreground cues or background cues via graph-based manifold ranking(通过基于图的流形排序对图像区域与前景线索或背景线索的相似性进行排序).
  • Since the conventional methods are not robust in complex scenes neither capable of capturing semantic objects, deep neural networks (DNNs) are introduced to overcome these drawbacks.

    • Li et al. [16] train CNNs with fully connected layers to predict saliency value of each superpixel, and to enhance the spatial coherence(空间连贯性) of their saliency results using a refinement method.
    • Li et al. [18] propose a FCN trained under the multi-task learning framework for saliency detection.
    • Zhang et al. [34] present a generic framework to aggregate multi-level convolutional features for saliency detection.

    Although the proposed method is also based on DNNs, the main difference between ours and these methods is that they learn a general model that directly maps images to labels, while our method learns a general embedding function as well as an image-specific NN classifier.

    TD

    Top-down (TD) saliency aims at finding salient regions specified by a task, and is usually formulated as a supervised learning problem.

    • Yang and Yang [33] propose a supervised top-down saliency model that jointly learns a Conditional Random Field (CRF) and a discriminative dictionary.
    • Gao et al. [9] introduced a top-down saliency algorithm by selecting discriminant features from a pre-defined filter bank(预定义的过滤器库).

    TD+BU

    Integration of TD and BU saliency has been exploited by some methods.

    • Borji [3] combines low-level features and saliency maps of previous bottom-up models with top-down cognitive visual features to predict fixations.
    • Tong et al. [26] proposed a top-down learning approach where the algorithm is bootstrapped with training samples generated using a bottom-up model(该算法使用自下而上模型生成的训练样本进行引导) to exploit the strengths of both bottom-up contrast-based saliency models and top-down learning methods.

    Our method also can be viewed as an integration of TD and BU saliency. Although both our method and the method of Tonget al. [26] formulate the problem as top-down saliency detection specified by initial saliency maps, there are certain difference between the two.

    1. First, Tong’s method trains a strong model via boostrap learning(引导学习) with training samples generated by a weak model. In contrast, our method maps pixels and the approximate salient/background regions into a learned metric space, which is related to zero-shot learning.
    2. Second, thanks to deep learning, our method is capable of capturing semantically salient regions and does well on complex scenes, while Tong’s method uses hand-crafted features and heuristic priors, which are less robust.
    3. Third, our method produces pixel-level results, while Tong’s method computes saliency value of each image region to assemble a saliency map, which tends to be coarser.

    The Proposed Method

    《论文<Learning to Promote Saliency Detectors>阅读》” /></span></p><p>Our method consists of three components:</p><ol><li>a DNN as an embedding function i.e. the anchor network, that maps pixels and regions of the input image into a learned metric space</li><li>a nearest neighbor (NN) classifier in the embedding space learned specifically for this image to classify its pixels</li><li>an iterative testing scheme that utilizes the result of the NN classifier to revise anchors(修改锚点), yielding increasingly more accurate results.</li></ol><h3>The anchor network</h3><p>这部分主要是进行了一个映射的操作. 一个是映射图像中的像素点, 一个是映射图像中的显著性/背景区域.</p><p>像素点通过一个DNN建模的嵌入函数, 来映射到一个D维度量空间的向量上.</p><p><span><img layer-src=https://zhuanlan.zhihu.com/p/…

  • What is embedding | embedded space | feature embedding in deep neural architectures?: https://www.quora.com/What-is…
  • 有谁可以解释下word embedding? – 寒蝉鸣泣的回答 – 知乎: https://www.zhihu.com/questio…
  • Sub-pixel Convolution(子像素卷积): https://blog.csdn.net/leviopk…
  •     原文作者:lart
        原文地址: https://segmentfault.com/a/1190000017845603
        本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
    点赞