Mobilenet V2 TensorFlow 代码解读

2019年7月14日 560次阅读来源: lonlon ago

Mobilenet V2 的结构是我被朋友安利最多的结构，所以一直想要好好看看，这次继续以谷歌官方的Mobilenet V2 代码为案例，看代码之前，需要先重点了解下Mobilenet V1 和V2 的最主要的结构特点，以及它为什么能够在减少参数量、提高推理速度的同时并保持相对较好的准确率，毕竟磨刀不误砍柴工，只有看懂了它的结构特点再去看代码才会比较清晰。

Mobilenet V1的特点是什么？

最大的特点就是depthwise + pointwise卷积结构替换了普通的卷积结构，下图是普通的卷积结构：

《Mobilenet V2 TensorFlow 代码解读》

C表示输入特征的通道数，k表示输出特征的通道数（也是本层的卷积核个数）。因此如果假设卷积核大小是h*w*C*k，输出是H*W*k，标准卷积的计算量是H*W*C*k*h*w。这个式子可以这么理解，先只看单个的卷积部分，一个二维卷积核去卷积一个二维输入feature map，每卷积一次就会有h*w个计算量，输出feature map的尺寸是H*W，由于输出feature map的每个点都是由卷积操作生成的，而因此一个二维卷积核去卷积一个二维输入feature map就有h*w*H*W个计算量（上图的一个X，或者一个 O 的计算量）；如果有C个输入feature map和k个卷积核，那么就会有H*W*C*k*h*w计算量。

depthwise 的结构则是如下图右边部分所示：

《Mobilenet V2 TensorFlow 代码解读》

可以看到，经典的卷积结构的卷积核是一组一组的，每一组会把原来的每个通道的特征做一个融合，组的数量就是新的通道数量，而depthwise结构不改变原有特征的通道数量，它也不对原来每个通道的特征做融合，它只是对每个通道提取相应的空间特征。

它的输出还是 H*W*C，也就是没有改变原来的形状，计算量则是H*W*C*h*w，比上面的少了一个量级。

pointwise的结构则是如下图右边部分所示：

《Mobilenet V2 TensorFlow 代码解读》

pointwise 实际上就是一个卷积核大小为1*1 的普通卷积结构，它的卷积核是一组一组的，每一组会把原来的每个通道的特征做一个融合，卷积核的组数量就是新的通道数量。

它的计算量套用普通卷积的计算量方法，就是 H*W*C*k*1*1

那么这二者加起来，总共耗费的计算时间还是比普通卷积的计算时间大大减小，而得到的特征形状和性质一样，从而达到了平衡速度和准确度的目的。

Mobilenet V2的特点是什么？

先理解ReLU造成的低维度数据坍塌(collapses):

《Mobilenet V2 TensorFlow 代码解读》

假设2维 n 个点数据X （2*n）,经随机矩阵 T （m*2）映射到 m 维并进行ReLU运算,再还原发现,m小的时候会出现信息丢失；

原因在于当channel为2时，信息都集中在这两个channel中，如果有部分数值小于0就会被RELU激活丢失掉。而如果channel为30，其实信息是分散的，而且具有了冗余，所以通过RELU激活后归于0的值可能并不会影响太多信息的存储。

所以作者建议对于channel数很少的那些层做线性激活。bottlenect就表示缩减的层，linear bottleneck表示对channel缩减的层做线性激活。如果要用RELU激活需要先增加channel数再做RELU激活。

《Mobilenet V2 TensorFlow 代码解读》

上面这张图是各种结构的对比，图中的方块代表 tensor 矩阵数据，红色部分代表卷积操作，这里有两种卷积操作，如 a 的红色卷积是普通卷积，它是用立体的红色部分表示的，b 的第一个卷积是 depthwise卷积，它是用平面的红色部分表示的：

a是普通卷积结构；

b是分离卷积，也就是 mobilenet V1 的depthwise + pointwise卷积结构，可以看到它的第一次卷积后的结果的形状和输入是一样的，并且第一次卷积的红色部分是没有融合输入的所有通道的；

c是有bottlenect的分离卷积，比b 多了一个压缩维度的步骤，这个结构可以暂时不用关注；

d是对bottlenect进行扩张后的分离卷积，它比 b 多了一个扩张维度的步骤，也就是 d 的第一步，所有虚线的tensor后面都是线性激活，也就是作者建议对于channel数很少的那些层做线性激活，因为虚线的 tensor 都是通道数比较少的；

下图是在上面的 d结构基础上加了残差结构示意图：

《Mobilenet V2 TensorFlow 代码解读》

a 是普通的残差模块，普通残差模块有一个压缩维度的过程，也称为 bottleneck 结构，目的是为了减少计算量；

b 是倒置残差模块，先进行channel扩张，然后进行channel缩减，这是作者对残差block提出的一个改进；b和上图的d的结构就是有无shorcut连接的差别。

V1和 V2的卷积结构区别示意图：

《Mobilenet V2 TensorFlow 代码解读》

不同之处：

1. Depth-wise convolution之前多了一个1*1的“扩张”层，目的是为了提升通道数，获得更多特征；

2. 最后不采用Relu，而是Linear，目的是防止Relu破坏特征。

MobileNetV2的block 与ResNet 的block区别示意图：

《Mobilenet V2 TensorFlow 代码解读》

不同之处：

ResNet是：压缩”→“卷积提特征”→“扩张”，即bottleneck；

MobileNetV2则是Inverted residuals,即：“扩张”→“卷积提特征”→ “压缩”

OK，理论部分就这样，正式进入代码解读部分

代码分为三个文件：mobilenet_v2.py、mobilenet.py、conv_blocks.py

conv_blocks.py mobilenet特殊卷积结构的实现；

mobilenet.py mobilenet基础结构；

mobilenet_v2.py 实现了V2的结构；

还有一个网络使用示例的文件：mobilenet_example.ipynb

代码的调用流程是怎样的？

网络结构的入口是mobilenet_v2.py ：

首先通过V2_DEF的字典配置网络的参数和结构；
然后通过mobilenet 函数调用了mobilenet.py 里面的 mobilenet 函数（该函数支持V1和V2两种网络结构的构建），该函数先调用mobilenet_base来构建网络基础结构，然后再构建全局池化层和softmax分类层；
最后就是在 mobilenet_base （no pooling and no logits）里面一层一层的构建起网络的主要结构：

 for i, opdef in enumerate(conv_defs['spec']):
   params = dict(opdef.params)
   ….
   net = opdef.op(net, **params)4

4、opdef.op其实主要是调用的expand_conv 函数，它的主要步骤在后面描述；

网络的结构的设置是通过什么传递的？

主要是通过V2_DEF来传递的

V2_DEF 的定义：

V2_DEF = {
 op(slim.conv2d, stride=2, num_outputs=32, kernel_size=[3, 3]),
 op(ops.expanded_conv, stride=2, num_outputs=24),
 ...

V2_DEF 中 op 的定义：

_Op = collections.namedtuple('Op', ['op', 'params', 'multiplier_func'])
 
def op(opfunc, **params):
 multiplier = params.pop('multiplier_transorm', depth_multiplier)
 return _Op(opfunc, params=params, multiplier_func=multiplier)

网络结构的设置属性是通过一个 V2_DEF 的字典传递的，字典里面通过 op 函数生成的元素是 _Op ，每一个_Op 代表一层网络，它是一个 collections.namedtuple，也就是一个有名字的元组，也就是说op(slim.conv2d, stride=2, num_outputs=32, kernel_size=[3, 3]) 会生成一个元组：

（
  'op'= slim.conv2d，
  'params'= stride=2, num_outputs=32, kernel_size=[3, 3]，
  'multiplier_func'= depth_multiplier 
）

slim.separable_conv2d 是什么卷积？

看该方法的注释说明：

This op first performs a depthwise convolution that acts separately on
channels, creating a variable called `depthwise_weights`. If `num_outputs`
is not None, it adds a pointwise convolution that mixes channels, creating a
variable called `pointwise_weights`. Then, if `normalizer_fn` is None,
it adds bias to the result, creating a variable called ‘biases’, otherwise,
the `normalizer_fn` is applied. It finally applies an activation function
to produce the end result.

它实际上就是一个depthwise + pointwise卷积结构，不过和 V1的结构不同之处在于它在depthwise 之后没有激活操作，只在最后面有一个激活函数，而 V1的结构是两个后面都有激活函数，如下图：

《Mobilenet V2 TensorFlow 代码解读》

代码中的split_separable_conv2d是什么卷积，和上面的separable_conv2d有什么区别？

它就是mobilenet V1风格的Depthwise卷积，把原来网络中最常使用的经典的卷积替换为 Depthwise+pointwise卷积结构；

它和slim.separable_conv2d的卷积结构很像，但是区别在于Depthwise后面还接了BN和非线性操作，这才符合论文的网络结构；

因为它是把 slim.separable_conv2d 分开了做了两次卷积，所以又起名为 split_separable_conv2d。

《Mobilenet V2 TensorFlow 代码解读》

从代码也能看出它的结构：

# Depthwise 部分，不指定num_outputs的separable_conv2d实际上就是Depthwise，我们在上文的slim.separable_conv2d的注释部分也可以看出来
 net = slim.separable_conv2d(
   input_tensor,
   None,
   kernel_size,
   depth_multiplier=1,
   stride=stride,
   rate=rate,
   normalizer_fn=normalizer_fn,
   padding=padding,
   scope=dw_scope)
 
 # pointwise 部分,实际上就是[1, 1]卷积核的普通卷积
 pw_scope = scope + 'pointwise'
 net = slim.conv2d(
   net,
   num_outputs,
   [1, 1],
   stride=1,
   normalizer_fn=normalizer_fn,
   scope=pw_scope)

这就是V1风格卷积的最简单实现方式，用slim.separable_conv2d不指定输出维度(num_outputs=None)实现depthwise部分，用普通卷积（卷积核为1*1）实现pointwise部分。

expanded_conv 是什么卷积？

它就是mobilenet V2风格的卷积结构，先扩张维度，然后是线性激活的V1结构，

expansion (1×1) -> depthwise (kernel_size) -> projection (1×1)

看简化后的代码：

 # expansion 操作
 if inner_size > net.shape[3]:
   net = split_conv( # expansion 部分
      net,
      inner_size,
      num_ways=split_expansion,
      scope='expand', # expansion
      stride=1,
      normalizer_fn=normalizer_fn)
 net = tf.identity(net, 'expansion_output')
 
 # depthwise 操作
 if depthwise_location == 'expansion':
    if use_explicit_padding:
       net = _fixed_padding(net, kernel_size, rate)
       net = depthwise_func(net) # depthwise 部分
 net = tf.identity(net, name='depthwise_output')
 

 # Note in contrast with expansion, we always have projection to produce the desired output size.
 # pointwise 操作，因为这里会缩减维度，所以激活函数变成了线性激活
 net = split_conv(
   net,
   num_outputs,
   num_ways=split_projection,
   stride=1,
   scope='project',
   normalizer_fn=normalizer_fn,
   activation_fn=tf.identity) # 相等的线性激活
 
 # 残差结构；只有在stride = 1并且通道数相等的情况下才会执行残差结构
if (residual and stride == 1 and
    net.get_shape().as_list()[3] == input_tensor.get_shape().as_list()[3]):
      net += input_tensor

很明显的三部曲结构；

split_conv是什么卷积？

看它被调用的地方是在上面的expanded _conv 代码，发现expansion和pointwise都是用它来实现的，那么它的主要用途就是用来做维度扩张和维度压缩的；再来看看它的说明：

Creates a split convolution.
Split convolution splits the input and output into
‘num_blocks’ blocks of approximately the same size each,
and only connects $i$-th input to $i$ output.

只连接第i 个输入和第i 个输出！

如果num_ways为1，那么这个函数实际上就是1*1的pointwise 卷积，实际上也就是卷积核为1*1的普通卷积；

但是如果 num_ways 大于1，它就是一个把输入和输出的channel都等分为 num_ways 份做1*1卷积，每份都单独进行卷积操作，然后将得到的结果各自对应的拼接起来，很奇怪的操作，为什么要做这样的操作？

 for i, (input_tensor, out_size) in enumerate(zip(inputs, output_splits)):
   scope = base + '_part_%d' % (i,)
   n = slim.conv2d(input_tensor, out_size, [1, 1], scope=scope, **kwargs)
   n = tf.identity(n, scope + '_output')
   outs.append(n)
 return tf.concat(outs, 3, name=scope + '_concat')

我的理解是避免扩张或者压缩feature map 的通道后， feature map的新的通道特征都是由原来所有的通道特征卷积而来的，它希望新的通道特征是由不同部分的原有通道特征卷积而来，也就是说每一部分输入都有不同的卷积核来处理，相当于人为的增加了一些特征多样性。但是这个理解不太确定，希望更清楚的人留言。

each input will replicated (with different filters) that many times.

为什么要使用 use_explicit_padding ？以及使用 _fixed_padding？不是直接使用same padding 就可以了吗？

来看看区别

使用sampe padding时：

《Mobilenet V2 TensorFlow 代码解读》

Padding 的数量和输入的尺寸有很大的相关性。

使用fixed padding

《Mobilenet V2 TensorFlow 代码解读》

可以看到，后者在case 2 的情况下，结果是不一样的，Padding 的数量和输入的尺寸没有相关性，padding 的数量只和kernel_size 相关了，而且，_fixed_padding一定是在左右前后padding加上同等数量的0，从代码也能看出来；

 kernel_size_effective = [kernel_size[0] + (kernel_size[0] - 1) * (rate - 1),
 kernel_size[0] + (kernel_size[0] - 1) * (rate - 1)]
 pad_total = [kernel_size_effective[0] - 1, kernel_size_effective[1] - 1]
 pad_beg = [pad_total[0] // 2, pad_total[1] // 2]
 pad_end = [pad_total[0] - pad_beg[0], pad_total[1] - pad_beg[1]]
 tf.pad(inputs, [[0, 0], [pad_beg[0], pad_end[0]], 
                 [pad_beg[1],pad_end[1]], [0, 0]])

所以处理之后的tensor不一定是刚刚好可以完全被卷积核处理完的，例如第二张图的case 2 就剩下了一个0没有被加入卷积计算。

padding 的原理：

paddings[D, 0] + tensor.dim_size(D) + paddings[D, 1]

depthwise 的线性激活怎么实现的？在哪些地方使用？

tf.identity；

只在split_conv中实现pointwise功能的时候才会调用；

为什么很多地方要加一个这种操作 n = tf.identity(n, scope + ‘_output’) ？

它返回的是一个一模一样的tensor的OP。

Return a tensor with the same shape and contents as the input tensor or value

一种情况是在tensorflow中，Variable和tensor 是两个不同的概念，变量Variable是可以跨session和跨设备的，它并没有固定在计算图graph中，所以会出现为了在计算图或其他设备内部把某个变量的值加入或者引用，就会采用该操作；

另外一种情况是单纯的给某个tensor或者OP加上名称，这个代码里面的操作大部分都是起这个作用；

再有一种就是上一个问题提到的，起到一个线性激活函数的作用。

depth_multiplier 是什么？怎么起作用？

每一层的通道数都会乘以或除以的倍数，用来扩张或压缩通道数

depth_multiplier: The multiplier applied to scale number of channels in each layer.

定义的 multiplier_func 如下：

 d = output_params['num_outputs']
 # 关键是这里的 num_outputs 会变成 d * multiplier 
 output_params['num_outputs'] = _make_divisible(d * multiplier, divisible_by, min_depth)

然后在构建网络之前会先调用 multiplier_func ：

opdef.multiplier_func(params, multiplier)

网络设置的 arg_scope 参数为什么要在两个地方分别设置，并且还有slim.conv2d, slim.fully_connected, slim.separable_conv2d 的activation_fn和’normalizer_fn 重复设置？还有(slim.batch_norm,): {‘center’: True, ‘scale’: True}为什么要在这里设置中心化和标准化？

arg_scope 主要在两个地方进行设置，分别在 V2_DEF 和 training_scope 里面；

V2_DEF 中的设置：

 defaults={
 # Note: these parameters of batch norm affect the architecture
 # that's why they are here and not in training_scope.
 # 就是这段话没理解...
 (slim.batch_norm,): {'center': True, 'scale': True},

 (slim.conv2d, slim.fully_connected, slim.separable_conv2d): {
 'normalizer_fn': slim.batch_norm,
 'activation_fn': tf.nn.relu6
 },

 (ops.expanded_conv,): {
 'expansion_size': expand_input(6),
 'split_expansion': 1,
 'normalizer_fn': slim.batch_norm,
 'residual': True
 },

 (slim.conv2d, slim.separable_conv2d): {'padding': 'SAME'}
 },

它的注释提到 these parameters of batch norm affect the architecture，BN 的参数会影响到网络结构，所以才会把它放在这里而不是在 training_scope 中，这点没太明白。

training_scope 里面的设置：

slim.arg_scope(
 [slim.conv2d, slim.fully_connected, slim.separable_conv2d],   # 第1次conv2d
               weights_initializer=weight_intitializer,
               normalizer_fn=slim.batch_norm), \
 slim.arg_scope([mobilenet_base, mobilenet], is_training=is_training),\
 safe_arg_scope([slim.batch_norm], **batch_norm_params), \
 safe_arg_scope([slim.dropout], is_training=is_training,
                 keep_prob=dropout_keep_prob), \
 slim.arg_scope([slim.conv2d], \                             # 第2次conv2d
      weights_regularizer=slim.l2_regularizer(weight_decay)), \
 slim.arg_scope([slim.separable_conv2d],                    # 第3次separable_conv2d
                 weights_regularizer=None) as s

这里的 conv2d 之所以要设置3次参数是为了单独给conv2d和separable_conv2d单独设置weights_regularizer权重正则化的参数；值得注意的是这里并不是多个slim.arg_scope嵌套而是多个slim.arg_scope并行设置；

@slim.add_arg_scope 起到什么作用？

slim.arg_scope([mobilenet_base, mobilenet], is_training=is_training)

加了该装饰器的函数可以被 slim.arg_scope 调用，通过上下文管理器设置参数。

网络的实际调用代码是怎样的？

训练：

with tf.contrib.slim.arg_scope(mobilenet_v2.training_scope()):
   logits, endpoints = mobilenet_v2.mobilenet(input_tensor)

推理或测试：

with tf.contrib.slim.arg_scope(mobilenet_v2.training_scope(is_training=False)):
   logits, endpoints = mobilenet_v2.mobilenet(images)

其他

除了最后的avgpool，整个网络并没有采用pooling进行下采样，而是利用stride=2来下采样

恢复模型的时候使用指数移动平均居然会提升1.5-2%的精确度，这个提升幅度很让人惊讶。

# Restore using exponential moving average since it produces (1.5-2%) higher accuracy
ema = tf.train.ExponentialMovingAverage(0.999)
vars = ema.variables_to_restore()
saver = tf.train.Saver(vars)

未解决的疑问包括：

为什么在 expanded_conv 里面还需要设置depthwise_func 的位置是在None / input / output / expansion中的一个？
上文提到的 arg_scope 两次设置以及 BN 影响网络结构的问题
split_conv 中切分通道做卷积

google代码链接：

https://github.com/tensorflow/models/tree/1af55e018eebce03fb61bba9959a04672536107d/research/slim/nets/mobilenet

参考文章链接：

https://blog.csdn.net/u014380165/article/details/72938047

https://blog.csdn.net/stesha_chen/article/details/82744320

https://blog.csdn.net/u011995719/article/details/79135818

    原文作者：lonlon ago
    原文地址: https://zhuanlan.zhihu.com/p/51608073
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。

Mobilenet V2的特点是什么？

V1和 V2的卷积结构区别示意图：

MobileNetV2的block 与ResNet 的block区别示意图：

代码的调用流程是怎样的？

slim.separable_conv2d 是什么卷积？

代码中的split_separable_conv2d是什么卷积，和上面的separable_conv2d有什么区别？

expanded_conv 是什么卷积？

split_conv是什么卷积？

为什么要使用 use_explicit_padding ？ 以及使用 _fixed_padding？不是直接使用same padding 就可以了吗？

depthwise 的线性激活怎么实现的？在哪些地方使用？

为什么很多地方要加一个这种操作 n = tf.identity(n, scope + ‘_output’) ？

网络的实际调用代码是怎样的？

其他

未解决的疑问包括：

为什么要使用 use_explicit_padding ？以及使用 _fixed_padding？不是直接使用same padding 就可以了吗？