三维网格模型重采样方法
问题 (The Problem)
3D convolutional neural networks (CNNs) are state-of-the-art amongst deep learning models for videos. However, these models for videos are extremely slow.
3D卷积神经网络(CNN)是视频深度学习模型中的最新技术。 但是,这些视频模型非常慢 。
The mini-batch shape for training a deep learning model for videos is defined by:
用于训练视频深度学习模型的小批量形状由以下方式定义:
- the number of clips
剪辑数
- the number of frames per clip
每个剪辑的帧数
- the spatial size per frame.
每帧的空间大小。
The state-of-the-art video models generally use a large mini-batch shape for the sake of accuracy. But this is precisely what makes these models so slow.
为了精确起见,最新的视频模型通常使用大的小批量形状 。 但这正是使这些模型如此缓慢的原因 。
Drawing on research from a fancy-ass field called numerical optimization, Wu, Girshick, He, Feichtenhofer & Krahenbuhl (2020) introduced a “multigrid method for efficiently training video models.” In essence, this method is a way to retain the accuracy of these video models, whilst saving time.
Wu,Girshick,He,Feichtenhofer和Krahenbuhl(2020)借鉴了花哨的领域中的数值优化研究,提出了“有效训练视频模型的多网格方法”。 本质上,此方法是一种在保留时间的同时保持这些视频模型准确性的方法。
解决方案 (The Solution)
The multigrid method makes use of the need to balance between large spatial and time dimensions (so the number of frames & the size per frame) versus the number of clips per mini-batch.
多重网格方法利用了在较大的空间和时间维度(即帧数和每帧大小)与每个小批量的剪辑数之间取得平衡的需求 。
Before we get into the details, it intuitively makes sense that we can start with a large number of mini-batches, and smaller-dimension time and space data (coarse learning), and then switch to smaller mini-batches and more granular time and space data (fine learning).
在深入了解细节之前,从直觉上讲,我们可以从大量迷你批处理开始,然后从较小维度的时间和空间数据 (粗略学习)开始,然后切换到较小的迷你批处理和更细粒度的时间,空间数据 (精细学习)。
Drawing on this intuition, Wu et al (2020) tried to answer two questions:
基于这种直觉, Wu等人(2020年)试图回答两个问题:
(i) is there a set of grids [spatial and temporal] with a grid schedule that can lead to faster training without a loss in accuracy?
(i)是否有一组带有网格进度表的网格[时空],可以导致更快的训练而又不损失准确性?
(ii) if so, does it robustly generalize to new models and datasets without modification?
(ii)如果是,是否可以不加修改地将其强大地推广到新模型和数据集?
正式解决方案 (Formalizing the Intended Solution)
First, let’s declare the problem that we are trying to solve in an equation:
首先,让我们用方程式声明要解决的问题:
b x t x h x w = B x T x H x W
bxtxhxw = B x T x H x W
So on the left, we are taking our newly scaled batch size, b, time, t, height, h, and width, w, and on the right, we have the original batch size, B, time, T, height, H, and width, W. Note that the scaled values on the left will be determined according to a schedule as training progresses, rather than being fixed throughout the training process like the values on the right would be — that is what makes this approach special.
因此,在左侧,我们使用新缩放的批次大小b ,时间t ,高度, h和宽度w ,在右侧,我们使用原始的批次大小B,时间,T,高度,H ,以及宽度W。请注意,左侧的缩放值将根据训练进度进行安排,而不是像右侧的值那样在整个训练过程中固定-这就是这种方法的特殊之处。
You’ll realize that if we are trying to speed up the training process, then our scaled b should, on average, be larger than B. This means that we can cover a larger number of batches by using the scaled/scheduling approach, but we also want to make sure we don’t diminish accuracy (which is improved by a larger t, h, and w).
您将认识到,如果我们试图加快训练过程,那么缩放后的b平均应该大于B。这意味着我们可以使用缩放/调度方法来覆盖更多批次 ,但是我们还想确保我们不会降低精度(通过增大t,h和w可以提高精度) 。
Although controlling b, that is the number of clips per mini-batch might seem fairly easy, you might be wondering how we can control the size of t, h, and w.
尽管控制b ,即每个迷你批处理的剪辑数看似相当容易,但您可能想知道我们如何控制t,h和w的大小 。
We control t, h, and w by using the concept of a sampling grid. A sampling grid consists of two values: a span and a stride. According to Wu et al (2020),
我们通过使用采样网格的概念来控制t,h和w 。 采样网格由两个值组成: span和stride 。 根据Wu等(2020) ,
“[t]he span is the support size of the grid and defines the duration or area that the grid covers. The stride is the spacing between sampling points.”
“跨度是网格的支撑尺寸,定义了网格覆盖的持续时间或面积。 跨度是采样点之间的间隔。”
Both space (defined by h and w) and time (t) dimensions can be resampled to smaller sizes using the sampling grid.
可以使用采样网格将空间(由h和w定义)和时间( t )维度重新采样为较小的尺寸。
The resampling process requires an operator. Wu et al (2020) describe an example operator: “a reconstruction filter applied to the source discrete signal followed by computing the values at the points specified by the grid (e.g., bilinear interpolation).”
重采样过程需要操作员 。 Wu等人(2020年)描述了一个示例运算符:“将重构滤波器应用于源离散信号,然后在网格指定的点上计算值(例如,双线性插值)。”
In addition, for a multigrid method to work, the baseline model that we use must be “compatible with inputs that are resampled on different grids, and therefore might have different shapes during training.” In other words, we can’t use the multigrid method with models that require a fixed shape during training. As noted by Wu et al (2020), models composed of convolutions, recurrence, and self-attention are supported by the multigrid method, but not fully-connected layers (unless we pool them to a fixed size). This is not a very restrictive rule, which is good. :)
此外,为了使多网格方法有效,我们使用的基线模型必须“与在不同网格上重新采样的输入兼容,因此在训练期间可能具有不同的形状。” 换句话说,我们不能将多网格方法用于训练过程中需要固定形状的模型。 正如Wu等人(2020)指出的那样,由多重网格方法支持由卷积,递归和自我注意组成的模型,但不支持完全连接的层(除非我们将它们合并为固定大小)。 这不是一个非常严格的规则,这很好。 :)
网格时间表 (The Grid Schedule)
We still haven’t defined the schedule for adjusting the sampling grid (for both time and space)! This is super important because it is what empowers us to save time, without diminishing accuracy.
我们还没有定义调整采样网格的时间表(时间和空间)! 这是非常重要的,因为它使我们能够节省时间而又不降低准确性。
Wu et al (2020) use a
hierarchical schedule that involves alternating between mini-batch shapes at two different frequencies: a long cycle that moves through a set of base shapes, generated by a variety of grids, staying on each shape for several epochs, and a short cycle that moves through a set of shapes that are ‘nearby’ the current base shape, staying on each one for a single iteration.
分级调度,在两个不同的频率包括小批量形状之间交替:一个长周期 ,通过一组基础形状的,由多种网格生成,停留在每个形状为几个信号出现时间,以及短周期移动,通过一个移动在当前基本形状“附近”的一组形状 ,在每个形状上停留一次。
Multigrid long cycles progressively decrease the mini-batch size, increasing the spatial grid size as training progresses. With short cycles, we cycle through smaller and larger spatial grids, but all within a small range. Wu et al (2020) used a combination of these two schedules, performing short cycle updates in between long cycle updates, which works well in practice to maintain accuracy whilst saving time!
多网格长周期逐渐 减小了小批量的大小,随着训练的进行增加了空间网格的大小 。 周期短时, 我们在越来越大的空间网格中循环,但都在很小的范围内 。 Wu et al(2020)结合使用了这两个时间表,在长周期更新之间执行短周期更新, 在实践中很好地保持了准确性,同时节省了时间 !
It’s almost time to look at our results with the multigrid method, but first, let’s talk about three more things: learning rate scaling, the fine-tuning phase, and batch normalization.
现在几乎是时候使用multigrid方法来查看我们的结果了,但是首先,让我们谈谈三件事:学习速率调整,微调阶段和批处理归一化。
Learning Rate Scaling
学习率缩放
Whenever the mini-batch size changes due to the long cycle, the learning rate is likewise reduced (as we approach the optimum). The authors found that adjusting according to the short cycle caused worse results in practice, so they avoided this.
每当小批量的大小由于长周期而变化时 , 学习率同样会降低 (当我们接近最佳值时)。 作者发现,根据短周期进行调整在实践中会导致较差的结果,因此他们避免了这种情况。
Fine-Tuning
微调
Okay, you’re probably confused by why I’m suddenly bringing up fine-tuning. Because this paper uses the multigrid method for the training phase and not the testing phase, we end up having different shapes for the training and test data. So, during the last part of the training phase, we fine-tune the model to fit the shape of the test data.
好的,您可能对为什么我突然提出微调感到困惑。 由于本文在训练阶段而不是测试阶段使用了多网格方法,因此最终形成了不同形状的训练和测试数据。 因此,在训练阶段的最后阶段,我们对模型进行微调以适合测试数据的形状。
Batch Normalization
批量归一化
Normally, when we are not using the multigrid method, the mini-batch size is a hyperparameter that impacts batch normalization behavior and the resulting BN statistics. Now when we use the multigrid method, we want to “decouple its[mini-batch size’s] impact on batch normalization from its impact on training speedup.” So to compute batch normalization statistics, Wu et al (2020) use standard sub-mini-batch sizes. We increase the sub-mini-batch size whenever we increase the mini-batch size due to short cycle changes. This seems to work in practice.
通常,当我们不使用多重网格方法时,最小批处理大小是一个超参数,它会影响批处理规范化行为以及由此产生的BN统计信息 。 现在,当我们使用multigrid方法时,我们想“ 将其对批次标准化的[小批量大小]影响与对训练速度的影响脱钩 ”。 因此,为了计算批次归一化统计数据, Wu等人(2020年)使用标准的超小型批次大小。 每当由于周期变化短而增加小批量时,我们都会增加亚小批量的大小 。 这似乎在实践中起作用。
主要实验结果 (Main Experiment Results)
The authors compared the multigrid method to a baseline ResNet-50 (R50) SlowFast network on the Kinetics-400 dataset video classification task.
作者将多网格方法与Kinetics-400数据集视频分类任务上的基线ResNet-50(R50)SlowFast网络进行了比较。
Overall, the multigrid method achieved a better tradeoff between batch size and spatial grid size, “iterat[ing] through 1.5× more epochs than [the] baseline method, while only requiring 1/3.4× the number of iterations, 1/4.5× training time, and achieving higher accuracy (75.6% → 76.4%).”
总体而言,多网格方法在批处理大小和空间网格大小之间实现了更好的权衡,“迭代次数比基线方法多了1.5倍,而只需要1 / 3.4倍的迭代次数,1 / 4.5倍。训练时间,并获得更高的准确性(75.6%→76.4%)。”
下一步 (Next Steps)
I’ve left out some other experiments performed in the paper from this article, such generalizations to different training settings — pre-training, different temporal shapes, different spatial shapes — as well as generalizations to different models— “a standard R50-I3D model and its extension with non-local blocks (I3D-NL).”
我在本文中省略了本文进行的其他一些实验,例如对不同训练设置的概括-预训练,不同时间形状,不同空间形状-以及对不同模型的概括-“标准R50-I3D模型以及使用非本地块(I3D-NL)的扩展。”
If this blog post has tickled your interests in the multigrid method, I’d encourage you to check out these other experiments in the main paper, and also test out the code for this paper which has been added to this Github repo.
如果这篇博客文章引起了您对multigrid方法的兴趣,我鼓励您在主要论文中查看这些其他实验 ,并测试已添加到此Github存储库中的本文代码。
Wu, C. Y., Girshick, R., He, K., Feichtenhofer, C., & Krahenbuhl, P. (2020). A Multigrid Method for Efficiently Training Video Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 153–162).
Wu,CY,Girshick,R.,He,K.,Feichtenhofer,C.,&Krahenbuhl,P.(2020)。 有效训练视频模型的多网格方法。 在IEEE / CVF计算机视觉和模式识别会议论文集 (第153-162页)中。
翻译自: https://towardsdatascience.com/a-multigrid-method-for-efficiently-training-video-models-bd7aab020411
三维网格模型重采样方法