TensorFlow源码解读之ctc loss

2019年7月14日 851次阅读来源: Michael

这周把ctc decode解释完了,感觉收获也挺大的,当时是因为一直怀疑tf.nn.ctc_beam_search_decoder用的是CTC实现代码中的一些图形化解释中所说的beam search而不是prefix beam search,没想到这么一看下来发现就是prefix beam search,确实来说以Google的开发能力不至于弄一个业界第二的decode上去.但是解释完吧,就突然感觉失去动力了,感觉Google把自己所遇到的所有疑惑都规避掉了,自己以前通过看TensorFlow源代码来改进自己的代码的想法也落空了,弄得研究ctc loss都没有了动力.

<1>文档介绍

首先还是从文档看起吧,文档在ctc_loss,文档作用为:

Computes the CTC (Connectionist Temporal Classification) Loss.

代码借鉴的论文也标记出来了,可以研究研究,写的也是异常精彩.

再次推荐一篇博客CTC 原理及实现,这篇博客解决了我大部分的疑惑,除了上一个.

看函数调用的格式:

tf.nn.ctc_loss(
    labels,
    inputs,
    sequence_length,
    preprocess_collapse_repeated=False,
    ctc_merge_repeated=True,
    ignore_longer_outputs_than_inputs=False,
    time_major=True
)

除掉四个已经有默认值的参数,只有三个参数是需要你设置的,一一介绍下吧.

labels:int32类型的稀疏向量
inputs:3维的float向量,如果time_major为默认的,那么其形状为[max_time, batch_size, num_classes],把LSTM输出的第0维和第1维换一下即可.另外,如同TensorFlow源码之greedy search中所讲的那样,输入值是经过logit处理的变量.
sequence_length:是一个int32列表,维度为 batch_size,里面每个值的大小为系列的长度.

返回值则是一个1维的float向量,维度为batch,值为概率的-log.

概率的意思是《TensorFlow源码解读之ctc loss》 ,概率的范围在0到1之间,越接近1代表准确率越高.

这里命名Tensorflow得到的值为loss.数学关系为: 《TensorFlow源码解读之ctc loss》 ,其范围在0到inf之间,根据数学关系,这个值越接近0代表准确率越高.反过来有 ,

这里通过一个真实案例来表示:

iter = 0, loss = 54.8419, 则 ,基本上等同于0了,这是未开始训练的情况
随着训练步骤的进行,loss波动下降,稳定下来loss = 0.0938,则 ,接近于1,是一个训练得比较好的状态.

另外,文档还给了一些输入的需求,比如这个:

sequence_length(b) <= time for all b
max(labels.indices(labels.indices[:, 1] == b, 2))
<= sequence_length(b) for all b.

第一个条件表示长度sequence_length小于等于time,这个time应该等同于ctc_loss_op.cc中定义的max_time,是inputs的时间那维,等同于lstm输出的维度. 另外,为什么是小于等于,在我看来等于就够用了,小于有什么用还得思考思考.

第二个条件很好理解,毕竟labels.indices[i, :] == [b, t],所以 labels.indices[:, 1]指代的是batch这一维,labels.indices[:, 2]指代的是label这一维,这个意思就是指输出不能大于输入.

之后还有一些notes,这两条我认为是现阶段用的着的.

第一条是:

This class performs the softmax operation for you, so inputs should be e.g. linear projections of outputs by an LSTM.

这句话很让人误解,第一个误解是softmax的使用,文档写的是不要弄softmax,在计算ctc loss的时候会执行一次softmax,第二个误解是inputs需要是LSTM输出的线性投射,然后inputs是需要经过logit处理的呀,这就不是线性关系,等见了源代码再看看吧.

第二条是:

The
inputs Tensor’s innermost dimension size,
num_classes, represents
num_labels + 1 classes, where num_labels is the number of true labels, and
the largest value
(num_classes - 1) is reserved for the blank label.

意思就是最后一个label是blank,文档还举了一个例子,这是比较好理解的.

<2>ctc_loss_op.cc

如同TensorFlow源码解读之ctc_beam_search_decoder中介绍的那样,按照文档给出的链接你找到只是这个ctc_ops.py,喜闻乐见,里面就给出了一个接口,真正的代码在ctc_loss_op.cc,其主要引用了ctc_loss_calculator.h,这个留待以后分析吧.

ctc_loss_op.cc里面只有一个类CTCLossOp.分析类CTCLossOp,可以注意到TensorFlow里面的矩阵底层是Eigen,首先用类型别名命名了两个Map,InputMap和OutputMap.之后分别是public部分和private部分,按照惯例先讲private部分吧.

1.private

里面命名了三个bool向量,可以看出,正是文档ctc_loss中所讲的默认变量,除了time_major都在这了.

2.public

主要函数就一个Compute()函数,分析Compute()函数,首先可以得到input几个维度的意思:

const TensorShape& inputs_shape = inputs->shape();
const int64 max_time = inputs_shape.dim_size(0);
const int64 batch_size = inputs_shape.dim_size(1);
const int64 num_classes_raw = inputs_shape.dim_size(2);

可以看出,这里已经是time_major为True的情况了.

之后是将sparse tensor转为ctc_loss_calculator.h中的LabelSequences数组,可以看到其定义为:

typedef std::vector<std::vector<int>> LabelSequences;

是一个稠密向量,所以这一大段的作用是将稀疏向量转为稠密向量以供之后的计算.

之后是初始化input_list_t和gradient_list_t,其定义为:

std::vector<OutputMap> gradient_list_t;
std::vector<InputMap> input_list_t;

分别是InputMap和OutputMap的数组,另外,input_list_t使用inputs_t初始化,是有实际意义的,而由于gradient_t没有值,所以gradient_list_t是初始化为0的.

之后就是使用ctc_loss_calculator.h中定义的class CTCLossCalculator来计算,这个留待以后再解决吧.

<3>ctc_loss_calculator.cc

1.函数CalculateForwardVariables()

这个函数的注释是这么写的:

Calculates the alpha(t, u) as described in (GravesTh) Section 7.3. Starting with t = 0 instead of t = 1 used in the text. Based on Kanishka’s CTC.

在7.3节中定义的《TensorFlow源码解读之ctc loss》为:

《TensorFlow源码解读之ctc loss》

其中,

《TensorFlow源码解读之ctc loss》

可喜的是,并没有用论文A. Graves, S. Fernandez, F. Gomez, J. Schmidhuber. Connectionist Temporal Classification: Labeling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 2006, Pittsburgh, USA, pp. 369-376.中定义的《TensorFlow源码解读之ctc loss》 ,而使用了简单的 ,说实话,弄懂了但并没有完全弄懂.

另外,这个函数是基于Kanishka’s CTC,可以搜一搜这个人,他是Google AI部门的一名工程师Kanishka Rao,介绍见KanishkaRao,里面有他的一些文章,先mark再看吧.

首先分析传进来的几个参数,lprime是序列的意思,而且是增加blank后的序列,其长度为《TensorFlow源码解读之ctc loss》 ,y是传进神经网络输出的概率值的二维矩阵,第一维是序列,其长度为 ,第二维是时间,logalpha是我们需要求解的,为了数值稳定性统一求对数了,所以记录的应该是,这个数值稳定性的作用待会再讲,logalpha与y一样,第一维是序列,其长度为《TensorFlow源码解读之ctc loss》 ,第二维是时间.

分析程序,第一行代码是:

log_alpha->setConstant(kLogZero);

这是因为logalpha初始化需要设置为0,求对数后自然就是负无穷大.

而该上面的注释:

Number of cols is the number of time steps = number of cols in target after the output delay.

应该是注释下面这一行的:

int T = log_alpha->cols();

这验证了logalpha的第二维是时间.

之后有个相等性的验证:

CHECK_EQ(U, log_alpha->rows());

这验证了logalpha的第一维是序列长度,大小为《TensorFlow源码解读之ctc loss》 .

之后就开始初始化了,在论文中三个参数可以被初始化:

《TensorFlow源码解读之ctc loss》

其中第三个公式已经在初始化的时候完成了,这里初始化的是第一个和第二个公式.

分析第一个公式:

log_alpha->coeffRef(0, 0) = log(y(blank_index_, output_delay_));

这里outputdelay是在ctc_loss_calculator.h中定义的,注释显示该变量的作用为:

Delay for target labels in time steps. The delay in time steps before the output sequence.

这个并没有找到其作用,所以姑且认为为0吧.

第二个公式:

auto label_0 = (l_prime.size() > 1) ? l_prime[1] : blank_index_;
log_alpha->coeffRef(1, 0) = log(y(label_0, output_delay_));

给出的解释为:

Below, l_prime[1] == labels[0]

因为在原序列的首部插入了blank,所以l_prime[0]=blank,第二个元素才是我们需要的元素.

label0的作用是当序列为空时,直接指定label为blank,因为当序列为空时, 《TensorFlow源码解读之ctc loss》 ,l_prime[1]是不存在的.为了防止出错设置这样一个判断.

之后就开始进入循环了,因为t=0的值已经初始化好了,所以这里是直接从t=1开始的,然后针对每个t的从上到下依次求解《TensorFlow源码解读之ctc loss》 ,这使用了一个循环:

for (int u = std::max(0, U - (2 * (T - t))); u < std::min(U, 2 * (t + 1));++u)

我们可以看看论文里的图示:

《TensorFlow源码解读之ctc loss》

这里左下角和右上角都是没有连线的,也是不用求解的,std::max(0, U – (2 * (T – t)))表示的就是右上角不用求解,std::min(U, 2 * (t + 1)表示的就是左下角不用求解.

之后开始求《TensorFlow源码解读之ctc loss》 ,一共有三项, , , .

首先求解的是《TensorFlow源码解读之ctc loss》 :

if (ctc_merge_repeated || l_prime[u] == blank_index_) {
        sum_log_alpha = log_alpha->coeff(u, t - 1);
      }

这里ctc_merge_repeated在ctc_loss中的解释是:

If ctc_merge_repeated is set False, then deep within the CTC calculation, repeated non-blank labels will not be merged and are interpreted as individual labels. This is a simplified (non-standard) version of CTC. Default: True.

大部分情况是遇不着的,所以不多做分析了.

之后求解《TensorFlow源码解读之ctc loss》 ,注意到条件是u>0,在图上的理解就是第一行是不行的,因为没有比这更小的序列了.

之后求解《TensorFlow源码解读之ctc loss》 :

if (u > 1) {
        const bool matching_labels_merge =
            ctc_merge_repeated && (l_prime[u] == l_prime[u - 2]);
        if (l_prime[u] != blank_index_ && !matching_labels_merge) {
          sum_log_alpha =
              LogSumExp(sum_log_alpha, log_alpha->coeff(u - 2, t - 1));
        }
      }

成立条件是《TensorFlow源码解读之ctc loss》且 .

最后的一行代码的解释为:

Multiply the summed alphas with the activation log probability.

乘上最后的概率即可,其数学原理是:

《TensorFlow源码解读之ctc loss》

最后讲讲为什么要开对数吧.

在论文中也用到了一些技巧,最后求解的值是:

《TensorFlow源码解读之ctc loss》

作者是这么讲其作用的:

In practice, the above recursions will soon lead to underflows on any digital computer. One way of avoiding this is to rescale the forward and backward variables (Rabiner, 1989).

说实话意思懂了,防止underflow,但这个公式还没看懂,这里就只对《TensorFlow源码解读之ctc loss》说说我的理解吧.

因为最终得到的值是一序列概率的乘积,而概率的范围为[0,1],在实践过程中,初始化各label的概率一般为《TensorFlow源码解读之ctc loss》 ,如果的话那么每个概率只有0.1(虽然随着训练的进行这个概率可能会上升到0.9,但初期肯定是非常小的一个值),如果时间为10的话,那么可以算得最终的概率是量级的数,这是一个非常小的数,如果用计算机这种纯数值计算的机器来计算的话是会出现underflow的,比如我用Python这样计算:

import math
math.pow(0.1, 10)

得到的结果为:

1.0000000000000006e-10

但是我使用对数后:

import math
math.log(0.1) * 10

得到的结果为:

-23.025850929940454

依然是一个在正常范围内的数.

p.s.本来想写这么写的,但得到了相反的结论:

math.exp(math.log(0.1) * 10)

得到的结果为:

1.0000000000000031e-10

误差比上一个大,也是百思不得姐了.

2.函数CalculateBackwardVariables()

在ctc loss中一些公式的推导中会有一些相应的介绍.

3.函数CalculateGradient()

乏善可陈.

4.函数CalculateGradient()

作用是将l扩容成《TensorFlow源码解读之ctc loss》 ,很好理解.

<4>ctc_loss_calculator.h

1.类CTCLossCalculator

在类CTCLossCalculator,首先是对这个类的说明,中心思想是最后一个token用于”no transition” output,也就是最后的blank,这个应该没什么问题.另外,作者也给出了这段代码的参考文献:

GravesTh: Alex Graves, “Supervised Sequence Labeling with Recurrent Neural Networks” (PhD Thesis), Technische Universit¨at M¨unchen.

这是一篇博士论文,而不是普遍意义上的论文A. Graves, S. Fernandez, F. Gomez, J. Schmidhuber. Connectionist Temporal Classification: Labeling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 2006, Pittsburgh, USA, pp. 369-376.,可以在Google上搜索下这本书,我找到的是还没有发表的,大体结构差不多,第7章Connectionist Temporal Classification详细讲了讲ctc loss和decoder的方法,还解释了在5个领域的应用,相比于论文来说是有点小作用的,但是遗憾的是公式推导方面和论文差不多,还是得自己手推一遍.分为public和private,首先分析private部分吧.

1.1 private

前面四个函数CalculateForwardVariables() CalculateBackwardVariables() CalculateGradient() GetLPrimeIndices()定义在ctc_loss_calculator.cc里,PopulateLPrimes()在这个文件里会有讲.

另外,还定义了两个变量,blank_index_和output_delay_,blank_index_的注释是这么写的:

Utility indices for the CTC algorithm.

就是blank的index,默认为N-1.

output_delay_的注释是这么写的:

Delay for target labels in time steps. The delay in time steps before the output sequence.

暂时还不知道这是什么意思.

1.2 public

首先定义了二维int变量LabelSequences,还有一些Matrix和Array,再加一个初始化方法,声明了CalculateLoss.

2.PopulateLPrimes()

文档显示这个函数的作用是:

Helper function that calculates the l_prime indices for all batches at the same time, and identifies errors for any given batch. Return value: max_{b in batch_size} l_primes[b].size()

计算lprime,并判断是否有错.

分析程序,首先做一个判断,如果labels的尺寸不等于batch size就报错:

if (labels.size() != batch_size) {
    return errors::InvalidArgument(
        "labels.size() != batch_size: ", labels.size(), " vs. ", batch_size);
  }

之后对每个batch进行操作,使用了如下的循环:

for (int b = 0; b < batch_size; b++)

然后提取每个batch里面的label:

const std::vector<int>& label = labels[b];

之后的代码按照文档是从Google第一代的机器学习系统DistBelief中转换格式,转成vector,另外,按照文档的写法,还有这么一种作用:

Saw an invalid sequence with non-null following null labels.

在非法字符(blank及blank之后的index)后跟着一个非空字符,不太明白这个意思,按理说,label里是不能出现blank的,不明白为什么blank出现在序列最后就可以,可能这也是DistBelief的特性吧,mark下吧.

之后依然是一个判断,两个,index不能小于0,也不能大于超出blank,不明白为何不在上面的转换程序中做这个判断,应该会省点时间.

之后就是是否忽略output比input长了,ignore_longer_outputs_than_inputs的意思可参考ctc_loss:

Boolean. Default: False. If True, sequences with longer outputs than inputs will be ignored.

最后将l扩充成《TensorFlow源码解读之ctc loss》即可,这是利用ctc_loss_calculator.cc里的函数.

max_u_prime的作用是做到序列长度的最大值,是扩容后的最大值.

3.CalculateLoss方法

首先得到时间num_time_steps:

auto num_time_steps = inputs.size();

inputs是输入,LSTM的输出,按照ctc_loss的解释有:

inputs: 3-D floatTensor. If time_major == False, this will be a Tensor shaped: [batch_size, max_time, num_classes]. If time_major == True (default), this will be a Tensor shaped: [max_time, batch_size, num_classes]. The logits.

默认的,第一维是max_time,第二维是batch size,第三维是num classes,因此有:

auto batch_size = inputs[0].rows();
auto num_classes = inputs[0].cols();

之后是一些判断,在这里就不多做解释,另外建立了变量max_seq_len用于记录seqlen的最大值.

之后使用PopulateLPrimes()进行扩容.

之后是使用多线程来计算loss,文档是这么写的:

Process each item in a batch in parallel, using at most kMaxThreads.

所以首先看看调用ComputeLossAndGradients()的函数吧.其使用了这么一大段函数:

if (workers) {
    // *Rough* estimate of the cost for one item in the batch.
    // Forward, Backward: O(T * U (= 2L + 1)), Gradients: O(T * (U + L)).
    //
    // softmax: T * L * (Cost(Exp) + Cost(Div))softmax +
    // fwd,bwd: T * 2 * (2*L + 1) * (Cost(LogSumExp) + Cost(Log)) +
    // grad: T * ((2L + 1) * Cost(LogSumExp) + L * (Cost(Expf) + Cost(Add)).
    const int64 cost_exp = Eigen::internal::functor_traits<
        Eigen::internal::scalar_exp_op<float>>::Cost;
    const int64 cost_log = Eigen::internal::functor_traits<
        Eigen::internal::scalar_log_op<float>>::Cost;
    const int64 cost_log_sum_exp =
        Eigen::TensorOpCost::AddCost<float>() + cost_exp + cost_log;
    const int64 cost =
        max_seq_len * num_classes *
            (cost_exp + Eigen::TensorOpCost::DivCost<float>()) +
        max_seq_len * 2 * (2 * num_classes + 1) *
            (cost_log_sum_exp + cost_log) +
        max_seq_len *
            ((2 * num_classes + 1) * cost_log_sum_exp +
             num_classes * (cost_exp + Eigen::TensorOpCost::AddCost<float>()));
    Shard(workers->num_threads, workers->workers, batch_size, cost,
          ComputeLossAndGradients);
  } else {
    ComputeLossAndGradients(0, batch_size);
  }
  return Status::OK();

C++渣,不能完全看懂这段代码,就不班门弄斧了,在文档里分析了下时间复杂度,然后就开始计算了,所以主业还是分析ComputeLossAndGradients()吧.

可以看到调用的是ComputeLossAndGradients(0, batch_size),所以这里start_row=0,limitrow=batchsize.

程序里使用了这样的一个循环:

for (int b = start_row; b < limit_row; b++)

batch从0一直循环到最后一个batch.

首先判断下序列长度是否为0.

之后建立了几个变量,lprime记录序列,logalphab表示log(alpha),logbetab表示log(beta),dy记录gradients.

yb用于记录input经过softmax后的变量,其经过了这样的一个转换:

for (int t = 0; t < seq_len(b); t++) {
        float max_coeff = inputs[t].row(b).maxCoeff();
        y_b_col = (inputs[t].row(b).array() - max_coeff).exp();
        y_b.col(t) = y_b_col / y_b_col.sum();
      }

很经典的softmax算法,主要到过程中还减了max_coeff,这是防止underflow的经典做法,具体可参考CTC实现代码中的一些图形化解释.这段代码也解决了我在第1节中的疑惑,即不需要softmax.

之后分别利用ctc_loss_calculator.cc里的函数去计算log(alpha)和log(beta).

之后计算了最终概率的对数,即《TensorFlow源码解读之ctc loss》 .

之后就是gradients了,也是非常的简单,大部分还是利用ctc_loss_calculator.cc里的函数.

    原文作者：Michael
    原文地址: https://zhuanlan.zhihu.com/p/41331716
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。