关于TensorFlow的一些总结

2023年4月28日 144次阅读来源: PENG

这么多深度学习框架，选择的时候着实很头疼。最早我是用Keras，后来随着写的模型越来越复杂，发现Keras实在是不够灵活，太过于抽象了，就想找个偏底层的框架。在PyTorch与TensorFlow之间纠结了一段，最后因为PyTorch对跨平台支持的不好，决定用TensorFlow（囧），不过后来又发现利用TensorBoard来画图真的很棒。

在使用TF的时候不时遇到一些问题，不得不说TF的API实在有些乱，而且tutorials写的太不友好，上手比其他框架要难一些，但熟悉了以后发现还是很好用的。这篇文章总结一下遇到的一些问题，解决方案以及一些有趣的功能。

数据的导入

由于我的数据量比较大，想利用TF的数据导入机制来读，这样的话比较节省内存，而且TF还支持各种Format的decode函数，比较方便，其实主要还是比较懒不想自己写dataloader。具体使用的是r1.2新添加的tf.contrib.data的API。代码也很简单，就这么点

def input_pipeline(filenames, batch_size):
    # Define a `tf.contrib.data.Dataset` for iterating over one epoch of the data.
    dataset = (tf.contrib.data.TextLineDataset(filenames)
               .map(lambda line: tf.decode_csv(
                    line, record_defaults=[['1'], ['1'], ['1']], field_delim='\t'))
               .shuffle(buffer_size=10000)  # Equivalent to min_after_dequeue=10.
               .batch(batch_size))

    # Return an *initializable* iterator over the dataset, which will allow us to
    # re-initialize it at the beginning of each epoch.
    return dataset.make_initializable_iterator() 

filenames=['1.txt']
batch_size = 300
num_epochs = 10
iterator = input_pipeline(filenames, batch_size)

# `a1`, `a2`, and `a3` represent the next element to be retrieved from the iterator. 
a1, a2, a3 = iterator.get_next()

with tf.Session() as sess:
    for _ in range(num_epochs):
        # Resets the iterator at the beginning of an epoch.
        sess.run(iterator.initializer)

        try:
            while True:
                a, b, c = sess.run([a1, a2, a3])
                print(a, b, c)
        except tf.errors.OutOfRangeError:
            # This will be raised when you reach the end of an epoch (i.e. the
            # iterator has no more elements).
            pass                 

        # Perform any end-of-epoch computation here.
        print('Done training, epoch reached')

这个API是在tf.train.string_input_producer基础上的一些改进，较为好用一些。可以在epoch的开始利用sess.run(iterator.initializer)进行重新shuffle。

但在用的过程中，我发现这种shuffle机制并不真的是全数据集进行shuffle。以上面的代码举例说明TF的机制：首先设置buffer_size=10000代表将文件中的前10000行读入缓存，然后根据batch_size=300随机取出300，这时候，缓存区只有9700个数据，于是又从文件中取出300行填充进缓存区，然后再shuffle取batch…

这种方法不仅没法在全数据集上随机，而且每取一次都需要shuffle buffer导致在跑起来很慢。最后我使用的还是自己写的dataloader，相比TF提供的方法速度反而提高了五倍。

具体见在Stack Overflow上的讨论 How to use TensorFlow tf.train.string_input_producer to produce several epochs data?

参数共享

拿博客Text Matching（II）中的模型来说，如果模型需要对两个输入共享参数（如Question和Answer），就需要设计Graph的时候小心一些。通常是使用tf.get_variable()来声明参数，然后将调用语句放在同一个variable_scope中声明变量可以reuse，这样TF在建图的时候会自动检测变量是否已被使用过。简单地来写一下就是

def nets(sequence):
    W = tf.get_variable('W', shape, initializer=tf.contrib.layers.xavier_initializer())
    pass

def inference(question, answer):
    with tf.variable_scope("nets") as scope:
        q = nets(query)
        scope.reuse_variables()
        a = nets(answer)

利用TensorBoard画图

使用了TensorBoard以后发现利用它来可视化简直太方便了，基本不用自己画图了。Tensorboard中提供一个tf.summary的API，其中常用的包含

Scalar：可以直接看到每一个step loss，accuracy等的变化情况
Distribution，Histogram：可以直接看参数在学习过程中的分布变化，根据这个可以判断自己的模型有没有充分的学习
Graph：直接定义出模型的可视化架构，方便看到建图的过程。例如上面说的参数共享如果实现了的话，在Graph中我们就会看到question和answer使用的是同一个module
Embedding：可以利用PCA降维，将输入映射到低维空间，很炫酷

这是强烈建议使用的功能，细节参考
summaries_and_tensorboard，这个特性tutorials介绍的还是比较详细的。

    原文作者：PENG
    原文地址: https://zhuanlan.zhihu.com/p/27625787
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。