小白的tensorflow+CUDA编程踩坑记录

2023年10月17日 290次阅读来源: YukiRain

初学CUDA，记录下自己踩过的坑。

A. 编译GPU版本的tensorflow OpKernel

参考网页https://www.tensorflow.org/extend/adding_an_op。步骤如下：

在一个.cpp文件中写这个Op类的接口，这个自定义的类应当继承tensorflow::OpKernel基类，并重写其Compute方法；
在cpp文件中写Register代码，可以选择用CPU或者GPU实现，最后编译的时候把以下参数传到REGISTER_KERNEL_BUILDER中。

typedef Eigen::ThreadPoolDevice CPUDevice;
typedef Eigen::GpuDevice GPUDevice;

具体的Op计算过程只在cpp文件中留出函数接口，不具体实现；
在.cu文件中写cuda代码，实现并行化的计算；
（optional）重复以上过程，实现该Op所对应的gradient计算过程，同样封装成一个tensorflow::OpKernel的子类；
用g++和nvcc编译C++和CUDA代码，生成动态链接.so文件；
在Python中调用动态链接库并进行进一步的封装，用Python中的@tf.RegisterGradient做decorator把gradient的实现绑定到Op上；

Note：

为了编译方便，REGISTER_KERNEL_BUILDER的代码最好写在条件编译中；
tensorflow的OpKernel注册时的名字必须严格按照大驼峰法命名，对应Python中变成小写，下划线分隔。比如说：

REGISTER_KERNEL_BUILDER(
 Name("SparseGraphConvolution").Device(DEVICE_GPU).TypeConstraint<float>("T"),
    GraphAdjacencyGeneratorOp<GPUDevice, float>);

到了Python中的名字就叫做sparse_graph_convolution

用OP_REQUIRES检查输入合法性，用OP_REQUIRES_OK分配新的内存空间。在多GPU环境下，后者比cuda中的cudaMalloc效率更高，因为不会造成多GPU之间的资源竞争。所以在多GPU环境下应该尽可能不用cudaMalloc函数，最好也不要在CUDA kernel里面写new和delete；
可以仿照STL的设计思路，设计这样一个类，然后传到tensorflow::OpKernel子类的template参数中去：

struct GpuAlloc {
    GpuAlloc(OpKernelContext *context) {
    this->context = context;
}

    void alloc(void**data, int bytes) {
        int num_elems = std::ceil(bytes / 8.0);
        // Have to keep the tensors until end of op to avoid memory crash.         tmp_tensors.push_back(Tensor());
        Tensor &tmp = tmp_tensors.back();
        TensorShape shape({num_elems});
        OP_REQUIRES_OK(context, context->allocate_temp(DataTypeToEnum<double>::value, shape, &tmp));

        double *buf = &(tmp.flat<double>()(0));
        *data = (void*)buf;
    }
    OpKernelContext *context;
    std::vector<Tensor> tmp_tensors;
};

原则上，确实应该尽可能地多用CUDA中的share的memory，在一些情况下，shared memory的读取速度甚至和寄存器的速度差不多。问题在于，神经网络中每一层的输出节点数量用户是可以随便定义的，如果输出节点数量太大，shared memory就不够用，如果每个block中用到的shared memory数量超过了maxSharedMemoryPerBlock的值，kernel就不会启动。我暂时还没有搞清楚谷歌在tensorflow中是如何处理这个问题的；

B. CUDA中的kernel封装问题

这个问题起源于我的一次自作聪明：每次传给kernel的参数，无论实际维度是多少，总是一个一级指针，直接对这个一级指针操作很容易不小心粗心写错，那么可不可以对这个指针做一个简单的封装，让它稍微好用一点呢？

于是我写了这么一段简单的代码来测试这个想法：

template <typename T> class QueryBase {
public:
    QueryBase(T* _data): data(_data) {}
    __device__ virtual T& operator()(int, int) = 0;
protected:
    __device__ QueryBase(const QueryBase<T>&);
    T* data;
};

template <typename T> class ForwardQuery : public QueryBase<T> {
public:
    ForwardQuery(T* _data, int _size)
                : QueryBase<T>(_data), size(_size) {}
    __device__ T& operator()(int a, int b) { return this->data[a*size+b]; }
protected:
    int size;
};

template<typename T>
__global__ void testTemplate(QueryBase<T>& in, QueryBase<T>& out) {
    out(threadIdx.x， threadIdx.y) = in(threadIdx.x, threadIdx.y) + 1.0;
    __syncthreads();
}

思路就是在host上建一个类，这个类的构造函数是__host__函数，其中保存一个指向device内存的指针，把它的operator()写成__device__函数，最后把这个类传进kernel的参数列表中。这样，在kernel中就可以方便地用operator()解引用矩阵元素了。

结果：可以通过编译，但运行结果不对。即使我小心翼翼地设计这个类，把它写成线程安全的，结果依然是错误的。原因至今还不清楚。这说明写CUDA kernel的时候还是老老实实用下标比较好。

C. Trouble shooting for nvcc

Bug1：找不到op.h——没有包含tensorflow的include路径，参考网页https://www.tensorflow.org/extend/adding_an_op，加上这部分可以解决；

•-I/home/xxx/anaconda3/lib/python3.5/site-packages/tensorflow/include -I/home/xxx/anaconda3/lib/python3.5/site-packages/te/nsync/public
–D_GLIBCXX11_ABI=0

Bug2：version ‘GLIBCXX_3.4.21’ not found——参考网页https://blog.csdn.net/amor_tila/article/details/77976964，这个错误常见于新装的ubuntu系统，因为新ubuntu系统中的libgcc是老版本的，其他网站一般就教你用硬链接ln一个假的libgcc出来，但个人觉得最好的办法还是自己conda install libgcc装一个新的；
Bug3：找不到nsynccv.h——解决方案详见https://blog.csdn.net/qq_27637315/article/details/79114633
Bug4：Undefined symbol: _ZTIN10tensorflow8OpKernelE，详见网站https://blog.csdn.net/qq_17827079/article/details/79709674
Bug5：undefined symbol: _ZN10tensorflow8internal21CheckOpMessageBuilder9NewStringEv：参考网页https://github.com/tensorflow/tensorflow/issues/9137。需要注意不光是nvcc编译的时候要加-D_GLIBCXX_USE_CXX11_ABI=0，最后一步用g++联合编译生成动态链接库的时候，也要加这一条命令；
g++编译时如果用了比-O2更高级别的编译器优化，可能会有意料之外的bug。

D. C++版本CUDA代码示例

网上C风格的CUDA学习资源很多，很多人也提倡用C风格的代码，因为效率更高，编译器的坑也少一些。之所以要写C++的代码，主要还是因为C++的template写起来可以很轻松的控制OpKernel中的变量类型，可以同时支持很多不同类型参数的计算。下面的示例代码把tensorflow官网上的那个AddOneOp改写成C++风格，功能是把输入的前K个数字加1：

// cuda_op_kernel.cu.cc #if GOOGLE_CUDA #define EIGEN_USE_GPU 
#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor" #include "tensorflow/core/framework/op.h" #include "tensorflow/core/framework/op_kernel.h" #include "tensorflow/core/framework/register_types.h" #include "tensorflow/core/framework/shape_inference.h" #include "tensorflow/core/framework/common_shape_fns.h" 
using namespace tensorflow;
typedef Eigen::ThreadPoolDevice CPUDevice;
typedef Eigen::GpuDevice GPUDevice;

template <typename T>
__global__ void AddOneKernel(const T* in, const int N, T* out) {
	int idx = blockIdx.x * N + threadIdx.x;
	out[idx] = in[idx] + 1;
}

template <typename T>
void AddOneKernelLauncher(const T* in, const int batch_size, const int N, const int K, T* out) {
	AddOneKernel<T><<<batch_size, K>>>(in, N, out);
	cudaDeviceSynchronize();
}

template <typename Device, typename T>
class AddOneOp : public OpKernel {
public:
	explicit AddOneOp(OpKernelConstruction* context) : OpKernel(context) {
		OP_REQUIRES_OK(context, context->GetAttr("K", &K_));
	}

	void Compute(OpKernelContext* context) override {
		const Tensor& input_tensor = context->input(0);
		auto input = input_tensor.flat<T>();
		const int batch_size = input_tensor.shape().dim_size(0);
		const int N = input.size() / batch_size;

		Tensor* output_tensor = NULL;
		OP_REQUIRES_OK(context, context->allocate_output(0, input_tensor.shape(), &output_tensor));
		auto output = output_tensor->flat<T>();
		OP_REQUIRES(context, K_>0 && K_ <= N,
			::tensorflow::errors::InvalidArgument("Invalid K value"));
		AddOneKernelLauncher<T>(input.data(), batch_size, N, K_, output.data());
	}
private:
	int K_;
};

REGISTER_KERNEL_BUILDER(Name("AddOne").Device(DEVICE_GPU).TypeConstraint<int>("T"),
                        AddOneOp<GPUDevice, int>);
REGISTER_KERNEL_BUILDER(Name("AddOne").Device(DEVICE_GPU).TypeConstraint<float>("T"),
                        AddOneOp<GPUDevice, float>);
REGISTER_KERNEL_BUILDER(Name("AddOne").Device(DEVICE_GPU).TypeConstraint<double>("T"),
                        AddOneOp<GPUDevice, double>);

#endif

然后是C++代码：

// cuda_op_kernel.cc #include "tensorflow/core/framework/op.h"  #include "tensorflow/core/framework/op_kernel.h"  
using namespace tensorflow;

typedef Eigen::ThreadPoolDevice CPUDevice;
typedef Eigen::GpuDevice GPUDevice;

REGISTER_OP("AddOne")
.Attr("T: {int32, float, double}")
.Input("input: T")
.Output("output: T")
.Attr("K: int")
.Doc(R"doc(  Adds 1 to all elements of the tensor.  output: A Tensor.   out_{i<K} = in_{i<K} + 1 )doc");

template<typename T>
void AddOneKernelLauncher(const T* in, const int batch_size, const int N, const int K, T* out);

template <typename Device, typename T>
class AddOneOp : public OpKernel {
public:
    explicit AddOneOp(OpKernelConstruction* context) : OpKernel(context) {
        OP_REQUIRES_OK(context, context->GetAttr("K", &K_));
    }

    void Compute(OpKernelContext* context) override {
        const Tensor& input_tensor = context->input(0);
        auto input = input_tensor.flat<T>();
        const int batch_size = input_tensor.shape().dim_size(0);
        const int N = input.size() / batch_size;
 
        Tensor* output_tensor = NULL;
        OP_REQUIRES_OK(context, context->allocate_output(0, input_tensor.shape(),
                       &output_tensor));
        auto output = output_tensor->flat<T>();
        OP_REQUIRES(context, K_>0 && K_<=N,
            ::tensorflow::errors::InvalidArgument("Invalid K value"));
        AddOneKernelLauncher<T>(input.data(), batch_size, N, K_, output.data());
    }

private:
    int K_;
};

#ifndef GOOGLE_CUDA REGISTER_KERNEL_BUILDER(Name("AddOne").Device(DEVICE_CPU).TypeConstraint<int>("T"),
                        AddOneOp<CPUDevice, int>);
REGISTER_KERNEL_BUILDER(Name("AddOne").Device(DEVICE_CPU).TypeConstraint<float>("T"),
                        AddOneOp<CPUDevice, float>);
REGISTER_KERNEL_BUILDER(Name("AddOne").Device(DEVICE_CPU).TypeConstraint<double>("T"),
                        AddOneOp<CPUDevice, double>);
#endif

最后是bash脚本（踩了一堆坑，写的有点啰嗦了）：

TF_CFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_compile_flags()[: -1]))') )
TF_LFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_link_flags()))') )
TF_IFLAGS=( $(python -c 'import tensorflow as tf; print(tf.sysconfig.get_include())')  )

g++ -std=c++11 -shared cuda_op_kernel.cc -o cuda_op_kernel.so -fPIC ${TF_CFLAGS[@]} -D_GLIBCXX_USE_CXX11_ABI=0\
 ${TF_LFLAGS[@]} -O2

nvcc -std=c++11 -c -o cuda_op_kernel.cu.o cuda_op_kernel.cu.cc ${TF_CFLAGS[@]} -D_GLIBCXX_USE_CXX11_ABI=0\
 ${TF_LFLAGS[@]} -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC

g++ -std=c++11 -shared -o cuda_op_kernel.so cuda_op_kernel.cc cuda_op_kernel.cu.o -D_GLIBCXX_USE_CXX11_ABI=0\
 -I$TF_IFLAGS -I$TF_IFLAGS/external/nsync/public -L/usr/local/cuda-9.1/lib64 ${TF_LFLAGS[@]} -fPIC -lcudart

    原文作者：YukiRain
    原文地址: https://zhuanlan.zhihu.com/p/40375792
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。