PyTorch教程学习总结

2024年5月14日 261次阅读来源: 咖喱

1.一些重要的概念

Tensor

autograd

Variable

nn — high-level abstraction

The
nnpackage defines a set of
Modules, which are roughly equivalent to neural network layers.
torch.nn.Linear
torch.nn.ReLU

saving_loading_models

A common PyTorch convention is to save models using either a
.ptor
.pthfile extension.

nn_tutorial

A trailling
_in PyTorch signifies that the operation is performed in-place.
View is PyTorch’s version of numpy’s
reshape.
A
Sequentialobject runs each of the modules contained within it, in a sequential manner.

使用nn.optim时，requires_grad设为False，意为freeze some layers

加载数据使用datasets loader

2.一些操作

Tensor.topkto get the index of the greatest value：

def categoryFromOutput(output):
    top_n, top_i = output.data.topk(1) # Tensor out of Variable with .data
    category_i = top_i[0][0]
    return all_categories[category_i], category_i

print(categoryFromOutput(output))

nn.LogSoftmax对应的loss是criterion = nn.NLLLoss()

nn.LSTM

nn.GRU

使用python指定GPU，如下

有一台服务器，服务器上有多块儿GPU可以供使用，但此时只希望使用第2块和第4块GPU，但是我们希望代码能看到的仍然是有两块GPU，分别编号为0,1，这个时候我们可以使用环境变量CUDA_VISIBLE_DEVICES来解决这个问题。
比如：
CUDA_VISIBLE_DEVICES=1 只有编号为1的GPU对程序是可见的，在代码中gpu[0]指的就是这块儿GPU
CUDA_VISIBLE_DEVICES=0,2,3 只有编号为0,2,3的GPU对程序是可见的，在代码中gpu[0]指的是第0块儿，gpu[1]指的是第2块儿，gpu[2]指的是第3块儿
CUDA_VISIBLE_DEVICES=2,0,3 只有编号为0,2,3的GPU对程序是可见的，但是在代码中gpu[0]指的是第2块儿，gpu[1]指的是第0块儿，gpu[2]指的是第3块儿

在python程序中，我们可以这么写

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
torch.cuda.is_available()
dev = torch.device("cuda") / torch.device("cpu")

torch.nn.Embedding

2.1 关于DataParallel

这里有一个讨论dataparallel-imbalanced-memory-usage。

2.2 数据类型转换

Pytorch-数据类型转换

2.3 一些见到的汇总

Pytorch maxpool的ceil_mode

3.遇到的坑

3.1 在测试pytorch-yolo2的时候，发现这个错误，已解决

PyTorch socket.error [Errno 111] Connection refused

3.2 在测试CornerNet的时候，使用conda安装pytorch等环境，

要升级conda里的gcc版本，因为Pytorch要求gcc>=4.9.

方法是在anaconda cloud里下载了gcc 4.9，安装后软连接即可，如下

ln -s /home/20xxx/anaconda2/envs/CornerNet/bin/gcc-4.9 /home/20xxx/anaconda2/envs/CornerNet//bin/gcc  
ln -s /home/20xxx/anaconda2/envs/CornerNet/bin/g++-4.9 /home/20xxx/anaconda2/envs/CornerNet//bin/g++

3.3 RuntimeError

3.3.1 RuntimeError: Only Tensors of floating point dtype can require gradients

在运行xmfbit/captcha-recognition时，会报此错误。把main.py的test()里这一段修改下

#x, act_lengths, flatten_target, target_lengths = tensor_to_variable(
            (x, act_lengths, flatten_target, target_lengths), volatile=True)
x, act_lengths, flatten_target, target_lengths = tensor_to_variable(
            (x, act_lengths, flatten_target, target_lengths), volatile=False)

关于requires_grad和volatile二者的区别和联系，还没有调查过。

另外，captcha-recognition在进行warpctc的python绑定时，使用的是pytoch-1.0的cpu版本。gpu版本当前pytorch10-py36-cuda8.0-cudnn7.1.2运行会有错误。

3.4 Error

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)

解决办法：出现这个错误的情况是，在服务器上的docker中运行训练代码时，batch size设置得过大，shared memory不够（因为docker限制了shm）.解决方法是，将Dataloader的num_workers调小

RuntimeError: DataLoader worker (pid 27) is killed by signal: Killed. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.

References

[1] Pytorch中文文档.

[2]Pytorch源码编译简明指南

    原文作者：咖喱
    原文地址: https://zhuanlan.zhihu.com/p/34811074
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。