工作人员和参数服务器在Distributed TensorFlow中的位置在哪里？

2023年1月6日 240次阅读

在这
post中,有人提到：

Also, there’s no built-in distinction between worker and ps devices —
it’s just a convention that variables get assigned to ps devices, and
ops are assigned to worker devices.

在这个post中,有人提到：

TL;DR: TensorFlow doesn’t know anything about “parameter servers”, but
instead it supports running graphs across multiple devices in
different processes. Some of these processes have devices whose names
start with "/job:ps", and these hold the variables. The workers drive
the training process, and when they run the train_op they will cause
work to happen on the "/job:ps" devices, which will update the shared
variables.

问题：

> ps中的变量是驻留在CPU还是GPU上？此外,如果“/ job：ps”驻留在CPU或GPU上,是否有任何性能提升？
>较低级别的库是否决定将变量或操作放在何处？

最佳答案

Do variables in ps reside on the CPU or GPU? Also, are there any performance gains if “/job:ps” resides on CPU or GPU?

您可以将ps作业固定到其中(有例外,请参见下文),但将其固定到GPU是不切实际的. ps实际上是存储参数和操作来更新它. CPU设备可以拥有比GPU更多的内存(即主RAM),并且在梯度进入时足够快地更新参数.在大多数情况下,矩阵乘法,卷积和其他昂贵的操作由工作人员完成因此,将工作人员放置在GPU上是有道理的.将ps放置到GPU是浪费资源,除非ps工作正在做一些非常具体和昂贵的事情.

但是：Tensorflow目前没有整数变量的GPU内核,因此当Tensorflow尝试将变量i放在GPU#0上时,以下代码将失败：

with tf.device("/gpu:0"):
  i = tf.Variable(3)

with tf.Session() as sess:
  sess.run(i.initializer)   # Fails!

以下消息：

Could not satisfy explicit device specification '/device:GPU:0' 
because no supported kernel for GPU devices is available.

如果没有为参数选择设备,那么就是这种情况,因此对于参数服务器：只有CPU.

Do the lower level libraries decide where to place a variable or operation?

如果我正确地提出这个问题,节点放置规则非常简单：

>如果某个节点已在先前运行的图表中放置在设备上,则该节点将保留在该设备上.
>否则,如果用户通过tf.device将节点固定到设备,则放置器将其放置在该设备上.
>否则,默认为GPU#0,如果没有GPU,则默认为CPU.

Tensorflow whitepaper还描述了一个更复杂的动态布局器,但它现在不是tensorflow的开源版本的一部分.