Tensorflow Debug:InvalidArgumentError: Cannot assign a device for operation 'gradients/embedding_lookup_grad/ToInt32': Could not satisfy explicit device specification '' because the node was colocated

2023年12月27日 277次阅读来源: 大胖子球花

原始代码为：

#Learning Algorithm for CADE
# config = tf.ConfigProto(allow_soft_placement = True)
sess = tf.InteractiveSession()
maxIter = 100
ite = int(0)
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
while ite<maxIter:
    t1 = time()
    print('Iteration%d start at %.4f...'%(ite,t1))
    for i in range(train_usernums):
        _loss,_ = sess.run((loss,train),feed_dict={x:trainuimat.todense(),v:trainuni,_drop_rate:drop_rate})
        print('\t loss:%f'%(_loss))
    out = sess.run(out2,feed_dict={x:validuimat.todense(),v:validuni,_drop_rate:0})
    out = out*(validuimat.todense()==0)
    out = np.argsort(out)[:,::-1]
    for _k in [1,5,10]:
        _MAP = MAP(testuidict,out,_k)
        print('Iteration%d :  MAP@%d %f'%(ite,_k,_MAP))
    print('Iteration%d used time:%.4f s'%(ite,time()-t1))
    ite+=1

然后报错：

---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1322     try:
-> 1323       return fn(*args)
   1324     except errors.OpError as e:

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run_fn(session, feed_dict, fetch_list, target_list, options, run_metadata)
   1301                                    feed_dict, fetch_list, target_list,
-> 1302                                    status, run_metadata)
   1303 

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py in __exit__(self, type_arg, value_arg, traceback_arg)
    472             compat.as_text(c_api.TF_Message(self.status.status)),
--> 473             c_api.TF_GetCode(self.status.status))
    474     # Delete the underlying status object from memory otherwise it stays alive

InvalidArgumentError: Cannot assign a device for operation 'gradients/embedding_lookup_grad/ToInt32': Could not satisfy explicit device specification '' because the node was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'
Colocation Debug Info:
Colocation group had the following types and devices: 
Const: GPU CPU 
VariableV2: GPU CPU 
UnsortedSegmentSum: GPU CPU 
Identity: GPU CPU 
L2Loss: GPU CPU 
Shape: GPU CPU 
Mul: GPU CPU 
Gather: GPU CPU 
SparseApplyAdagrad: CPU 
Cast: GPU CPU 
Unique: GPU CPU 
StridedSlice: GPU CPU 
     [[Node: gradients/embedding_lookup_grad/ToInt32 = Cast[DstT=DT_INT32, SrcT=DT_INT64, _class=["loc:@EmbeddingParams"]](gradients/embedding_lookup_grad/Shape)]]

During handling of the above exception, another exception occurred:

InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-6-01dec1a8f42a> in <module>()
     10     print('Iteration%d start at %.4f...'%(ite,t1))
     11     for i in range(train_usernums):
---> 12         _loss,_ = sess.run((loss,train),feed_dict={x:trainuimat.todense(),v:trainuni,_drop_rate:drop_rate})
     13         print('\t loss:%f'%(_loss))
     14     out = sess.run(out2,feed_dict={x:validuimat.todense(),v:validuni,_drop_rate:0})

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
    887     try:
    888       result = self._run(None, fetches, feed_dict, options_ptr,
--> 889                          run_metadata_ptr)
    890       if run_metadata:
    891         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1118     if final_fetches or final_targets or (handle and feed_dict_tensor):
   1119       results = self._do_run(handle, final_targets, final_fetches,
-> 1120                              feed_dict_tensor, options, run_metadata)
   1121     else:
   1122       results = []

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1315     if handle is None:
   1316       return self._do_call(_run_fn, self._session, feeds, fetches, targets,
-> 1317                            options, run_metadata)
   1318     else:
   1319       return self._do_call(_prun_fn, self._session, handle, feeds, fetches)

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1334         except KeyError:
   1335           pass
-> 1336       raise type(e)(node_def, op, message)
   1337 
   1338   def _extend_graph(self):

InvalidArgumentError: Cannot assign a device for operation 'gradients/embedding_lookup_grad/ToInt32': Could not satisfy explicit device specification '' because the node was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'
Colocation Debug Info:
Colocation group had the following types and devices: 
Const: GPU CPU 
VariableV2: GPU CPU 
UnsortedSegmentSum: GPU CPU 
Identity: GPU CPU 
L2Loss: GPU CPU 
Shape: GPU CPU 
Mul: GPU CPU 
Gather: GPU CPU 
SparseApplyAdagrad: CPU 
Cast: GPU CPU 
Unique: GPU CPU 
StridedSlice: GPU CPU 
     [[Node: gradients/embedding_lookup_grad/ToInt32 = Cast[DstT=DT_INT32, SrcT=DT_INT64, _class=["loc:@EmbeddingParams"]](gradients/embedding_lookup_grad/Shape)]]

Caused by op 'gradients/embedding_lookup_grad/ToInt32', defined at:
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 486, in start
    self.io_loop.start()
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 127, in start
    self.asyncio_loop.run_forever()
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/asyncio/base_events.py", line 422, in run_forever
    self._run_once()
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/asyncio/base_events.py", line 1432, in _run_once
    handle._run()
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/asyncio/events.py", line 145, in _run
    self._callback(*self._args)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tornado/ioloop.py", line 759, in _run_callback
    ret = callback()
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tornado/stack_context.py", line 276, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 536, in <lambda>
    self.io_loop.add_callback(lambda : self._handle_events(self.socket, 0))
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
    self._handle_recv()
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
    self._run_callback(callback, msg)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
    callback(*args, **kwargs)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tornado/stack_context.py", line 276, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
    handler(stream, idents, msg)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/ipykernel/ipkernel.py", line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/ipykernel/zmqshell.py", line 537, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2662, in run_cell
    raw_cell, store_history, silent, shell_futures)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2785, in _run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2903, in run_ast_nodes
    if self.run_code(code, result):
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-4-d4a6590ba166>", line 27, in <module>
    train = optimizer.minimize(loss)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 343, in minimize
    grad_loss=grad_loss)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 414, in compute_gradients
    colocate_gradients_with_ops=colocate_gradients_with_ops)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 581, in gradients
    grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 353, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 581, in <lambda>
    grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/array_grad.py", line 367, in _GatherGrad
    params_shape = math_ops.to_int32(params_shape)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 826, in to_int32
    return cast(x, dtypes.int32, name=name)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 745, in cast
    return gen_math_ops.cast(x, base_type, name=name)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 892, in cast
    "Cast", x=x, DstT=DstT, name=name)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

...which was originally created as op 'embedding_lookup', defined at:
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
[elided 23 identical lines from previous traceback]
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-4-d4a6590ba166>", line 11, in <module>
    ve = tf.nn.embedding_lookup(embedding_params,v)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/embedding_ops.py", line 328, in embedding_lookup
    transform_fn=None)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/embedding_ops.py", line 150, in _embedding_lookup_and_transform
    result = _clip(_gather(params[0], ids, name=name), ids, max_norm)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/embedding_ops.py", line 54, in _gather
    return array_ops.gather(params, ids, name=name)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 2486, in gather
    params, indices, validate_indices=validate_indices, name=name)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1834, in gather
    validate_indices=validate_indices, name=name)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'gradients/embedding_lookup_grad/ToInt32': Could not satisfy explicit device specification '' because the node was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'
Colocation Debug Info:
Colocation group had the following types and devices: 
Const: GPU CPU 
VariableV2: GPU CPU 
UnsortedSegmentSum: GPU CPU 
Identity: GPU CPU 
L2Loss: GPU CPU 
Shape: GPU CPU 
Mul: GPU CPU 
Gather: GPU CPU 
SparseApplyAdagrad: CPU 
Cast: GPU CPU 
Unique: GPU CPU 
StridedSlice: GPU CPU 
     [[Node: gradients/embedding_lookup_grad/ToInt32 = Cast[DstT=DT_INT32, SrcT=DT_INT64, _class=["loc:@EmbeddingParams"]](gradients/embedding_lookup_grad/Shape)]]

google到：https://github.com/tensorflow/tensorflow/issues/2292

说是GPU配置问题：

I just follow mrry's suggestion here, adding "allow_soft_placement=True" as follows:

config = tf.ConfigProto(allow_soft_placement = True)
sess = tf.Session(config = config)

Then it works.

I reviewed the Using GPUs in tutorial. It mentions adding "allow_soft_placement" under the error "Could not satisfy explicit device specification '/gpu:X' ". But it not mentions it could also solve the error "no supported kernel for GPU devices is available". Maybe it's better to add this in tutorial text in order to avoid confusing future users.

添加该语句（源代码注释部分），得到错误：

InvalidArgumentError                      Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1322     try:
-> 1323       return fn(*args)
   1324     except errors.OpError as e:

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run_fn(session, feed_dict, fetch_list, target_list, options, run_metadata)
   1301                                    feed_dict, fetch_list, target_list,
-> 1302                                    status, run_metadata)
   1303 

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py in __exit__(self, type_arg, value_arg, traceback_arg)
    472             compat.as_text(c_api.TF_Message(self.status.status)),
--> 473             c_api.TF_GetCode(self.status.status))
    474     # Delete the underlying status object from memory otherwise it stays alive

InvalidArgumentError: AttrValue must not have reference type value of float_ref
     for attr 'tensor_type'
    ; NodeDef: EmbeddingParams/Adagrad/_41 = _Recv[_start_time=0, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_470_EmbeddingParams/Adagrad", tensor_type=DT_FLOAT_REF, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^Adagrad/learning_rate/_43, ^Adagrad/update_EmbeddingParams/UnsortedSegmentSum, ^Adagrad/update_EmbeddingParams/Unique); Op<name=_Recv; signature= -> tensor:tensor_type; attr=tensor_type:type; attr=tensor_name:string; attr=send_device:string; attr=send_device_incarnation:int; attr=recv_device:string; attr=client_terminated:bool,default=false; is_stateful=true>
     [[Node: EmbeddingParams/Adagrad/_41 = _Recv[_start_time=0, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_470_EmbeddingParams/Adagrad", tensor_type=DT_FLOAT_REF, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^Adagrad/learning_rate/_43, ^Adagrad/update_EmbeddingParams/UnsortedSegmentSum, ^Adagrad/update_EmbeddingParams/Unique)]]

During handling of the above exception, another exception occurred:

InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-6-3171867edb26> in <module>()
     10     print('Iteration%d start at %s...'%(ite,t1))
     11     for i in range(train_usernums):
---> 12         _loss,_ = sess.run((loss,train),feed_dict={x:trainuimat.todense(),v:trainuni,_drop_rate:drop_rate})
     13         print('\t loss:%f'%(_loss))
     14     out = sess.run(out2,feed_dict={x:validuimat.todense(),v:validuni,_drop_rate:0})

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
    887     try:
    888       result = self._run(None, fetches, feed_dict, options_ptr,
--> 889                          run_metadata_ptr)
    890       if run_metadata:
    891         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1118     if final_fetches or final_targets or (handle and feed_dict_tensor):
   1119       results = self._do_run(handle, final_targets, final_fetches,
-> 1120                              feed_dict_tensor, options, run_metadata)
   1121     else:
   1122       results = []

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1315     if handle is None:
   1316       return self._do_call(_run_fn, self._session, feeds, fetches, targets,
-> 1317                            options, run_metadata)
   1318     else:
   1319       return self._do_call(_prun_fn, self._session, handle, feeds, fetches)

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1334         except KeyError:
   1335           pass
-> 1336       raise type(e)(node_def, op, message)
   1337 
   1338   def _extend_graph(self):

InvalidArgumentError: AttrValue must not have reference type value of float_ref
     for attr 'tensor_type'
    ; NodeDef: EmbeddingParams/Adagrad/_41 = _Recv[_start_time=0, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_470_EmbeddingParams/Adagrad", tensor_type=DT_FLOAT_REF, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^Adagrad/learning_rate/_43, ^Adagrad/update_EmbeddingParams/UnsortedSegmentSum, ^Adagrad/update_EmbeddingParams/Unique); Op<name=_Recv; signature= -> tensor:tensor_type; attr=tensor_type:type; attr=tensor_name:string; attr=send_device:string; attr=send_device_incarnation:int; attr=recv_device:string; attr=client_terminated:bool,default=false; is_stateful=true>
     [[Node: EmbeddingParams/Adagrad/_41 = _Recv[_start_time=0, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_470_EmbeddingParams/Adagrad", tensor_type=DT_FLOAT_REF, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^Adagrad/learning_rate/_43, ^Adagrad/update_EmbeddingParams/UnsortedSegmentSum, ^Adagrad/update_EmbeddingParams/Unique)]]

google得到：https://github.com/tensorflow/tensorflow/issues/13880

采用了方法之一：把InteractiveSession改为常规session。解决问题：

#Learning Algorithm for CADE
config = tf.ConfigProto(allow_soft_placement = True)
with tf.Session(config=config) as sess:
    maxIter = 100
    ite = int(0)
    sess.run(tf.global_variables_initializer())
    sess.run(tf.local_variables_initializer())
    while ite<maxIter:
        t1 = time()
        print('Iteration%d start at %.4f s...'%(ite,t1))
        for i in range(train_usernums):
            _loss,_ = sess.run((loss,train),feed_dict={x:trainuimat.todense(),v:trainuni,_drop_rate:drop_rate})
            print('\t loss:%f'%(_loss))
        out = sess.run(out2,feed_dict={x:validuimat.todense(),v:validuni,_drop_rate:0})
        out = out*(validuimat.todense()==0)
        out = np.argsort(out)[:,::-1]
        for _k in [1,5,10]:
            _MAP = MAP(testuidict,out,_k)
            print('Iteration%d :  MAP@%d %f'%(ite,_k,_MAP))
        print('Iteration%d used time:%.4f s'%(ite,time()-t1))
        ite+=1

不过又出了新的问题：

InvalidArgumentError: indices[69165,0] = 69166 is not in [0, 69166)
     [[Node: embedding_lookup = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@EmbeddingParams"], validate_indices=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](EmbeddingParams/read, _arg_Placeholder_1_0_1)]]

看Node：是embedding出了问题，

想起献文昨天说过embedding输入维度+1的事情，改了

embedding_params = tf.get_variable('EmbeddingParams',shape=[train_usernums+1,K],dtype=tf.float32,
                                   initializer=tf.glorot_normal_initializer(),
                                   regularizer=tf.contrib.layers.l2_regularizer(lamda))

又出新问题：

InternalError: Dst tensor is not initialized.
     [[Node: embedding_lookup/_15 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_38_embedding_lookup", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
     [[Node: add/_33 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_414_add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

查了说是GPU 内存满了，猜想可能是没有开

config.gpu_options.allow_growth = True

加上，没用。

然后用nvidia-smi查看，发现竟然用了3ge多G的GPU内存。然后顿悟，我是不是应该一条一条的传给placeholder而不是全部传进去……

然后问题解决了……太智障了……

    原文作者：大胖子球花
    原文地址: https://www.cnblogs.com/chason95/articles/9529027.html
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。