-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Description
The latest commit solved the following bug:
Instructions for updating:
renamed to `run`
0%| | 0/16 [00:37<?, ?it/s]epochs: 0%| | 0/6 [00:37<?, ?it/s]Traceback (most recent call last):
File "example_t5.py", line 47, in <module>
trainer.train(model, strategy, tokenizer, inputs)
File "/root/ttt/ttt/t2t_trainer.py", line 227, in train
epoch_total_loss += loss.numpy()
File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1063, in numpy
maybe_arr = self._numpy() # pylint: disable=protected-access
File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1031, in _numpy
six.raise_from(core._status_to_exception(e.code, e.message), None) # pylint: disable=protected-access
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnavailableError: Socket closed
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/distribute/tpu_strategy.py", line 540, in async_wait
context.async_wait()
File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/eager/context.py", line 2319, in async_wait
context().sync_executors()
File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/eager/context.py", line 658, in sync_executors
pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
tensorflow.python.framework.errors_impl.UnavailableError: 2 root error(s) found.
(0) Unavailable: Socket closed
(1) Invalid argument: Unable to find a context_id matching the specified one (14250917626996280268). Perhaps the worker was restarted, or the context was GC'd?
0 successful operations.
0 derived errors ignored.
2020-10-23 19:51:06.239763: W 3876 ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: Invalid argument: Unable to find a context_id matching the specified one (14250917626996280268). Perhaps the worker was restarted, or the context was GC'd?
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"@1603482666.236322988","description":"Error received from peer ipv4:x.x.x.x:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find a context_id matching the specified one (14250917626996280268). Perhaps the worker was restarted, or the context was GC'd?","grpc_status":3}
2020-10-23 19:51:06.241849: W 3781 tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:76] Unable to destroy remote tensor handles. If you are running a tf.function, it usually indicates some op in the graph gets an error: Unable to find a context_id matching the specified one (14250917626996280268). Perhaps the worker was restarted, or the context was GC'd?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels