Skip to content

TPU error in Google GCP - fixed #2

@wangcongcong123

Description

@wangcongcong123

The latest commit solved the following bug:

Instructions for updating:
renamed to `run`
  0%|                                                                                              | 0/16 [00:37<?, ?it/s]epochs:   0%|                                                                                       | 0/6 [00:37<?, ?it/s]Traceback (most recent call last):
  File "example_t5.py", line 47, in <module>
    trainer.train(model, strategy, tokenizer, inputs)
  File "/root/ttt/ttt/t2t_trainer.py", line 227, in train
    epoch_total_loss += loss.numpy()
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1063, in numpy
    maybe_arr = self._numpy()  # pylint: disable=protected-access
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1031, in _numpy
    six.raise_from(core._status_to_exception(e.code, e.message), None)  # pylint: disable=protected-access
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnavailableError: Socket closed
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/distribute/tpu_strategy.py", line 540, in async_wait
    context.async_wait()
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/eager/context.py", line 2319, in async_wait
    context().sync_executors()
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/eager/context.py", line 658, in sync_executors
    pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
tensorflow.python.framework.errors_impl.UnavailableError: 2 root error(s) found.
  (0) Unavailable: Socket closed
  (1) Invalid argument: Unable to find a context_id matching the specified one (14250917626996280268). Perhaps the worker was restarted, or the context was GC'd?
0 successful operations.
0 derived errors ignored.
2020-10-23 19:51:06.239763: W    3876 ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: Invalid argument: Unable to find a context_id matching the specified one (14250917626996280268). Perhaps the worker was restarted, or the context was GC'd?
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"@1603482666.236322988","description":"Error received from peer ipv4:x.x.x.x:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find a context_id matching the specified one (14250917626996280268). Perhaps the worker was restarted, or the context was GC'd?","grpc_status":3}
2020-10-23 19:51:06.241849: W    3781 tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:76] Unable to destroy remote tensor handles. If you are running a tf.function, it usually indicates some op in the graph gets an error: Unable to find a context_id matching the specified one (14250917626996280268). Perhaps the worker was restarted, or the context was GC'd?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions