-
Notifications
You must be signed in to change notification settings - Fork 9
training ending prematurely #289
Description
I am trying to run axon on a Supervisly dataset with default training parameters. Here is the command line output from the axon container on docker:
[1] Successfully created the TFRecord: /wpi-data/projects/8b397942-b1e5-4dde-a69a-7a4c940396a8/train.recordLABELS ['cargo_red', 'cargo_blue', 'cargo_reflect_red', 'cargo_reflect_blue']
[1] .
[1] Successfully created the TFRecord: /wpi-data/projects/8b397942-b1e5-4dde-a69a-7a4c940396a8/eval.recordoutput_pbtxt in parse_meta.py: /wpi-data/projects/8b397942-b1e5-4dde-a69a-7a4c940396a8/map.pbtxt
[1] <open file '/wpi-data/projects/8b397942-b1e5-4dde-a69a-7a4c940396a8/map.pbtxt', mode 'w+' at 0x7f13f825be40>
[1] .
[1] Records generated
[1] 8b397942-b1e5-4dde-a69a-7a4c940396a8: Trainer extracted dataset
[1] 8b397942-b1e5-4dde-a69a-7a4c940396a8: Launching container wpilib/axon-metrics
[1] 8b397942-b1e5-4dde-a69a-7a4c940396a8: Launching container wpilib/axon-training
[1] /tensorflow/models/research/object_detection/utils/visualization_utils.py:26: UserWarning:
[1] This call to matplotlib.use() has no effect because the backend has already
[1] been chosen; matplotlib.use() must be called before pylab, matplotlib.pyplot,
[1] or matplotlib.backends is imported for the first time.
[1]
[1] The backend was originally set to 'TkAgg' by the following code:
[1] File "train.py", line 7, in
[1] import modularized_model_main
[1] File "/tensorflow/models/research/modularized_model_main.py", line 10, in
[1] from object_detection import model_lib
[1] File "/tensorflow/models/research/object_detection/model_lib.py", line 27, in
[1] from object_detection import eval_util
[1] File "/tensorflow/models/research/object_detection/eval_util.py", line 27, in
[1] from object_detection.metrics import coco_evaluation
[1] File "/tensorflow/models/research/object_detection/metrics/coco_evaluation.py", line 20, in
[1] from object_detection.metrics import coco_tools
[1] File "/tensorflow/models/research/object_detection/metrics/coco_tools.py", line 47, in
[1] from pycocotools import coco
[1] File "/tensorflow/models/research/pycocotools/coco.py", line 49, in
[1] import matplotlib.pyplot as plt
[1] File "/usr/local/lib/python2.7/dist-packages/matplotlib/pyplot.py", line 71, in
[1] from matplotlib.backends import pylab_setup
[1] File "/usr/local/lib/python2.7/dist-packages/matplotlib/backends/init.py", line 16, in
[1] line for line in traceback.format_stack()
[1]
[1]
[1] import matplotlib; matplotlib.use('Agg') # pylint: disable=multiple-statements
[1] TensorBoard 1.12.0 at http://8975446b1d29:6006 (Press CTRL+C to quit)
[1] WARNING:tensorflow:Forced number of epochs for all eval validations to be 1.
[1] WARNING:tensorflow:Expected number of evaluation epochs is 1, but instead encountered eval_on_train_input_config.num_epochs = 0. Overwriting num_epochs to 1.
[1] WARNING:tensorflow:Estimator's model_fn (<function model_fn at 0x7f6555f308c0>) includes params argument, but params are not passed to Estimator.
[1] WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
[1] WARNING:tensorflow:From /tensorflow/models/research/object_detection/builders/dataset_builder.py:80: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
[1] Instructions for updating:
[1] Use tf.data.experimental.parallel_interleave(...).
[1] WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/sparse_ops.py:1165: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
[1] Instructions for updating:
[1] Create a tf.sparse.SparseTensor and use tf.sparse.to_dense instead.
[1] WARNING:tensorflow:From /tensorflow/models/research/object_detection/core/preprocessor.py:1218: calling squeeze (from tensorflow.python.ops.array_ops) with squeeze_dims is deprecated and will be removed in a future version.
[1] Instructions for updating:
[1] Use the axis argument instead
[1] WARNING:tensorflow:From /tensorflow/models/research/object_detection/builders/dataset_builder.py:148: batch_and_drop_remainder (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
[1] Instructions for updating:
[1] Use tf.data.Dataset.batch(..., drop_remainder=True).
[1] WARNING:root:Variable [BoxPredictor_0/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[273]], model variable shape: [[15]]. This variable will not be initialized from the checkpoint.
[1] WARNING:root:Variable [BoxPredictor_0/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 576, 273]], model variable shape: [[1, 1, 576, 15]]. This variable will not be initialized from the checkpoint.
[1] WARNING:root:Variable [BoxPredictor_1/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[546]], model variable shape: [[30]]. This variable will not be initialized from the checkpoint.
[1] WARNING:root:Variable [BoxPredictor_1/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 1280, 546]], model variable shape: [[1, 1, 1280, 30]]. This variable will not be initialized from the checkpoint.
[1] WARNING:root:Variable [BoxPredictor_2/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[546]], model variable shape: [[30]]. This variable will not be initialized from the checkpoint.
[1] WARNING:root:Variable [BoxPredictor_2/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 512, 546]], model variable shape: [[1, 1, 512, 30]]. This variable will not be initialized from the checkpoint.
[1] WARNING:root:Variable [BoxPredictor_3/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[546]], model variable shape: [[30]]. This variable will not be initialized from the checkpoint.
[1] WARNING:root:Variable [BoxPredictor_3/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 256, 546]], model variable shape: [[1, 1, 256, 30]]. This variable will not be initialized from the checkpoint.
[1] WARNING:root:Variable [BoxPredictor_4/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[546]], model variable shape: [[30]]. This variable will not be initialized from the checkpoint.
[1] WARNING:root:Variable [BoxPredictor_4/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 256, 546]], model variable shape: [[1, 1, 256, 30]]. This variable will not be initialized from the checkpoint.
[1] WARNING:root:Variable [BoxPredictor_5/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[546]], model variable shape: [[30]]. This variable will not be initialized from the checkpoint.
[1] WARNING:root:Variable [BoxPredictor_5/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 128, 546]], model variable shape: [[1, 1, 128, 30]]. This variable will not be initialized from the checkpoint.
[1] WARNING:root:Variable [global_step] is not available in checkpoint
[1] 2022-01-18 00:53:38.619242: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[1] W0118 00:53:59.300209 Reloader plugin_event_accumulator.py:286] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event.
[1] W0118 00:53:59.300209 140395273336576 plugin_event_accumulator.py:286] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event.
[1] 2022-01-18 00:54:19.306182: W tensorflow/core/framework/allocator.cc:122] Allocation of 46080000 exceeds 10% of system memory.
[1] 2022-01-18 00:54:19.641319: W tensorflow/core/framework/allocator.cc:122] Allocation of 46080000 exceeds 10% of system memory.
[1] 2022-01-18 00:54:19.814552: W tensorflow/core/framework/allocator.cc:122] Allocation of 46080000 exceeds 10% of system memory.
[1] 2022-01-18 00:54:19.847873: W tensorflow/core/framework/allocator.cc:122] Allocation of 46080000 exceeds 10% of system memory.
[1] 2022-01-18 00:54:20.033703: W tensorflow/core/framework/allocator.cc:122] Allocation of 46080000 exceeds 10% of system memory.
[1] 8b397942-b1e5-4dde-a69a-7a4c940396a8: Training complete
[1] Checkpoint update routine terminated.