-
Notifications
You must be signed in to change notification settings - Fork 52
Habana Bert Training on Aws Gaudi1 #1
Description
Hi, I need to run MLPerf 2.0 Intel-Habana Bert training on Aws Gaudi1 processor with image Deep Learning AMI Habana PyTorch 1.12.0 SynapseAI 1.6.0 (Ubuntu 20.04) 20220928. Followed below readme
Readme: https://github.com/mlcommons/training_results_v2.0/tree/main/Intel-HabanaLabs/benchmarks
Dataset : https://github.com/mlcommons/training/tree/master/language_model/tensorflow/bert#download-and-preprocess-datasets
Aws Deep Learning AMI Habana PyTorch 1.12.0 SynapseAI 1.6.0 (Ubuntu 20.04) 20220928 Guaid1 VM, RAM - 742GB and 96core
I've challenges in converting datasets into tf_records, packing script pack_pretraining_data_tfrec never succeeded due memory issue
To convert tf_records with unzipped dataset results_text.zip, when I ran pretraining/create_pretraining_data.py with --input_file=/root/datasets/results4/part-0000* option it tooks almost all the available 742GB memory and swap.
so, I converted each part file into tf_record using for loop.
To use packed method in training, I used script pack_pretraining_data_tfrec to covert tf_records with --max-files option 10 files (default 100), but looks like script load all tf_records into memory sequentially before start pack to create strategy files, hence fill-up all avaiable 742 memory and failed pack.
Hence I tried with Unpacked method, for that I converted tf_records into binary file using script record_to_binary script from GraphCore v1.0 submission,(https://github.com/mlcommons/training_results_v1.0/tree/master/Graphcore/benchmarks/bert/implementations/popart/bert_data)
When i run training process, getting corrupt data.
Questions: Is it right procedure to convert
1/ dataset part file into tf_records one at time.
2/ convert tf_records part-000* into binary file? can the resulted part-*** can be used for unpacked method?
3/ how to get limit max-files to 10 or 25 files in packing?
Please advise if there are any alternative method to pack Bert wiki dataset for Mlperf v2.0 Bert Training for Gaudi.