Skip to content
This repository was archived by the owner on Sep 18, 2025. It is now read-only.
This repository was archived by the owner on Sep 18, 2025. It is now read-only.

Habana Bert Training on Aws Gaudi1 #1

@rajeshitshoulders

Description

@rajeshitshoulders

Hi, I need to run MLPerf 2.0 Intel-Habana Bert training on Aws Gaudi1 processor with image Deep Learning AMI Habana PyTorch 1.12.0 SynapseAI 1.6.0 (Ubuntu 20.04) 20220928. Followed below readme

Readme: https://github.com/mlcommons/training_results_v2.0/tree/main/Intel-HabanaLabs/benchmarks
Dataset : https://github.com/mlcommons/training/tree/master/language_model/tensorflow/bert#download-and-preprocess-datasets
Aws Deep Learning AMI Habana PyTorch 1.12.0 SynapseAI 1.6.0 (Ubuntu 20.04) 20220928 Guaid1 VM, RAM - 742GB and 96core

I've challenges in converting datasets into tf_records, packing script pack_pretraining_data_tfrec never succeeded due memory issue

To convert tf_records with unzipped dataset results_text.zip, when I ran pretraining/create_pretraining_data.py with --input_file=/root/datasets/results4/part-0000* option it tooks almost all the available 742GB memory and swap.

so, I converted each part file into tf_record using for loop.

To use packed method in training, I used script pack_pretraining_data_tfrec to covert tf_records with --max-files option 10 files (default 100), but looks like script load all tf_records into memory sequentially before start pack to create strategy files, hence fill-up all avaiable 742 memory and failed pack.

Hence I tried with Unpacked method, for that I converted tf_records into binary file using script record_to_binary script from GraphCore v1.0 submission,(https://github.com/mlcommons/training_results_v1.0/tree/master/Graphcore/benchmarks/bert/implementations/popart/bert_data)
When i run training process, getting corrupt data.

Questions: Is it right procedure to convert
1/ dataset part file into tf_records one at time.
2/ convert tf_records part-000* into binary file? can the resulted part-*** can be used for unpacked method?
3/ how to get limit max-files to 10 or 25 files in packing?

Please advise if there are any alternative method to pack Bert wiki dataset for Mlperf v2.0 Bert Training for Gaudi.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions