More here
Datasets used for training, validation and testing can be found here (direct link to DropBox)
- Python 3.12.x
- CUDA 12.6.x
- cuDNN 9.10.x
- PyTorch 2.7.0
- TorchVision 0.22.0
- (optional) Docker + NVIDIA Container Toolkit
- (optional) ClearML
- Install dependencies from
requirements.txtor Conda environment fromenvironment.yml - Download weights from releases page
- Follow steps on example notebook
Datasets can be in two different formats:
-
LMDB
The structure should be approximately as follows:
├── test │ ├── CUTE80 │ │ ├── data.mdb │ │ └── lock.mdb │ ├── IC03_860 │ │ ├── data.mdb │ │ └── lock.mdb │ ├── IC03_867 │ │ ├── data.mdb │ │ └── lock.mdb │ ... ├── train │ ├── MJ │ │ ├── data.mdb │ │ └── lock.mdb │ └── ST │ ├── data.mdb │ └── lock.mdb └── val ├── MJ_valid │ ├── data.mdb │ └── lock.mdb └── extra_val ├── data.mdb └── lock.mdbMore about lmdb internal structure can be found in
LmdbDataset.__getitem__ -
JSON
├── test │ ├── ann │ │ └── 1.json │ └── img │ │ └── 1.png ├── train │ ├── ann │ │ └── 2.json │ └── img │ │ └── 2.jpg └── val ├── ann └── 3.json └── img └── 3.jpegJSON file must contain 2 fields:
description(real target) andname(image filename without extension){"description": "kioto", "name": "2"}kiotois real target (what should be recognized by model) and2is image filename without extension, which could be2.jpgor2.pngetc.
Main configuration file is configs/config.yaml. You can modify it to suit your needs. Almost all field have comments to help you understand what they do.
NOTE 1:
LABELSis case sensitive. If you want to train a model that can determine the case of the text, you need to include both lowercase and uppercase labels. Example:aAbBcC...for case sensitive training andabc...for case insensitive training.
NOTE 2: If you use Docker don't change the
DATASET_PATHinconfig.yaml. This path is used inside the Docker container and it's not accessible from your host machine.
NOTE 3: Training results will be saved in the
outputsdirectory for both local and Docker training.
- Install dependencies from
requirements.txtor Conda environment fromenvironment.yml - Change
DATASET_PATHinconfig.yamlto point to your dataset - Change
DATASET_TYPEinconfig.yamlto match your dataset type (lmdborjson) - Run the script:
python main.py --config=./configs/config.yaml --output-dir=outputs --device=0
- Change
DATASET_PATHin.envto point to your dataset - Change
DATASET_TYPEinconfig.yamlto match your dataset type (lmdborjson) - Build and run the Docker container:
source .env && docker compose build && docker compose run vitstr
ViTSTR-Transducer: Cross-Attention-Free Vision Transformer Transducer for Scene Text Recognition
Text recognition (optical character recognition) with deep learning methods, ICCV 2019
Vision Transformer for Fast and Efficient Scene Text Recognition









