ViTSTR-Transducer implementation for text recognition

Test set examples (clickable spoiler)

More here

Datasets used for training, validation and testing can be found here (direct link to DropBox)

Architecture

Prerequisites:

Python 3.12.x
CUDA 12.6.x
cuDNN 9.10.x
PyTorch 2.7.0
TorchVision 0.22.0
(optional) Docker + NVIDIA Container Toolkit
(optional) ClearML

Inference using pretrained weights

Install dependencies from requirements.txt or Conda environment from environment.yml
Download weights from releases page
Follow steps on example notebook

Train on your own data

Dataset

Datasets can be in two different formats:

LMDB

The structure should be approximately as follows:

├── test
│   ├── CUTE80
│   │   ├── data.mdb
│   │   └── lock.mdb
│   ├── IC03_860
│   │   ├── data.mdb
│   │   └── lock.mdb
│   ├── IC03_867
│   │   ├── data.mdb
│   │   └── lock.mdb
│  ...
├── train
│   ├── MJ
│   │   ├── data.mdb
│   │   └── lock.mdb
│   └── ST
│       ├── data.mdb
│       └── lock.mdb
└── val
    ├── MJ_valid
    │   ├── data.mdb
    │   └── lock.mdb
    └── extra_val
        ├── data.mdb
        └── lock.mdb

More about lmdb internal structure can be found in LmdbDataset.__getitem__

JSON

├── test
│   ├── ann
│   │   └── 1.json
│   └── img
│   │   └── 1.png
├── train
│   ├── ann
│   │   └── 2.json
│   └── img
│   │   └── 2.jpg
└── val
    ├── ann
        └── 3.json
    └── img
        └── 3.jpeg

JSON file must contain 2 fields: description (real target) and name (image filename without extension)

{"description": "kioto", "name": "2"}

kioto is real target (what should be recognized by model) and 2 is image filename without extension, which could be 2.jpg or 2.png etc.

Configuration

Main configuration file is configs/config.yaml. You can modify it to suit your needs. Almost all field have comments to help you understand what they do.

NOTE 1: LABELS is case sensitive. If you want to train a model that can determine the case of the text, you need to include both lowercase and uppercase labels. Example: aAbBcC... for case sensitive training and abc... for case insensitive training.

NOTE 2: If you use Docker don't change the DATASET_PATH in config.yaml. This path is used inside the Docker container and it's not accessible from your host machine.

NOTE 3: Training results will be saved in the outputs directory for both local and Docker training.

Local training

Install dependencies from requirements.txt or Conda environment from environment.yml
Change DATASET_PATH in config.yaml to point to your dataset
Change DATASET_TYPE in config.yaml to match your dataset type (lmdb or json)

Run the script:

python main.py --config=./configs/config.yaml --output-dir=outputs --device=0

Training with Docker

Change DATASET_PATH in .env to point to your dataset
Change DATASET_TYPE in config.yaml to match your dataset type (lmdb or json)

Build and run the Docker container:

source .env && docker compose build && docker compose run vitstr

References

ViTSTR-Transducer: Cross-Attention-Free Vision Transformer Transducer for Scene Text Recognition

Text recognition (optical character recognition) with deep learning methods, ICCV 2019

Vision Transformer for Fast and Efficient Scene Text Recognition

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
configs		configs
imgs		imgs
src		src
vitstr_backbone		vitstr_backbone
.env		.env
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml
inference_example.ipynb		inference_example.ipynb
main.py		main.py
requirements.txt		requirements.txt
torchscript_example.ipynb		torchscript_example.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViTSTR-Transducer implementation for text recognition

Architecture

Prerequisites:

Inference using pretrained weights

Train on your own data

Dataset

Configuration

Local training

Training with Docker

References

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

ScReameer/ViTSTR-Transducer

Folders and files

Latest commit

History

Repository files navigation

ViTSTR-Transducer implementation for text recognition

Architecture

Prerequisites:

Inference using pretrained weights

Train on your own data

Dataset

Configuration

Local training

Training with Docker

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages