MolGrapher

This is the repository for MolGrapher: Graph-based Visual Recognition of Chemical Structures. MolGrapher is a model to convert molecule images into molecular graphs.

Citation

If you find this repository useful, please consider citing:

@inproceedings{Morin_2023_ICCV,
	title        = {{MolGrapher: Graph-based Visual Recognition of Chemical Structures}},
	author       = {Morin, Lucas and Danelljan, Martin and Agea, Maria Isabel and Nassar, Ahmed and Weber, Valery and Meijer, Ingmar and Staar, Peter and Yu, Fisher},
	year         = 2023,
	month        = {October},
	booktitle    = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
	pages        = {19552--19561}
}

Installation

Create a virtual environment.

python3.11 -m venv molgrapher-env
source molgrapher-env/bin/activate

Option 1: Install MolGrapher for CPU.

For Linux:

pip install -e .["cpu"]

For MacOS:

pip install torch==2.1.2 torchvision==0.16.2 paddlepaddle==0.0.0" -f https://www.paddlepaddle.org.cn/whl/mac/cpu/develop.html "git+https://github.com/lucas-morin/MolGrapher#egg=molgrapher"

Option 2: Install MolGrapher for GPU. (Tested for x86_64, Linux Ubuntu 20.04, CUDA 11.7, CUDNN 8.4)

pip install -e .["gpu"]

To install and run MolGrapher using Docker, please refer to README_DOCKER.md.

Inference

Python

from molgrapher.models.molgrapher_model import MolgrapherModel

model = MolgrapherModel()
images_or_paths = ["./data/benchmarks/default/images/image_0.png"] 
annotations = model.predict_batch(images_or_paths)

annotations is a list of dictionnaries with fields:

[
    {
        'smi': 'O=C(O)C1=CC=C(C2=C(...',                      # MolGrapher SMILES prediction
        'conf': 0.991,                                        # MolGrapher confidence
        'file-info': {
            'filename': '...',                                # Input image filename
            'image_nbr': 1       
        }, 
        'abbreviations_ocr': [...],                           # Detected OCR text
        'abbreviations': [...],                               # Post-processed detected OCR text
        'annotator': {'program': 'MolGrapher', 'version': '1.0.0'},
   },
   ...
]

Script

Place your input images in: MolGrapher/data/benchmarks/default/.
Run MolGrapher:

bash molgrapher/scripts/annotate/run.sh

Read predictions in: MolGrapher/data/predictions/default/.
(Optional) Visualize predictions in: MolGrapher/data/visualization/predictions/default/.

Model

Models are available on Hugging Face. They are automatically downloaded when running the model's inference. The model parameters are documented here.

Docling Integration

Docling is a toolkit to extract the content and structure from PDF documents. It recognizes page layout, reading order, table structure, code, formulas, classify images, and more. Here, we combine docling and MolGrapher:

Docling segments and classify chemical-structure images from document pages,
MolGrapher converts images to SMILES.

Install docling in the molgrapher environment.

pip install docling

Option 1: Convert a PDF document with docling and enrich it with MolGrapher annotations.

Example:

bash molgrapher/scripts/annotate/docling/docling_convert_and_enrich.sh ./data/pdfs/US9259003_page_4.pdf ./data/docling_documents/US9259003_page_4/
# bash [script] [pdf-path] [docling-document-directory-path]

Option 2: Enrich an existing docling document with MolGrapher annotations.

Example:

python3 molgrapher/scripts/annotate/docling/enrich_docling_document.py --docling-document-directory-path ./data/docling_documents/US9259003_page_4/  
# python3 [script] --docling-document-directory-path [docling-document-directory-path]

The docling document, enriched with SMILES predictions, is stored in [docling-document-directory-path]. For more information, please refer to docling.

USPTO-30K Benchmark

USPTO-30K is available on Hugging Face.

USPTO-10K contains 10,000 clean molecules, i.e. without any abbreviated groups.
USPTO-10K-abb contains 10,000 molecules with superatom groups.
USPTO-10K-L contains 10,000 clean molecules with more than 70 atoms.

Synthetic Dataset

The synthetic dataset is available on Hugging Face. Images and graphs are generated using MolDepictor.

Training

To train the keypoint detector:

python3 ./molgrapher/scripts/train/train_keypoint_detector.py

To train the node classifier:

python3 ./molgrapher/scripts/train/train_graph_classifier.py

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
.github/workflows		.github/workflows
assets		assets
data		data
molgrapher		molgrapher
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
MAINTAINERS.md		MAINTAINERS.md
README.md		README.md
README_DOCKER.md		README_DOCKER.md
docker_build.sh		docker_build.sh
install_packages.sh		install_packages.sh
install_paddleocr.sh		install_paddleocr.sh
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MolGrapher

Citation

Installation

Inference

Python

Script

Model

Docling Integration

USPTO-30K Benchmark

Synthetic Dataset

Training

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

DS4SD/MolGrapher

Folders and files

Latest commit

History

Repository files navigation

MolGrapher

Citation

Installation

Inference

Python

Script

Model

Docling Integration

USPTO-30K Benchmark

Synthetic Dataset

Training

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages