Skip to content

DS4SD/MolGrapher

MolGrapher

Huggingface Huggingface arXiv ICCV

This is the repository for MolGrapher: Graph-based Visual Recognition of Chemical Structures. MolGrapher is a model to convert molecule images into molecular graphs.

MolGrapher

Citation

If you find this repository useful, please consider citing:

@inproceedings{Morin_2023_ICCV,
	title        = {{MolGrapher: Graph-based Visual Recognition of Chemical Structures}},
	author       = {Morin, Lucas and Danelljan, Martin and Agea, Maria Isabel and Nassar, Ahmed and Weber, Valery and Meijer, Ingmar and Staar, Peter and Yu, Fisher},
	year         = 2023,
	month        = {October},
	booktitle    = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
	pages        = {19552--19561}
}

Installation

Create a virtual environment.

python3.11 -m venv molgrapher-env
source molgrapher-env/bin/activate

Option 1: Install MolGrapher for CPU.

For Linux:

pip install -e .["cpu"]

For MacOS:

pip install torch==2.1.2 torchvision==0.16.2 paddlepaddle==0.0.0" -f https://www.paddlepaddle.org.cn/whl/mac/cpu/develop.html "git+https://github.com/lucas-morin/MolGrapher#egg=molgrapher"

Option 2: Install MolGrapher for GPU. (Tested for x86_64, Linux Ubuntu 20.04, CUDA 11.7, CUDNN 8.4)

pip install -e .["gpu"]

To install and run MolGrapher using Docker, please refer to README_DOCKER.md.

Inference

Python

from molgrapher.models.molgrapher_model import MolgrapherModel

model = MolgrapherModel()
images_or_paths = ["./data/benchmarks/default/images/image_0.png"] 
annotations = model.predict_batch(images_or_paths) 

annotations is a list of dictionnaries with fields:

[
    {
        'smi': 'O=C(O)C1=CC=C(C2=C(...',                      # MolGrapher SMILES prediction
        'conf': 0.991,                                        # MolGrapher confidence
        'file-info': {
            'filename': '...',                                # Input image filename
            'image_nbr': 1       
        }, 
        'abbreviations_ocr': [...],                           # Detected OCR text
        'abbreviations': [...],                               # Post-processed detected OCR text
        'annotator': {'program': 'MolGrapher', 'version': '1.0.0'},
   },
   ...
]

Script

  1. Place your input images in: MolGrapher/data/benchmarks/default/.

  2. Run MolGrapher:

bash molgrapher/scripts/annotate/run.sh
  1. Read predictions in: MolGrapher/data/predictions/default/.

  2. (Optional) Visualize predictions in: MolGrapher/data/visualization/predictions/default/.

Model

Models are available on Hugging Face. They are automatically downloaded when running the model's inference. The model parameters are documented here.

Docling Integration

Docling is a toolkit to extract the content and structure from PDF documents. It recognizes page layout, reading order, table structure, code, formulas, classify images, and more. Here, we combine docling and MolGrapher:

  • Docling segments and classify chemical-structure images from document pages,
  • MolGrapher converts images to SMILES.

Install docling in the molgrapher environment.

pip install docling

Option 1: Convert a PDF document with docling and enrich it with MolGrapher annotations.

Example:

bash molgrapher/scripts/annotate/docling/docling_convert_and_enrich.sh ./data/pdfs/US9259003_page_4.pdf ./data/docling_documents/US9259003_page_4/
# bash [script] [pdf-path] [docling-document-directory-path]

Option 2: Enrich an existing docling document with MolGrapher annotations.

Example:

python3 molgrapher/scripts/annotate/docling/enrich_docling_document.py --docling-document-directory-path ./data/docling_documents/US9259003_page_4/  
# python3 [script] --docling-document-directory-path [docling-document-directory-path]

The docling document, enriched with SMILES predictions, is stored in [docling-document-directory-path]. For more information, please refer to docling.

USPTO-30K Benchmark

USPTO-30K is available on Hugging Face.

  • USPTO-10K contains 10,000 clean molecules, i.e. without any abbreviated groups.
  • USPTO-10K-abb contains 10,000 molecules with superatom groups.
  • USPTO-10K-L contains 10,000 clean molecules with more than 70 atoms.

Synthetic Dataset

The synthetic dataset is available on Hugging Face. Images and graphs are generated using MolDepictor.

Training

To train the keypoint detector:

python3 ./molgrapher/scripts/train/train_keypoint_detector.py

To train the node classifier:

python3 ./molgrapher/scripts/train/train_graph_classifier.py

About

[ICCV 23] MolGrapher: Graph-based Visual Recognition of Chemical Structures

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages