This is the repository for MolGrapher: Graph-based Visual Recognition of Chemical Structures. MolGrapher is a model to convert molecule images into molecular graphs.
If you find this repository useful, please consider citing:
@inproceedings{Morin_2023_ICCV,
title = {{MolGrapher: Graph-based Visual Recognition of Chemical Structures}},
author = {Morin, Lucas and Danelljan, Martin and Agea, Maria Isabel and Nassar, Ahmed and Weber, Valery and Meijer, Ingmar and Staar, Peter and Yu, Fisher},
year = 2023,
month = {October},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
pages = {19552--19561}
}
Create a virtual environment.
python3.11 -m venv molgrapher-env
source molgrapher-env/bin/activate
Option 1: Install MolGrapher for CPU.
For Linux:
pip install -e .["cpu"]
For MacOS:
pip install torch==2.1.2 torchvision==0.16.2 paddlepaddle==0.0.0" -f https://www.paddlepaddle.org.cn/whl/mac/cpu/develop.html "git+https://github.com/lucas-morin/MolGrapher#egg=molgrapher"
Option 2: Install MolGrapher for GPU. (Tested for x86_64, Linux Ubuntu 20.04, CUDA 11.7, CUDNN 8.4)
pip install -e .["gpu"]
To install and run MolGrapher using Docker, please refer to README_DOCKER.md.
from molgrapher.models.molgrapher_model import MolgrapherModel
model = MolgrapherModel()
images_or_paths = ["./data/benchmarks/default/images/image_0.png"]
annotations = model.predict_batch(images_or_paths)
annotations is a list of dictionnaries with fields:
[
{
'smi': 'O=C(O)C1=CC=C(C2=C(...', # MolGrapher SMILES prediction
'conf': 0.991, # MolGrapher confidence
'file-info': {
'filename': '...', # Input image filename
'image_nbr': 1
},
'abbreviations_ocr': [...], # Detected OCR text
'abbreviations': [...], # Post-processed detected OCR text
'annotator': {'program': 'MolGrapher', 'version': '1.0.0'},
},
...
]
-
Place your input images in:
MolGrapher/data/benchmarks/default/. -
Run MolGrapher:
bash molgrapher/scripts/annotate/run.sh
-
Read predictions in:
MolGrapher/data/predictions/default/. -
(Optional) Visualize predictions in:
MolGrapher/data/visualization/predictions/default/.
Models are available on Hugging Face. They are automatically downloaded when running the model's inference. The model parameters are documented here.
Docling is a toolkit to extract the content and structure from PDF documents. It recognizes page layout, reading order, table structure, code, formulas, classify images, and more.
Here, we combine docling and MolGrapher:
Doclingsegments and classify chemical-structure images from document pages,MolGrapherconverts images to SMILES.
Install docling in the molgrapher environment.
pip install docling
Option 1: Convert a PDF document with docling and enrich it with MolGrapher annotations.
Example:
bash molgrapher/scripts/annotate/docling/docling_convert_and_enrich.sh ./data/pdfs/US9259003_page_4.pdf ./data/docling_documents/US9259003_page_4/
# bash [script] [pdf-path] [docling-document-directory-path]
Option 2: Enrich an existing docling document with MolGrapher annotations.
Example:
python3 molgrapher/scripts/annotate/docling/enrich_docling_document.py --docling-document-directory-path ./data/docling_documents/US9259003_page_4/
# python3 [script] --docling-document-directory-path [docling-document-directory-path]
The docling document, enriched with SMILES predictions, is stored in [docling-document-directory-path].
For more information, please refer to docling.
USPTO-30K is available on Hugging Face.
- USPTO-10K contains 10,000 clean molecules, i.e. without any abbreviated groups.
- USPTO-10K-abb contains 10,000 molecules with superatom groups.
- USPTO-10K-L contains 10,000 clean molecules with more than 70 atoms.
The synthetic dataset is available on Hugging Face. Images and graphs are generated using MolDepictor.
To train the keypoint detector:
python3 ./molgrapher/scripts/train/train_keypoint_detector.py
To train the node classifier:
python3 ./molgrapher/scripts/train/train_graph_classifier.py
