A Comprehensive Database of Biologically Inactive Compounds
InertDB is a curated chemical database designed to address the lack of biologically inactive compounds in predictive modeling for AI-based drug discovery. This limitation often leads to biased datasets dominated by active compounds, reducing the diversity and robustness of machine learning models.
InertDB bridges this gap by providing:
- Curated Inactive Compounds (CICs): 3,205 inactive compounds rigorously curated from PubChem BioAssays.
- Generated Inactive Compounds (GICs): 64,368 potential inactive compounds generated using deep generative AI trained on the CICs. By offering a comprehensive resource for biologically inactive small molecules and expanding the chemical space with GICs, Inert DB aims to enhance the robustness and accuracy of predictive AI models in toxicology and pharmacology.
- Diverse Assays: CICs are extracted from over 260 million PubChem bioassay results, leveraging an NLP-based assay diversity metric.
- AI-Generated Inactives: GICs supplement chemical space using RNN-based deep generative AI (
inertdb_generator.py). - Low PAINS Content: Minimizes frequent false positives in high-throughput screening.
- Drug-Like Properties: CICs exhibit physicochemical properties comparable to approved drugs.
- Validated Performance: Predictive modeling benchmarks (LIT-PCBA and MUV) show significant improvements.
InertDB/
├── data/ # Pre-processed datasets of CICs and GICs
│ ├── inertdb_cic_v2024.03.smi
│ ├── inertdb_gic_v2024.03.smi
│
├── inertdb_generator.py # Script for generating additional GICs
├── README.md # Project documentation (this file)
Download the CICs and GICs datasets:
wget https://raw.githubusercontent.com/ann081993/InertDB/main/data/inertdb_cic_v2024.03.smi
wget https://raw.githubusercontent.com/ann081993/InertDB/main/data/inertdb_gic_v2024.03.smiUse the provided script to generate new GICs using the pre-trained generative AI model.
Ensure the following Python packages are installed, or install the dependencies from requirements.txt:
tensorflownumpyrdkit
conda create -n inertdb python=3.10
conda activate inertdb
pip install -r requirements.txtGenerate additional GICs by specifying the number of iterations:
python inertdb_generator.py -n NUM_GENERATIONS -o OUTPUT_FILENUM_GENERATIONS: Number of iterations to generate (each iteration produces 1,000 SMILES).OUTPUT_FILE: Name of the file to save the generated GICs (default:gic.txt).
Example:
python inertdb_generator.py -n 5 -o my_gics.txtThis generates up to 5,000 SMILES strings and saves the valid, unique SMILES to my_gics.txt.
- If you use InertDB in your research, please considering citing the following publication:
@article{An2025,
author = {Seungchan An and Yeonjin Lee and Junpyo Gong and Seokyoung Hwang and In Guk Park and Jayhyun Cho and Min Ju Lee and Minkyu Kim and Yun Pyo Kang and Minsoo Noh},
title = {InertDB as a generative AI-expanded resource of biologically inactive small molecules from PubChem},
journal = {Journal of Cheminformatics},
year = {2025},
volume = {17},
pages = {49},
doi = {10.1186/s13321-025-00999-1},
url = {https://doi.org/10.1186/s13321-025-00999-1}
}
This InertDB is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. This curated dataset is freely available for academic and non-commercial research purposes. For commercial use, a license agreement is required. Please contact [ann081993 at snu dot ac dot kr] for or refer to the LICENSE for details.

