Refactor storage to use GridFS for binary samples and add orphan cleanup #84

r0ny123 · 2025-06-13T20:02:02Z

This commit addresses issue #80 by refactoring the MongoDB storage backend to utilize GridFS for storing binary sample data. Storing binary data in GridFS is more efficient for large files and is the recommended approach with MongoDB.

Key changes include:

Modified MongoDbStorage.py and SampleEntry.py to integrate GridFS.
When samples are added via addSmdaReport, their binary content (assumed to be in smda_report.buffer) is now stored in GridFS. A reference (GridFS file ID) is stored in the SampleEntry.
Sample retrieval methods (getSampleById, getSampleBySha256) now fetch binary data from GridFS if a gridfs_id is present.
Deleting a sample via deleteSample now also removes the corresponding binary file from GridFS.
Implemented a new utility method cleanup_orphan_gridfs_objects in MongoDbStorage.py. This method identifies and deletes GridFS files that are no longer referenced by any SampleEntry in the database, helping to reclaim storage space.
Added comprehensive integration tests in tests/testStorage.py to cover:
- Storing and retrieving sample binary data via GridFS.
- Deletion of samples and their corresponding GridFS files.
- Correct identification and removal of orphan GridFS files by the cleanup utility under different scenarios.

This refactoring improves the scalability and efficiency of handling binary samples within mcrit.

This commit addresses issue #80 by refactoring the MongoDB storage backend to utilize GridFS for storing binary sample data. Storing binary data in GridFS is more efficient for large files and is the recommended approach with MongoDB. Key changes include: - Modified `MongoDbStorage.py` and `SampleEntry.py` to integrate GridFS. - When samples are added via `addSmdaReport`, their binary content (assumed to be in `smda_report.buffer`) is now stored in GridFS. A reference (GridFS file ID) is stored in the `SampleEntry`. - Sample retrieval methods (`getSampleById`, `getSampleBySha256`) now fetch binary data from GridFS if a `gridfs_id` is present. - Deleting a sample via `deleteSample` now also removes the corresponding binary file from GridFS. - Implemented a new utility method `cleanup_orphan_gridfs_objects` in `MongoDbStorage.py`. This method identifies and deletes GridFS files that are no longer referenced by any `SampleEntry` in the database, helping to reclaim storage space. - Added comprehensive integration tests in `tests/testStorage.py` to cover: - Storing and retrieving sample binary data via GridFS. - Deletion of samples and their corresponding GridFS files. - Correct identification and removal of orphan GridFS files by the cleanup utility under different scenarios. This refactoring improves the scalability and efficiency of handling binary samples within mcrit.

r0ny123 closed this Jun 13, 2025

r0ny123 deleted the feat/gridfs-storage branch June 13, 2025 20:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor storage to use GridFS for binary samples and add orphan cleanup #84

Refactor storage to use GridFS for binary samples and add orphan cleanup #84

Uh oh!

r0ny123 commented Jun 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Refactor storage to use GridFS for binary samples and add orphan cleanup #84

Refactor storage to use GridFS for binary samples and add orphan cleanup #84

Uh oh!

Conversation

r0ny123 commented Jun 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant