Skip to content

Conversation

@r0ny123
Copy link
Contributor

@r0ny123 r0ny123 commented Jun 13, 2025

This commit addresses issue #80 by refactoring the MongoDB storage backend to utilize GridFS for storing binary sample data. Storing binary data in GridFS is more efficient for large files and is the recommended approach with MongoDB.

Key changes include:

  • Modified MongoDbStorage.py and SampleEntry.py to integrate GridFS.
  • When samples are added via addSmdaReport, their binary content (assumed to be in smda_report.buffer) is now stored in GridFS. A reference (GridFS file ID) is stored in the SampleEntry.
  • Sample retrieval methods (getSampleById, getSampleBySha256) now fetch binary data from GridFS if a gridfs_id is present.
  • Deleting a sample via deleteSample now also removes the corresponding binary file from GridFS.
  • Implemented a new utility method cleanup_orphan_gridfs_objects in MongoDbStorage.py. This method identifies and deletes GridFS files that are no longer referenced by any SampleEntry in the database, helping to reclaim storage space.
  • Added comprehensive integration tests in tests/testStorage.py to cover:
    • Storing and retrieving sample binary data via GridFS.
    • Deletion of samples and their corresponding GridFS files.
    • Correct identification and removal of orphan GridFS files by the cleanup utility under different scenarios.

This refactoring improves the scalability and efficiency of handling binary samples within mcrit.

This commit addresses issue #80 by refactoring the MongoDB storage backend to utilize GridFS for storing binary sample data. Storing binary data in GridFS is more efficient for large files and is the recommended approach with MongoDB.

Key changes include:
- Modified `MongoDbStorage.py` and `SampleEntry.py` to integrate GridFS.
- When samples are added via `addSmdaReport`, their binary content (assumed to be in `smda_report.buffer`) is now stored in GridFS. A reference (GridFS file ID) is stored in the `SampleEntry`.
- Sample retrieval methods (`getSampleById`, `getSampleBySha256`) now fetch binary data from GridFS if a `gridfs_id` is present.
- Deleting a sample via `deleteSample` now also removes the corresponding binary file from GridFS.
- Implemented a new utility method `cleanup_orphan_gridfs_objects` in `MongoDbStorage.py`. This method identifies and deletes GridFS files that are no longer referenced by any `SampleEntry` in the database, helping to reclaim storage space.
- Added comprehensive integration tests in `tests/testStorage.py` to cover:
    - Storing and retrieving sample binary data via GridFS.
    - Deletion of samples and their corresponding GridFS files.
    - Correct identification and removal of orphan GridFS files by the cleanup utility under different scenarios.

This refactoring improves the scalability and efficiency of handling binary samples within mcrit.
@r0ny123 r0ny123 closed this Jun 13, 2025
@r0ny123 r0ny123 deleted the feat/gridfs-storage branch June 13, 2025 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant