Skip to content

Refactor: Implement Singleton ModelManager to fix VRAM memory leak#521

Open
Siddhazntx wants to merge 5 commits intoAOSSIE-Org:mainfrom
Siddhazntx:enhancement/model-manager-singleton
Open

Refactor: Implement Singleton ModelManager to fix VRAM memory leak#521
Siddhazntx wants to merge 5 commits intoAOSSIE-Org:mainfrom
Siddhazntx:enhancement/model-manager-singleton

Conversation

@Siddhazntx
Copy link

@Siddhazntx Siddhazntx commented Mar 3, 2026

Addressed Issues:

Fixes #520

Screenshots/Recordings:

Server Startup Output :

> [nltk_data]    |
> [nltk_data]  Done downloading collection popular
> Starting Flask App...
> Initializing Shared ModelManager... Loading massive models into memory ONCE.
> You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>.

Additional Notes:

Summary of Changes:

  • Memory Optimization: Implemented a ModelManager Singleton in main.py to prevent server.py from loading the 3GB t5-large model and spacy tools multiple times into RAM/VRAM.
  • Refactoring: Updated MCQGenerator, ShortQGenerator, and ParaphraseGenerator to utilize lightweight pointers to the shared models via the ModelManager. This drops the overall memory footprint from ~9GB down to ~3GB and significantly speeds up server boot times.
  • Test Fix: Updated the stale /get_answer endpoint string in test_server.py to the correct /get_shortq_answer route. All backend tests now pass locally without any Out-Of-Memory (OOM) crashes.

This is strictly an internal architectural improvement; the external API behavior remains completely unchanged.

Checklist

  • My PR addresses a single issue, fixes a single bug or makes a single improvement.
  • My code follows the project's code style and conventions
  • If applicable, I have made corresponding changes or additions to the documentation
  • If applicable, I have made corresponding changes or additions to tests
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contribution Guidelines
  • Once I submit my PR, CodeRabbit AI will automatically review it and I will address CodeRabbit's comments.

AI Usage Disclosure

Check one of the checkboxes below:

  • This PR does not contain AI-generated code at all.
  • This PR contains AI-generated code. I have tested the code locally and I am responsible for it.

Summary by CodeRabbit

  • Refactor

    • Share heavy ML resources across question generators to reduce startup cost and improve inference consistency
    • Improved device handling and disabled unnecessary gradient tracking for more reliable, faster predictions
  • Chores

    • Renamed test API endpoint for short-question answers to better reflect its purpose

@coderabbitai
Copy link

coderabbitai bot commented Mar 3, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 97bf585 and 687653a.

📒 Files selected for processing (1)
  • backend/Generator/main.py

📝 Walkthrough

Walkthrough

Introduces a ModelManager singleton that lazily loads and shares heavy ML models and NLP tools across generators; refactors MCQGenerator, ShortQGenerator, ParaphraseGenerator, and BoolQGenerator to use the shared resources. AnswerPredictor now moves its NLI model to the detected device, sets eval mode, and runs boolean prediction without grad. Test endpoint renamed.

Changes

Cohort / File(s) Summary
Model manager & generators
backend/Generator/main.py
Adds ModelManager singleton for lazy, shared loading of tokenizer/model/device/nlp/s2v/fdist/normalized_levenshtein; refactors MCQGenerator, ShortQGenerator, ParaphraseGenerator, and BoolQGenerator to obtain shared resources instead of constructing duplicates.
Answer prediction fixes
backend/Generator/main.py
Updates AnswerPredictor to load/push the NLI model to self.device, call model.eval(), decorate predict_boolean_answer with @torch.no_grad(), and move input tensors to the correct device before inference.
Test endpoint rename
backend/test_server.py
Renames test route and test function from /get_answer/test_get_answer to /get_shortq_answer/test_get_shortq_answer and updates printed output formatting.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant ShortQGenerator
    participant ModelManager
    participant NLIModel

    Client->>ShortQGenerator: POST /get_shortq_answer (input)
    ShortQGenerator->>ModelManager: request shared resources (tokenizer, model, device, nlp, s2v, fdist)
    ModelManager-->>ShortQGenerator: return shared references (lazily loaded if needed)
    ShortQGenerator->>ShortQGenerator: prepare inputs (tokenize, preprocess, move to device)
    ShortQGenerator->>NLIModel: call AnswerPredictor.predict_boolean_answer (inputs on device, no-grad, eval)
    NLIModel-->>ShortQGenerator: boolean prediction
    ShortQGenerator-->>Client: response (answer)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I nibble code with care and cheer,
A single Manager now sits near.
No triple loads that made me grieve,
Generators share — less to achieve.
Hoppity hop, the server's clear!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Out of Scope Changes check ❓ Inconclusive The PR includes a minor test endpoint rename (test_get_answer to test_get_shortq_answer) which is a necessary correction unrelated to the core objective but internally justified. Clarify whether the test endpoint rename was part of the original scope or if it should be separated into a distinct PR for better change isolation.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Refactor: Implement Singleton ModelManager to fix VRAM memory leak' accurately summarizes the main change—introducing a ModelManager singleton to resolve memory inefficiency.
Linked Issues check ✅ Passed The PR successfully implements the ModelManager singleton pattern to eliminate redundant model loading across generators, achieving the primary objective of issue #520.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
backend/test_server.py (1)

77-77: Align test naming/log label with the new endpoint.

Endpoint update is correct, but keeping old test_get_answer naming/output text is misleading during debugging.

✏️ Suggested cleanup
-def test_get_answer():
+def test_get_shortq_answer():
     endpoint = '/get_shortq_answer'
@@
-    print(f'/get_answer Response: {response}')
+    print(f'/get_shortq_answer Response: {response}')
@@
-    test_get_answer()
+    test_get_shortq_answer()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/test_server.py` at line 77, The test still uses the old name/label
"test_get_answer" while the endpoint variable was changed to
'/get_shortq_answer'; rename the test function and any log/assert messages to
match the new endpoint (e.g., change function name test_get_answer ->
test_get_shortq_answer and update any printed/logged strings or pytest ids), and
update any references to the endpoint in the test body that mention the old name
so test output and labels reflect endpoint variable endpoint =
'/get_shortq_answer'.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/Generator/main.py`:
- Around line 34-57: The ModelManager singleton is not thread-safe: both __new__
and __init__ use unsynchronized check-then-act on _instance and _is_initialized
causing multiple heavy initializations; fix by adding a class-level
threading.Lock (e.g., _init_lock) and use it to guard creation/initialization
with double-checked locking — acquire the lock in __new__ around the _instance
check before calling super().__new__, and/or acquire it at start of __init__ and
re-check _is_initialized inside the lock before loading models (update
references to ModelManager._instance and self._is_initialized accordingly) so
only one thread performs the heavy model loads.

---

Nitpick comments:
In `@backend/test_server.py`:
- Line 77: The test still uses the old name/label "test_get_answer" while the
endpoint variable was changed to '/get_shortq_answer'; rename the test function
and any log/assert messages to match the new endpoint (e.g., change function
name test_get_answer -> test_get_shortq_answer and update any printed/logged
strings or pytest ids), and update any references to the endpoint in the test
body that mention the old name so test output and labels reflect endpoint
variable endpoint = '/get_shortq_answer'.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fc3bf1a and f525d73.

📒 Files selected for processing (2)
  • backend/Generator/main.py
  • backend/test_server.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/Generator/main.py`:
- Line 25: Remove the duplicate import of the threading module by keeping a
single "import threading" statement and deleting the redundant one; locate both
occurrences of the "import threading" import statements in main.py and remove
the second/redundant import so the module is only imported once (resolving the
Ruff F811 redefinition).

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f525d73 and 97bf585.

📒 Files selected for processing (2)
  • backend/Generator/main.py
  • backend/test_server.py

@Siddhazntx
Copy link
Author

@coderabbitai , I have addressed both of these findings in the latest commit:

  1. Thread-Safety: Added a class-level threading.Lock() and implemented double-checked locking in both new and init within the ModelManager class. This ensures heavy model initialization is strictly single-threaded and prevents race conditions.
  2. Test Naming: Renamed test_get_answer to test_get_shortq_answer in test_server.py and updated the corresponding print statements to accurately reflect the endpoint being tested.

Thanks for catching these!

@coderabbitai
Copy link

coderabbitai bot commented Mar 3, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify thread-safety implementation in ModelManager
echo "=== Checking ModelManager for threading.Lock implementation ==="
ast-grep --pattern $'class ModelManager:
  $$$
'

echo -e "\n=== Checking for threading import ==="
rg -n "import threading" backend/Generator/main.py

echo -e "\n=== Checking for Lock usage in ModelManager ==="
rg -n -A5 -B5 "threading.Lock" backend/Generator/main.py

echo -e "\n=== Checking test_server.py for test_get_shortq_answer ==="
rg -n "def test_get_shortq_answer" backend/test_server.py

echo -e "\n=== Checking for old test_get_answer function ==="
rg -n "def test_get_answer" backend/test_server.py

echo -e "\n=== Checking endpoint usage in test ==="
rg -n "get_shortq_answer" backend/test_server.py

Length of output: 3120


@Siddhazntx, excellent work! I've verified both fixes:

Thread-Safety: Confirmed that threading.Lock() is properly declared as a class-level variable and double-checked locking is correctly implemented in both __new__ and __init__. The pattern ensures thread-safe singleton initialization.

Test Naming: Verified that test_get_answer has been successfully renamed to test_get_shortq_answer throughout the test file, including the endpoint variable and test runner invocation.

Both issues are now fully addressed. Great job on the quick turnaround and thorough implementation! 🎉


If you found this review helpful, would you consider giving us a shout-out on X?

Thank you for using CodeRabbit!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement]: Fix redundant model loading with Singleton ModelManager

1 participant