Conversation
AOSSIE-Org#428) When use_mediawiki=1 is passed but the MediaWiki API call fails (SSL error, timeout, network unreachable, etc.), the entire request crashed with a 500 response. Users saw a generic error and lost their input. Changes: - Wrap mediawikiapi.summary() in try/except inside process_input_text() - On failure, log a warning and continue with the original input text - Return a "warning" field in the JSON response so the frontend can notify users that Wikipedia enrichment was skipped - All 7 endpoints that use MediaWiki are updated: /get_mcq, /get_boolq, /get_shortq, /get_problems, /get_shortq_hard, /get_mcq_hard, /get_boolq_hard - Add conftest.py with session-scoped fixture to prevent heavy ML model loading during tests - Add 30 new pytest tests covering SSL errors, connection errors, timeouts, successful calls, and text preservation Co-authored-by: Cursor <cursoragent@cursor.com>
- Use defensive .get() with default [] on output["questions"] and output["Boolean_Questions"] to prevent KeyError when generator returns an empty dict (major) - Rename test_ssl_error_hard_endpoints to test_connection_error_hard_endpoints to match actual exception (minor) Co-authored-by: Cursor <cursoragent@cursor.com>
- Add python-pptx dependency to requirements.txt - Implement extract_text_from_pptx() in FileProcessor class - Extracts text from text frames (titles, text boxes, placeholders) - Extracts text from table cells across all slides - Update process_file() to route .pptx files to the new extractor - Update frontend labels and file accept filters in: - eduaid_web (Text_Input.jsx) - extension (TextInput.jsx) - Add 6 unit tests covering: - Simple text extraction - Table text extraction - Empty presentations - process_file routing for .pptx - Unsupported .ppt rejection - Multi-slide extraction Closes AOSSIE-Org#361
- Add explicit .ppt handling with warning log for unsupported legacy format
- Rewrite tests to import the real FileProcessor from Generator.main
(mocking heavy ML dependencies via sys.modules to avoid model loading)
- Sync upload hint text with accept filter in both web and extension
('PDF, PPTX, TXT, DOCX, MP3 supported')
Add a TextProcessor utility that breaks large documents into overlapping 'Context Blocks' before they reach the question-generation pipeline. This prevents OOM errors and pipeline hangs when processing large PDFs (50+ pages). Changes: - Add backend/utils/text_processor.py with RecursiveCharacterTextSplitter- style chunking (zero new dependencies, pure Python stdlib) - Add process_file_chunked() method to FileProcessor in Generator/main.py - Update /upload endpoint in server.py to return chunks alongside content (backward-compatible: existing 'content' field is preserved) - Add 38 unit tests in test_text_processor.py covering chunking logic, overlap, edge cases, and metadata generation The splitter uses a separator hierarchy (paragraph > line > sentence > word > character) with configurable chunk_size (default 1000 chars) and chunk_overlap (default 200 chars), keeping each chunk within the T5 model's 512-token limit.
📝 WalkthroughWalkthroughThis pull request introduces PowerPoint (.pptx) text extraction support, chunked text processing via a new TextProcessor utility, and MediaWiki-based text enrichment with warning propagation. FileProcessor gains PPTX extraction and chunked processing methods; the Flask server integrates these features and propagates warnings across MCQ/boolean/short-answer generation endpoints. Comprehensive test coverage validates PPTX extraction, TextProcessor chunking behavior, and graceful MediaWiki fallback. Web UI updates reflect expanded file format support (PPTX, TXT, DOCX). Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Server as Flask Server
participant FileProc as FileProcessor
participant TextProc as TextProcessor
Client->>Server: POST /upload (PPTX file)
Server->>FileProc: process_file(file)
FileProc->>FileProc: extract_text_from_pptx()
FileProc->>TextProc: chunk_document(text)
TextProc->>TextProc: _recursive_split() with overlaps
TextProc-->>FileProc: [chunks with metadata]
FileProc-->>Server: extracted text + chunks
Server-->>Client: {content, chunks, num_chunks}
sequenceDiagram
participant Client
participant Server as Flask Server
participant MediaWiki as MediaWiki API
participant Generator as MCQ/BoolQ/ShortQ
Client->>Server: POST /get_mcq?text=...&use_mediawiki=1
Server->>MediaWiki: fetch_enriched_text()
alt MediaWiki Success
MediaWiki-->>Server: enriched_text
Server->>Generator: generate(enriched_text)
else MediaWiki Failure
MediaWiki-->>Server: Exception
Server->>Generator: generate(original_text)
Server->>Server: wiki_warning = "API failed"
end
Generator-->>Server: questions
Server-->>Client: {output, warning?}
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (6)
backend/server.py (1)
472-492: Consider passingsource_typetochunk_documentfor better metadata.The upload endpoint processes various file types (PDF, PPTX, DOCX, TXT), but chunking doesn't capture the source type in metadata. This could help downstream RAG pipelines.
💡 Suggested enhancement
content = file_processor.process_file(file) if content: + # Derive source type from file extension + ext = file.filename.rsplit('.', 1)[-1].lower() if '.' in file.filename else 'unknown' - chunks = text_processor.chunk_document(content) + chunks = text_processor.chunk_document(content, source_type=ext) return jsonify({ "content": content, "chunks": chunks, "num_chunks": len(chunks), })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 472 - 492, The upload_file endpoint currently calls file_processor.process_file(file) and then text_processor.chunk_document(content) without informing chunk_document of the original source; update upload_file to determine a source_type (e.g., from file.mimetype or file.filename extension) and pass it into text_processor.chunk_document as a second argument (e.g., chunk_document(content, source_type)), propagate that source_type into the chunk metadata inside chunk_document, and include source_type in the upload_file JSON response alongside content, chunks, and num_chunks so downstream RAG pipelines can use the source information.backend/Generator/main.py (2)
408-415: Legacy.pptformat returns empty content silently.The warning is logged, but the caller receives an empty string without any indication that the file format was unsupported. Consider returning an error indicator or raising an exception so upstream code can inform the user.
💡 Suggested improvement to surface the warning
elif file.filename.endswith('.pptx'): content = self.extract_text_from_pptx(file_path) elif file.filename.endswith('.ppt'): import logging logging.warning( "Legacy .ppt format is not supported. " "Please convert to .pptx and try again." ) + # Return a sentinel or raise so callers know extraction failed + content = "" # Explicitly mark as unsupported os.remove(file_path) - return content + return contentAlternatively, return a tuple
(content, warning)or raise a custom exception so the API layer can return a proper error response to the user.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/main.py` around lines 408 - 415, The branch handling legacy .ppt files currently only logs a warning and returns no content; change it so the caller can detect the unsupported format by raising an exception (e.g., raise ValueError or a custom UnsupportedFormatError) when file.filename.endswith('.ppt') instead of only logging, and include a clear message like "Legacy .ppt format is not supported; please convert to .pptx" so upstream code (that calls extract_text_from_pptx / the file-handling path) can catch the exception and return an appropriate user-facing error.
420-436: Verifyfileobject state afterprocess_fileconsumes it.
process_filecallsfile.save(file_path), which may consume the file stream. The subsequent access tofile.filename(Line 432) should work sincefilenameis a metadata attribute, but ifprocess_filewere ever modified to alter the file object, this could break.The implementation is correct as-is, but consider extracting the filename before calling
process_filefor defensive coding.♻️ Optional defensive refactor
def process_file_chunked(self, file, chunk_size=1000, chunk_overlap=200): """Process file and return chunked text for large documents. Returns a list of chunk dicts (see TextProcessor.chunk_document). Falls back to an empty list when the file type is unsupported or the extracted text is empty. """ + # Capture filename before process_file potentially modifies file object + filename = file.filename content = self.process_file(file) if not content: return [] # Determine source type from filename - ext = os.path.splitext(file.filename)[1].lstrip('.').lower() + ext = os.path.splitext(filename)[1].lstrip('.').lower() return self.text_processor.chunk_document( content, source_type=ext or "unknown", chunk_size=chunk_size, chunk_overlap=chunk_overlap, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/main.py` around lines 420 - 436, Extract the filename/extension from the incoming file before calling process_file to avoid relying on file state after consumption: in process_file_chunked capture ext = os.path.splitext(file.filename)[1].lstrip('.').lower() (or set "unknown" if filename falsy) before calling self.process_file(file), then pass that ext into self.text_processor.chunk_document instead of reading file.filename after process_file; keep the existing fallback to an empty list when content is falsy.backend/test_text_processor.py (3)
183-203: Overlap test has a weak assertion fallback.The test first checks for substring overlap, then falls back to checking for shared words between the first two chunks only. The fallback could pass even if overlap isn't working correctly (e.g., common words like "the", "is").
Consider strengthening the assertion or documenting why word overlap is an acceptable proxy.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/test_text_processor.py` around lines 183 - 203, The fallback assertion in test_overlap_present is too weak because it only checks shared words between chunks[0] and chunks[1], which can pass due to common stopwords; update the test_overlap_present to iterate all adjacent chunk pairs returned by TextProcessor.chunk_text (use the same chunk_size and chunk_overlap values), and for each pair assert a stronger overlap: either that the last chunk_overlap characters of chunks[i] appear at the start of chunks[i+1], or that there is at least one shared n-gram of length >= min(3, chunk_overlap_word_count) after removing common stopwords; reference the test function test_overlap_present, the TextProcessor.chunk_text method, LONG_PARAGRAPH, and the chunk_overlap parameter when implementing the stronger check.
165-173: Large tolerance (25%) for chunk size validation.The test allows chunks up to 250 characters when
chunk_size=200, which is a 25% tolerance. While the comment explains this is for edge cases with indivisible tokens, this seems generous.Consider tightening the tolerance or adding a comment explaining the specific scenario that requires such a large margin.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/test_text_processor.py` around lines 165 - 173, The test test_chunks_respect_size_limit uses a very large 25% tolerance for chunk size when constructing TextProcessor(chunk_size=200) and asserting chunks from TextProcessor.chunk_text(LONG_PARAGRAPH) are <=250; tighten this by reducing the allowed overshoot (e.g., assert len(chunk) <= 220) or, if 250 is required, replace the generic comment with a precise rationale describing the exact edge case (e.g., unbreakable token/very long word or specific punctuation sequences) that forces the larger overshoot so the test documents why TextProcessor.chunk_text may emit chunks larger than chunk_size.
113-115: Update type hint toOptional[str]or remove the test.The implementation does handle
Nonegracefully via short-circuit evaluation (line 84:if not text or not text.strip(): return []prevents calling.strip()onNone). However, the type hint declarestext: str, notOptional[str], which creates a contract violation.Decide whether
Nonehandling is intentional: if yes, update the type hint totext: Optional[str]; if no, remove this test and rely on type checking to catch invalid calls.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/test_text_processor.py` around lines 113 - 115, The test test_none_returns_empty asserts TextProcessor.chunk_text accepts None but the method signature uses text: str; decide intent and fix accordingly—if None should be supported, change the chunk_text signature to accept Optional[str] (import Optional from typing) and update any docstring/annotations to reflect Optional[str]; if None should not be accepted, remove the test test_none_returns_empty from backend/test_text_processor.py so the type contract remains text: str. Reference: TextProcessor.chunk_text and the test test_none_returns_empty.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/test_wikipedia_fallback.py`:
- Around line 115-128: The test function
test_ssl_error_get_problems_returns_200_with_warning currently sets
wiki_mock.summary.side_effect = TimeoutError(...) but is intended to simulate an
SSL failure; update the test to raise an SSL-related exception (e.g., set
wiki_mock.summary.side_effect = ssl.SSLError("...") or
requests.exceptions.SSLError(...)) and update the docstring/name if you prefer
to reflect the actual exception; locate the failure injection at
wiki_mock.summary.side_effect in the
test_ssl_error_get_problems_returns_200_with_warning function and replace the
TimeoutError with an appropriate SSLError instance.
---
Nitpick comments:
In `@backend/Generator/main.py`:
- Around line 408-415: The branch handling legacy .ppt files currently only logs
a warning and returns no content; change it so the caller can detect the
unsupported format by raising an exception (e.g., raise ValueError or a custom
UnsupportedFormatError) when file.filename.endswith('.ppt') instead of only
logging, and include a clear message like "Legacy .ppt format is not supported;
please convert to .pptx" so upstream code (that calls extract_text_from_pptx /
the file-handling path) can catch the exception and return an appropriate
user-facing error.
- Around line 420-436: Extract the filename/extension from the incoming file
before calling process_file to avoid relying on file state after consumption: in
process_file_chunked capture ext =
os.path.splitext(file.filename)[1].lstrip('.').lower() (or set "unknown" if
filename falsy) before calling self.process_file(file), then pass that ext into
self.text_processor.chunk_document instead of reading file.filename after
process_file; keep the existing fallback to an empty list when content is falsy.
In `@backend/server.py`:
- Around line 472-492: The upload_file endpoint currently calls
file_processor.process_file(file) and then
text_processor.chunk_document(content) without informing chunk_document of the
original source; update upload_file to determine a source_type (e.g., from
file.mimetype or file.filename extension) and pass it into
text_processor.chunk_document as a second argument (e.g.,
chunk_document(content, source_type)), propagate that source_type into the chunk
metadata inside chunk_document, and include source_type in the upload_file JSON
response alongside content, chunks, and num_chunks so downstream RAG pipelines
can use the source information.
In `@backend/test_text_processor.py`:
- Around line 183-203: The fallback assertion in test_overlap_present is too
weak because it only checks shared words between chunks[0] and chunks[1], which
can pass due to common stopwords; update the test_overlap_present to iterate all
adjacent chunk pairs returned by TextProcessor.chunk_text (use the same
chunk_size and chunk_overlap values), and for each pair assert a stronger
overlap: either that the last chunk_overlap characters of chunks[i] appear at
the start of chunks[i+1], or that there is at least one shared n-gram of length
>= min(3, chunk_overlap_word_count) after removing common stopwords; reference
the test function test_overlap_present, the TextProcessor.chunk_text method,
LONG_PARAGRAPH, and the chunk_overlap parameter when implementing the stronger
check.
- Around line 165-173: The test test_chunks_respect_size_limit uses a very large
25% tolerance for chunk size when constructing TextProcessor(chunk_size=200) and
asserting chunks from TextProcessor.chunk_text(LONG_PARAGRAPH) are <=250;
tighten this by reducing the allowed overshoot (e.g., assert len(chunk) <= 220)
or, if 250 is required, replace the generic comment with a precise rationale
describing the exact edge case (e.g., unbreakable token/very long word or
specific punctuation sequences) that forces the larger overshoot so the test
documents why TextProcessor.chunk_text may emit chunks larger than chunk_size.
- Around line 113-115: The test test_none_returns_empty asserts
TextProcessor.chunk_text accepts None but the method signature uses text: str;
decide intent and fix accordingly—if None should be supported, change the
chunk_text signature to accept Optional[str] (import Optional from typing) and
update any docstring/annotations to reflect Optional[str]; if None should not be
accepted, remove the test test_none_returns_empty from
backend/test_text_processor.py so the type contract remains text: str.
Reference: TextProcessor.chunk_text and the test test_none_returns_empty.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
backend/Generator/main.pybackend/conftest.pybackend/server.pybackend/test_pptx_extraction.pybackend/test_text_processor.pybackend/test_wikipedia_fallback.pybackend/utils/__init__.pybackend/utils/text_processor.pyeduaid_web/src/pages/Text_Input.jsxextension/src/pages/text_input/TextInput.jsxrequirements.txt
| def test_ssl_error_get_problems_returns_200_with_warning(self, client): | ||
| """SSL failure should NOT crash the combined /get_problems endpoint.""" | ||
| with patch("server.mediawikiapi") as wiki_mock: | ||
| wiki_mock.summary.side_effect = TimeoutError("Connection timed out") | ||
| resp = client.post( | ||
| "/get_problems", | ||
| json={"input_text": SAMPLE_TEXT, "use_mediawiki": 1}, | ||
| ) | ||
| assert resp.status_code == 200 | ||
| data = resp.get_json() | ||
| assert "output_mcq" in data | ||
| assert "output_boolq" in data | ||
| assert "output_shortq" in data | ||
| assert "warning" in data |
There was a problem hiding this comment.
Test name doesn't match the exception being tested.
The test is named test_ssl_error_get_problems_returns_200_with_warning but raises TimeoutError instead of an SSL-related exception.
🔧 Suggested fix
- def test_ssl_error_get_problems_returns_200_with_warning(self, client):
- """SSL failure should NOT crash the combined /get_problems endpoint."""
+ def test_timeout_error_get_problems_returns_200_with_warning(self, client):
+ """Timeout failure should NOT crash the combined /get_problems endpoint."""
with patch("server.mediawikiapi") as wiki_mock:
wiki_mock.summary.side_effect = TimeoutError("Connection timed out")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def test_ssl_error_get_problems_returns_200_with_warning(self, client): | |
| """SSL failure should NOT crash the combined /get_problems endpoint.""" | |
| with patch("server.mediawikiapi") as wiki_mock: | |
| wiki_mock.summary.side_effect = TimeoutError("Connection timed out") | |
| resp = client.post( | |
| "/get_problems", | |
| json={"input_text": SAMPLE_TEXT, "use_mediawiki": 1}, | |
| ) | |
| assert resp.status_code == 200 | |
| data = resp.get_json() | |
| assert "output_mcq" in data | |
| assert "output_boolq" in data | |
| assert "output_shortq" in data | |
| assert "warning" in data | |
| def test_timeout_error_get_problems_returns_200_with_warning(self, client): | |
| """Timeout failure should NOT crash the combined /get_problems endpoint.""" | |
| with patch("server.mediawikiapi") as wiki_mock: | |
| wiki_mock.summary.side_effect = TimeoutError("Connection timed out") | |
| resp = client.post( | |
| "/get_problems", | |
| json={"input_text": SAMPLE_TEXT, "use_mediawiki": 1}, | |
| ) | |
| assert resp.status_code == 200 | |
| data = resp.get_json() | |
| assert "output_mcq" in data | |
| assert "output_boolq" in data | |
| assert "output_shortq" in data | |
| assert "warning" in data |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/test_wikipedia_fallback.py` around lines 115 - 128, The test function
test_ssl_error_get_problems_returns_200_with_warning currently sets
wiki_mock.summary.side_effect = TimeoutError(...) but is intended to simulate an
SSL failure; update the test to raise an SSL-related exception (e.g., set
wiki_mock.summary.side_effect = ssl.SSLError("...") or
requests.exceptions.SSLError(...)) and update the docstring/name if you prefer
to reflect the actual exception; locate the failure injection at
wiki_mock.summary.side_effect in the
test_ssl_error_get_problems_returns_200_with_warning function and replace the
TimeoutError with an appropriate SSLError instance.
Addressed Issues:
Adds a PDF chunking utility (utils/text_processor.py) for RAG preparation to prevent OOM errors and pipeline hangs when processing large documents (50+ pages).
Screenshots/Recordings:
All 38 unit tests pass locally in 0.36s.
Additional Notes:
/uploadendpoint still returns content as before, plus new chunks andnum_chunksfieldsRecursiveCharacterTextSplitter-style separator hierarchy: paragraph → line → sentence → word → characterChecklist
I have used the following AI models and tools: Gemini (Antigravity coding assistant)