feat: Add PPTX file upload support for quiz generation (#361)#492
feat: Add PPTX file upload support for quiz generation (#361)#492zohaib-7035 wants to merge 5 commits intoAOSSIE-Org:mainfrom
Conversation
AOSSIE-Org#428) When use_mediawiki=1 is passed but the MediaWiki API call fails (SSL error, timeout, network unreachable, etc.), the entire request crashed with a 500 response. Users saw a generic error and lost their input. Changes: - Wrap mediawikiapi.summary() in try/except inside process_input_text() - On failure, log a warning and continue with the original input text - Return a "warning" field in the JSON response so the frontend can notify users that Wikipedia enrichment was skipped - All 7 endpoints that use MediaWiki are updated: /get_mcq, /get_boolq, /get_shortq, /get_problems, /get_shortq_hard, /get_mcq_hard, /get_boolq_hard - Add conftest.py with session-scoped fixture to prevent heavy ML model loading during tests - Add 30 new pytest tests covering SSL errors, connection errors, timeouts, successful calls, and text preservation Co-authored-by: Cursor <cursoragent@cursor.com>
- Use defensive .get() with default [] on output["questions"] and output["Boolean_Questions"] to prevent KeyError when generator returns an empty dict (major) - Rename test_ssl_error_hard_endpoints to test_connection_error_hard_endpoints to match actual exception (minor) Co-authored-by: Cursor <cursoragent@cursor.com>
- Add python-pptx dependency to requirements.txt - Implement extract_text_from_pptx() in FileProcessor class - Extracts text from text frames (titles, text boxes, placeholders) - Extracts text from table cells across all slides - Update process_file() to route .pptx files to the new extractor - Update frontend labels and file accept filters in: - eduaid_web (Text_Input.jsx) - extension (TextInput.jsx) - Add 6 unit tests covering: - Simple text extraction - Table text extraction - Empty presentations - process_file routing for .pptx - Unsupported .ppt rejection - Multi-slide extraction Closes AOSSIE-Org#361
📝 WalkthroughWalkthroughAdds PPTX upload/extraction support (new FileProcessor in Changes
Sequence DiagramssequenceDiagram
participant User as User/Client
participant Frontend as Frontend
participant Backend as Backend Server
participant FileProc as File Processor
participant Generator as Quiz Generator
User->>Frontend: Upload PPTX file
Frontend->>Backend: POST /upload (file)
Backend->>FileProc: process_file(file)
FileProc->>FileProc: extract_text_from_pptx()
FileProc-->>Backend: return extracted text
Backend->>Generator: generate_quiz(extracted_text)
Generator-->>Backend: quiz results
Backend-->>Frontend: HTTP 200 JSON (output)
Frontend-->>User: Display quiz
sequenceDiagram
participant User as User/Client
participant Frontend as Frontend
participant Backend as Backend Server
participant MediaWiki as MediaWiki API
participant Generator as Quiz Generator
User->>Frontend: Submit text (use_mediawiki=1)
Frontend->>Backend: POST /get_mcq (text, flag)
Backend->>Backend: process_input_text(text)
Backend->>MediaWiki: attempt enrichment
alt MediaWiki Success
MediaWiki-->>Backend: enriched_text
Backend->>Generator: generate_mcq(enriched_text)
Generator-->>Backend: MCQs
Backend-->>Frontend: response (output, no warning)
else MediaWiki Failure
MediaWiki--xBackend: error/timeout
Backend->>Backend: log warning, use original text
Backend->>Generator: generate_mcq(original_text)
Generator-->>Backend: MCQs
Backend-->>Frontend: response (output + warning)
end
Frontend-->>User: Display quiz (+/- warning)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
backend/Generator/main.py (1)
394-410:⚠️ Potential issue | 🟠 MajorTemp upload cleanup is not guaranteed on extraction failures.
If any extractor raises (e.g., malformed PPTX), execution won’t reach Line 409, leaving orphaned files in
uploads/.🛠️ Proposed hardening
def process_file(self, file): file_path = os.path.join(self.upload_folder, file.filename) file.save(file_path) content = "" - - if file.filename.endswith('.txt'): - with open(file_path, 'r') as f: - content = f.read() - elif file.filename.endswith('.pdf'): - content = self.extract_text_from_pdf(file_path) - elif file.filename.endswith('.docx'): - content = self.extract_text_from_docx(file_path) - elif file.filename.endswith('.pptx'): - content = self.extract_text_from_pptx(file_path) - - os.remove(file_path) - return content + ext = os.path.splitext(file.filename)[1].lower() + try: + if ext == '.txt': + with open(file_path, 'r', encoding='utf-8') as f: + content = f.read() + elif ext == '.pdf': + content = self.extract_text_from_pdf(file_path) + elif ext == '.docx': + content = self.extract_text_from_docx(file_path) + elif ext == '.pptx': + content = self.extract_text_from_pptx(file_path) + finally: + if os.path.exists(file_path): + os.remove(file_path) + return content🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/main.py` around lines 394 - 410, The process_file function can leave uploaded files in uploads/ if an extractor raises because os.remove(file_path) is only called on the normal path; wrap the save/processing in a try/finally so file_path is always removed (or use a context-managed temp file), e.g., ensure the block that calls extract_text_from_pdf, extract_text_from_docx, extract_text_from_pptx or reads the .txt is inside a try and move os.remove(file_path) into the finally so cleanup always runs; keep raising or logging the original exception after cleanup so calling code can handle extractor errors.backend/server.py (1)
456-464:⚠️ Potential issue | 🔴 Critical
/get_boolq_hardcurrently calls the generator with an invalid answer style.
answer_style="true_false"is not supported byQuestionGeneratorand will fail at runtime. Also, Line 463 appliesmake_question_harderto each whole item instead of question text.✅ Safe fix (use existing bool generator path)
`@app.route`("/get_boolq_hard", methods=["POST"]) def get_boolq_hard(): data = request.get_json() input_text = data.get("input_text", "") use_mediawiki = data.get("use_mediawiki", 0) - input_questions = data.get("input_question", []) + max_questions = data.get("max_questions", 4) input_text, wiki_warning = process_input_text(input_text, use_mediawiki) - # Generate questions using the same QG model - generated = qg.generate( - article=input_text, - num_questions=input_questions, - answer_style="true_false" - ) - - # Apply transformation to make each question harder - harder_questions = [make_question_harder(q) for q in generated] + generated = BoolQGen.generate_boolq( + {"input_text": input_text, "max_questions": max_questions} + ) + harder_questions = [ + make_question_harder(q) + for q in generated.get("Boolean_Questions", []) + ] result = {"output": harder_questions} if wiki_warning: result["warning"] = wiki_warning return jsonify(result)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 456 - 464, The call to qg.generate uses an invalid answer_style ("true_false") and then maps make_question_harder over entire items instead of their question text; update the qg.generate invocation to use the boolean-compatible generator path or a supported answer_style (e.g., the existing bool generator variant used elsewhere in the project) so generation succeeds, assign its result to generated as before, and change the list comprehension that builds harder_questions to call make_question_harder on the question string within each generated item (e.g., make_question_harder(item["question"])) rather than on the whole item.
🧹 Nitpick comments (1)
eduaid_web/src/pages/Text_Input.jsx (1)
188-190: Consider deriving upload copy from a shared allowed-types constant.This same list is duplicated in both web and extension UIs. Defining one shared constant (or helper) for accepted extensions + display label would prevent drift like the current copy mismatch.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@eduaid_web/src/pages/Text_Input.jsx` around lines 188 - 190, Derive the upload copy and accept list from a single shared constant instead of hardcoding them in the JSX: create an exported constant (e.g., ALLOWED_FILE_TYPES or allowedFileTypes with shape {accept: ".pdf,.pptx,.txt,.docx,.mp3", label: "PDF, PPTX, MP3 supported"}) and import it where Text_Input.jsx uses fileInputRef and handleFileUpload; replace the hardcoded accept attribute and the paragraph text ("Choose a file...") to reference allowedFileTypes.accept and allowedFileTypes.label so both the input accept and visible copy stay in sync (also update the extension UI to import the same constant).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/Generator/main.py`:
- Around line 406-407: The routing currently only matches
file.filename.endswith('.pptx') and therefore rejects .ppt files; add explicit
handling for '.ppt' by detecting file.filename.endswith('.ppt') and either (a)
call a new helper extract_text_from_ppt(file_path) that extracts text from
legacy PPTs, or (b) convert the .ppt to .pptx (e.g., via soffice/LibreOffice
headless conversion) and then call the existing
extract_text_from_pptx(file_path_converted); update or add the helper function
name (extract_text_from_ppt) or conversion flow and ensure the branch logic
calls the correct helper so both .ppt and .pptx are supported by the upload
processing path.
In `@backend/test_pptx_extraction.py`:
- Around line 16-55: The test file currently defines its own FileProcessor class
which duplicates production logic and masks regressions; replace the local class
by importing the real implementation (or extract the shared functionality into a
small module and import that from both places). Specifically, remove the
duplicate FileProcessor in this test and import the production FileProcessor
from the module containing the real class (or move
extract_text_from_pptx/process_file into a new shared module and have both
backend/Generator/main.py and backend/test_pptx_extraction.py import that
module) so tests exercise the actual production symbols (FileProcessor,
extract_text_from_pptx, process_file).
In `@extension/src/pages/text_input/TextInput.jsx`:
- Around line 237-244: The upload hint text "PDF, PPTX, MP3 supported" is out of
sync with the input accept list (which also allows .txt and .docx); update the
hint text near the JSX where the upload message is rendered to reflect the
actual accepted types or remove .txt/.docx from the accept attribute if those
types shouldn't be allowed—locate the hint element adjacent to the <input
type="file" ref={fileInputRef} onChange={handleFileUpload}
accept=".pdf,.pptx,.txt,.docx,.mp3" and make the text match the accept list (or
vice versa) so users see the correct supported file types.
---
Outside diff comments:
In `@backend/Generator/main.py`:
- Around line 394-410: The process_file function can leave uploaded files in
uploads/ if an extractor raises because os.remove(file_path) is only called on
the normal path; wrap the save/processing in a try/finally so file_path is
always removed (or use a context-managed temp file), e.g., ensure the block that
calls extract_text_from_pdf, extract_text_from_docx, extract_text_from_pptx or
reads the .txt is inside a try and move os.remove(file_path) into the finally so
cleanup always runs; keep raising or logging the original exception after
cleanup so calling code can handle extractor errors.
In `@backend/server.py`:
- Around line 456-464: The call to qg.generate uses an invalid answer_style
("true_false") and then maps make_question_harder over entire items instead of
their question text; update the qg.generate invocation to use the
boolean-compatible generator path or a supported answer_style (e.g., the
existing bool generator variant used elsewhere in the project) so generation
succeeds, assign its result to generated as before, and change the list
comprehension that builds harder_questions to call make_question_harder on the
question string within each generated item (e.g.,
make_question_harder(item["question"])) rather than on the whole item.
---
Nitpick comments:
In `@eduaid_web/src/pages/Text_Input.jsx`:
- Around line 188-190: Derive the upload copy and accept list from a single
shared constant instead of hardcoding them in the JSX: create an exported
constant (e.g., ALLOWED_FILE_TYPES or allowedFileTypes with shape {accept:
".pdf,.pptx,.txt,.docx,.mp3", label: "PDF, PPTX, MP3 supported"}) and import it
where Text_Input.jsx uses fileInputRef and handleFileUpload; replace the
hardcoded accept attribute and the paragraph text ("Choose a file...") to
reference allowedFileTypes.accept and allowedFileTypes.label so both the input
accept and visible copy stay in sync (also update the extension UI to import the
same constant).
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
backend/Generator/main.pybackend/conftest.pybackend/server.pybackend/test_pptx_extraction.pybackend/test_wikipedia_fallback.pyeduaid_web/src/pages/Text_Input.jsxextension/src/pages/text_input/TextInput.jsxrequirements.txt
backend/Generator/main.py
Outdated
| elif file.filename.endswith('.pptx'): | ||
| content = self.extract_text_from_pptx(file_path) |
There was a problem hiding this comment.
Current implementation does not satisfy the stated .ppt / .pptx objective.
Routing was added only for .pptx. .ppt remains unsupported (and tests currently codify rejection). If .ppt is truly in scope for issue #361 objectives, this is still incomplete.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/Generator/main.py` around lines 406 - 407, The routing currently only
matches file.filename.endswith('.pptx') and therefore rejects .ppt files; add
explicit handling for '.ppt' by detecting file.filename.endswith('.ppt') and
either (a) call a new helper extract_text_from_ppt(file_path) that extracts text
from legacy PPTs, or (b) convert the .ppt to .pptx (e.g., via
soffice/LibreOffice headless conversion) and then call the existing
extract_text_from_pptx(file_path_converted); update or add the helper function
name (extract_text_from_ppt) or conversion flow and ensure the branch logic
calls the correct helper so both .ppt and .pptx are supported by the upload
processing path.
- Add explicit .ppt handling with warning log for unsupported legacy format
- Rewrite tests to import the real FileProcessor from Generator.main
(mocking heavy ML dependencies via sys.modules to avoid model loading)
- Sync upload hint text with accept filter in both web and extension
('PDF, PPTX, TXT, DOCX, MP3 supported')
There was a problem hiding this comment.
🧹 Nitpick comments (4)
backend/utils/file_processor.py (1)
13-18: Consider closing the PDF document explicitly.
fitz.open()returns aDocumentobject that holds file resources. While it will be garbage-collected eventually, explicitly closing it is more robust, especially if processing many files.♻️ Proposed fix using context manager
def extract_text_from_pdf(self, file_path): - doc = fitz.open(file_path) - text = "" - for page in doc: - text += page.get_text() - return text + with fitz.open(file_path) as doc: + text = "" + for page in doc: + text += page.get_text() + return text🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/utils/file_processor.py` around lines 13 - 18, The extract_text_from_pdf function opens a fitz Document with fitz.open(file_path) but never closes it, risking leaked file handles; update extract_text_from_pdf to either use a context manager (with fitz.open(file_path) as doc:) or ensure doc.close() in a finally block so the Document is always closed after reading pages (refer to extract_text_from_pdf and the fitz.open/doc usage to locate the change).backend/test_pptx_extraction.py (1)
82-82: Minor: Unused lambda parameter.The
destparameter in the mock'sside_effectlambda is unused. You can use_to indicate it's intentionally ignored.♻️ Optional fix
- mock_file.save = MagicMock(side_effect=lambda dest: None) + mock_file.save = MagicMock(side_effect=lambda _: None)Apply similarly on line 97.
Also applies to: 97-97
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/test_pptx_extraction.py` at line 82, The test uses MagicMock(side_effect=lambda dest: None) for mock_file.save with an unused parameter name; update the side_effect lambdas to use a throwaway parameter name (e.g., lambda _: None) for mock_file.save (and the other occurrence at the second lambda mentioned) so the unused argument is explicitly ignored; locate the MagicMock side_effect definitions for mock_file.save to apply the change.backend/server.py (1)
68-79: Consider catching more specific exceptions for MediaWiki failures.While the broad
Exceptioncatch ensures graceful degradation (per the docstring), it may mask unrelated programming errors. The MediaWikiAPI typically raises specific exceptions likeMediaWikiExceptionor network-related errors.Also, the log message on line 72 uses an EN DASH (
–) instead of a standard hyphen (-).♻️ Suggested improvement
- except Exception: + except (OSError, ConnectionError, TimeoutError, Exception) as exc: logger.warning( - "Wikipedia enrichment failed – continuing without it.", + "Wikipedia enrichment failed - continuing without it.", exc_info=True, )If you want to be more precise, you could import and catch
mediawikiapi.MediaWikiExceptionspecifically, falling back to a broader catch for unexpected errors.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 68 - 79, The broad except around mediawikiapi.summary should be narrowed: import and catch mediawikiapi.MediaWikiException (and/or the library's network error class) specifically to handle expected MediaWiki failures, log via logger.warning(..., exc_info=True) for that specific exception, and then add a final fallback except Exception to preserve graceful degradation; also replace the EN DASH in the logger warning message with a standard hyphen. Reference mediawikiapi.summary, mediawikiapi.MediaWikiException, logger.warning and the existing return block when applying the change.backend/Generator/main.py (1)
25-25: Remove unused import:Presentationfrompptx.This import is not used anywhere in
main.py. SinceFileProcessor(which usesPresentation) was moved tobackend/utils/file_processor.py, this is leftover dead code from the refactoring.♻️ Proposed fix
-from pptx import Presentation🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/main.py` at line 25, Remove the unused import "Presentation" from pptx at the top of main.py; this was left over after FileProcessor was moved (FileProcessor now uses Presentation in its own module), so delete the line "from pptx import Presentation" and run a quick search in main.py for any remaining references to Presentation to confirm no other changes are needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/Generator/main.py`:
- Line 25: Remove the unused import "Presentation" from pptx at the top of
main.py; this was left over after FileProcessor was moved (FileProcessor now
uses Presentation in its own module), so delete the line "from pptx import
Presentation" and run a quick search in main.py for any remaining references to
Presentation to confirm no other changes are needed.
In `@backend/server.py`:
- Around line 68-79: The broad except around mediawikiapi.summary should be
narrowed: import and catch mediawikiapi.MediaWikiException (and/or the library's
network error class) specifically to handle expected MediaWiki failures, log via
logger.warning(..., exc_info=True) for that specific exception, and then add a
final fallback except Exception to preserve graceful degradation; also replace
the EN DASH in the logger warning message with a standard hyphen. Reference
mediawikiapi.summary, mediawikiapi.MediaWikiException, logger.warning and the
existing return block when applying the change.
In `@backend/test_pptx_extraction.py`:
- Line 82: The test uses MagicMock(side_effect=lambda dest: None) for
mock_file.save with an unused parameter name; update the side_effect lambdas to
use a throwaway parameter name (e.g., lambda _: None) for mock_file.save (and
the other occurrence at the second lambda mentioned) so the unused argument is
explicitly ignored; locate the MagicMock side_effect definitions for
mock_file.save to apply the change.
In `@backend/utils/file_processor.py`:
- Around line 13-18: The extract_text_from_pdf function opens a fitz Document
with fitz.open(file_path) but never closes it, risking leaked file handles;
update extract_text_from_pdf to either use a context manager (with
fitz.open(file_path) as doc:) or ensure doc.close() in a finally block so the
Document is always closed after reading pages (refer to extract_text_from_pdf
and the fitz.open/doc usage to locate the change).
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
backend/Generator/__init__.pybackend/Generator/main.pybackend/conftest.pybackend/server.pybackend/test_pptx_extraction.pybackend/utils/file_processor.pyeduaid_web/src/pages/Text_Input.jsxextension/src/pages/text_input/TextInput.jsx
🚧 Files skipped from review as they are similar to previous changes (1)
- eduaid_web/src/pages/Text_Input.jsx
Summary
Closes #361
Adds support for
.pptx(PowerPoint) file uploads, allowing users to extract text from presentations for quiz generation.Changes
Backend
python-pptxdependency torequirements.txtextract_text_from_pptx()method inFileProcessorclass (backend/Generator/main.py)process_file()to route.pptxfiles to the new extractorFrontend
eduaid_web/src/pages/Text_Input.jsxextension/src/pages/text_input/TextInput.jsxacceptattribute to file inputs to filter for supported file typesTests
backend/test_pptx_extraction.pywith 6 unit tests:process_file()routing for.pptx.ppt(legacy binary) rejectionHow to Test
cd backend pip install python-pptx python -m pytest test_pptx_extraction.py -vAll 6 tests pass ✅
Notes
.pptx(modern XML-based format) is supported. Legacy.ppt(binary format) is not supported bypython-pptxand will gracefully return empty content.PyMuPDF) and DOCX (mammoth) file processors.Summary by CodeRabbit
New Features
Tests