Conversation
…jpeg) using OpenCV + Tesseract
… safety, cross-platform tesseract detection)
📝 WalkthroughWalkthroughThe changes introduce OCR-based image text extraction to the FileProcessor class, integrate it into file processing workflows with guaranteed cleanup, update deprecated tokenization API calls to current Transformers library standards, and adjust method signatures for consistency. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
backend/Generator/main.py (1)
413-429: Handle unsupported file types explicitly inprocess_file.Without an
elsebranch, unsupported extensions return an empty string, which makes invalid input indistinguishable from valid-but-empty extraction.🧩 Suggested update
try: if filename.endswith('.txt'): with open(file_path, 'r') as f: content = f.read() elif filename.endswith('.pdf'): content = self.extract_text_from_pdf(file_path) elif filename.endswith('.docx'): content = self.extract_text_from_docx(file_path) elif filename.endswith(('.png', '.jpg', '.jpeg')): content = self.extract_text_from_image(file_path) + else: + raise ValueError(f"Unsupported file type: {file.filename}") finally: if os.path.exists(file_path): os.remove(file_path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/main.py` around lines 413 - 429, In process_file (the try block handling filename/extension checks) add an explicit else branch for unsupported file types: after the existing if/elif chain (which calls extract_text_from_pdf, extract_text_from_docx, extract_text_from_image), set content to None or raise a descriptive exception (e.g., raise ValueError(f"Unsupported file type: {filename}")) so callers can distinguish unsupported formats from valid empty extractions; ensure the finally block still removes file_path as before and adjust any downstream code to handle the new None/exception case accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/Generator/main.py`:
- Around line 413-429: In process_file (the try block handling
filename/extension checks) add an explicit else branch for unsupported file
types: after the existing if/elif chain (which calls extract_text_from_pdf,
extract_text_from_docx, extract_text_from_image), set content to None or raise a
descriptive exception (e.g., raise ValueError(f"Unsupported file type:
{filename}")) so callers can distinguish unsupported formats from valid empty
extractions; ensure the finally block still removes file_path as before and
adjust any downstream code to handle the new None/exception case accordingly.
|
Since PR #516 already addresses this issue and is ready for merge, I am closing this PR to avoid duplication. Thank you. |
📌 Description
Fixes #511
This PR updates deprecated HuggingFace tokenizer parameters to ensure compatibility with newer versions of the Transformers library.
The deprecated argument:
pad_to_max_length=True
has been replaced with the recommended parameters:
padding="max_length",
truncation=True
🔧 Changes Made
Updated tokenization calls in:
This removes deprecation warnings and aligns the codebase with current HuggingFace API standards.
Impact
📸 Screenshots / Recordings
Not applicable.
This change updates deprecated tokenizer parameters and does not affect UI behavior.
Testing
✅ Checklist
🤖 AI Usage Disclosure
AI tools used:
Summary by CodeRabbit
New Features
Improvements