fix/tokenizer-padding-update by mahek2016 · Pull Request #517 · AOSSIE-Org/EduAid

mahek2016 · 2026-03-03T08:17:09Z

📌 Description

Fixes #511

This PR updates deprecated HuggingFace tokenizer parameters to ensure compatibility with newer versions of the Transformers library.

The deprecated argument:

pad_to_max_length=True

has been replaced with the recommended parameters:

padding="max_length",
truncation=True

🔧 Changes Made

Updated tokenization calls in:

backend/Generator/main.py
backend/Generator/mcq.py

This removes deprecation warnings and aligns the codebase with current HuggingFace API standards.

Impact

No change in functional behavior
No API structure modifications
No UI impact
No tokenizer deprecation warnings
Improved forward compatibility with future Transformers releases

📸 Screenshots / Recordings

Not applicable.
This change updates deprecated tokenizer parameters and does not affect UI behavior.

Testing

Verified question generation works correctly
Confirmed no tokenizer warnings appear during execution
Ensured model outputs remain unchanged

✅ Checklist

My PR addresses a single issue, fixes a single bug or makes a single improvement
My code follows the project's code style and conventions
No breaking changes introduced
My changes generate no new warnings or errors
I have joined the Discord server and will share this PR
I have read the Contribution Guidelines
I will address CodeRabbit review comments

🤖 AI Usage Disclosure

This PR contains AI-assisted code. I have tested the code locally and I am responsible for it.

AI tools used:

ChatGPT (OpenAI)

Summary by CodeRabbit

New Features
- Added optical character recognition (OCR) to extract text from image files (PNG and JPG formats)
Improvements
- Enhanced file cleanup to ensure temporary uploaded files are always removed, even when errors occur during processing

…jpeg) using OpenCV + Tesseract

… safety, cross-platform tesseract detection)

…x_length'

coderabbitai · 2026-03-03T08:17:29Z

📝 Walkthrough

Walkthrough

The changes introduce OCR-based image text extraction to the FileProcessor class, integrate it into file processing workflows with guaranteed cleanup, update deprecated tokenization API calls to current Transformers library standards, and adjust method signatures for consistency.

Changes

Cohort / File(s)	Summary
Image OCR Integration `backend/Generator/main.py`	Adds new public method `extract_text_from_image()` using OpenCV and pytesseract for OCR parsing. Integrates OCR into `process_file()` to handle png, jpg, jpeg formats. Moves file removal into finally block to guarantee cleanup regardless of content extraction outcome. Updates beam_search_decoding call signatures with additional parameter.
Tokenization API Updates `backend/Generator/mcq.py`	Replaces deprecated `tokenizer.batch_encode_plus()` calls with updated arguments in `generate_multiple_choice_questions()` and `generate_normal_questions()`. Changes from `pad_to_max_length=True` to `padding="max_length", truncation=True, return_tensors="pt"` to align with current Transformers library API.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

📸 Through images keen eyes now peek,
OCR extracts the text we seek,
Deprecated warnings fade away,
New API calls save the day,
With cleanup secure, hopping right,
Our code shines brilliant and bright! 🐰

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	⚠️ Warning	The PR contains out-of-scope changes: new image OCR parsing functionality (extract_text_from_image method and integration) in main.py is not related to the tokenizer deprecation fix described in issue `#511`.	Remove OCR-related changes from this PR and create a separate PR for the image OCR feature to maintain focus on the tokenizer argument deprecation fix.
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix/tokenizer-padding-update' is specific and directly related to the primary objective of replacing deprecated tokenizer arguments with updated HuggingFace Transformers API parameters.
Linked Issues check	✅ Passed	The PR successfully addresses issue `#511` by replacing deprecated pad_to_max_length=True with padding='max_length' and truncation=True in both mcq.py and main.py files, aligning with the Transformers API requirements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

backend/Generator/main.py (1)

413-429: Handle unsupported file types explicitly in process_file.

Without an else branch, unsupported extensions return an empty string, which makes invalid input indistinguishable from valid-but-empty extraction.

🧩 Suggested update

         try:
             if filename.endswith('.txt'):
                 with open(file_path, 'r') as f:
                     content = f.read()

             elif filename.endswith('.pdf'):
                 content = self.extract_text_from_pdf(file_path)

             elif filename.endswith('.docx'):
                 content = self.extract_text_from_docx(file_path)

             elif filename.endswith(('.png', '.jpg', '.jpeg')):
                 content = self.extract_text_from_image(file_path)
+            else:
+                raise ValueError(f"Unsupported file type: {file.filename}")

         finally:
             if os.path.exists(file_path):
                 os.remove(file_path)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 413 - 429, In process_file (the try
block handling filename/extension checks) add an explicit else branch for
unsupported file types: after the existing if/elif chain (which calls
extract_text_from_pdf, extract_text_from_docx, extract_text_from_image), set
content to None or raise a descriptive exception (e.g., raise
ValueError(f"Unsupported file type: {filename}")) so callers can distinguish
unsupported formats from valid empty extractions; ensure the finally block still
removes file_path as before and adjust any downstream code to handle the new
None/exception case accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/Generator/main.py`:
- Around line 413-429: In process_file (the try block handling
filename/extension checks) add an explicit else branch for unsupported file
types: after the existing if/elif chain (which calls extract_text_from_pdf,
extract_text_from_docx, extract_text_from_image), set content to None or raise a
descriptive exception (e.g., raise ValueError(f"Unsupported file type:
{filename}")) so callers can distinguish unsupported formats from valid empty
extractions; ensure the finally block still removes file_path as before and
adjust any downstream code to handle the new None/exception case accordingly.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fc3bf1a and 1c32071.

📒 Files selected for processing (2)

backend/Generator/main.py
backend/Generator/mcq.py

mahek2016 · 2026-03-03T08:21:55Z

Since PR #516 already addresses this issue and is ready for merge, I am closing this PR to avoid duplication.

Thank you.

mahek2016 added 4 commits March 2, 2026 21:18

feat(backend): add local OCR support for image uploads (.png, .jpg, .…

dd0c63d

…jpeg) using OpenCV + Tesseract

fix(ocr): address CodeRabbit review comments (OCR invocation, cleanup…

b39830c

… safety, cross-platform tesseract detection)

fix: correct indentation errors in OCR integration

72fd5dc

fix(tokenizer): replace deprecated pad_to_max_length with padding='ma…

1c32071

…x_length'

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

mahek2016 closed this Mar 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix/tokenizer-padding-update#517

fix/tokenizer-padding-update#517
mahek2016 wants to merge 4 commits intoAOSSIE-Org:mainfrom
mahek2016:fix/tokenizer-padding-update

mahek2016 commented Mar 3, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 3, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Uh oh!

mahek2016 commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mahek2016 commented Mar 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔧 Changes Made

Impact

📸 Screenshots / Recordings

Testing

✅ Checklist

🤖 AI Usage Disclosure

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

mahek2016 commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mahek2016 commented Mar 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 3, 2026 •

edited

Loading