Skip to content

fix/tokenizer-padding-update#517

Closed
mahek2016 wants to merge 4 commits intoAOSSIE-Org:mainfrom
mahek2016:fix/tokenizer-padding-update
Closed

fix/tokenizer-padding-update#517
mahek2016 wants to merge 4 commits intoAOSSIE-Org:mainfrom
mahek2016:fix/tokenizer-padding-update

Conversation

@mahek2016
Copy link

@mahek2016 mahek2016 commented Mar 3, 2026

📌 Description

Fixes #511

This PR updates deprecated HuggingFace tokenizer parameters to ensure compatibility with newer versions of the Transformers library.

The deprecated argument:

pad_to_max_length=True

has been replaced with the recommended parameters:

padding="max_length",
truncation=True


🔧 Changes Made

Updated tokenization calls in:

  • backend/Generator/main.py
  • backend/Generator/mcq.py

This removes deprecation warnings and aligns the codebase with current HuggingFace API standards.


Impact

  • No change in functional behavior
  • No API structure modifications
  • No UI impact
  • No tokenizer deprecation warnings
  • Improved forward compatibility with future Transformers releases

📸 Screenshots / Recordings

Not applicable.
This change updates deprecated tokenizer parameters and does not affect UI behavior.


Testing

  • Verified question generation works correctly
  • Confirmed no tokenizer warnings appear during execution
  • Ensured model outputs remain unchanged

✅ Checklist

  • My PR addresses a single issue, fixes a single bug or makes a single improvement
  • My code follows the project's code style and conventions
  • No breaking changes introduced
  • My changes generate no new warnings or errors
  • I have joined the Discord server and will share this PR
  • I have read the Contribution Guidelines
  • I will address CodeRabbit review comments

🤖 AI Usage Disclosure

  • This PR contains AI-assisted code. I have tested the code locally and I am responsible for it.

AI tools used:

  • ChatGPT (OpenAI)

Summary by CodeRabbit

  • New Features

    • Added optical character recognition (OCR) to extract text from image files (PNG and JPG formats)
  • Improvements

    • Enhanced file cleanup to ensure temporary uploaded files are always removed, even when errors occur during processing

@coderabbitai
Copy link

coderabbitai bot commented Mar 3, 2026

📝 Walkthrough

Walkthrough

The changes introduce OCR-based image text extraction to the FileProcessor class, integrate it into file processing workflows with guaranteed cleanup, update deprecated tokenization API calls to current Transformers library standards, and adjust method signatures for consistency.

Changes

Cohort / File(s) Summary
Image OCR Integration
backend/Generator/main.py
Adds new public method extract_text_from_image() using OpenCV and pytesseract for OCR parsing. Integrates OCR into process_file() to handle png, jpg, jpeg formats. Moves file removal into finally block to guarantee cleanup regardless of content extraction outcome. Updates beam_search_decoding call signatures with additional parameter.
Tokenization API Updates
backend/Generator/mcq.py
Replaces deprecated tokenizer.batch_encode_plus() calls with updated arguments in generate_multiple_choice_questions() and generate_normal_questions(). Changes from pad_to_max_length=True to padding="max_length", truncation=True, return_tensors="pt" to align with current Transformers library API.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

📸 Through images keen eyes now peek,
OCR extracts the text we seek,
Deprecated warnings fade away,
New API calls save the day,
With cleanup secure, hopping right,
Our code shines brilliant and bright! 🐰

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning The PR contains out-of-scope changes: new image OCR parsing functionality (extract_text_from_image method and integration) in main.py is not related to the tokenizer deprecation fix described in issue #511. Remove OCR-related changes from this PR and create a separate PR for the image OCR feature to maintain focus on the tokenizer argument deprecation fix.
Docstring Coverage ⚠️ Warning Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix/tokenizer-padding-update' is specific and directly related to the primary objective of replacing deprecated tokenizer arguments with updated HuggingFace Transformers API parameters.
Linked Issues check ✅ Passed The PR successfully addresses issue #511 by replacing deprecated pad_to_max_length=True with padding='max_length' and truncation=True in both mcq.py and main.py files, aligning with the Transformers API requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
backend/Generator/main.py (1)

413-429: Handle unsupported file types explicitly in process_file.

Without an else branch, unsupported extensions return an empty string, which makes invalid input indistinguishable from valid-but-empty extraction.

🧩 Suggested update
         try:
             if filename.endswith('.txt'):
                 with open(file_path, 'r') as f:
                     content = f.read()

             elif filename.endswith('.pdf'):
                 content = self.extract_text_from_pdf(file_path)

             elif filename.endswith('.docx'):
                 content = self.extract_text_from_docx(file_path)

             elif filename.endswith(('.png', '.jpg', '.jpeg')):
                 content = self.extract_text_from_image(file_path)
+            else:
+                raise ValueError(f"Unsupported file type: {file.filename}")

         finally:
             if os.path.exists(file_path):
                 os.remove(file_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 413 - 429, In process_file (the try
block handling filename/extension checks) add an explicit else branch for
unsupported file types: after the existing if/elif chain (which calls
extract_text_from_pdf, extract_text_from_docx, extract_text_from_image), set
content to None or raise a descriptive exception (e.g., raise
ValueError(f"Unsupported file type: {filename}")) so callers can distinguish
unsupported formats from valid empty extractions; ensure the finally block still
removes file_path as before and adjust any downstream code to handle the new
None/exception case accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/Generator/main.py`:
- Around line 413-429: In process_file (the try block handling
filename/extension checks) add an explicit else branch for unsupported file
types: after the existing if/elif chain (which calls extract_text_from_pdf,
extract_text_from_docx, extract_text_from_image), set content to None or raise a
descriptive exception (e.g., raise ValueError(f"Unsupported file type:
{filename}")) so callers can distinguish unsupported formats from valid empty
extractions; ensure the finally block still removes file_path as before and
adjust any downstream code to handle the new None/exception case accordingly.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fc3bf1a and 1c32071.

📒 Files selected for processing (2)
  • backend/Generator/main.py
  • backend/Generator/mcq.py

@mahek2016
Copy link
Author

Since PR #516 already addresses this issue and is ready for merge, I am closing this PR to avoid duplication.

Thank you.

@mahek2016 mahek2016 closed this Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update deprecated pad_to_max_length argument in tokenization pipeline

1 participant