Skip to content

feat(pdf): add font-based heading detection and refactor PDF/Markdown parsing#413

Merged
zhoujh01 merged 3 commits intomainfrom
feat/auto-group-large-directories
Mar 4, 2026
Merged

feat(pdf): add font-based heading detection and refactor PDF/Markdown parsing#413
zhoujh01 merged 3 commits intomainfrom
feat/auto-group-large-directories

Conversation

@qin-ctx
Copy link
Collaborator

@qin-ctx qin-ctx commented Mar 4, 2026

Description

Add font-based heading detection for PDFs, refactor PDF bookmark extraction, and clean up Markdown parser. Remove auto-group directory logic. Translate Chinese comments to English.

Related Issue

fixed #393

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

  • PDFParser: Add _detect_headings_by_font() for font-size-based heading detection with configurable thresholds
  • PDFParser: Add heading detection strategy (heading_detection: bookmarks | font | auto | none) with bookmarks-to-font fallback
  • PDFParser: Refactor _extract_bookmarks() for clarity, translate all Chinese comments to English
  • PDFConfig: Add heading_detection, font_heading_min_delta, max_heading_levels config options
  • MarkdownParser: Remove auto-group directory logic (max_children_per_dir, _auto_group_sections())
  • ParserConfig: Remove max_children_per_dir field

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux
    • macOS
    • Windows

Checklist

  • My code follows the project coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Screenshots (if applicable)

N/A

Additional Notes

Generated with Claude Code

…exceeds limit

When a document is split into many parts, automatically organize them into
subdirectories to avoid having too many files in a single directory. Also
refactors PDF bookmark extraction for clarity and uses defaultdict.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… English

Remove max_children_per_dir config and auto-grouping of sections into
subdirectories when file count exceeds a limit. Translate all Chinese
comments and docstrings to English across pdf.py and parser_config.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@qin-ctx qin-ctx changed the title feat(parse): auto-group sections into subdirectories when file count exceeds limit feat(pdf): add font-based heading detection and refactor PDF/Markdown parsing Mar 4, 2026
…s_with_merge

The separate function was only needed for the removed auto-group logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@zhoujh01 zhoujh01 merged commit a5a80e4 into main Mar 4, 2026
5 checks passed
@zhoujh01 zhoujh01 deleted the feat/auto-group-large-directories branch March 4, 2026 11:09
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Feature]: Improve PDF parsing structure preservation and directory organization

3 participants