Skip to content

Add invoice PDF renaming utility script#37

Merged
kzuraw merged 5 commits intomainfrom
claude/rename-invoice-pdfs-6Vk4c
Jan 27, 2026
Merged

Add invoice PDF renaming utility script#37
kzuraw merged 5 commits intomainfrom
claude/rename-invoice-pdfs-6Vk4c

Conversation

@kzuraw
Copy link
Owner

@kzuraw kzuraw commented Jan 27, 2026

Summary

Add a new utility script rename_invoices.py that automatically renames invoice PDFs to a standardized format with date prefixes based on file creation metadata.

Changes

  • New script: python/rename_invoices.py - Renames invoice PDFs from company_name invoice_number.pdf to yyyy-mm company_name invoice_number.pdf format
    • Extracts file creation date (year, month) from file metadata
    • Formats invoice numbers by removing whitespace and converting / to -
    • Sanitizes filenames to remove invalid characters
    • Supports --dry-run flag to preview changes before applying them
    • Skips files that already have date prefixes or have unexpected formats
    • Handles edge cases like duplicate target filenames
  • Updated README.md: Added documentation section for the new script with usage instructions

Implementation Details

  • Uses file st_birthtime (creation time) when available, falls back to st_mtime for cross-platform compatibility
  • Implements robust error handling for malformed filenames and existing target files
  • Provides clear user feedback with counts of renamed and skipped files
  • Follows the same CLI pattern as the existing rename_epubs.py script using Click framework

https://claude.ai/code/session_01AgLRG2yeADzgzH9VjTPUW2

Extracts dates from PDF content and renames invoice files to
'yyyy-mm company_name invoice_number.pdf' format with formatted
invoice numbers (whitespace removed, / converted to -).

https://claude.ai/code/session_01AgLRG2yeADzgzH9VjTPUW2
Remove pypdf dependency - use file metadata instead of parsing PDF content.

https://claude.ai/code/session_01AgLRG2yeADzgzH9VjTPUW2
@claude
Copy link

claude bot commented Jan 27, 2026

Code Review for PR #37: Add invoice PDF renaming utility script

Thank you for this contribution! The script follows the established patterns well and implements a useful utility. Here are my findings:

✅ Strengths

  1. Consistent with project conventions: Follows the same structure as rename_epubs.py with PEP 723 inline dependencies and Click framework
  2. Good error handling: Handles edge cases like duplicate files, malformed filenames, and files already having date prefixes
  3. Clear user feedback: Provides informative output during both dry-run and actual execution
  4. Cross-platform compatibility: Gracefully falls back from st_birthtime to st_mtime for systems that don't support creation time
  5. Documentation: README.md is properly updated with usage instructions

🐛 Issues & Bugs

Critical: Missing backslash in sanitize_filename (python/rename_invoices.py:35)

The invalid_chars string is missing the backslash character. It should be: '<>:"/\|?*' to match rename_epubs.py:63. This is important for cross-platform compatibility.

Bug: Missing continue statement (python/rename_invoices.py:66-67)

When a file already has a date prefix, the code echoes a message but doesn't continue to skip further processing. Add an explicit 'continue' after line 67 to prevent the code from trying to parse and rename these files.

Potential Issue: Company name parsing too simplistic (python/rename_invoices.py:71-76)

The script assumes company name is the first word using split(maxsplit=1). This works for single-word companies but might not handle multi-word company names as expected. Consider documenting the expected input format more explicitly.

🔍 Code Quality & Best Practices

  1. Python version inconsistency: rename_invoices.py uses >=3.12 while rename_epubs.py uses >=3.14. Consider standardizing across all scripts.

  2. Sanitization differences: The sanitize_filename function differs from rename_epubs.py - it doesn't normalize whitespace with re.sub(). Consider adding this for consistency.

  3. Date prefix regex: The regex allows any 4-digit year and 2-digit month without validation (e.g., month 99 would pass).

🚀 Performance

No concerns - the script efficiently processes files with minimal I/O operations.

🔒 Security

No security concerns identified. The script properly sanitizes filenames and provides dry-run mode.

📝 Suggestions

  1. Add examples in the docstring showing input/output transformations
  2. Consider counting files that already have date prefixes separately in the summary

Summary

The code is well-structured and functional, but please address the missing backslash in invalid_chars and add the explicit continue statement after detecting files with existing date prefixes. These are important for correctness and preventing potential bugs.

Great work overall! 🎉

- Process files that already have date prefix (strip and reprocess)
- Update to Python 3.14
- Add missing backslash to invalid_chars for cross-platform compatibility

https://claude.ai/code/session_01AgLRG2yeADzgzH9VjTPUW2
@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Jan 27, 2026

Deploying tools with  Cloudflare Pages  Cloudflare Pages

Latest commit: 8d663f5
Status: ✅  Deploy successful!
Preview URL: https://d2342909.tools-4nd.pages.dev
Branch Preview URL: https://claude-rename-invoice-pdfs-6.tools-4nd.pages.dev

View logs

@claude
Copy link

claude bot commented Jan 27, 2026

Pull Request Review

Overall, this is a well-structured utility script that follows the repository's conventions. The code is clean and readable with good error handling. I've identified a few areas for improvement:

✅ Strengths

  1. Consistent with existing patterns - Follows the same structure as rename_epubs.py (Click CLI, dry-run support, PEP 723 dependencies)
  2. Good error handling - Checks for edge cases like duplicate filenames and malformed input
  3. Clear user feedback - Informative output messages and proper summary statistics
  4. Documentation - Well-documented with docstrings and README updates

🐛 Potential Bugs

1. Date prefix detection issue (python/rename_invoices.py:66)

if re.match(r"^\d{4}-\d{2}\s", stem):
    stem = re.sub(r"^\d{4}-\d{2}\s+", "", stem)

Issue: The regex patterns differ - match() requires exactly one space after the date, but sub() removes one or more spaces. If a file is named 2025-01 Company INV123.pdf, the match will succeed, but the substitution won't remove the date prefix properly.

Fix: Use consistent regex pattern:

if re.match(r"^\d{4}-\d{2}\s+", stem):
    stem = re.sub(r"^\d{4}-\d{2}\s+", "", stem)

2. Already-correctly-named files counted incorrectly (python/rename_invoices.py:96)

When a file is already correctly named (new_path == pdf_path), the script skips it but doesn't provide the same feedback as rename_epubs.py does. This could be confusing for users.

Current behavior: Silent skip (no counter increment)

Suggested improvement: Add explicit handling like in rename_epubs.py:95-97:

if new_path == pdf_path:
    click.echo(f"Already named correctly: {pdf_path.name}")
    continue

⚠️ Edge Cases to Consider

3. Company names with multiple words (python/rename_invoices.py:70)

parts = stem.split(maxsplit=1)

The current logic assumes the company name is a single word. If the input file is Acme Corp INV-123.pdf, it will parse as:

  • Company: Acme
  • Invoice: Corp INV-123

Impact: This might be the intended behavior if users are expected to use underscore-separated company names (e.g., Acme_Corp). However, the PR description shows company_name invoice_number.pdf which suggests underscores are already expected.

Recommendation: Document this requirement more explicitly in the docstring, or consider a more robust parsing strategy (e.g., looking for invoice number patterns).

4. Cross-platform date handling (python/rename_invoices.py:21)

timestamp = getattr(stat, "st_birthtime", None) or stat.st_mtime

Consideration: On Linux systems, st_birthtime is typically not available, so this will fall back to st_mtime (modification time). This could give unexpected results if PDFs are edited after creation. Consider documenting this platform-specific behavior.

🔒 Security

No security concerns identified. The script:

  • ✅ Uses Path for safe filesystem operations
  • ✅ Validates input directory exists
  • ✅ Sanitizes filenames to prevent directory traversal
  • ✅ Checks for existing files before overwriting

🚀 Performance

Performance is appropriate for the use case (processing local PDF files). No concerns.

📝 Minor Suggestions

5. Inconsistent whitespace handling

In rename_invoices.py:66, the sanitization differs from rename_epubs.py:66:

  • This PR: name.strip() only
  • rename_epubs.py: re.sub(r"\s+", " ", name) then name.strip()

The epub version normalizes multiple spaces to single spaces, which is more robust. Consider applying the same pattern here for consistency.

📋 Summary

Category Status
Code Quality ✅ Good
Consistency ✅ Follows repo patterns
Bugs ⚠️ 1 minor regex bug (easily fixed)
Edge Cases ⚠️ Document company name format assumption
Security ✅ No concerns
Performance ✅ Appropriate

Recommended Actions

Must fix:

Nice to have:

Despite these issues, the code is production-ready with the regex fix applied. Great work maintaining consistency with the existing codebase! 🎉

@claude
Copy link

claude bot commented Jan 27, 2026

Pull Request Review

Thanks for this contribution! I've reviewed the PR and have the following feedback:

Strengths

  1. Consistent with existing patterns: The script follows the same structure as rename_epubs.py (Click framework, dry-run flag, similar output formatting)
  2. Good error handling: Properly handles edge cases like duplicate files, unexpected formats, and malformed filenames
  3. Cross-platform compatibility: Falls back to st_mtime when st_birthtime is unavailable
  4. Clear documentation: README entry and docstrings are helpful

🐛 Bug: Date Prefix Stripping Logic

Location: python/rename_invoices.py:66-67

This logic strips existing date prefixes but then re-adds them based on current file creation time. This creates a problem where files already correctly named won't match new_path == pdf_path because the stem was modified. If file creation dates change (e.g., after copying), it could incorrectly rename already-correct files.

Recommendation: Remove the date-stripping logic (lines 65-67) and add an early check like rename_epubs.py:95-97 before the duplicate check.

🔧 Minor Issues

  1. Filename parsing assumption (python/rename_invoices.py:69-73): Assumes company name is a single word. If input is "Big Company invoice 123.pdf", it parses company="Big" and invoice="Company invoice 123". Document this assumption or enhance parsing.

  2. Inconsistency with sanitize_filename: rename_epubs.py normalizes multiple spaces but rename_invoices.py doesn't. Add name = re.sub(r"\s+", " ", name) for consistency.

🎯 Code Quality

Type hints, function decomposition, and PEP 723 compliance all look good!

🔒 Security & Performance

No concerns identified.


Summary

Solid addition! Please address the date prefix stripping bug and consider the minor improvements.

🤖 Review by Claude Code

- Skip files that already have yyyy-mm prefix instead of re-processing
- Add early check for files already named correctly before duplicate check

https://claude.ai/code/session_01AgLRG2yeADzgzH9VjTPUW2
@claude
Copy link

claude bot commented Jan 27, 2026

PR Review: Add invoice PDF renaming utility script

Overall Assessment

Good implementation that follows the established patterns in the repository. The code is clean, well-documented, and includes proper error handling. However, there are a few issues that should be addressed before merging.

Issues Found

1. Critical: Missing duplicate filename handling in rename_epubs.py pattern ⚠️

The new script handles duplicate target filenames by skipping them, but rename_epubs.py doesn't check for this case. If two different epub files have the same author/title metadata, the second rename will fail with a confusing error. This is actually a bug in the existing script that this PR inadvertently highlights.

Recommendation: This PR's approach is correct. Consider fixing rename_epubs.py in a follow-up PR to match this pattern.

2. Bug: Inconsistent sanitize_filename behavior

The sanitize_filename() function differs between the two scripts:

rename_invoices.py (line 32-37):

def sanitize_filename(name: str) -> str:
    """Remove characters that are invalid in filenames."""
    invalid_chars = '<>:"/\\|?*'
    for char in invalid_chars:
        name = name.replace(char, "")
    return name.strip()

rename_epubs.py (line 61-67):

def sanitize_filename(name: str) -> str:
    """Remove characters that are invalid in filenames."""
    invalid_chars = '<>:"/\\|?*'
    for char in invalid_chars:
        name = name.replace(char, "")
    name = re.sub(r"\s+", " ", name)  # <-- Missing in rename_invoices.py
    return name.strip()

The epub version normalizes multiple whitespace characters to a single space, but the invoice version doesn't. This could lead to filenames with excessive whitespace if the company name or invoice number has multiple consecutive spaces.

Recommendation: Add the whitespace normalization line to rename_invoices.py.

3. Potential issue: Filename parsing assumption

Line 71 uses stem.split(maxsplit=1) which assumes the company name is a single word. This means:

  • Acme 12345.pdf → works
  • Acme Corp 12345.pdf → fails (interprets "Acme" as company, "Corp 12345" as invoice number)

Recommendation: Document this limitation clearly in the docstring and consider adding an example of expected input format.

4. Minor: Platform compatibility note

Line 21 uses st_birthtime with a fallback to st_mtime. The comment says "fall back to mtime" but st_birthtime is only available on macOS and some BSD systems. On Linux, this will always use st_mtime (modification time), which may not accurately represent when the invoice was created if the file was copied or modified.

Recommendation: Consider documenting this behavior in the docstring or README.

Code Quality ✅

Strengths:

  • ✅ Follows PEP 723 inline dependency specification correctly
  • ✅ Consistent with existing rename_epubs.py pattern
  • ✅ Good error handling for malformed filenames and edge cases
  • ✅ Clear user feedback with counts
  • ✅ Proper use of Click framework
  • ✅ Type hints on function signatures
  • ✅ Skip logic for already-processed files
  • ✅ Dry-run functionality implemented correctly

Style:

  • ✅ Clean, readable code
  • ✅ Good function decomposition
  • ✅ Appropriate docstrings

Security Considerations ✅

  • ✅ Sanitizes filenames to prevent path traversal or invalid characters
  • ✅ No arbitrary code execution risks
  • ✅ Proper use of Path objects instead of string concatenation

Performance Considerations ✅

  • ✅ Efficient for typical use cases (small number of PDFs)
  • ✅ No unnecessary file I/O

Documentation ✅

  • ✅ README.md updated appropriately
  • ✅ Usage examples provided
  • ✅ Follows existing documentation pattern

Recommendations Summary

Must fix before merge:

  1. Add whitespace normalization to sanitize_filename() to match rename_epubs.py

Should consider:
2. Document the single-word company name limitation
3. Add a note about file date metadata behavior across platforms

Follow-up work:
4. Consider extracting sanitize_filename() to a shared utility module to maintain consistency
5. Add duplicate filename handling to rename_epubs.py in a separate PR

Overall, this is a solid PR with good code quality. The main issue is the missing whitespace normalization. Once addressed, this should be good to merge!

@kzuraw kzuraw merged commit 0558145 into main Jan 27, 2026
2 checks passed
@kzuraw kzuraw deleted the claude/rename-invoice-pdfs-6Vk4c branch January 27, 2026 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants