Merged
Conversation
Implements solution for issue #37: Automated checking of bibliographic entries against external sources to verify accuracy of metadata. New features: - bibverify.py: Python script to verify .bib entries against CrossRef API - Parallel batch processing for efficient verification of large bibliographies - DOI-based and title-based lookup strategies with fuzzy matching - Comprehensive verification of titles, authors, years, journals, volumes, pages - Detailed discrepancy reporting with suggestions for corrections - Thread-safe parallel processing with configurable worker count - BIBVERIFY_README.md: Complete documentation and usage guide Technical details: - Uses CrossRef REST API (170M+ records, free, unlimited) - Supports 1-20 parallel workers for scalable performance - Smart fuzzy matching with configurable similarity thresholds - Respects API rate limits with built-in delays - Framework in place for future auto-fix functionality Performance: Can verify 6,151 entries in ~30-60 minutes with 10 workers (compared to 5-8 hours sequentially) Related to: #37
Test results show: - Processing speed: 17.5 entries/second with 10 workers - Full bibliography (6,151 entries) estimated at ~6 minutes - Found discrepancies in 78% of tested entries - Common issues: missing DOIs, formatting differences, some genuine errors Performance demonstrates that automated verification is highly feasible and practical for the CDL bibliography at scale.
Major improvements based on user feedback: 1. Conservative Match Verification: - Requires ALL of: title ≥85%, authors ≥70%, journal ≥60%, year ≤1 difference - Rejects uncertain matches rather than reporting false positives - Fixes GuoEtal20 false positive (0% author match correctly rejected) 2. Focus on Metadata Accuracy: - Only verifies volume/pages/number when confident match found - Removed DOI suggestions (not needed per formatting guide) - Detects common errors (DOI in pages field) 3. Results Improvement: - 54% verified (vs 2% before) - 14% real errors (vs 78% false positives before) - No more wrong-paper suggestions Test results on 100 entries: - Real errors found: volume mismatches, DOI in pages, year discrepancies - False positives eliminated: GuoEtal20, etc. - Processing speed: ~12 entries/sec with conservative matching Addresses feedback on issue #37.
…ication results Changes: - Integrated bibverify.py documentation into main README.md - Removed standalone BIBVERIFY_README.md (now in README.md) - Removed VERIFICATION_TEST_RESULTS.md (superseded by full report) - Added full_verification_report.txt with complete verification results Full verification results (6,151 entries in 6min 11sec): - ✓ Verified: 3,988 entries (65%) - ✗ Errors: 724 entries (12%) - real metadata issues - ⚠ Warnings: 1,434 entries (23%) - not in CrossRef or uncertain match Common errors found: - Volume/issue number mismatches - Page range errors (off-by-one) - DOI in pages field instead of doi field - Year discrepancies (preprint vs published) The bibverify tool successfully demonstrates feasibility of automated bibliographic accuracy verification at scale, addressing issue #37.
Removed per user request - full report not needed in repository.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
auto verification