-
Notifications
You must be signed in to change notification settings - Fork 36
refactor: consolidate 3 extraction loops into 1 parametric function #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Unify extractTextFromContentWithContext, extractTextFromContentWithBounds, and extractTextWithMcidTracking into a single extractContentStream() using an ExtractionMode union (stream/bounds/structured). Extract shared logic into pushOperand() and lookupFont() helpers. Remove 4 duplicate text write helpers. Rename ParseError to ParseErrorRecord to avoid collision with parser.ParseError. Fix decompress.zig to import Object from parser.zig instead of root.zig.
📝 WalkthroughWalkthroughThe changes introduce a unified extraction pipeline architecture for PDF content with an ExtractionMode union for controlling operator dispatch. Additionally, ParseError is renamed to ParseErrorRecord, and internal imports are restructured. The refactoring integrates MCID tracking, font-aware text processing, and new writer utilities while maintaining backward compatibility. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/root.zig`:
- Around line 976-978: The fixed-size text_buf ([4096]u8) and text_pos may
truncate long text spans; change to a growable buffer: introduce a compile-time
constant (e.g., const TEXT_BUF_INITIAL = 4096) and replace text_buf/text_pos
with an allocator-backed slice (allocating TEXT_BUF_INITIAL) that is resized
(realloc/grow) when needed before writes, updating all uses to track
length/capacity instead of a raw array; alternatively, if dynamic allocation is
undesired, expose the buffer size as a compile-time constant and document the
limitation so callers can tune it.
- Around line 927-938: The code currently constructs an ExtractionContext with
.xref_table = undefined and .object_cache = undefined which is fragile; change
ExtractionMode.stream.ctx to be an optional pointer (e.g., ?*const
ExtractionContext) and pass null instead of undefined when no context is
available in the call site in extractContentStream; then update handleDoOperator
to test for a null ctx (e.g., if (ctx == null) return) and safely unwrap ctx
only when present before accessing xref_table or object_cache; ensure all other
call sites and any pattern matches are updated to the new optional type so no
undefined pointer values are used.
🧹 Nitpick comments (2)
src/root.zig (2)
869-905: Consider adding overflow detection for debugging.When
count.* >= operands.len, operands are silently dropped. While 64 operands is generous for standard PDF operators, silently ignoring overflow could mask issues with malformed PDFs or extraction bugs.💡 Optional: Add debug assertion or logging
.number => |n| { if (count.* < operands.len) { operands[count.*] = .{ .number = n }; count.* += 1; - } + } else if (builtin.mode == .Debug) { + std.debug.print("Warning: operand overflow at count {d}\n", .{count.*}); + } },
730-747: Consider simplifying the labeled block pattern.The
nw_blk:labeled block pattern is valid but unusual. A simpler approach might improve readability:💡 Alternative pattern
- if (nw_blk: { - var nw: NullWriter = .{}; - break :nw_blk extractContentStream(content, .{ .structured = &extractor }, &self.font_cache, page_num, arena, &nw); - }) |_| { + var nw: NullWriter = .{}; + const extract_ok = extractContentStream(content, .{ .structured = &extractor }, &self.font_cache, page_num, arena, &nw); + if (extract_ok) |_| { // Collect text in structure tree order ... } else |_| {}
| try extractContentStream(content, .{ .stream = .{ | ||
| .resources = null, | ||
| .ctx = &.{ | ||
| .allocator = allocator, | ||
| .data = &.{}, | ||
| .xref_table = undefined, | ||
| .object_cache = undefined, | ||
| .font_cache = font_cache, | ||
| .page_num = page_num, | ||
| .depth = 0, | ||
| }, | ||
| } }, font_cache, page_num, allocator, writer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using undefined for pointer fields is fragile.
The ExtractionContext is created with .xref_table = undefined and .object_cache = undefined. While handleDoOperator guards against this with if (ctx.data.len == 0) return, this pattern is fragile—future modifications could accidentally dereference these undefined pointers, causing undefined behavior.
💡 Suggested safer approach
Consider using an optional context or a sentinel pattern:
-fn extractTextFromContent(
- allocator: std.mem.Allocator,
- content: []const u8,
- page_num: usize,
- font_cache: *const std.StringHashMap(encoding.FontEncoding),
- writer: anytype,
-) !void {
- // Simple path without Form XObject support (for backward compatibility)
- try extractContentStream(content, .{ .stream = .{
- .resources = null,
- .ctx = &.{
- .allocator = allocator,
- .data = &.{},
- .xref_table = undefined,
- .object_cache = undefined,
- .font_cache = font_cache,
- .page_num = page_num,
- .depth = 0,
- },
- } }, font_cache, page_num, allocator, writer);
-}
+fn extractTextFromContent(
+ allocator: std.mem.Allocator,
+ content: []const u8,
+ page_num: usize,
+ font_cache: *const std.StringHashMap(encoding.FontEncoding),
+ writer: anytype,
+) !void {
+ // Simple path without Form XObject support - pass null ctx to skip Do handling
+ try extractContentStream(content, .{ .stream = .{
+ .resources = null,
+ .ctx = null, // Signals no XObject resolution available
+ } }, font_cache, page_num, allocator, writer);
+}This would require updating ExtractionMode.stream.ctx to be ?*const ExtractionContext and adjusting handleDoOperator accordingly.
🤖 Prompt for AI Agents
In `@src/root.zig` around lines 927 - 938, The code currently constructs an
ExtractionContext with .xref_table = undefined and .object_cache = undefined
which is fragile; change ExtractionMode.stream.ctx to be an optional pointer
(e.g., ?*const ExtractionContext) and pass null instead of undefined when no
context is available in the call site in extractContentStream; then update
handleDoOperator to test for a null ctx (e.g., if (ctx == null) return) and
safely unwrap ctx only when present before accessing xref_table or object_cache;
ensure all other call sites and any pattern matches are updated to the new
optional type so no undefined pointer values are used.
| // Text buffer for structured mode (MCID tracking) | ||
| var text_buf: [4096]u8 = undefined; | ||
| var text_pos: usize = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed 4096-byte buffer may truncate long text spans.
The text_buf used for MCID tracking in structured mode has a fixed 4096-byte size. While this is typically sufficient, very long text spans (e.g., from a single Tj operator with extensive text) would be silently truncated.
Consider either:
- Documenting this limitation
- Using dynamic allocation when the buffer is insufficient
- Increasing the buffer size with a compile-time constant for easier tuning
🤖 Prompt for AI Agents
In `@src/root.zig` around lines 976 - 978, The fixed-size text_buf ([4096]u8) and
text_pos may truncate long text spans; change to a growable buffer: introduce a
compile-time constant (e.g., const TEXT_BUF_INITIAL = 4096) and replace
text_buf/text_pos with an allocator-backed slice (allocating TEXT_BUF_INITIAL)
that is resized (realloc/grow) when needed before writes, updating all uses to
track length/capacity instead of a raw array; alternatively, if dynamic
allocation is undesired, expose the buffer size as a compile-time constant and
document the limitation so callers can tune it.
Unify extractTextFromContentWithContext, extractTextFromContentWithBounds, and extractTextWithMcidTracking into a single extractContentStream() using an ExtractionMode union (stream/bounds/structured). Extract shared logic into pushOperand() and lookupFont() helpers. Remove 4 duplicate text write helpers. Rename ParseError to ParseErrorRecord to avoid collision with parser.ParseError. Fix decompress.zig to import Object from parser.zig instead of root.zig.
Summary by CodeRabbit
ParseErrortoParseErrorRecord