refactor: consolidate 3 extraction loops into 1 parametric function #5

Lulzx · 2026-02-05T19:57:11Z

Unify extractTextFromContentWithContext, extractTextFromContentWithBounds, and extractTextWithMcidTracking into a single extractContentStream() using an ExtractionMode union (stream/bounds/structured). Extract shared logic into pushOperand() and lookupFont() helpers. Remove 4 duplicate text write helpers. Rename ParseError to ParseErrorRecord to avoid collision with parser.ParseError. Fix decompress.zig to import Object from parser.zig instead of root.zig.

Summary by CodeRabbit

Breaking Changes
- Renamed ParseError to ParseErrorRecord
New Features
- Introduced multiple extraction modes for flexible operator handling
- Enhanced structure tree extraction with improved MCID support
- Improved Form XObject and text extraction
- Better font processing throughout extraction pipeline
Refactor
- Unified content extraction pathway for consistent behavior across modes

Unify extractTextFromContentWithContext, extractTextFromContentWithBounds, and extractTextWithMcidTracking into a single extractContentStream() using an ExtractionMode union (stream/bounds/structured). Extract shared logic into pushOperand() and lookupFont() helpers. Remove 4 duplicate text write helpers. Rename ParseError to ParseErrorRecord to avoid collision with parser.ParseError. Fix decompress.zig to import Object from parser.zig instead of root.zig.

coderabbitai · 2026-02-05T19:57:30Z

📝 Walkthrough

Walkthrough

The changes introduce a unified extraction pipeline architecture for PDF content with an ExtractionMode union for controlling operator dispatch. Additionally, ParseError is renamed to ParseErrorRecord, and internal imports are restructured. The refactoring integrates MCID tracking, font-aware text processing, and new writer utilities while maintaining backward compatibility.

Changes

Cohort / File(s)	Summary
Import Restructuring `src/decompress.zig`	Import path for Object type updated from `root.zig` to `parser.zig`.
Type Renaming and API Updates `src/root.zig`	ParseError renamed to ParseErrorRecord; Document.errors type updated from `std.ArrayList(ParseError)` to `std.ArrayList(ParseErrorRecord)`.
Extraction Pipeline Architecture `src/root.zig`	New public ExtractionMode union introduced for controlling extraction dispatch (stream, bounds, structured). Unified content extraction pathway via extractContentStream added; Form XObject and text extraction refactored to use new approach. MCID tracking and structure-tree-based extraction support added.
Writer Utilities and Font Handling `src/root.zig`	BufferWriter and NullWriter helper utilities added; font-lookup helper introduced and integrated into extraction paths. Font-aware text writing mechanisms (writeTextToBuffer, writeTJArrayToBuffer) implemented for MCID-aware processing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Poem

🐰 A parser's journey, made fresh and clean,
New extraction modes—a unified scene!
From streams to structures, the pathways align,
Font-aware buffering, now all intertwine—
hop hop ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'refactor: consolidate 3 extraction loops into 1 parametric function' clearly and accurately describes the main change: consolidating three extraction functions into a single parametric function, which aligns with the PR objectives.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@src/root.zig`:
- Around line 976-978: The fixed-size text_buf ([4096]u8) and text_pos may
truncate long text spans; change to a growable buffer: introduce a compile-time
constant (e.g., const TEXT_BUF_INITIAL = 4096) and replace text_buf/text_pos
with an allocator-backed slice (allocating TEXT_BUF_INITIAL) that is resized
(realloc/grow) when needed before writes, updating all uses to track
length/capacity instead of a raw array; alternatively, if dynamic allocation is
undesired, expose the buffer size as a compile-time constant and document the
limitation so callers can tune it.
- Around line 927-938: The code currently constructs an ExtractionContext with
.xref_table = undefined and .object_cache = undefined which is fragile; change
ExtractionMode.stream.ctx to be an optional pointer (e.g., ?*const
ExtractionContext) and pass null instead of undefined when no context is
available in the call site in extractContentStream; then update handleDoOperator
to test for a null ctx (e.g., if (ctx == null) return) and safely unwrap ctx
only when present before accessing xref_table or object_cache; ensure all other
call sites and any pattern matches are updated to the new optional type so no
undefined pointer values are used.

🧹 Nitpick comments (2)

src/root.zig (2)

869-905: Consider adding overflow detection for debugging.

When count.* >= operands.len, operands are silently dropped. While 64 operands is generous for standard PDF operators, silently ignoring overflow could mask issues with malformed PDFs or extraction bugs.

💡 Optional: Add debug assertion or logging

         .number => |n| {
             if (count.* < operands.len) {
                 operands[count.*] = .{ .number = n };
                 count.* += 1;
-            }
+            } else if (builtin.mode == .Debug) {
+                std.debug.print("Warning: operand overflow at count {d}\n", .{count.*});
+            }
         },

730-747: Consider simplifying the labeled block pattern.

The nw_blk: labeled block pattern is valid but unusual. A simpler approach might improve readability:

💡 Alternative pattern

-                    if (nw_blk: {
-                        var nw: NullWriter = .{};
-                        break :nw_blk extractContentStream(content, .{ .structured = &extractor }, &self.font_cache, page_num, arena, &nw);
-                    }) |_| {
+                    var nw: NullWriter = .{};
+                    const extract_ok = extractContentStream(content, .{ .structured = &extractor }, &self.font_cache, page_num, arena, &nw);
+                    if (extract_ok) |_| {
                         // Collect text in structure tree order
                         ...
                     } else |_| {}

coderabbitai · 2026-02-05T20:00:19Z

src/root.zig

+    try extractContentStream(content, .{ .stream = .{
+        .resources = null,
+        .ctx = &.{
+            .allocator = allocator,
+            .data = &.{},
+            .xref_table = undefined,
+            .object_cache = undefined,
+            .font_cache = font_cache,
+            .page_num = page_num,
+            .depth = 0,
+        },
+    } }, font_cache, page_num, allocator, writer);


⚠️ Potential issue | 🟡 Minor

Using undefined for pointer fields is fragile.

The ExtractionContext is created with .xref_table = undefined and .object_cache = undefined. While handleDoOperator guards against this with if (ctx.data.len == 0) return, this pattern is fragile—future modifications could accidentally dereference these undefined pointers, causing undefined behavior.

💡 Suggested safer approach

Consider using an optional context or a sentinel pattern:

-fn extractTextFromContent( - allocator: std.mem.Allocator, - content: []const u8, - page_num: usize, - font_cache: *const std.StringHashMap(encoding.FontEncoding), - writer: anytype, -) !void { - // Simple path without Form XObject support (for backward compatibility) - try extractContentStream(content, .{ .stream = .{ - .resources = null, - .ctx = &.{ - .allocator = allocator, - .data = &.{}, - .xref_table = undefined, - .object_cache = undefined, - .font_cache = font_cache, - .page_num = page_num, - .depth = 0, - }, - } }, font_cache, page_num, allocator, writer); -} +fn extractTextFromContent( + allocator: std.mem.Allocator, + content: []const u8, + page_num: usize, + font_cache: *const std.StringHashMap(encoding.FontEncoding), + writer: anytype, +) !void { + // Simple path without Form XObject support - pass null ctx to skip Do handling + try extractContentStream(content, .{ .stream = .{ + .resources = null, + .ctx = null, // Signals no XObject resolution available + } }, font_cache, page_num, allocator, writer); +}

This would require updating ExtractionMode.stream.ctx to be ?*const ExtractionContext and adjusting handleDoOperator accordingly.

🤖 Prompt for AI Agents

In `@src/root.zig` around lines 927 - 938, The code currently constructs an ExtractionContext with .xref_table = undefined and .object_cache = undefined which is fragile; change ExtractionMode.stream.ctx to be an optional pointer (e.g., ?*const ExtractionContext) and pass null instead of undefined when no context is available in the call site in extractContentStream; then update handleDoOperator to test for a null ctx (e.g., if (ctx == null) return) and safely unwrap ctx only when present before accessing xref_table or object_cache; ensure all other call sites and any pattern matches are updated to the new optional type so no undefined pointer values are used.

coderabbitai · 2026-02-05T20:00:19Z

src/root.zig

+    // Text buffer for structured mode (MCID tracking)
+    var text_buf: [4096]u8 = undefined;
+    var text_pos: usize = 0;


⚠️ Potential issue | 🟡 Minor

Fixed 4096-byte buffer may truncate long text spans.

The text_buf used for MCID tracking in structured mode has a fixed 4096-byte size. While this is typically sufficient, very long text spans (e.g., from a single Tj operator with extensive text) would be silently truncated.

Consider either:

Documenting this limitation

Using dynamic allocation when the buffer is insufficient

Increasing the buffer size with a compile-time constant for easier tuning

🤖 Prompt for AI Agents

In `@src/root.zig` around lines 976 - 978, The fixed-size text_buf ([4096]u8) and text_pos may truncate long text spans; change to a growable buffer: introduce a compile-time constant (e.g., const TEXT_BUF_INITIAL = 4096) and replace text_buf/text_pos with an allocator-backed slice (allocating TEXT_BUF_INITIAL) that is resized (realloc/grow) when needed before writes, updating all uses to track length/capacity instead of a raw array; alternatively, if dynamic allocation is undesired, expose the buffer size as a compile-time constant and document the limitation so callers can tune it.

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: consolidate 3 extraction loops into 1 parametric function #5

refactor: consolidate 3 extraction loops into 1 parametric function #5

Uh oh!

Lulzx commented Feb 5, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 5, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 5, 2026

Uh oh!

coderabbitai bot Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

refactor: consolidate 3 extraction loops into 1 parametric function #5

Are you sure you want to change the base?

refactor: consolidate 3 extraction loops into 1 parametric function #5

Uh oh!

Conversation

Lulzx commented Feb 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Lulzx commented Feb 5, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 5, 2026 •

edited

Loading