Skip to content

Conversation

@Lulzx
Copy link
Owner

@Lulzx Lulzx commented Feb 5, 2026

Unify extractTextFromContentWithContext, extractTextFromContentWithBounds, and extractTextWithMcidTracking into a single extractContentStream() using an ExtractionMode union (stream/bounds/structured). Extract shared logic into pushOperand() and lookupFont() helpers. Remove 4 duplicate text write helpers. Rename ParseError to ParseErrorRecord to avoid collision with parser.ParseError. Fix decompress.zig to import Object from parser.zig instead of root.zig.

Summary by CodeRabbit

  • Breaking Changes
    • Renamed ParseError to ParseErrorRecord
  • New Features
    • Introduced multiple extraction modes for flexible operator handling
    • Enhanced structure tree extraction with improved MCID support
    • Improved Form XObject and text extraction
    • Better font processing throughout extraction pipeline
  • Refactor
    • Unified content extraction pathway for consistent behavior across modes

Unify extractTextFromContentWithContext, extractTextFromContentWithBounds,
and extractTextWithMcidTracking into a single extractContentStream()
using an ExtractionMode union (stream/bounds/structured). Extract shared
logic into pushOperand() and lookupFont() helpers. Remove 4 duplicate
text write helpers. Rename ParseError to ParseErrorRecord to avoid
collision with parser.ParseError. Fix decompress.zig to import Object
from parser.zig instead of root.zig.
@coderabbitai
Copy link

coderabbitai bot commented Feb 5, 2026

📝 Walkthrough

Walkthrough

The changes introduce a unified extraction pipeline architecture for PDF content with an ExtractionMode union for controlling operator dispatch. Additionally, ParseError is renamed to ParseErrorRecord, and internal imports are restructured. The refactoring integrates MCID tracking, font-aware text processing, and new writer utilities while maintaining backward compatibility.

Changes

Cohort / File(s) Summary
Import Restructuring
src/decompress.zig
Import path for Object type updated from root.zig to parser.zig.
Type Renaming and API Updates
src/root.zig
ParseError renamed to ParseErrorRecord; Document.errors type updated from std.ArrayList(ParseError) to std.ArrayList(ParseErrorRecord).
Extraction Pipeline Architecture
src/root.zig
New public ExtractionMode union introduced for controlling extraction dispatch (stream, bounds, structured). Unified content extraction pathway via extractContentStream added; Form XObject and text extraction refactored to use new approach. MCID tracking and structure-tree-based extraction support added.
Writer Utilities and Font Handling
src/root.zig
BufferWriter and NullWriter helper utilities added; font-lookup helper introduced and integrated into extraction paths. Font-aware text writing mechanisms (writeTextToBuffer, writeTJArrayToBuffer) implemented for MCID-aware processing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Poem

🐰 A parser's journey, made fresh and clean,
New extraction modes—a unified scene!
From streams to structures, the pathways align,
Font-aware buffering, now all intertwine—
hop hop

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'refactor: consolidate 3 extraction loops into 1 parametric function' clearly and accurately describes the main change: consolidating three extraction functions into a single parametric function, which aligns with the PR objectives.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/root.zig`:
- Around line 976-978: The fixed-size text_buf ([4096]u8) and text_pos may
truncate long text spans; change to a growable buffer: introduce a compile-time
constant (e.g., const TEXT_BUF_INITIAL = 4096) and replace text_buf/text_pos
with an allocator-backed slice (allocating TEXT_BUF_INITIAL) that is resized
(realloc/grow) when needed before writes, updating all uses to track
length/capacity instead of a raw array; alternatively, if dynamic allocation is
undesired, expose the buffer size as a compile-time constant and document the
limitation so callers can tune it.
- Around line 927-938: The code currently constructs an ExtractionContext with
.xref_table = undefined and .object_cache = undefined which is fragile; change
ExtractionMode.stream.ctx to be an optional pointer (e.g., ?*const
ExtractionContext) and pass null instead of undefined when no context is
available in the call site in extractContentStream; then update handleDoOperator
to test for a null ctx (e.g., if (ctx == null) return) and safely unwrap ctx
only when present before accessing xref_table or object_cache; ensure all other
call sites and any pattern matches are updated to the new optional type so no
undefined pointer values are used.
🧹 Nitpick comments (2)
src/root.zig (2)

869-905: Consider adding overflow detection for debugging.

When count.* >= operands.len, operands are silently dropped. While 64 operands is generous for standard PDF operators, silently ignoring overflow could mask issues with malformed PDFs or extraction bugs.

💡 Optional: Add debug assertion or logging
         .number => |n| {
             if (count.* < operands.len) {
                 operands[count.*] = .{ .number = n };
                 count.* += 1;
-            }
+            } else if (builtin.mode == .Debug) {
+                std.debug.print("Warning: operand overflow at count {d}\n", .{count.*});
+            }
         },

730-747: Consider simplifying the labeled block pattern.

The nw_blk: labeled block pattern is valid but unusual. A simpler approach might improve readability:

💡 Alternative pattern
-                    if (nw_blk: {
-                        var nw: NullWriter = .{};
-                        break :nw_blk extractContentStream(content, .{ .structured = &extractor }, &self.font_cache, page_num, arena, &nw);
-                    }) |_| {
+                    var nw: NullWriter = .{};
+                    const extract_ok = extractContentStream(content, .{ .structured = &extractor }, &self.font_cache, page_num, arena, &nw);
+                    if (extract_ok) |_| {
                         // Collect text in structure tree order
                         ...
                     } else |_| {}

Comment on lines +927 to +938
try extractContentStream(content, .{ .stream = .{
.resources = null,
.ctx = &.{
.allocator = allocator,
.data = &.{},
.xref_table = undefined,
.object_cache = undefined,
.font_cache = font_cache,
.page_num = page_num,
.depth = 0,
},
} }, font_cache, page_num, allocator, writer);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Using undefined for pointer fields is fragile.

The ExtractionContext is created with .xref_table = undefined and .object_cache = undefined. While handleDoOperator guards against this with if (ctx.data.len == 0) return, this pattern is fragile—future modifications could accidentally dereference these undefined pointers, causing undefined behavior.

💡 Suggested safer approach

Consider using an optional context or a sentinel pattern:

-fn extractTextFromContent(
-    allocator: std.mem.Allocator,
-    content: []const u8,
-    page_num: usize,
-    font_cache: *const std.StringHashMap(encoding.FontEncoding),
-    writer: anytype,
-) !void {
-    // Simple path without Form XObject support (for backward compatibility)
-    try extractContentStream(content, .{ .stream = .{
-        .resources = null,
-        .ctx = &.{
-            .allocator = allocator,
-            .data = &.{},
-            .xref_table = undefined,
-            .object_cache = undefined,
-            .font_cache = font_cache,
-            .page_num = page_num,
-            .depth = 0,
-        },
-    } }, font_cache, page_num, allocator, writer);
-}
+fn extractTextFromContent(
+    allocator: std.mem.Allocator,
+    content: []const u8,
+    page_num: usize,
+    font_cache: *const std.StringHashMap(encoding.FontEncoding),
+    writer: anytype,
+) !void {
+    // Simple path without Form XObject support - pass null ctx to skip Do handling
+    try extractContentStream(content, .{ .stream = .{
+        .resources = null,
+        .ctx = null, // Signals no XObject resolution available
+    } }, font_cache, page_num, allocator, writer);
+}

This would require updating ExtractionMode.stream.ctx to be ?*const ExtractionContext and adjusting handleDoOperator accordingly.

🤖 Prompt for AI Agents
In `@src/root.zig` around lines 927 - 938, The code currently constructs an
ExtractionContext with .xref_table = undefined and .object_cache = undefined
which is fragile; change ExtractionMode.stream.ctx to be an optional pointer
(e.g., ?*const ExtractionContext) and pass null instead of undefined when no
context is available in the call site in extractContentStream; then update
handleDoOperator to test for a null ctx (e.g., if (ctx == null) return) and
safely unwrap ctx only when present before accessing xref_table or object_cache;
ensure all other call sites and any pattern matches are updated to the new
optional type so no undefined pointer values are used.

Comment on lines +976 to +978
// Text buffer for structured mode (MCID tracking)
var text_buf: [4096]u8 = undefined;
var text_pos: usize = 0;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fixed 4096-byte buffer may truncate long text spans.

The text_buf used for MCID tracking in structured mode has a fixed 4096-byte size. While this is typically sufficient, very long text spans (e.g., from a single Tj operator with extensive text) would be silently truncated.

Consider either:

  1. Documenting this limitation
  2. Using dynamic allocation when the buffer is insufficient
  3. Increasing the buffer size with a compile-time constant for easier tuning
🤖 Prompt for AI Agents
In `@src/root.zig` around lines 976 - 978, The fixed-size text_buf ([4096]u8) and
text_pos may truncate long text spans; change to a growable buffer: introduce a
compile-time constant (e.g., const TEXT_BUF_INITIAL = 4096) and replace
text_buf/text_pos with an allocator-backed slice (allocating TEXT_BUF_INITIAL)
that is resized (realloc/grow) when needed before writes, updating all uses to
track length/capacity instead of a raw array; alternatively, if dynamic
allocation is undesired, expose the buffer size as a compile-time constant and
document the limitation so callers can tune it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant