Skip to content

Conversation

@vinyas-bharadwaj
Copy link
Contributor

@vinyas-bharadwaj vinyas-bharadwaj commented Oct 8, 2025

Description

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code refactoring
  • Performance improvement
  • Other (please describe):

Related Issue

Fixes #81

Changes Made

  • Filters binary files from diffs for better LLM context
  • Adds --dry-run flag to preview LLM prompt without API call

Testing

  • Tested with Gemini API
  • Tested with Grok API
  • Tested on Windows
  • Tested on Linux
  • Tested on macOS
  • Added/updated tests (if applicable)

Checklist

  • My code follows the project's code style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings or errors
  • I have tested this in a real Git repository
  • I have read the CONTRIBUTING.md guidelines

Screenshots (if applicable)

Additional Notes


For Hacktoberfest Participants

  • This PR is submitted as part of Hacktoberfest 2025

Thank you for your contribution! 🎉

Summary by CodeRabbit

  • New Features

    • Binary files are excluded from unstaged, staged, and untracked change lists and diffs to reduce noise.
    • Untracked text files still show small text contents when applicable, with sensitive environment data scrubbed.
    • Expanded and improved text/binary detection (many added extensions plus common extensionless filenames) for more accurate diffs.
  • Chores

    • Dependency ordering adjusted with no functional impact.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 8, 2025

Walkthrough

Adds binary-file detection and filtering: introduces IsBinaryFile, expands IsTextFile extensions and extensionless filename handling, and updates git change collection to exclude binary files from unstaged, staged, and untracked listings and diffs. Also reorders one dependency line in go.mod. No other public API changes.

Changes

Cohort / File(s) Summary
Dependency ordering
go.mod
Reordered github.com/google/shlex within the require block; no version change or functional impact.
Utils: file type detection
internal/utils/utils.go
Added exported IsBinaryFile(filename string) bool. Expanded IsTextFile to recognise many additional text extensions and common extensionless text filenames; logic otherwise unchanged.
Git operations: binary filtering
internal/git/operations.go
Added parsing helpers for git --name-status, functions to filter out binary entries and extract non-binary filenames. Updated GetChanges to exclude binary files from unstaged, staged, and untracked listings and to request diffs only for non-binary files; untracked text files still read and scrubbed as before.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller as Caller
  participant GetChanges as GetChanges
  participant GitCLI as git CLI
  participant Utils as utils (IsBinaryFile/IsTextFile)

  Caller->>GetChanges: GetChanges()
  GetChanges->>GitCLI: git diff --name-status (unstaged)
  GetChanges->>GetChanges: parse name-status, filterBinaryFiles
  GetChanges->>Utils: classify filenames (IsBinaryFile/IsTextFile)
  Utils-->>GetChanges: non-binary filenames
  GetChanges->>GitCLI: git diff -- <non-binary files>
  GetChanges-->>Caller: Unstaged summary + diffs (text only)

  GetChanges->>GitCLI: git diff --cached --name-status (staged)
  GetChanges->>GetChanges: parse name-status, filterBinaryFiles
  GetChanges->>Utils: classify filenames
  Utils-->>GetChanges: non-binary filenames
  GetChanges->>GitCLI: git diff --cached -- <non-binary files>
  GetChanges-->>Caller: Staged summary + diffs (text only)

  GetChanges->>GitCLI: git ls-files --others --exclude-standard (untracked)
  loop per file
    GetChanges->>Utils: IsBinaryFile / IsTextFile
    Utils-->>GetChanges: classification
  end
  GetChanges->>GitCLI: read contents of eligible text untracked files
  GetChanges-->>Caller: Untracked list + contents (text only)

  note over GetChanges,Utils: Binary files excluded from lists and diffs
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

enhancement, go

Suggested reviewers

  • DFanso

Poem

I nibble through the repo's maze,
Skipping blobs and heavy glaze.
Text I pick, small and bright,
Big bytes rest beyond my sight.
Hop, patch, and fluff—the diff's just right. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title clearly summarizes the main change by stating the goal of reducing API costs through filtering binary and other unwanted files, matching the pull request’s primary focus without extraneous detail or jargon.
Linked Issues Check ✅ Passed The changes implement the core requirement from issue #81 by detecting and excluding binary and non-text files in diffs before API calls, using newly added parsing and filtering logic in operations.go and the IsBinaryFile utility in utils.go.
Out of Scope Changes Check ✅ Passed All modifications directly support filtering out unwanted files, updating dependency ordering, and extending file type detection without introducing unrelated functionality or code unrelated to the objectives of binary filtering.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c50506 and 32f7d9d.

📒 Files selected for processing (3)
  • go.mod (1 hunks)
  • internal/git/operations.go (4 hunks)
  • internal/utils/utils.go (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
internal/git/operations.go (2)
internal/utils/utils.go (3)
  • IsBinaryFile (41-69)
  • IsTextFile (19-38)
  • IsSmallFile (72-81)
internal/scrubber/scrubber.go (1)
  • ScrubEnvFile (193-231)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build Go Binary (windows-latest)
🔇 Additional comments (4)
internal/utils/utils.go (1)

24-27: LGTM! Good expansion of text file extensions.

The additional extensions (SQL, R, Scala, Kotlin, Swift, Objective-C, Perl, Lua, Vim, CSV, logs, configs, lock files, etc.) provide comprehensive coverage for common text-based development files.

internal/git/operations.go (3)

94-118: Unstaged changes filtering logic is sound.

The approach of filtering the name-status output and then fetching diff content only for non-binary files is correct and aligns with the PR objectives to reduce API costs.

However, this depends on fixing the helper functions to correctly handle rename/copy status codes (see previous comment).


127-151: Staged changes filtering logic is consistent with unstaged.

Good parallel implementation for staged changes using the same filtering strategy.

However, this also depends on fixing the helper functions to correctly handle rename/copy status codes (see earlier comment).


160-200: LGTM! Untracked files filtering is well-implemented.

The filtering logic correctly:

  • Excludes binary files from the untracked list
  • Only reads content for files that are both text and small
  • Applies special scrubbing for .env files
  • Handles read errors gracefully by continuing

This implementation effectively reduces API costs by preventing binary file content from being sent to the LLM.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
internal/utils/utils.go (1)

56-87: LGTM! Binary file detection is comprehensive and well-documented.

The function covers a broad set of binary formats across multiple categories, and the comment at lines 84-85 clearly documents the intentional behavior for unknown extensions (treated as non-binary by default). The SVG exclusion from the binary list (line 61 comment) correctly aligns with its XML text nature.

internal/git/operations.go (1)

67-123: LGTM! Excellent refactoring eliminates code duplication.

The introduction of processGitStatusOutput as the single source of truth, with filterBinaryFiles and extractNonBinaryFiles as lightweight wrappers, successfully addresses the code duplication concern from the previous review. The returnFilenames parameter cleanly controls whether filenames are extracted, making the code more maintainable.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32f7d9d and e0752f3.

📒 Files selected for processing (2)
  • internal/git/operations.go (4 hunks)
  • internal/utils/utils.go (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
internal/git/operations.go (2)
internal/utils/utils.go (3)
  • IsBinaryFile (57-87)
  • IsTextFile (19-54)
  • IsSmallFile (90-99)
internal/scrubber/scrubber.go (1)
  • ScrubEnvFile (193-231)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build Go Binary (windows-latest)
🔇 Additional comments (4)
internal/utils/utils.go (2)

24-27: LGTM! Comprehensive text file extension coverage.

The additions cover a wide range of programming languages (.sql, .r, .scala, .kt, .swift, .m, .pl, .lua, .vim), configuration formats (.csv, .log, .cfg, .conf, .ini, .toml, .lock), and build files (.gitignore, .dockerfile, .makefile, .cmake, .pro, .pri, .svg). SVG is correctly classified as text (XML-based) rather than binary.


37-51: LGTM! Extensionless text file detection is well-implemented.

The logic correctly identifies common extensionless text files (README, Dockerfile, Makefile, etc.) by checking the base filename when no extension is present. This prevents false classification of important configuration files.

internal/git/operations.go (2)

136-193: LGTM! Binary file filtering is consistently applied.

The unstaged and staged sections now correctly filter binary files before including them in the diff output and limit git diff operations to non-binary files only. This reduces API costs by avoiding transmission of binary content to the LLM. The pattern is consistent and the logic is sound, assuming the underlying parseGitNameStatus function is fixed per the earlier comment.


203-243: LGTM! Untracked file filtering and content extraction is well-implemented.

Binary files are correctly excluded from the untracked file list (lines 211-214), and content extraction is limited to text files that are small and non-binary (line 224). The special handling for .env files with ScrubEnvFile (lines 233-235) is appropriate for security. This completes the binary filtering across all git change categories.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e0752f3 and 654dad4.

📒 Files selected for processing (1)
  • internal/git/operations.go (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
internal/git/operations.go (2)
internal/utils/utils.go (3)
  • IsBinaryFile (57-87)
  • IsTextFile (19-54)
  • IsSmallFile (90-99)
internal/scrubber/scrubber.go (1)
  • ScrubEnvFile (193-231)

Comment on lines +39 to +65
parts := strings.Split(line, "\t")
if len(parts) < 2 {
return parseGitStatusLine{}
}

status := parts[0]

// Handle rename/copy status codes (e.g., "R100", "C75")
if len(status) > 1 && (status[0] == 'R' || status[0] == 'C') {
// For rename/copy, we expect: "R100\toldname\tnewname" or "C75\toldname\tnewname"
if len(parts) >= 3 {
// For renames/copies, both old and new filenames need to be checked
oldFile := parts[1]
newFile := parts[2]
return parseGitStatusLine{
status: status,
filenames: []string{oldFile, newFile},
}
}
}

// Handle regular status codes (M, A, D, etc.)
filename := parts[1]
return parseGitStatusLine{
status: status,
filenames: []string{filename},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Unquote name-status paths before classification/diffing

git diff --name-status emits C-quoted paths whenever core.quotepath is left at its (true) default—for example a binary file image file.png shows up as M\t"image file.png". We forward that quoted string straight into utils.IsBinaryFile and the later git diff -- … call. Two bad things happen:

  • .Ext("\"image file.png\"") yields .png", so we fail to recognise it as binary and end up sending the very data we meant to block.
  • exec.Command passes the quotes verbatim, so git diff never matches the real path and the diff content silently disappears.

Please strip the C-style quoting (and unescape sequences) before returning filenames. A tiny helper around strconv.Unquote for both the rename/copy and regular branches fixes it:

-import (
-	"fmt"
-	"os"
-	"os/exec"
-	"path/filepath"
-	"strings"
+import (
+	"fmt"
+	"os"
+	"os/exec"
+	"path/filepath"
+	"strconv"
+	"strings"
 )
@@
 func parseGitNameStatus(line string) parseGitStatusLine {
@@
-			oldFile := parts[1]
-			newFile := parts[2]
+			oldFile := unquoteGitPath(parts[1])
+			newFile := unquoteGitPath(parts[2])
@@
-	filename := parts[1]
+	filename := unquoteGitPath(parts[1])
@@
 }
+
+func unquoteGitPath(path string) string {
+	if len(path) >= 2 && path[0] == '"' && path[len(path)-1] == '"' {
+		if unquoted, err := strconv.Unquote(path); err == nil {
+			return unquoted
+		}
+	}
+	return path
+}

Without this, any path that needs quoting (spaces, non-ASCII, etc.) will either bypass the binary filter or vanish from the diff, defeating the main goal of the PR. -->

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
parts := strings.Split(line, "\t")
if len(parts) < 2 {
return parseGitStatusLine{}
}
status := parts[0]
// Handle rename/copy status codes (e.g., "R100", "C75")
if len(status) > 1 && (status[0] == 'R' || status[0] == 'C') {
// For rename/copy, we expect: "R100\toldname\tnewname" or "C75\toldname\tnewname"
if len(parts) >= 3 {
// For renames/copies, both old and new filenames need to be checked
oldFile := parts[1]
newFile := parts[2]
return parseGitStatusLine{
status: status,
filenames: []string{oldFile, newFile},
}
}
}
// Handle regular status codes (M, A, D, etc.)
filename := parts[1]
return parseGitStatusLine{
status: status,
filenames: []string{filename},
}
import (
"fmt"
"os"
"os/exec"
"path/filepath"
"strconv"
"strings"
)
func parseGitNameStatus(line string) parseGitStatusLine {
parts := strings.Split(line, "\t")
if len(parts) < 2 {
return parseGitStatusLine{}
}
status := parts[0]
// Handle rename/copy status codes (e.g., "R100", "C75")
if len(status) > 1 && (status[0] == 'R' || status[0] == 'C') {
// For rename/copy, we expect: "R100\toldname\tnewname" or "C75\toldname\tnewname"
if len(parts) >= 3 {
// For renames/copies, both old and new filenames need to be checked
oldFile := unquoteGitPath(parts[1])
newFile := unquoteGitPath(parts[2])
return parseGitStatusLine{
status: status,
filenames: []string{oldFile, newFile},
}
}
}
// Handle regular status codes (M, A, D, etc.)
filename := unquoteGitPath(parts[1])
return parseGitStatusLine{
status: status,
filenames: []string{filename},
}
}
func unquoteGitPath(path string) string {
if len(path) >= 2 && path[0] == '"' && path[len(path)-1] == '"' {
if unquoted, err := strconv.Unquote(path); err == nil {
return unquoted
}
}
return path
}

Copy link
Owner

@DFanso DFanso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉

@DFanso DFanso added enhancement New feature or request hacktoberfest Eligible for Hacktoberfest hacktoberfest-accepted Approved Hacktoberfest contribution go Pull requests that update go code labels Oct 9, 2025
@DFanso DFanso merged commit 9739e70 into DFanso:main Oct 9, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request go Pull requests that update go code hacktoberfest Eligible for Hacktoberfest hacktoberfest-accepted Approved Hacktoberfest contribution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Skip tracking changes on binary files, images to sending unwanted data to LLM

2 participants