-
Notifications
You must be signed in to change notification settings - Fork 18
Reducing API costs by filtering out binary files and other unwanted files #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds binary-file detection and filtering: introduces Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Caller as Caller
participant GetChanges as GetChanges
participant GitCLI as git CLI
participant Utils as utils (IsBinaryFile/IsTextFile)
Caller->>GetChanges: GetChanges()
GetChanges->>GitCLI: git diff --name-status (unstaged)
GetChanges->>GetChanges: parse name-status, filterBinaryFiles
GetChanges->>Utils: classify filenames (IsBinaryFile/IsTextFile)
Utils-->>GetChanges: non-binary filenames
GetChanges->>GitCLI: git diff -- <non-binary files>
GetChanges-->>Caller: Unstaged summary + diffs (text only)
GetChanges->>GitCLI: git diff --cached --name-status (staged)
GetChanges->>GetChanges: parse name-status, filterBinaryFiles
GetChanges->>Utils: classify filenames
Utils-->>GetChanges: non-binary filenames
GetChanges->>GitCLI: git diff --cached -- <non-binary files>
GetChanges-->>Caller: Staged summary + diffs (text only)
GetChanges->>GitCLI: git ls-files --others --exclude-standard (untracked)
loop per file
GetChanges->>Utils: IsBinaryFile / IsTextFile
Utils-->>GetChanges: classification
end
GetChanges->>GitCLI: read contents of eligible text untracked files
GetChanges-->>Caller: Untracked list + contents (text only)
note over GetChanges,Utils: Binary files excluded from lists and diffs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
go.mod(1 hunks)internal/git/operations.go(4 hunks)internal/utils/utils.go(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
internal/git/operations.go (2)
internal/utils/utils.go (3)
IsBinaryFile(41-69)IsTextFile(19-38)IsSmallFile(72-81)internal/scrubber/scrubber.go (1)
ScrubEnvFile(193-231)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build Go Binary (windows-latest)
🔇 Additional comments (4)
internal/utils/utils.go (1)
24-27: LGTM! Good expansion of text file extensions.The additional extensions (SQL, R, Scala, Kotlin, Swift, Objective-C, Perl, Lua, Vim, CSV, logs, configs, lock files, etc.) provide comprehensive coverage for common text-based development files.
internal/git/operations.go (3)
94-118: Unstaged changes filtering logic is sound.The approach of filtering the name-status output and then fetching diff content only for non-binary files is correct and aligns with the PR objectives to reduce API costs.
However, this depends on fixing the helper functions to correctly handle rename/copy status codes (see previous comment).
127-151: Staged changes filtering logic is consistent with unstaged.Good parallel implementation for staged changes using the same filtering strategy.
However, this also depends on fixing the helper functions to correctly handle rename/copy status codes (see earlier comment).
160-200: LGTM! Untracked files filtering is well-implemented.The filtering logic correctly:
- Excludes binary files from the untracked list
- Only reads content for files that are both text and small
- Applies special scrubbing for
.envfiles- Handles read errors gracefully by continuing
This implementation effectively reduces API costs by preventing binary file content from being sent to the LLM.
…ike Makefile and Dockerfile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
internal/utils/utils.go (1)
56-87: LGTM! Binary file detection is comprehensive and well-documented.The function covers a broad set of binary formats across multiple categories, and the comment at lines 84-85 clearly documents the intentional behavior for unknown extensions (treated as non-binary by default). The SVG exclusion from the binary list (line 61 comment) correctly aligns with its XML text nature.
internal/git/operations.go (1)
67-123: LGTM! Excellent refactoring eliminates code duplication.The introduction of
processGitStatusOutputas the single source of truth, withfilterBinaryFilesandextractNonBinaryFilesas lightweight wrappers, successfully addresses the code duplication concern from the previous review. ThereturnFilenamesparameter cleanly controls whether filenames are extracted, making the code more maintainable.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
internal/git/operations.go(4 hunks)internal/utils/utils.go(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
internal/git/operations.go (2)
internal/utils/utils.go (3)
IsBinaryFile(57-87)IsTextFile(19-54)IsSmallFile(90-99)internal/scrubber/scrubber.go (1)
ScrubEnvFile(193-231)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build Go Binary (windows-latest)
🔇 Additional comments (4)
internal/utils/utils.go (2)
24-27: LGTM! Comprehensive text file extension coverage.The additions cover a wide range of programming languages (.sql, .r, .scala, .kt, .swift, .m, .pl, .lua, .vim), configuration formats (.csv, .log, .cfg, .conf, .ini, .toml, .lock), and build files (.gitignore, .dockerfile, .makefile, .cmake, .pro, .pri, .svg). SVG is correctly classified as text (XML-based) rather than binary.
37-51: LGTM! Extensionless text file detection is well-implemented.The logic correctly identifies common extensionless text files (README, Dockerfile, Makefile, etc.) by checking the base filename when no extension is present. This prevents false classification of important configuration files.
internal/git/operations.go (2)
136-193: LGTM! Binary file filtering is consistently applied.The unstaged and staged sections now correctly filter binary files before including them in the diff output and limit
git diffoperations to non-binary files only. This reduces API costs by avoiding transmission of binary content to the LLM. The pattern is consistent and the logic is sound, assuming the underlyingparseGitNameStatusfunction is fixed per the earlier comment.
203-243: LGTM! Untracked file filtering and content extraction is well-implemented.Binary files are correctly excluded from the untracked file list (lines 211-214), and content extraction is limited to text files that are small and non-binary (line 224). The special handling for
.envfiles withScrubEnvFile(lines 233-235) is appropriate for security. This completes the binary filtering across all git change categories.
…instead of spaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
internal/git/operations.go(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
internal/git/operations.go (2)
internal/utils/utils.go (3)
IsBinaryFile(57-87)IsTextFile(19-54)IsSmallFile(90-99)internal/scrubber/scrubber.go (1)
ScrubEnvFile(193-231)
| parts := strings.Split(line, "\t") | ||
| if len(parts) < 2 { | ||
| return parseGitStatusLine{} | ||
| } | ||
|
|
||
| status := parts[0] | ||
|
|
||
| // Handle rename/copy status codes (e.g., "R100", "C75") | ||
| if len(status) > 1 && (status[0] == 'R' || status[0] == 'C') { | ||
| // For rename/copy, we expect: "R100\toldname\tnewname" or "C75\toldname\tnewname" | ||
| if len(parts) >= 3 { | ||
| // For renames/copies, both old and new filenames need to be checked | ||
| oldFile := parts[1] | ||
| newFile := parts[2] | ||
| return parseGitStatusLine{ | ||
| status: status, | ||
| filenames: []string{oldFile, newFile}, | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Handle regular status codes (M, A, D, etc.) | ||
| filename := parts[1] | ||
| return parseGitStatusLine{ | ||
| status: status, | ||
| filenames: []string{filename}, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unquote name-status paths before classification/diffing
git diff --name-status emits C-quoted paths whenever core.quotepath is left at its (true) default—for example a binary file image file.png shows up as M\t"image file.png". We forward that quoted string straight into utils.IsBinaryFile and the later git diff -- … call. Two bad things happen:
.Ext("\"image file.png\"")yields.png", so we fail to recognise it as binary and end up sending the very data we meant to block.exec.Commandpasses the quotes verbatim, sogit diffnever matches the real path and the diff content silently disappears.
Please strip the C-style quoting (and unescape sequences) before returning filenames. A tiny helper around strconv.Unquote for both the rename/copy and regular branches fixes it:
-import (
- "fmt"
- "os"
- "os/exec"
- "path/filepath"
- "strings"
+import (
+ "fmt"
+ "os"
+ "os/exec"
+ "path/filepath"
+ "strconv"
+ "strings"
)
@@
func parseGitNameStatus(line string) parseGitStatusLine {
@@
- oldFile := parts[1]
- newFile := parts[2]
+ oldFile := unquoteGitPath(parts[1])
+ newFile := unquoteGitPath(parts[2])
@@
- filename := parts[1]
+ filename := unquoteGitPath(parts[1])
@@
}
+
+func unquoteGitPath(path string) string {
+ if len(path) >= 2 && path[0] == '"' && path[len(path)-1] == '"' {
+ if unquoted, err := strconv.Unquote(path); err == nil {
+ return unquoted
+ }
+ }
+ return path
+}Without this, any path that needs quoting (spaces, non-ASCII, etc.) will either bypass the binary filter or vanish from the diff, defeating the main goal of the PR. -->
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| parts := strings.Split(line, "\t") | |
| if len(parts) < 2 { | |
| return parseGitStatusLine{} | |
| } | |
| status := parts[0] | |
| // Handle rename/copy status codes (e.g., "R100", "C75") | |
| if len(status) > 1 && (status[0] == 'R' || status[0] == 'C') { | |
| // For rename/copy, we expect: "R100\toldname\tnewname" or "C75\toldname\tnewname" | |
| if len(parts) >= 3 { | |
| // For renames/copies, both old and new filenames need to be checked | |
| oldFile := parts[1] | |
| newFile := parts[2] | |
| return parseGitStatusLine{ | |
| status: status, | |
| filenames: []string{oldFile, newFile}, | |
| } | |
| } | |
| } | |
| // Handle regular status codes (M, A, D, etc.) | |
| filename := parts[1] | |
| return parseGitStatusLine{ | |
| status: status, | |
| filenames: []string{filename}, | |
| } | |
| import ( | |
| "fmt" | |
| "os" | |
| "os/exec" | |
| "path/filepath" | |
| "strconv" | |
| "strings" | |
| ) | |
| func parseGitNameStatus(line string) parseGitStatusLine { | |
| parts := strings.Split(line, "\t") | |
| if len(parts) < 2 { | |
| return parseGitStatusLine{} | |
| } | |
| status := parts[0] | |
| // Handle rename/copy status codes (e.g., "R100", "C75") | |
| if len(status) > 1 && (status[0] == 'R' || status[0] == 'C') { | |
| // For rename/copy, we expect: "R100\toldname\tnewname" or "C75\toldname\tnewname" | |
| if len(parts) >= 3 { | |
| // For renames/copies, both old and new filenames need to be checked | |
| oldFile := unquoteGitPath(parts[1]) | |
| newFile := unquoteGitPath(parts[2]) | |
| return parseGitStatusLine{ | |
| status: status, | |
| filenames: []string{oldFile, newFile}, | |
| } | |
| } | |
| } | |
| // Handle regular status codes (M, A, D, etc.) | |
| filename := unquoteGitPath(parts[1]) | |
| return parseGitStatusLine{ | |
| status: status, | |
| filenames: []string{filename}, | |
| } | |
| } | |
| func unquoteGitPath(path string) string { | |
| if len(path) >= 2 && path[0] == '"' && path[len(path)-1] == '"' { | |
| if unquoted, err := strconv.Unquote(path); err == nil { | |
| return unquoted | |
| } | |
| } | |
| return path | |
| } |
DFanso
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🎉
Description
Type of Change
Related Issue
Fixes #81
Changes Made
Testing
Checklist
Screenshots (if applicable)
Additional Notes
For Hacktoberfest Participants
Thank you for your contribution! 🎉
Summary by CodeRabbit
New Features
Chores