Reducing API costs by filtering out binary files and other unwanted files #98

vinyas-bharadwaj · 2025-10-08T18:48:37Z

Description

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Code refactoring
Performance improvement
Other (please describe):

Related Issue

Fixes #81

Changes Made

Filters binary files from diffs for better LLM context
Adds --dry-run flag to preview LLM prompt without API call

Testing

Checklist

My code follows the project's code style
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings or errors
I have tested this in a real Git repository
I have read the CONTRIBUTING.md guidelines

Screenshots (if applicable)

Additional Notes

For Hacktoberfest Participants

This PR is submitted as part of Hacktoberfest 2025

Thank you for your contribution! 🎉

Summary by CodeRabbit

New Features
- Binary files are excluded from unstaged, staged, and untracked change lists and diffs to reduce noise.
- Untracked text files still show small text contents when applicable, with sensitive environment data scrubbed.
- Expanded and improved text/binary detection (many added extensions plus common extensionless filenames) for more accurate diffs.
Chores
- Dependency ordering adjusted with no functional impact.

coderabbitai · 2025-10-08T18:49:02Z

Walkthrough

Adds binary-file detection and filtering: introduces IsBinaryFile, expands IsTextFile extensions and extensionless filename handling, and updates git change collection to exclude binary files from unstaged, staged, and untracked listings and diffs. Also reorders one dependency line in go.mod. No other public API changes.

Changes

Cohort / File(s)	Summary
Dependency ordering `go.mod`	Reordered `github.com/google/shlex` within the `require` block; no version change or functional impact.
Utils: file type detection `internal/utils/utils.go`	Added exported `IsBinaryFile(filename string) bool`. Expanded `IsTextFile` to recognise many additional text extensions and common extensionless text filenames; logic otherwise unchanged.
Git operations: binary filtering `internal/git/operations.go`	Added parsing helpers for `git --name-status`, functions to filter out binary entries and extract non-binary filenames. Updated `GetChanges` to exclude binary files from unstaged, staged, and untracked listings and to request diffs only for non-binary files; untracked text files still read and scrubbed as before.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller as Caller
  participant GetChanges as GetChanges
  participant GitCLI as git CLI
  participant Utils as utils (IsBinaryFile/IsTextFile)

  Caller->>GetChanges: GetChanges()
  GetChanges->>GitCLI: git diff --name-status (unstaged)
  GetChanges->>GetChanges: parse name-status, filterBinaryFiles
  GetChanges->>Utils: classify filenames (IsBinaryFile/IsTextFile)
  Utils-->>GetChanges: non-binary filenames
  GetChanges->>GitCLI: git diff -- <non-binary files>
  GetChanges-->>Caller: Unstaged summary + diffs (text only)

  GetChanges->>GitCLI: git diff --cached --name-status (staged)
  GetChanges->>GetChanges: parse name-status, filterBinaryFiles
  GetChanges->>Utils: classify filenames
  Utils-->>GetChanges: non-binary filenames
  GetChanges->>GitCLI: git diff --cached -- <non-binary files>
  GetChanges-->>Caller: Staged summary + diffs (text only)

  GetChanges->>GitCLI: git ls-files --others --exclude-standard (untracked)
  loop per file
    GetChanges->>Utils: IsBinaryFile / IsTextFile
    Utils-->>GetChanges: classification
  end
  GetChanges->>GitCLI: read contents of eligible text untracked files
  GetChanges-->>Caller: Untracked list + contents (text only)

  note over GetChanges,Utils: Binary files excluded from lists and diffs

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

enhancement, go

Suggested reviewers

DFanso

Poem

I nibble through the repo's maze,
Skipping blobs and heavy glaze.
Text I pick, small and bright,
Big bytes rest beyond my sight.
Hop, patch, and fluff—the diff's just right. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title clearly summarizes the main change by stating the goal of reducing API costs through filtering binary and other unwanted files, matching the pull request’s primary focus without extraneous detail or jargon.
Linked Issues Check	✅ Passed	The changes implement the core requirement from issue #81 by detecting and excluding binary and non-text files in diffs before API calls, using newly added parsing and filtering logic in operations.go and the IsBinaryFile utility in utils.go.
Out of Scope Changes Check	✅ Passed	All modifications directly support filtering out unwanted files, updating dependency ordering, and extending file type detection without introducing unrelated functionality or code unrelated to the objectives of binary filtering.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c50506 and 32f7d9d.

📒 Files selected for processing (3)

go.mod (1 hunks)
internal/git/operations.go (4 hunks)
internal/utils/utils.go (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

internal/git/operations.go (2)

internal/utils/utils.go (3)

IsBinaryFile (41-69)

IsTextFile (19-38)

IsSmallFile (72-81)

internal/scrubber/scrubber.go (1)

ScrubEnvFile (193-231)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build Go Binary (windows-latest)

🔇 Additional comments (4)

internal/utils/utils.go (1)

24-27: LGTM! Good expansion of text file extensions.

The additional extensions (SQL, R, Scala, Kotlin, Swift, Objective-C, Perl, Lua, Vim, CSV, logs, configs, lock files, etc.) provide comprehensive coverage for common text-based development files.

internal/git/operations.go (3)

94-118: Unstaged changes filtering logic is sound.

The approach of filtering the name-status output and then fetching diff content only for non-binary files is correct and aligns with the PR objectives to reduce API costs.

However, this depends on fixing the helper functions to correctly handle rename/copy status codes (see previous comment).

127-151: Staged changes filtering logic is consistent with unstaged.

Good parallel implementation for staged changes using the same filtering strategy.

However, this also depends on fixing the helper functions to correctly handle rename/copy status codes (see earlier comment).

160-200: LGTM! Untracked files filtering is well-implemented.

The filtering logic correctly:

Excludes binary files from the untracked list

Only reads content for files that are both text and small

Applies special scrubbing for .env files

Handles read errors gracefully by continuing

This implementation effectively reduces API costs by preventing binary file content from being sent to the LLM.

internal/git/operations.go

internal/utils/utils.go

…status parsing

…ike Makefile and Dockerfile

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

internal/utils/utils.go (1)

56-87: LGTM! Binary file detection is comprehensive and well-documented.

The function covers a broad set of binary formats across multiple categories, and the comment at lines 84-85 clearly documents the intentional behavior for unknown extensions (treated as non-binary by default). The SVG exclusion from the binary list (line 61 comment) correctly aligns with its XML text nature.

internal/git/operations.go (1)

67-123: LGTM! Excellent refactoring eliminates code duplication.

The introduction of processGitStatusOutput as the single source of truth, with filterBinaryFiles and extractNonBinaryFiles as lightweight wrappers, successfully addresses the code duplication concern from the previous review. The returnFilenames parameter cleanly controls whether filenames are extracted, making the code more maintainable.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32f7d9d and e0752f3.

📒 Files selected for processing (2)

internal/git/operations.go (4 hunks)
internal/utils/utils.go (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

internal/git/operations.go (2)

internal/utils/utils.go (3)

IsBinaryFile (57-87)

IsTextFile (19-54)

IsSmallFile (90-99)

internal/scrubber/scrubber.go (1)

ScrubEnvFile (193-231)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build Go Binary (windows-latest)

🔇 Additional comments (4)

internal/utils/utils.go (2)

24-27: LGTM! Comprehensive text file extension coverage.

The additions cover a wide range of programming languages (.sql, .r, .scala, .kt, .swift, .m, .pl, .lua, .vim), configuration formats (.csv, .log, .cfg, .conf, .ini, .toml, .lock), and build files (.gitignore, .dockerfile, .makefile, .cmake, .pro, .pri, .svg). SVG is correctly classified as text (XML-based) rather than binary.

37-51: LGTM! Extensionless text file detection is well-implemented.

The logic correctly identifies common extensionless text files (README, Dockerfile, Makefile, etc.) by checking the base filename when no extension is present. This prevents false classification of important configuration files.

internal/git/operations.go (2)

136-193: LGTM! Binary file filtering is consistently applied.

The unstaged and staged sections now correctly filter binary files before including them in the diff output and limit git diff operations to non-binary files only. This reduces API costs by avoiding transmission of binary content to the LLM. The pattern is consistent and the logic is sound, assuming the underlying parseGitNameStatus function is fixed per the earlier comment.

203-243: LGTM! Untracked file filtering and content extraction is well-implemented.

Binary files are correctly excluded from the untracked file list (lines 211-214), and content extraction is limited to text files that are small and non-binary (line 224). The special handling for .env files with ScrubEnvFile (lines 233-235) is appropriate for security. This completes the binary filtering across all git change categories.

internal/git/operations.go

…instead of spaces

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e0752f3 and 654dad4.

📒 Files selected for processing (1)

internal/git/operations.go (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

internal/git/operations.go (2)

internal/utils/utils.go (3)

IsBinaryFile (57-87)

IsTextFile (19-54)

IsSmallFile (90-99)

internal/scrubber/scrubber.go (1)

ScrubEnvFile (193-231)

coderabbitai · 2025-10-08T19:27:39Z

internal/git/operations.go

+	parts := strings.Split(line, "\t")
+	if len(parts) < 2 {
+		return parseGitStatusLine{}
+	}
+
+	status := parts[0]
+
+	// Handle rename/copy status codes (e.g., "R100", "C75")
+	if len(status) > 1 && (status[0] == 'R' || status[0] == 'C') {
+		// For rename/copy, we expect: "R100\toldname\tnewname" or "C75\toldname\tnewname"
+		if len(parts) >= 3 {
+			// For renames/copies, both old and new filenames need to be checked
+			oldFile := parts[1]
+			newFile := parts[2]
+			return parseGitStatusLine{
+				status:    status,
+				filenames: []string{oldFile, newFile},
+			}
+		}
+	}
+
+	// Handle regular status codes (M, A, D, etc.)
+	filename := parts[1]
+	return parseGitStatusLine{
+		status:    status,
+		filenames: []string{filename},
+	}


⚠️ Potential issue | 🔴 Critical

Unquote name-status paths before classification/diffing

git diff --name-status emits C-quoted paths whenever core.quotepath is left at its (true) default—for example a binary file image file.png shows up as M\t"image file.png". We forward that quoted string straight into utils.IsBinaryFile and the later git diff -- … call. Two bad things happen:

.Ext("\"image file.png\"") yields .png", so we fail to recognise it as binary and end up sending the very data we meant to block.

exec.Command passes the quotes verbatim, so git diff never matches the real path and the diff content silently disappears.

Please strip the C-style quoting (and unescape sequences) before returning filenames. A tiny helper around strconv.Unquote for both the rename/copy and regular branches fixes it:

-import ( - "fmt" - "os" - "os/exec" - "path/filepath" - "strings" +import ( + "fmt" + "os" + "os/exec" + "path/filepath" + "strconv" + "strings" ) @@ func parseGitNameStatus(line string) parseGitStatusLine { @@ - oldFile := parts[1] - newFile := parts[2] + oldFile := unquoteGitPath(parts[1]) + newFile := unquoteGitPath(parts[2]) @@ - filename := parts[1] + filename := unquoteGitPath(parts[1]) @@ } + +func unquoteGitPath(path string) string { + if len(path) >= 2 && path[0] == '"' && path[len(path)-1] == '"' { + if unquoted, err := strconv.Unquote(path); err == nil { + return unquoted + } + } + return path +}

Without this, any path that needs quoting (spaces, non-ASCII, etc.) will either bypass the binary filter or vanish from the diff, defeating the main goal of the PR. -->

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

parts := strings.Split(line, "\t")

if len(parts) < 2 {

return parseGitStatusLine{}

}

status := parts[0]

// Handle rename/copy status codes (e.g., "R100", "C75")

if len(status) > 1 && (status[0] == 'R' || status[0] == 'C') {

// For rename/copy, we expect: "R100\toldname\tnewname" or "C75\toldname\tnewname"

if len(parts) >= 3 {

// For renames/copies, both old and new filenames need to be checked

oldFile := parts[1]

newFile := parts[2]

return parseGitStatusLine{

status: status,

filenames: []string{oldFile, newFile},

}

}

}

// Handle regular status codes (M, A, D, etc.)

filename := parts[1]

return parseGitStatusLine{

status: status,

filenames: []string{filename},

}

import (

"fmt"

"os"

"os/exec"

"path/filepath"

"strconv"

"strings"

)

func parseGitNameStatus(line string) parseGitStatusLine {

parts := strings.Split(line, "\t")

if len(parts) < 2 {

return parseGitStatusLine{}

}

status := parts[0]

// Handle rename/copy status codes (e.g., "R100", "C75")

if len(status) > 1 && (status[0] == 'R' || status[0] == 'C') {

// For rename/copy, we expect: "R100\toldname\tnewname" or "C75\toldname\tnewname"

if len(parts) >= 3 {

// For renames/copies, both old and new filenames need to be checked

oldFile := unquoteGitPath(parts[1])

newFile := unquoteGitPath(parts[2])

return parseGitStatusLine{

status: status,

filenames: []string{oldFile, newFile},

}

}

}

// Handle regular status codes (M, A, D, etc.)

filename := unquoteGitPath(parts[1])

return parseGitStatusLine{

status: status,

filenames: []string{filename},

}

}

func unquoteGitPath(path string) string {

if len(path) >= 2 && path[0] == '"' && path[len(path)-1] == '"' {

if unquoted, err := strconv.Unquote(path); err == nil {

return unquoted

}

}

return path

}

DFanso

LGTM 🎉

vinyas-bharadwaj added 3 commits October 7, 2025 19:20

fixed the issue with the logo

0cddb3e

Filters binary files in order to reduce API costs

f25b3ac

Merge branch 'DFanso:main' into main

32f7d9d

coderabbitai bot reviewed Oct 8, 2025

View reviewed changes

internal/git/operations.go Outdated Show resolved Hide resolved

internal/utils/utils.go Show resolved Hide resolved

vinyas-bharadwaj added 2 commits October 8, 2025 18:59

Removed svg from the binary file filtering and fixed the rename/copy …

afbe485

…status parsing

Improved text file detection by checking common exesnsionless files l…

e0752f3

…ike Makefile and Dockerfile

coderabbitai bot reviewed Oct 8, 2025

View reviewed changes

internal/git/operations.go Show resolved Hide resolved

Fixes parsing of git --name-status output by splitting lines on tabs …

654dad4

…instead of spaces

coderabbitai bot reviewed Oct 8, 2025

View reviewed changes

DFanso approved these changes Oct 9, 2025

View reviewed changes

DFanso assigned vinyas-bharadwaj Oct 9, 2025

DFanso added enhancement New feature or request hacktoberfest Eligible for Hacktoberfest hacktoberfest-accepted Approved Hacktoberfest contribution go Pull requests that update go code labels Oct 9, 2025

DFanso merged commit 9739e70 into DFanso:main Oct 9, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reducing API costs by filtering out binary files and other unwanted files #98

Reducing API costs by filtering out binary files and other unwanted files #98

Uh oh!

vinyas-bharadwaj commented Oct 8, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 8, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 8, 2025

Uh oh!

DFanso left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Reducing API costs by filtering out binary files and other unwanted files #98

Reducing API costs by filtering out binary files and other unwanted files #98

Uh oh!

Conversation

vinyas-bharadwaj commented Oct 8, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issue

Changes Made

Testing

Checklist

Screenshots (if applicable)

Additional Notes

For Hacktoberfest Participants

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

DFanso left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vinyas-bharadwaj commented Oct 8, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 8, 2025 •

edited

Loading