Skip to content

Conversation

@mustansir14
Copy link
Contributor

@mustansir14 mustansir14 commented Dec 11, 2025

Description:

This PR adds the changes to include gitlab project details like project ID, project name and project owner name to the metadata of a chunk.

Achieving this wasn't straightforward as the chunks are generated by the git source and it calls an injected SourceMetadataFunc to create the metadata. The signature of this function obviously does not include the gitlab project ID.

The solution implemented here is to maintain a repoToProjectCache map that stores the project details for a repo. This cache is populated as we enumerate the repos and the callback function uses this cache to populate the project details.

The only concerning part of this solution is memory usage, so just to put in perspective how much this impacts that, I found a public organization with ~3000 projects and ran benchmarks for before and after making the changes. (I know this isn't anywhere close to the largest organizations we've come across, but this is the biggest one I could find that was public)

Results before making changes:

Total memory usage: 43234992 bytes (43.23 MBs)

Running tool: /usr/local/go/bin/go test -benchmem -run=^$ -bench ^BenchmarkMemoryUsage$ github.com/trufflesecurity/trufflehog/v3/pkg/sources/gitlab

2025-12-11T18:58:01+05:00	info-0	context	starting projects enumeration	{"list_options": {"pagination":"keyset","order_by":"id","sort":"asc"}, "all_available": false}
2025-12-11T18:59:01+05:00	info-0	context	Enumerated GitLab projects	{"count": 2930}
Memory used during benchmark:
Allocated (Alloc): 43234992 bytes
Total Allocated (TotalAlloc): 43234992 bytes
Memory obtained from system (Sys): 17629184 bytes
goos: linux
goarch: amd64
pkg: github.com/trufflesecurity/trufflehog/v3/pkg/sources/gitlab
cpu: 12th Gen Intel(R) Core(TM) i5-1235U
BenchmarkMemoryUsage-12    	       1	59642136351 ns/op	42916920 B/op	  246843 allocs/op
PASS
ok  	github.com/trufflesecurity/trufflehog/v3/pkg/sources/gitlab	60.146s

Results after this change:

Total memory usage: 44666216 bytes (44.66 MBs)

Running tool: /usr/local/go/bin/go test -benchmem -run=^$ -bench ^BenchmarkMemoryUsage$ github.com/trufflesecurity/trufflehog/v3/pkg/sources/gitlab

goos: linux
goarch: amd64
pkg: github.com/trufflesecurity/trufflehog/v3/pkg/sources/gitlab
cpu: 12th Gen Intel(R) Core(TM) i5-1235U
=== RUN   BenchmarkMemoryUsage
BenchmarkMemoryUsage
2025-12-11T18:38:02+05:00       info-0  context starting projects enumeration   {"list_options": {"pagination":"keyset","order_by":"id","sort":"asc"}, "all_available": false}
2025-12-11T18:39:09+05:00       info-0  context Enumerated GitLab projects      {"count": 2930}
Memory used during benchmark:
Allocated (Alloc): 44666216 bytes
Total Allocated (TotalAlloc): 44666216 bytes
Memory obtained from system (Sys): 21823488 bytes
BenchmarkMemoryUsage-12                1        66827846825 ns/op       44350440 B/op     261384 allocs/op
PASS
ok      github.com/trufflesecurity/trufflehog/v3/pkg/sources/gitlab     67.355s

It's important to note that this doesn't really perform a full scan (I tried doing that first but it would take hours), so I tweaked the code to expose the callback function and directly called that after enumerating the repos. This is the code used for benchmarking:

func BenchmarkMemoryUsage(b *testing.B) {
	// Enable memory allocation reporting
	b.ReportAllocs()

	// Run a specific function and measure memory usage
	var mStart, mEnd runtime.MemStats

	// Get memory stats before execution
	runtime.ReadMemStats(&mStart)

	connection := &sourcespb.GitLab{
		Endpoint: "https://gitlab.eclipse.org",
	}

	s := Source{}

	conn, err := anypb.New(connection)
	if err != nil {
		b.Fatal(err)
	}

	err = s.Init(context.Background(), "benchmark", 0, 0, false, conn, 100)
	if err != nil {
		b.Errorf("Source.Init() error = %v", err)
		return
	}

	feature.UseSimplifiedGitlabEnumeration.Store(true)

	// Run the benchmark loop
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		reporter := &sourcestest.TestReporter{}
		err = s.Enumerate(context.Background(), reporter)
		if err != nil {
			b.Errorf("Source.Chunks() error = %v", err)
			return
		}

		for _, unit := range reporter.Units {
			repository, _ := unit.SourceUnitID()
			_ = s.cfg.SourceMetadataFunc("test file", "test email", "test commit", "test timestamp", repository, "test local path", 1)
		}
	}

	// Get memory stats after execution
	runtime.ReadMemStats(&mEnd)

	// Calculate the memory usage
	allocDiff := mEnd.Alloc - mStart.Alloc
	totalAllocDiff := mEnd.TotalAlloc - mStart.TotalAlloc
	sysDiff := mEnd.Sys - mStart.Sys

	// Log the memory stats
	fmt.Printf("Memory used during benchmark:\n")
	fmt.Printf("Allocated (Alloc): %d bytes\n", allocDiff)
	fmt.Printf("Total Allocated (TotalAlloc): %d bytes\n", totalAllocDiff)
	fmt.Printf("Memory obtained from system (Sys): %d bytes\n", sysDiff)
}

Checklist:

  • Tests passing (make test-community)?
  • Lint passing (make lint this requires golangci-lint)?

@mustansir14 mustansir14 requested a review from a team December 11, 2025 14:32
@mustansir14 mustansir14 requested a review from a team as a code owner December 11, 2025 14:32
Copy link
Contributor

@amanfcp amanfcp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! A few suggestions

Comment on lines 1213 to 1228
// Cache project metadata if available.
if gitUnit, ok := unit.(git.SourceUnit); ok && gitUnit.Metadata != nil {
if gitUnit.Metadata["project_id"] != "" {
project := &project{}
projectId, err := strconv.Atoi(gitUnit.Metadata["project_id"])
if err != nil {
ctx.Logger().Error(err, "could not convert project_id metadata to int", "project_id", gitUnit.Metadata["project_id"])
} else {
project.id = projectId
}
project.name = gitUnit.Metadata["project_name"]
project.owner = gitUnit.Metadata["project_owner"]
s.repoToProjCache.set(repoURL, project)
}

}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would recommend early return flow here. It's kind of difficult to read 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, totally agree. I'll make the changes

Comment on lines 941 to 943
if proj.Owner != nil {
metadata["project_owner"] = proj.Owner.Email
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're missing the username check here as handled in gitlabProjectToCacheProject

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, thanks

…data' of mustansir:mustansir14/trufflehog into INS-206-Store-GitLab-project-ID-in-secret-location-metadata
@rosecodym rosecodym requested a review from a team December 15, 2025 16:51
Copy link
Contributor

@rosecodym rosecodym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some scanning pipeline machinery that cannot yet support this, so this PR won't work as-is. I'm requesting changes until I (or maybe @mcastorina) has some time to go through it and figure out whether "won't work" actually means there will be problems, or it just won't work.

Copy link
Contributor

@rosecodym rosecodym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty straightforward, but I have some notes and questions.

}
// extract project path from repo URL
// https://gitlab.com/testermctestface/testy.git => testermctestface/testy
repoPath := strings.TrimSuffix(strings.Join(strings.Split(repoUrl, "/")[len(strings.Split(repoUrl, "/"))-2:], "/"), ".git")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider properly parsing this as a URL instead of doing a simple string split?

Relatedly, this looks like it will panic if the repo URL is sufficiently malformed. Even if string splitting is the best solution, we should avoid that.

}

// Clear the repo to project cache when done.
defer s.repoToProjCache.clear()
Copy link
Contributor

@rosecodym rosecodym Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I respect your concern for cache size, I don't believe clearing this cache is necessary. Source structures don't survive past a single scan, so clearing them at the end of the scan doesn't reclaim any memory that wouldn't be soon reclaimed anyway, and introduces the possibility of race conditions in the case that the git-parsing code sends work to separate goroutines.

Comment on lines +85 to +122
type project struct {
id int
name string
owner string
}

type repoToProjectCache struct {
sync.RWMutex

cache map[string]*project
}

func (r *repoToProjectCache) get(repo string) (*project, bool) {
r.RLock()
defer r.RUnlock()
proj, ok := r.cache[repo]
return proj, ok
}

func (r *repoToProjectCache) set(repo string, proj *project) {
r.Lock()
defer r.Unlock()

r.cache[repo] = proj
}

func (r *repoToProjectCache) del(repo string) {
r.Lock()
defer r.Unlock()

delete(r.cache, repo)
}

func (r *repoToProjectCache) clear() {
r.Lock()
defer r.Unlock()

clear(r.cache)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of moving this to a separate file (still within this package)?

// ensure project details for specified repos are cached
// this is required to populate metadata during chunking
for _, repo := range repos {
s.ensureProjectInCache(ctx, repo)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a few passes back and forth to figure out why sometimes you cache projects using ensureProjectInCache, and sometimes you directly cache the the value returned from gitlabProjectToCacheProject. I eventually got there, but laying out the call tree a bit differently would definitely have helped me understand what was going on much more quickly. What I realized I was looking for was a single "entry point" into caching, e.g.:

func (s *Source) cacheProject(repoUrl string, project *gitlab.Project)

When a gitlab.Project is available, this could be used directly. When only a repo URL is available, it could be transformed into a gitlab.Project using a second helper function:

func (s *Source) getGitlabProject(repoURL string) *gitlab.Project

This way, every function call that caches something looks the "same," and the function that acquires new (necessary) information for caching has a clear name. (ensure is a suboptimal function verb, because it's vague.)

What do you think of this?

@rosecodym rosecodym dismissed their stale review December 16, 2025 18:51

i misread the pr when i requested changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants