-
Notifications
You must be signed in to change notification settings - Fork 2.2k
[INS-206] Store Gitlab Project ID in secret location metadata #4601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[INS-206] Store Gitlab Project ID in secret location metadata #4601
Conversation
amanfcp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! A few suggestions
pkg/sources/gitlab/gitlab.go
Outdated
| // Cache project metadata if available. | ||
| if gitUnit, ok := unit.(git.SourceUnit); ok && gitUnit.Metadata != nil { | ||
| if gitUnit.Metadata["project_id"] != "" { | ||
| project := &project{} | ||
| projectId, err := strconv.Atoi(gitUnit.Metadata["project_id"]) | ||
| if err != nil { | ||
| ctx.Logger().Error(err, "could not convert project_id metadata to int", "project_id", gitUnit.Metadata["project_id"]) | ||
| } else { | ||
| project.id = projectId | ||
| } | ||
| project.name = gitUnit.Metadata["project_name"] | ||
| project.owner = gitUnit.Metadata["project_owner"] | ||
| s.repoToProjCache.set(repoURL, project) | ||
| } | ||
|
|
||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would recommend early return flow here. It's kind of difficult to read 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, totally agree. I'll make the changes
pkg/sources/gitlab/gitlab.go
Outdated
| if proj.Owner != nil { | ||
| metadata["project_owner"] = proj.Owner.Email | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're missing the username check here as handled in gitlabProjectToCacheProject
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch, thanks
…data' of mustansir:mustansir14/trufflehog into INS-206-Store-GitLab-project-ID-in-secret-location-metadata
rosecodym
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some scanning pipeline machinery that cannot yet support this, so this PR won't work as-is. I'm requesting changes until I (or maybe @mcastorina) has some time to go through it and figure out whether "won't work" actually means there will be problems, or it just won't work.
rosecodym
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty straightforward, but I have some notes and questions.
| } | ||
| // extract project path from repo URL | ||
| // https://gitlab.com/testermctestface/testy.git => testermctestface/testy | ||
| repoPath := strings.TrimSuffix(strings.Join(strings.Split(repoUrl, "/")[len(strings.Split(repoUrl, "/"))-2:], "/"), ".git") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider properly parsing this as a URL instead of doing a simple string split?
Relatedly, this looks like it will panic if the repo URL is sufficiently malformed. Even if string splitting is the best solution, we should avoid that.
| } | ||
|
|
||
| // Clear the repo to project cache when done. | ||
| defer s.repoToProjCache.clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I respect your concern for cache size, I don't believe clearing this cache is necessary. Source structures don't survive past a single scan, so clearing them at the end of the scan doesn't reclaim any memory that wouldn't be soon reclaimed anyway, and introduces the possibility of race conditions in the case that the git-parsing code sends work to separate goroutines.
| type project struct { | ||
| id int | ||
| name string | ||
| owner string | ||
| } | ||
|
|
||
| type repoToProjectCache struct { | ||
| sync.RWMutex | ||
|
|
||
| cache map[string]*project | ||
| } | ||
|
|
||
| func (r *repoToProjectCache) get(repo string) (*project, bool) { | ||
| r.RLock() | ||
| defer r.RUnlock() | ||
| proj, ok := r.cache[repo] | ||
| return proj, ok | ||
| } | ||
|
|
||
| func (r *repoToProjectCache) set(repo string, proj *project) { | ||
| r.Lock() | ||
| defer r.Unlock() | ||
|
|
||
| r.cache[repo] = proj | ||
| } | ||
|
|
||
| func (r *repoToProjectCache) del(repo string) { | ||
| r.Lock() | ||
| defer r.Unlock() | ||
|
|
||
| delete(r.cache, repo) | ||
| } | ||
|
|
||
| func (r *repoToProjectCache) clear() { | ||
| r.Lock() | ||
| defer r.Unlock() | ||
|
|
||
| clear(r.cache) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of moving this to a separate file (still within this package)?
| // ensure project details for specified repos are cached | ||
| // this is required to populate metadata during chunking | ||
| for _, repo := range repos { | ||
| s.ensureProjectInCache(ctx, repo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a few passes back and forth to figure out why sometimes you cache projects using ensureProjectInCache, and sometimes you directly cache the the value returned from gitlabProjectToCacheProject. I eventually got there, but laying out the call tree a bit differently would definitely have helped me understand what was going on much more quickly. What I realized I was looking for was a single "entry point" into caching, e.g.:
func (s *Source) cacheProject(repoUrl string, project *gitlab.Project)
When a gitlab.Project is available, this could be used directly. When only a repo URL is available, it could be transformed into a gitlab.Project using a second helper function:
func (s *Source) getGitlabProject(repoURL string) *gitlab.Project
This way, every function call that caches something looks the "same," and the function that acquires new (necessary) information for caching has a clear name. (ensure is a suboptimal function verb, because it's vague.)
What do you think of this?
i misread the pr when i requested changes
Description:
This PR adds the changes to include gitlab project details like project ID, project name and project owner name to the metadata of a chunk.
Achieving this wasn't straightforward as the chunks are generated by the
gitsource and it calls an injectedSourceMetadataFuncto create the metadata. The signature of this function obviously does not include the gitlab project ID.The solution implemented here is to maintain a
repoToProjectCachemap that stores the project details for a repo. This cache is populated as we enumerate the repos and the callback function uses this cache to populate the project details.The only concerning part of this solution is memory usage, so just to put in perspective how much this impacts that, I found a public organization with ~3000 projects and ran benchmarks for before and after making the changes. (I know this isn't anywhere close to the largest organizations we've come across, but this is the biggest one I could find that was public)
Results before making changes:
Total memory usage: 43234992 bytes (43.23 MBs)
Results after this change:
Total memory usage: 44666216 bytes (44.66 MBs)
It's important to note that this doesn't really perform a full scan (I tried doing that first but it would take hours), so I tweaked the code to expose the callback function and directly called that after enumerating the repos. This is the code used for benchmarking:
Checklist:
make test-community)?make lintthis requires golangci-lint)?