-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Enable optional symlink following for fs source #4565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| } | ||
|
|
||
| if fileInfo.Mode()&os.ModeSymlink != 0 { | ||
| if !s.followSymlinks && fileInfo.Mode()&os.ModeSymlink != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fileInfo.Mode() would always be non-symlink if os.Stat is used, right? Since Stat follows symlinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have redone this preserving the original first Lstat and then doing a Stat after if followSymlinks is enabled and it is a symlink to make it more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unresolving because I don't see the original Lstat anywhere. Am I missing it?
9f9d140 to
b8dddcd
Compare
| } | ||
|
|
||
| if fileInfo.Mode()&os.ModeSymlink != 0 { | ||
| if !s.followSymlinks && fileInfo.Mode()&os.ModeSymlink != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unresolving because I don't see the original Lstat anywhere. Am I missing it?
| // Why LRU cache instead of a map: | ||
| // - Bounded memory: Limits to 10k paths (~1MB) even for massive directory trees | ||
| // - Per-path reset: Cache is recreated for each scan path to prevent accumulation | ||
| // - Loop detection: Prevents scanning the same file multiple times via different symlinks | ||
| // | ||
| // Why depth-1 limiting: | ||
| // - Prevents infinite loops: Symlink chains (A->B->C->...) are limited | ||
| // - Predictable behavior: Users know exactly which symlinks will be followed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does a cache get us that Stat / EvalSymlinks does not address? Looks like if there's a cycle, those functions return an error: too many links
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the cycle we're worried about is a symlink to a directory and then recursively scanning that. Do you think we could get away with saving the visited directories only?
| // If followSymlinks is enabled and this is a symlink, check for loops | ||
| if s.followSymlinks && fileInfo.Mode()&os.ModeSymlink != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is always false since fileInfo comes from Stat when followSymlinks is true. Is there an example or test case that exercises this scenario?
| // isDirectChild checks if a path is a direct child of any scan root path. | ||
| // This enforces depth-1 symlink following to prevent: | ||
| // - Infinite symlink loops | ||
| // - Deep directory traversal through symlinks | ||
| // | ||
| // Returns true only if the symlink's parent directory matches a scan root path. | ||
| func (s *Source) isDirectChild(path string) bool { | ||
| dir := filepath.Clean(filepath.Dir(path)) | ||
| _, isRoot := s.scanRootPaths[dir] | ||
| return isRoot | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I read "depth-1 limiting" I thought it meant we allow only one symlink depth, but this looks like it is limiting symbolic link targets to a very specific directory structure.
Why are we doing this? It seems like something that can be easily tripped on and I'm not seeing why it needs to be that way.
| t.Run("only follow first level symlink in chain", func(t *testing.T) { | ||
| conn, err := anypb.New(&sourcespb.Filesystem{ | ||
| Paths: []string{tempDir}, | ||
| FollowSymlinks: true, | ||
| }) | ||
| require.NoError(t, err) | ||
|
|
||
| s := Source{} | ||
| err = s.Init(ctx, "test symlink chain", 0, 0, true, conn, 1) | ||
| require.NoError(t, err) | ||
|
|
||
| reporter := sourcestest.TestReporter{} | ||
| err = s.ChunkUnit(ctx, sources.CommonSourceUnit{ | ||
| ID: tempDir, | ||
| }, &reporter) | ||
| require.NoError(t, err) | ||
|
|
||
| // Should have 2 chunks: real_file and symlink1 (which resolves to real_file) | ||
| // symlink2 also resolves to the same real_file, so loop detection prevents duplicate scanning | ||
| // This is correct behavior - we don't want to scan the same content multiple times | ||
| assert.Equal(t, 2, len(reporter.Chunks), "Expected two chunks from real file and first symlink") | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I agree that this is testing "only follow first level symlink in chain"
It scans the entire directory vs just symlink1.
Description:
Adds an option to follow symlinks in the fs source.
Checklist:
make test-community)?make lintthis requires golangci-lint)?