Skip to content

Conversation

@dustin-decker
Copy link
Contributor

@dustin-decker dustin-decker commented Nov 20, 2025

Description:

Adds an option to follow symlinks in the fs source.

Checklist:

  • Tests passing (make test-community)?
  • Lint passing (make lint this requires golangci-lint)?

@dustin-decker dustin-decker requested a review from a team November 20, 2025 18:05
@dustin-decker dustin-decker requested review from a team as code owners November 20, 2025 18:05
}

if fileInfo.Mode()&os.ModeSymlink != 0 {
if !s.followSymlinks && fileInfo.Mode()&os.ModeSymlink != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fileInfo.Mode() would always be non-symlink if os.Stat is used, right? Since Stat follows symlinks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have redone this preserving the original first Lstat and then doing a Stat after if followSymlinks is enabled and it is a symlink to make it more clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolving because I don't see the original Lstat anywhere. Am I missing it?

@dustin-decker dustin-decker requested a review from a team as a code owner November 21, 2025 19:04
}

if fileInfo.Mode()&os.ModeSymlink != 0 {
if !s.followSymlinks && fileInfo.Mode()&os.ModeSymlink != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolving because I don't see the original Lstat anywhere. Am I missing it?

Comment on lines +48 to +55
// Why LRU cache instead of a map:
// - Bounded memory: Limits to 10k paths (~1MB) even for massive directory trees
// - Per-path reset: Cache is recreated for each scan path to prevent accumulation
// - Loop detection: Prevents scanning the same file multiple times via different symlinks
//
// Why depth-1 limiting:
// - Prevents infinite loops: Symlink chains (A->B->C->...) are limited
// - Predictable behavior: Users know exactly which symlinks will be followed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does a cache get us that Stat / EvalSymlinks does not address? Looks like if there's a cycle, those functions return an error: too many links

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the cycle we're worried about is a symlink to a directory and then recursively scanning that. Do you think we could get away with saving the visited directories only?

Comment on lines +157 to +158
// If followSymlinks is enabled and this is a symlink, check for loops
if s.followSymlinks && fileInfo.Mode()&os.ModeSymlink != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is always false since fileInfo comes from Stat when followSymlinks is true. Is there an example or test case that exercises this scenario?

Comment on lines +310 to +320
// isDirectChild checks if a path is a direct child of any scan root path.
// This enforces depth-1 symlink following to prevent:
// - Infinite symlink loops
// - Deep directory traversal through symlinks
//
// Returns true only if the symlink's parent directory matches a scan root path.
func (s *Source) isDirectChild(path string) bool {
dir := filepath.Clean(filepath.Dir(path))
_, isRoot := s.scanRootPaths[dir]
return isRoot
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I read "depth-1 limiting" I thought it meant we allow only one symlink depth, but this looks like it is limiting symbolic link targets to a very specific directory structure.

Why are we doing this? It seems like something that can be easily tripped on and I'm not seeing why it needs to be that way.

Comment on lines +675 to +696
t.Run("only follow first level symlink in chain", func(t *testing.T) {
conn, err := anypb.New(&sourcespb.Filesystem{
Paths: []string{tempDir},
FollowSymlinks: true,
})
require.NoError(t, err)

s := Source{}
err = s.Init(ctx, "test symlink chain", 0, 0, true, conn, 1)
require.NoError(t, err)

reporter := sourcestest.TestReporter{}
err = s.ChunkUnit(ctx, sources.CommonSourceUnit{
ID: tempDir,
}, &reporter)
require.NoError(t, err)

// Should have 2 chunks: real_file and symlink1 (which resolves to real_file)
// symlink2 also resolves to the same real_file, so loop detection prevents duplicate scanning
// This is correct behavior - we don't want to scan the same content multiple times
assert.Equal(t, 2, len(reporter.Chunks), "Expected two chunks from real file and first symlink")
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I agree that this is testing "only follow first level symlink in chain"

It scans the entire directory vs just symlink1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants