-
Notifications
You must be signed in to change notification settings - Fork 0
Refactor and optimize archive and manifest handling with benchmarks #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Added benchmarks for: - Zstd compression/decompression with and without context reuse - Compression level comparisons - Buffer allocation strategies - Manifest marshal/unmarshal operations - Archive header operations - Lookup table key strategies
Major changes: - Reorganized code into pkg/archive and pkg/manifest packages - Created clean CLI tool in cmd/evrtools - Used idiomatic Go naming conventions (CamelCase exports) - Added comprehensive documentation and comments - Consolidated duplicate types (removed redundancy between tool/ and evrManifests/) - Added unit tests and benchmarks for new packages - Updated README with library usage examples - Updated Makefile with proper targets Benchmark results show: - Context reuse for ZSTD decompression: ~3.7x faster (6290ns vs 1688ns) - Zero allocations with context reuse - CombinedInt64Key lookup: ~2.7x faster than StructKey Legacy code in tool/ and evrManifests/ retained for backwards compatibility.
Header operations (archive): - Marshal: 136.3 ns → 1.05 ns (130x faster), 3 allocs → 0 allocs - Unmarshal: 134.5 ns → 3.8 ns (35x faster), 2 allocs → 0 allocs Manifest operations: - Marshal: 1,354,843 ns → 122,781 ns (11x faster) - Memory: 3,228,085 B → 729,093 B (4.4x reduction) - Allocs: 9 → 1 (9x reduction) - Unmarshal: 1,345,174 ns → 154,367 ns (8.7x faster) - Memory: 1,474,805 B → 737,286 B (2x reduction) - Allocs: 8 → 3 (2.7x reduction) Changes: - Replaced bytes.Buffer + binary.Write with direct LittleEndian encoding - Pre-calculate and allocate exact buffer sizes - Use inline field encoding instead of reflection-based binary package - Added size constants for all binary structures
Frame content lookup: - LinearScan: 2619 ns → PrebuiltIndex: 7 ns (374x faster) - Build frame index map before extraction loop - Eliminates O(n²) complexity in package extraction String formatting: - fmt.Sprintf: 68.5 ns/op, 1 alloc → strconv.FormatInt: 26.5 ns/op, 0 allocs - Use strconv.FormatInt/FormatUint for hex/decimal conversion - 2.6x faster with no allocations Other optimizations: - Builder.incrementSection: removed loop, use direct arithmetic - Package.Extract: cache created directories to avoid repeated MkdirAll - Added benchmarks for frame index and hex formatting strategies
Performance improvements: - Added EncodeTo/DecodeFrom methods to Header for zero-allocation encoding - Reader now uses embedded headerBuf array instead of allocating - Writer now uses embedded headerBuf array instead of allocating - Added BinarySize and EncodeTo methods to Manifest for pre-allocated encoding Benchmark results: - Header DecodeFrom: 3.8x faster than UnmarshalBinary (1.0ns vs 3.8ns) - Archive Encode: 13→11 allocations (15% reduction) - Archive Decode: 10→9 allocations (10% reduction) Remaining allocations are at practical minimum: - zstd compression/decompression buffers - Manifest slice allocations for data storage
Changes: - Remove legacy tool/ package (duplicated pkg/ functionality) - Remove legacy evrManifests/ package (unused manifest versions) - Remove legacy main.go CLI (replaced by cmd/evrtools) - Update module path from goopsie to EchoTools organization - Clean up benchmark log files - Update Makefile (remove legacy targets) - Update README with current structure and usage Final structure: cmd/evrtools/ - CLI application pkg/archive/ - ZSTD archive format handling pkg/manifest/ - EVR manifest/package operations All tests pass, build verified.
Changes: - Add -decimal-names flag to CLI (default: false, uses hex) - When -decimal-names is set, extract uses decimal format for filenames - Add WithDecimalNames() option to manifest.Extract() - Type symbols remain hex in directory names - File symbols can now be decimal (old behavior) or hex (new default) Usage: evrtools -mode extract ... -decimal-names # Use decimal filenames
…; delete obsolete 'evrtools' binary
Changes: - Hex filenames are now formatted as uint64 (e.g. 0xc8c33e483b601ab6) - Decimal filenames remain int64 (e.g. -3980269165710665034) - Type symbols are now formatted as uint64 hex
| } | ||
|
|
||
| // Grow slice if needed | ||
| for int(chunkNum) >= len(files) { |
Check failure
Code scanning / CodeQL
Incorrect conversion between integer types High
strconv.ParseInt
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 6 days ago
In general, to fix this kind of issue you should avoid converting a potentially large int64 (from ParseInt with size 64) to a smaller integer type without checking that the value lies within the target type’s range. Either (a) parse directly into the appropriate size, or (b) enforce explicit bounds before casting.
Here, the problem is the conversion from int64 (chunkNum) to int in the loop condition for int(chunkNum) >= len(files). The simplest fix that preserves existing behavior is to avoid the narrowing conversion entirely by working in int64 consistently and using int64(len(files)) in the comparison. That way, no unchecked narrowing occurs, and the code is safe regardless of the platform’s int size.
Concretely in pkg/manifest/scanner.go:
- Replace the loop condition
for int(chunkNum) >= len(files) {with a comparison that castslen(files)toint64:for chunkNum >= int64(len(files)) {. - Replace the subsequent use of
files[chunkNum]withfiles[chunkNumInt]wherechunkNumIntis a safely convertedintwith checked range or, more simply, keep usingfiles[chunkNum]by convertingchunkNumtointafter proving it fits via the loop bound. However, because slice indices must beint, we need one conversion. - The best approach: keep the loop condition in
int64, and inside the loop and before indexing, convertchunkNumonce after a bounds check that ensureschunkNumis within theintrange (at least non‑negative and not exceedingmath.MaxInton this platform). In practice, we can leverage the invariant thatchunkNum >= 0(directory names for chunks should be non‑negative); we should add an explicit check to enforce this and to ensurechunkNum <= math.MaxInt. This makes the index conversion safe.
So:
- Add an import of
math(already used earlier in other examples but not in this file). - After parsing
chunkNum, add checks:if chunkNum < 0 { return fmt.Errorf("invalid chunk number: %d", chunkNum) }if chunkNum > int64(math.MaxInt) { return fmt.Errorf("chunk number too large: %d", chunkNum) }
- Then we can safely use
int(chunkNum)in the loop and index expressions, since we’ve guaranteed it is withinint’s range.
This approach retains the structure of the code and satisfies CodeQL by adding explicit upper (and lower) bound checks before the narrowing conversion.
-
Copy modified line R5 -
Copy modified lines R44-R50 -
Copy modified line R80
| @@ -2,6 +2,7 @@ | ||
|
|
||
| import ( | ||
| "fmt" | ||
| "math" | ||
| "os" | ||
| "path/filepath" | ||
| "strconv" | ||
| @@ -40,6 +41,13 @@ | ||
| if err != nil { | ||
| return fmt.Errorf("parse chunk number: %w", err) | ||
| } | ||
| // Ensure chunkNum is within the valid range for int to avoid overflow on conversion. | ||
| if chunkNum < 0 { | ||
| return fmt.Errorf("invalid chunk number: %d", chunkNum) | ||
| } | ||
| if chunkNum > int64(math.MaxInt) { | ||
| return fmt.Errorf("chunk number too large: %d", chunkNum) | ||
| } | ||
|
|
||
| typeSymbol, err := strconv.ParseInt(parts[len(parts)-2], 10, 64) | ||
| if err != nil { | ||
| @@ -69,7 +77,7 @@ | ||
| files = append(files, nil) | ||
| } | ||
|
|
||
| files[chunkNum] = append(files[chunkNum], file) | ||
| files[int(chunkNum)] = append(files[int(chunkNum)], file) | ||
| return nil | ||
| }) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just switch to using uint64 from the start, chunkNum will never be negative
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the project into an idiomatic Go structure with significant performance optimizations and enhanced functionality. The changes move from a monolithic main.go to a well-organized package structure, implement direct binary encoding/decoding for better performance, and add comprehensive benchmarks.
Key Changes:
- Reorganized codebase into
pkg/andcmd/structure following Go conventions - Optimized binary encoding/decoding and extraction with context reuse and pre-allocated buffers
- Added CLI option for decimal filenames and improved command-line interface
Reviewed changes
Copilot reviewed 22 out of 26 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
pkg/manifest/manifest.go |
Core manifest types and optimized binary encoding/decoding |
pkg/manifest/package.go |
Multi-part package extraction with performance optimizations |
pkg/manifest/builder.go |
Package building functionality from scanned files |
pkg/manifest/scanner.go |
Directory scanning for package building |
pkg/archive/header.go |
ZSTD archive header handling with zero-allocation encoding |
pkg/archive/reader.go |
Streaming decompression with reusable buffers |
pkg/archive/writer.go |
Streaming compression with header management |
cmd/evrtools/main.go |
Refactored CLI application with cleaner flag handling |
README.md |
Updated documentation with library usage examples and project structure |
Makefile |
Build automation and development workflow targets |
go.mod |
Updated module path and Go version |
| Legacy files | Removed obsolete code from previous implementation |
Comments suppressed due to low confidence (2)
ready-at-dawn-echo-arena:1
- This file contains an absolute path to a local installation directory. This appears to be a development/local configuration file that should not be committed to version control as it exposes system paths and won't work for other users.
_data:1 - This file contains an absolute path to a local installation directory. This appears to be a development/local configuration file that should not be committed to version control as it exposes system paths and won't work for other users.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@copilot implement the comments. |
|
@thesprockee I've opened a new pull request, #3, to work on those changes. Once the pull request is ready, I'll request review from you. |
bb33f9a to
67e01c7
Compare
Introduce a new idiomatic Go project structure, enhancing organization and documentation. Implement significant performance optimizations in binary encoding/decoding and extraction processes, achieving substantial speed and memory allocation improvements. Add benchmark tests for various operations to facilitate optimization analysis. Introduce a CLI option for decimal filenames and clean up legacy code for better maintainability. Update dependencies and streamline performance analysis by removing obsolete files.