Skip to content

Conversation

@rayandrew
Copy link
Collaborator

@rayandrew rayandrew commented Aug 30, 2025

hi @izzet and @hariharan-devarajan

this PR aims to deprecate zindex and using our in-house indexer DFTracerIndexer.
This integration and whole PR is done but let's merge it until we have dftracer-utils package later in pypi.

I changed small lines of code as well so it should be easy to review.

My initial finding it improves the speed of indexing and reading.

[UPDATED] 10/21/2025

the file info is as follow

gzip -l trace.pfw.gz 
         
compressed                                        uncompressed                                  ratio             uncompressed_name
784954304 (784.954304 MB)          5566780464 (5.566780464 GB)    85.9%              trace.pfw
  • this is the reading using line as range from random offset from lines 10M to 20M (total 10M lines)
========== Building DFT Index
Benchmark 1: python ./temp/dft-build.py
  Time (mean ± σ):      4.713 s ±  0.014 s    [User: 4.473 s, System: 0.221 s]
  Range (min … max):    4.692 s …  4.732 s    10 runs
 
========== Reading from DFT Index, lines 10000000 to 20000000 (10M lines)
Benchmark 1: python ./temp/dft-read.py
  Time (mean ± σ):     838.5 ms ±   3.2 ms    [User: 712.2 ms, System: 123.9 ms]
  Range (min … max):   834.3 ms … 843.1 ms    10 runs
 
========== Building ZI Index
Benchmark 1: python ./temp/zi-build.py
  Time (mean ± σ):     55.164 s ±  0.491 s    [User: 51.866 s, System: 2.887 s]
  Range (min … max):   54.101 s … 55.703 s    10 runs
 
========== Reading from ZI Index, lines 10000000 to 20000000 (10M lines)
Benchmark 1: python ./temp/zi-read.py
  Time (mean ± σ):     21.820 s ±  0.226 s    [User: 21.127 s, System: 0.647 s]
  Range (min … max):   21.508 s … 22.105 s    10 runs
  • of course C++ version much faster
 hyperfine --warmup 3 "./build/bin/dftracer_reader trace.pfw.gz --start 10000000 --end 20000000 --mode lines --read-buffer-size $((1 * 1024 * 1024))"
Benchmark 1: ./build/bin/dftracer_reader trace.pfw.gz --start 10000000 --end 20000000 --mode lines --read-buffer-size 1048576
  Time (mean ± σ):     540.1 ms ±   3.3 ms    [User: 492.7 ms, System: 45.4 ms]
  Range (min … max):   537.1 ms … 546.1 ms    10 runs
  • even reading using bytes as offset are much faster
# checking how much bytes we read using lines mode
./build/bin/dftracer_reader trace.pfw.gz --start 10000000 --end 20000000 --mode lines --read-buffer-size $((1 * 1024 * 1024)) | wc -c
558899481

hyperfine "./build/bin/dftracer_reader trace.pfw.gz --start 50050000 --end 608949479 --mode line_bytes --read-buffer-size $((1 * 1024 * 1024))"  # roughly 558899481 bytes
Benchmark 1: ./build/bin/dftracer_reader trace.pfw.gz --start 50050000 --end 608949479 --mode line_bytes --read-buffer-size 1048576
  Time (mean ± σ):     382.6 ms ±   5.9 ms    [User: 369.1 ms, System: 12.0 ms]
  Range (min … max):   375.0 ms … 396.8 ms    10 runs
  • Validation
./build/bin/dftracer_reader trace.pfw.gz --start 10000000 --end 20000000 --mode lines --read-buffer-size $((1 * 1024 * 1024)) | wc -l
10000001
./build/bin/dftracer_reader trace.pfw.gz --start 50050000 --end 608949479 --mode line_bytes --read-buffer-size 1048576 | wc -l
10324290 # reading more lines but faster!
  • random access speed
Benchmark 1: ./build/bin/dftracer_reader trace.pfw.gz --start 0 --end 20971520 --mode line_bytes --read-buffer-size 1048576
  Time (mean ± σ):      24.5 ms ±   0.7 ms    [User: 20.2 ms, System: 3.4 ms]
  Range (min … max):    23.2 ms …  28.3 ms    109 runs
 
+ start=104857600
+ end=125829120
+ hyperfine --warmup 3 './build/bin/dftracer_reader trace.pfw.gz --start 104857600 --end 125829120 --mode line_bytes --read-buffer-size 1048576'
Benchmark 1: ./build/bin/dftracer_reader trace.pfw.gz --start 104857600 --end 125829120 --mode line_bytes --read-buffer-size 1048576
  Time (mean ± σ):      29.6 ms ±   3.4 ms    [User: 24.7 ms, System: 3.6 ms]
  Range (min … max):    28.0 ms …  61.2 ms    91 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
+ start=1073741824
+ end=1094713344
+ hyperfine --warmup 3 './build/bin/dftracer_reader trace.pfw.gz --start 1073741824 --end 1094713344 --mode line_bytes --read-buffer-size 1048576'
Benchmark 1: ./build/bin/dftracer_reader trace.pfw.gz --start 1073741824 --end 1094713344 --mode line_bytes --read-buffer-size 1048576
  Time (mean ± σ):      24.0 ms ±   1.5 ms    [User: 19.3 ms, System: 3.7 ms]
  Range (min … max):    21.7 ms …  28.6 ms    115 runs
 
+ start=3221225472
+ end=3242196992
+ hyperfine --warmup 3 './build/bin/dftracer_reader trace.pfw.gz --start 3221225472 --end 3242196992 --mode line_bytes --read-buffer-size 1048576'
Benchmark 1: ./build/bin/dftracer_reader trace.pfw.gz --start 3221225472 --end 3242196992 --mode line_bytes --read-buffer-size 1048576
  Time (mean ± σ):      23.7 ms ±   1.2 ms    [User: 19.2 ms, System: 3.5 ms]
  Range (min … max):    22.3 ms …  28.8 ms    113 runs

Takeaways

  • Building index in zindex is slow. but it does not matter since this is one time operation
  • Python induced overhead sadly since we need to copy object to be managed by python
  • DFTIndexer support random offsets and it is fast

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR deprecates the zindex library and replaces it with a new in-house DFTracerIndexer from the dftracer-utils package. The change aims to improve indexing and reading performance for trace files.

Key changes:

  • Replace zindex_py dependency with dftracer-utils package
  • Refactor file indexing to use DFTracerIndexer instead of zindex
  • Modify batch processing to work with byte offsets rather than line numbers

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
pyproject.toml Updates dependency from zindex_py to dftracer-utils package
dfanalyzer/dftracer.py Replaces zindex implementation with DFTracerIndexer and refactors processing logic

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@codecov-commenter
Copy link

codecov-commenter commented Aug 30, 2025

Codecov Report

❌ Patch coverage is 70.78652% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.91%. Comparing base (d76b03d) to head (5b439ac).

Files with missing lines Patch % Lines
dfanalyzer/dftracer.py 70.78% 26 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #26      +/-   ##
==========================================
+ Coverage   57.48%   66.91%   +9.42%     
==========================================
  Files          26       26              
  Lines        2164     2167       +3     
==========================================
+ Hits         1244     1450     +206     
+ Misses        920      717     -203     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@izzet izzet added the enhancement New feature or request label Aug 31, 2025
@rayandrew rayandrew force-pushed the feat/deprecate-zindex branch from 77c8ff1 to c7a9676 Compare October 20, 2025 20:36
@rayandrew
Copy link
Collaborator Author

please wait until we have new version of dftracer-utils then we can get this merged

@rayandrew rayandrew changed the title deprecate zindex and introduce DFTracerIndexer restructure and deprecate zindex Oct 21, 2025
@rayandrew rayandrew force-pushed the feat/deprecate-zindex branch from 0b32944 to f102ef8 Compare October 21, 2025 21:29
@rayandrew rayandrew force-pushed the feat/deprecate-zindex branch from edec548 to ea0a030 Compare October 22, 2025 01:37
Copy link
Member

@hariharan-devarajan hariharan-devarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@hariharan-devarajan hariharan-devarajan merged commit 1465b8e into main Oct 22, 2025
0 of 3 checks passed
@rayandrew rayandrew deleted the feat/deprecate-zindex branch October 22, 2025 21:38
@rayandrew rayandrew mentioned this pull request Oct 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants