Skip to content

Conversation

@eric-tramel
Copy link
Contributor

@eric-tramel eric-tramel commented Jan 28, 2026

Summary

  • Adds detailed progress logging during cell-by-cell column generation in ColumnWiseDatasetBuilder
  • Extracts progress tracking into a testable ProgressTracker class
  • Uses thread-safe counters to track completion, success, and failure counts
  • Reports progress every ~10% with percentage complete, success/failure counts, processing rate (rec/s), and ETA

Example output

🐙 Processing LLM_TEXT column 'response' with 8 concurrent workers
🧭 LLM_TEXT column 'response' will report progress every 10 record(s).
📈 LLM_TEXT column 'response' progress: 10/100 (10%) complete, 10 ok, 0 failed, 2.45 rec/s, eta 36.7s
📈 LLM_TEXT column 'response' progress: 20/100 (20%) complete, 19 ok, 1 failed, 2.51 rec/s, eta 31.9s
📈 LLM_TEXT column 'response' progress: 30/100 (30%) complete, 28 ok, 2 failed, 2.48 rec/s, eta 28.2s
...
📈 LLM_TEXT column 'response' progress: 100/100 (100%) complete, 97 ok, 3 failed, 2.52 rec/s, eta 0.0s

Test plan

  • Run dataset generation with cell-by-cell columns and verify progress logs appear
  • Verify progress updates at ~10% intervals
  • Test with varying batch sizes to confirm interval calculation works correctly
  • Unit tests for ProgressTracker class (19 tests covering thread safety, edge cases, logging)

Adds detailed progress logging during cell-by-cell column generation
with thread-safe counters. Reports progress every ~10% with completion
percentage, success/failure counts, processing rate, and ETA.
@eric-tramel eric-tramel requested a review from a team as a code owner January 28, 2026 21:06
Move progress tracking logic from nested functions with nonlocal
variables into a separate, testable ProgressTracker class. This
improves code readability and testability while maintaining the
same functionality.
The ProgressTracker requires total_records to be an integer for
comparison operations.
Add 19 tests covering:
- Initialization and configuration
- Success and failure recording
- Thread safety under concurrent access
- Logging behavior at intervals
- Edge cases (zero records, small totals)
@eric-tramel eric-tramel self-assigned this Jan 28, 2026
@eric-tramel eric-tramel added the enhancement New feature or request label Jan 28, 2026
Copy link
Contributor

@johnnygreco johnnygreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love pulling it out into a class like this! Thanks @eric-tramel!

Just the smallest of small nits

…_builders/utils/progress_tracker.py

Co-authored-by: Johnny Greco <jogreco@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants