Skip to content

Conversation

@zasdfgbnm
Copy link
Collaborator

No description provided.

@github-actions
Copy link

github-actions bot commented Dec 17, 2025

Review updated until commit 1092227

Description

  • Enhanced PrecomputedValues with detailed error messages and IR node tracking for better debugging

  • Added comprehensive debug logging throughout binding and validation processes

  • Improved tensor metadata binding with detailed debug output

  • Added two new test cases for stream-based matrix multiplication operations

  • Fixed unshardedSizes to handle allocation domains properly

Changes walkthrough

Relevant files
Enhancement
evaluator_common.cpp
Enhanced debugging and error reporting in PrecomputedValues

csrc/evaluator_common.cpp

  • Enhanced bindValue calls to include IR node parameter for better
    tracking
  • Improved validation error messages with IR node details and binding
    history
  • Added debug logging throughout binding and validation processes
  • Enhanced tensor metadata binding with detailed debug output
  • +62/-16 
    evaluator_common.h
    Updated PrecomputedValues interface for IR node tracking 

    csrc/evaluator_common.h

  • Updated bindValue method signatures to include optional IR node
    parameter
  • Modified binding log to store tuples with IR node information
  • Added debug output for rebinding scenarios with value comparison
  • +20/-5   
    fusion_executor_cache.cpp
    Added debug logging for fusion execution phases                   

    csrc/runtime/fusion_executor_cache.cpp

  • Added debug output for fusion IR printing before execution
  • Added debug logging for compilation and execution phases
  • +6/-0     
    fusion_kernel_runtime.cpp
    Enhanced debugging in fusion compilation process                 

    csrc/runtime/fusion_kernel_runtime.cpp

  • Added debug logging for PrecomputedValues creation and binding process
  • Enhanced compilation phase debugging output
  • +3/-0     
    Bug fix
    execution_utils.cpp
    Fixed unshardedSizes to handle allocation domains               

    csrc/multidevice/execution_utils.cpp

  • Added check for allocation domain in unshardedSizes function
  • Fixed handling of sharded allocation domains
  • +6/-0     
    Tests
    test_stream.cpp
    Added stream-based matrix multiplication test cases           

    tests/cpp/test_stream.cpp

  • Added Matmul test case with stream parallelization
  • Added TwoMatmuls test case with stream-based input sharding
  • Both tests validate stream-based matrix multiplication operations
  • +94/-0   

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    Performance Impact

    The extensive debug output throughout the codebase (especially in evaluator_common.cpp) could significantly impact performance when debug mode is enabled. The debug statements should be properly guarded or removed in production builds.

    void bindValue_(
        int index,
        const PolymorphicValue& value,
        const Val* ir_node = nullptr) {
      if (index < 0 || is_constant_[index]) {
        return;
      }
    
      // Debug: show if we're rebinding a value
      if (defined_[index]) {
        debug() << "[DEBUG] REBINDING index " << index;
        if (ir_node != nullptr) {
          debug() << " (node: " << ir_node->toString() << ")";
        }
        debug() << " from " << PolymorphicValue_functions::toString(values_[index]) 
                << " to " << PolymorphicValue_functions::toString(value) << std::endl;
      }
    
      defined_[index] = true;
      values_[index] = value;
      binding_log_.emplace_back(index, value, ir_node);
      validate();
    }
    API Breaking Change

    The bindValue method signatures have been modified to include an optional ir_node parameter. While backward compatibility is maintained with default parameters, this is a significant API change that affects all call sites.

    void bindValue(int index, const T& value, const Val* ir_node = nullptr) {
      bindValue_(index, PolymorphicValue(value), ir_node);
    }
    Validation Frequency

    The validate() function is now called immediately after each bindValue call (line 235 in evaluator_common.h). This could cause significant performance overhead as validation involves iterating through the entire binding_log_. Consider validating only at the end of binding operations.

    void PrecomputedValues::validate() {
      FUSER_PERF_SCOPE("PrecomputedValuess::Validate");
      using namespace PolymorphicValue_functions;
      for (const auto& [index, expected_value, ir_node] : binding_log_) {
        if (!isSame(values_[index], expected_value)) {
          std::stringstream error_msg;
          error_msg << "Precomputed values failed to validate.\n"
                    << "Something unexpected changed between the compilation and "
                       "execution.\n";
          if (ir_node != nullptr) {
            error_msg << "IR node: " << ir_node->toString() << "\n";
          }
          error_msg << "Computed value: " << toString(values_[index]) << "\n"
                    << "Expected value: " << toString(expected_value);
    
          // Debug: Show binding history for this index
          debug() << "[DEBUG] ===== VALIDATION FAILED =====" << std::endl;
          debug() << "[DEBUG] Binding history for index " << index << ":" << std::endl;
          for (const auto& [idx, val, node] : binding_log_) {
            if (idx == index) {
              debug() << "[DEBUG]   Bound to: " << toString(val);
              if (node != nullptr) {
                debug() << " (node: " << node->toString() << ")";
              }
              debug() << std::endl;
            }
          }
          debug() << "[DEBUG] ================================" << std::endl;
    
          NVF_ERROR(false, error_msg.str());
        }
      }
      has_valid_values_ = true;
    }

    Test failures

    • (Medium, 15) NVFuser evaluator assertion failures in stream / multidevice matmul & linear tests

      Test Name A100 A100 (dist.) GB200 GB200 (dist.) H100 H100 (dist.) Source
      StreamTest.TwoMatmuls Link
      tests.python.direct.test_stream.test_two_matmuls_inlinable[nvfuser_direct_test=eager]
      tests.python.direct.test_stream.test_two_matmuls_inlinable[nvfuser_direct_test=lru_cache]
      tests.python.multidevice.test_overlap.test_row_parallel_linear_forward
    • (Medium, 6) NVFuser stream matmul profiler event count mismatch in tests.python.direct.test_stream

      Test Name A100 GB200 H100 Source
      tests.python.direct.test_stream.test_matmul[nvfuser_direct_test=eager]
      tests.python.direct.test_stream.test_matmul[nvfuser_direct_test=lru_cache]

    @zasdfgbnm
    Copy link
    Collaborator Author

    !test

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants