Skip to content

Conversation

@csarofeen
Copy link
Collaborator

No description provided.

Reduces cumulative template instantiation time by 22% (clang -ftime-trace).
Move isSame, ceildiv, max, fmax, min, fmin from header to cpp to reduce template instantiation costs. These functions use PolymorphicValue operators that trigger ForAllTypes recursion.
@csarofeen
Copy link
Collaborator Author

!test

@github-actions
Copy link

github-actions bot commented Dec 24, 2025

Description

  • Move template-heavy functions from headers to implementation files to reduce compilation time

  • Relocate getDataType and castToDtype from type.h to polymorphic_value.cpp

  • Move operator-using functions (isSame, ceildiv, max, fmax, min, fmin) from polymorphic_value.h to cpp

  • Reduce cumulative template instantiation time by 22% (clang -ftime-trace)

Changes walkthrough

Relevant files
Enhancement
polymorphic_value.cpp
Add template-heavy function implementations                           

csrc/polymorphic_value.cpp

  • Added implementations for getDataType and castToDtype (moved from
    type.h)
  • Added implementations for isSame, ceildiv, max, fmax, min, fmin (moved
    from polymorphic_value.h)
  • These functions use for_all_types or operators that trigger heavy
    template instantiation
  • +131/-0 
    polymorphic_value.h
    Convert inline functions to forward declarations                 

    csrc/polymorphic_value.h

  • Removed inline implementations of isSame, ceildiv, max, fmax, min,
    fmin
  • Added forward declarations with NVF_API for these functions
  • Moved implementations to polymorphic_value.cpp to reduce template
    bloat
  • +15/-61 
    type.h
    Convert inline functions to forward declarations                 

    csrc/type.h

  • Removed inline implementations of getDataType and castToDtype
  • Added forward declarations with NVF_API for these functions
  • Moved implementations to polymorphic_value.cpp to reduce template
    instantiation
  • +6/-57   

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 No relevant tests
    ⚡ Recommended focus areas for review
    Template instantiation

    The moved functions still use PolymorphicValue::for_all_types which triggers heavy template instantiation. While this is now contained to a single translation unit, the performance impact should be measured to ensure the build speed improvements are significant enough to justify the code reorganization.

      PolymorphicValue::for_all_types([&value, &dtype](auto _) {
        using T = typename decltype(_)::type;
        if constexpr (IsPrimitiveNativeType<T>::value) {
          if (value.is<T>()) {
            dtype = NativeTypeToDataType<T>::type;
          }
        } else if constexpr (std::is_same_v<T, std::vector<PolymorphicValue>>) {
          if (value.is<T>()) {
            const auto& vec = value.as<T>();
            size_t size = vec.size();
            NVF_CHECK(size > 0, "Empty array is not supported");
            dtype =
                ArrayType{std::make_shared<DataType>(getDataType(vec[0])), size};
          }
        } else if constexpr (std::is_same_v<T, Pointer>) {
          // For pointers in polymorphic value, we only store the data size of the
          // pointee, so it is impossible to infer the pointer type.
          NVF_CHECK(!value.is<T>(), "Can not infer pointer type.");
        } else if constexpr (std::is_same_v<T, StructHandle>) {
          if (value.is<T>()) {
            dtype = value.as<T>().type();
          }
        } else if constexpr (std::is_same_v<T, Opaque>) {
          if (value.is<T>()) {
            const auto& opaque = value.as<T>();
            dtype = DataType(OpaqueType{
                .type_info = opaque.any().type(), .size = opaque.size()});
          }
        }
      });
      NVF_CHECK(dtype.has_value(), "Unknown dtype for ", value.type().name());
      return dtype.value();
    }
    Error handling

    The moved functions include NVF_CHECK statements for error handling. Ensure these error paths are properly tested and that the error messages remain clear and helpful for debugging.

          NVF_CHECK(size > 0, "Empty array is not supported");
          dtype =
              ArrayType{std::make_shared<DataType>(getDataType(vec[0])), size};
        }
      } else if constexpr (std::is_same_v<T, Pointer>) {
        // For pointers in polymorphic value, we only store the data size of the
        // pointee, so it is impossible to infer the pointer type.
        NVF_CHECK(!value.is<T>(), "Can not infer pointer type.");
      } else if constexpr (std::is_same_v<T, StructHandle>) {
        if (value.is<T>()) {
          dtype = value.as<T>().type();
        }
      } else if constexpr (std::is_same_v<T, Opaque>) {
        if (value.is<T>()) {
          const auto& opaque = value.as<T>();
          dtype = DataType(OpaqueType{
              .type_info = opaque.any().type(), .size = opaque.size()});
        }
      }
    });
    NVF_CHECK(dtype.has_value(), "Unknown dtype for ", value.type().name());
    return dtype.value();

    Test failures

    • (High, 44) NCCL NVLS multicast binding errors in multi-device distributed tests (multidevice/* on dlcluster_viking_ci)

      Test Name H100 (dist.) Source
      tests.python.multidevice.test_communication.test_allgather
      tests.python.multidevice.test_communication.test_allgather_expanded_broadcast
      tests.python.multidevice.test_communication.test_allreduce
      tests.python.multidevice.test_communication.test_reduce_scatter
      tests.python.multidevice.test_communication.test_reduce_scatter_noncontiguous
      tests.python.multidevice.test_dtensor.test_column_parallel_linear
      tests.python.multidevice.test_dtensor.test_plus_one
      tests.python.multidevice.test_dtensor.test_row_parallel_linear
      tests.python.multidevice.test_expert_parallel.test_dispatch_and_combine
      tests.python.multidevice.test_matmul.test_column_parallel_grouped_mm
      ... with 34 more test failures omitted. Check internal logs.
    • (High, 1) NCCL invalid group usage in multidevice overlap test_overlap_allgather_matmul_shard_outermost

      Test Name H100 (dist.) Source
      tests.python.multidevice.test_overlap.test_overlap_allgather_matmul_shard_outermost[backend_type=CommunicatorBackend.cuda]

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants