diff --git a/parallel_algorithms/P3732/numeric_ranges_algorithms.md b/parallel_algorithms/P3732/numeric_ranges_algorithms.md index b959de4..c6b8275 100644 --- a/parallel_algorithms/P3732/numeric_ranges_algorithms.md +++ b/parallel_algorithms/P3732/numeric_ranges_algorithms.md @@ -1,8 +1,8 @@ --- title: Numeric range algorithms -document: P3732R0 -date: 2025-06-14 +document: P3732R1 +date: 2025-07-15 audience: SG1,SG9 author: - name: Ruslan Arutyunyan @@ -38,6 +38,28 @@ We propose `ranges` algorithm overloads (both parallel and non-parallel) for the * Abhilash Majumder (NVIDIA) +# Revision history + +## R0 to be submitted 2025-07-15 + +R0 is the original draft prepared before the June 2025 Sofia WG21 meeting. SG1 reviewed this draft during the Sofia meeting with the following feedback. + +- SG1 agrees (via poll 4/5/1/0/0) that users should have a way to specify an identity value. SG1 asks whether there is any need to specify this as a compile-time value, or whether a run-time-only interface would suffice. One concern is the potential cost of broadcasting an identity value at run time to all threads, versus initializing each thread's accumulator to a value known at compile time. + +- SG1 has no objection to adding `transform_*` variants of algorithms. + +- SG1 asks us to add `reduce_into` and `transform_reduce_into` (via poll 4/4/0/0/0), that is, versions of `reduce` and `transform_reduce` that write the reduction result to an output range of one element. (We asked SG1 to take this poll because LEWG rejected an analogous design for std::linalg reduction-like algorithms such as dot product and norms.) + +- SG1 members would like separate proposals on fixing _`movable-box`_ trivial copyability, and fixing performance issues with views in general. + +## R1 in preparation + +- Revise non-wording sections + + - Explain `reduce_into` and `transform_reduce_into` + + - Show different designs for specifying identity value + # What we propose We propose `ranges` overloads (both parallel and non-parallel) of the following algorithms: @@ -48,14 +70,18 @@ We propose `ranges` overloads (both parallel and non-parallel) of the following * `exclusive_scan` and `transform_exclusive_scan`. +These correspond to existing algorithms with the same names in the `` header. +Therefore, we called them "numeric range(s) algorithms." + We also propose adding parallel and non-parallel convenience wrappers: * `ranges::sum` and `ranges::product` for special cases of `reduce` with addition and multiplication, respectively; and * `ranges::dot` for the special case of binary `transform_reduce` with transform `multiplies{}` and reduction `plus{}`. -The following sections explain why we propose these algorithms and not others. This relates to other aspects of the design -besides algorithm selection, such as whether to include optional projection parameters. +The following sections explain why we propose these algorithms and not others. +This relates to other aspects of the design besides algorithm selection, +such as whether to include optional projection parameters. # Design @@ -171,7 +197,9 @@ as `ranges::transform_inclusive_scan(r, o, f, g)` with `g` as the transform oper The binary variant of `transform_reduce` is different. Unlike `reduce` and most other numeric algorithms, it takes two input sequences and applies a binary function to the pairs of elements from both sequences. Projections, being unary functions, -cannot replace the binary transform function of the algorithm. `transform_view` is similarly of no help unless it is combined with +cannot replace the binary transform function of the algorithm. +Likewise, `transform_view` by itself cannot replace +the binary transform function unless it is combined with `zip_view` and operates on tuples of elements. `zip_transform_view` is a convenient way to express this combination; applying `reduce` to `zip_transform_view` gives the necessary result (code examples are shown below). @@ -226,7 +254,7 @@ assert(out2 == expected); ``` ::: -The code without projections using a single big lambda to express the binary operation. Users have to read the big lambda +The code without projections uses a single big lambda to express the binary operation. Users have to read the big lambda to see what it does. So does the compiler, which can hinder optimization if it's not good at inlining. In contrast, the version with projections lets users read out loud what it does. It also separates the "selection" or "query" part of the transform from the "arithmetic" or "computation" part. The power of @@ -235,11 +263,13 @@ natural to extend this separation to selection logic as well. ##### Unary transform -It's harder to avoid a lambda, as the function that does an operation, in the unary `transform` case. Most of the named -C++ Standard Library arithmetic function objects are binary. Currying them into unary functions in C++ requires either -making a lambda (which defeats the purpose somewhat) or using something like `std::bind_front` (which is verbose). On the -other hand, using a projection still has the benefit of separating the "selection" part of the transform from the -"computation" part. +In the unary `transform` case, it's harder to avoid using a lambda. +Most of the named C++ Standard Library arithmetic function objects are binary. +Currying them into unary functions in C++ requires either +making a lambda (which defeats the purpose somewhat) or +using something like `std::bind_front` (which is verbose). +On the other hand, using a projection still has the benefit of +separating the "selection" part of the transform from the "computation" part. ```c++ struct foo {}; @@ -326,10 +356,13 @@ assert(result_no_proj == 52); ##### Binary `transform_reduce` -As we described above, expressing the functionality of binary `transform_reduce` using only `reduce` requires `zip_transform_view` -or something like it, making the `reduce`-only version more verbose. Users may also find it troublesome that `zip_view` and `zip_transform_view` +As we explained above, expressing the functionality of binary `transform_reduce` +using only `reduce` requires `zip_transform_view` or something like it. +This makes the `reduce`-only version more verbose. +Users may also find it troublesome that `zip_view` and `zip_transform_view` are not pipeable: there is no `{v1, v2} | views::zip` syntax, for example. -On the other hand, it's a toss-up which version is easier to understand. Users either need to learn what a "zip transform view" does, +On the other hand, it's a toss-up which version is easier to understand. +Users either need to learn what `zip_transform_view` does, or they need to learn about `transform_reduce` and know which of the two function arguments does what. ```c++ @@ -361,11 +394,12 @@ elements from the two input ranges into a single value. The algorithm then reduc function and the initial value. It's perhaps misleading that this binary function is called a "transform"; it's really a kind of "inner" reduction on corresponding elements of the two input ranges. -One can imagine a ranges analog of C++17 binary `transform_reduce` that takes two projection functions, as in the example -below. It's not too hard for a casual reader to tell that the last two arguments of `reduce` apply to each of the input -sequences in turn, but that's still more consecutive function arguments than for any other algorithm in the C++ Standard -Library. Without projections, users need to resort to `transform_view`, but this more verbose syntax makes it more -clear which functions do what. +One can imagine a ranges analog of C++17 binary `transform_reduce` +that takes two projection functions, as in the example below. +The result has four consecutive function arguments in a row, +which is more than for any other algorithm in the Standard Library. +Without projections, users need to resort to `transform_view`, +but this more verbose syntax makes it more clear which functions do what. ```c++ struct foo {}; @@ -375,12 +409,12 @@ std::vector> v2{ {"thirteen", 13}, {"seventeen", 17}, {"nineteen", 19}}; constexpr int init = 3; -// With projections +// With projections: 4 functions in a row auto result_proj = std::ranges::transform_reduce(v1, v2, init, std::plus{}, std::multiplies{}, get_element<0>{}, get_element<1>{}); assert(result_proj == 396); -// Without projections +// Without projections: more clear where get_element happens auto result_no_proj = std::ranges::transform_reduce( std::views::transform(v1, get_element<0>{}), std::views::transform(v2, get_element<1>{}), @@ -490,7 +524,8 @@ parallel execution. Let's review what we learned from the above discussion. -- In general and particularly for `ranges::transform`, projections improve readability and expose optimization potential, +- Projections improve readability of `ranges::transform`. +- Projections expose optimization potential, by separating the selection part of an algorithm from the computation part. - None of the existing `fold_*` `ranges` algorithms (the closest things the Standard Library currently has to `ranges::reduce`) take projections. @@ -526,6 +561,55 @@ reduction algorithms to have projections. `ranges::transform_{in,ex}clusive_scan` as well as `ranges::{in,ex}clusive_scan`, and do not provide projections for any of them. +### `reduce_into` and `transform_reduce_into` + +We propose new algorithms `reduce_into` and `transform_reduce_into`. +These work like `reduce` and `transform_reduce`, +except that instead of returning the reduction result by value, +they write it to the first element of an output range. + +The `reduce_into` algorithm has +[precedent in the Thrust library](https://nvidia.github.io/cccl/thrust/api_docs/algorithms/reductions.html). +Its performance advantange is that the algorithm can write its result +directly to special memory associated with parallel execution, +such as accelerator memory or a NUMA (Non-Uniform Memory Access) domain +where the algorithm's threads run. + +#### Output should be a single iterator + +P3179 (parallel ranges algorithms) always specifies output ranges +as sized ranges, instead of as a single iterator. +However, in the case of `*reduce_into`, +the output range always has exactly one element. +Thus, there would be no safety improvement in requiring a sized range. +Users would end up needing to go through a possibly error-prone +syntax ritual to turn their one output iterator into a sized range. +The use cases below illustrate this. + +```c++ +std::vector input_range{3.0f, 5.0f, 7.0f}; +float out_value{}; +unique_ptr out{my_alloc(sizeof(float)), deleter}; + +// Input sized range, output iterator +ranges::reduce_into(input_range, /* ... */ &out_value); +ranges::reduce_into(input_range, /* ... */ out.get()); +assert(out_value == out[0]); + +// Input sized range, output sized range (size 1) +ranges::reduce_into(input_range, /* ... */ span{out_value}); +ranges::reduce_into(input_range, /* ... */ span{out.get()}); +assert(out_value == out[0]); +``` + +#### Output iterator could be single pass + +The output range needs to be copyable for parallelization reasons, +so its iterator can't be merely an `output_iterator`. +However, it could be a single-pass range, +because the algorithm only needs to write one element at most once. +The Standard does not currently have an iterator category to express this distinction. + ### We propose convenience wrappers to replace some algorithms #### `accumulate` @@ -585,27 +669,37 @@ We do not propose `adjacent_transform` for the reasons described above. #### `partial_sum` -The `partial_sum` algorithm performs operations sequentially. The existing ranges library does not have an equivalent -algorithm with this left-to-right sequential behavior, nor do we propose such an algorithm. For users who want this -behavior, [@P2760R1] suggests a view instead of an algorithm. [@P3351R2], "`views::scan`," proposes this view; it -is currently in SG9 (Ranges Study Group) review. - -Users of `partial_sum` who are not concerned about the order of operations can call `inclusive_scan` instead, which we -propose here. We considered adding a convenience wrapper for the same special case of an inclusive prefix plus-scan that -`partial_sum` supports. However, names like `partial_sum` or `prefix_sum` would obscure whether this is an inclusive or -exclusive scan. Also, we already have `std::partial_sum` that operates in order. Using the same name as a convenient wrapper -on top of out-of-order `*_scan`, we propose in the paper, is misleading. We think it's not a very convenient convenience -wrapper if users have to look these aspects up every time they use it. - -If WG21 did want a convenience wrapper, one option would be to give this common use case a longer but more explicit name, +The `partial_sum` algorithm combines elements sequentially, from left to right. +It behaves like an order-constrained version of `inclusive_scan`. + +Our proposal focuses on algorithms that permit reordering binary operations. +For users who want an order-constrained partial sum, +[@P3351R2], "`views::scan`," proposes a view with the same left-to-right behavior. +This paper is currently in SG9 (Ranges Study Group) review. + +Users of `partial_sum` who are not concerned about the order of operations +can call the `inclusive_scan` algorithm (proposed here) instead. +We considered adding a convenience wrapper for the same special case +of an inclusive prefix plus-scan that `partial_sum` supports. +However, names like `partial_sum` or `prefix_sum` would obscure +whether this is an inclusive or exclusive scan. +Also, the existing `partial_sum` algorithm operates left-to-right. +A new algorithm with the same name and almost the same interface, +but with a different order of operations, could be misleading. +We think it's not a very convenient convenience wrapper +if users have to look up its behavior every time they use it. + +If WG21 did want a convenience wrapper, one option would be +to give this common use case a longer but more explicit name, like `inclusive_sum_scan`. ### We don't propose "the lost algorithm" (noncommutative parallel reduce) The Standard lacks an analog of `reduce` that can assume associativity but not commutativity of binary operations. -One author of this proposal refers to this as "the lost algorithm" (in e.g., -[Episode 25 of "ASDP: The Podcast"](https://adspthepodcast.com/2021/05/14/Episode-25.html)). We do not propose this -algorithm, but we would welcome a separate proposal to do so. +One author of this proposal refers to this as "the lost algorithm." +(Please refer to +[Episode 25 of "ASDP: The Podcast"](https://adspthepodcast.com/2021/05/14/Episode-25.html).) +We do not propose this algorithm, but we would welcome a separate proposal to do so. The current numeric algorithms express a variety of permissions to reorder binary operations. @@ -636,7 +730,7 @@ with a two-sided identity element. This proposal leaves the described algorithm out of scope. We think the right way would be to propose a new algorithm with a distinct name. A reasonable choice of name would be `fold` (just `fold` by itself, not `fold_left` or `fold_right`). -### We don't propose `reduce_with_iter` +### We don't propose `reduce_with_iter` {#no-reduce-with-iter} A hypothetical `reduce_with_iter` algorithm would look like `fold_left_with_iter`, but would permit reordering of binary operations. It would return both an iterator to one past the last input element, and the computed value. The only reason @@ -644,6 +738,8 @@ for a reduction to return an iterator would be if the input range is single-pass input range really should be using one of the `fold*` algorithms instead of `reduce*`. As a result, we do not propose the analogous `reduce_with_iter` here. +Note that the previous paragraph effectively argues for `*reduce` to require at least forward ranges. + Just like `fold_left`, the `reduce` algorithm should return just the computed value. Section 4.4 of [@P2322R6] argues that this makes it easier to use, and improves consistency with other `ranges` algorithms like `ranges::count` and `ranges::any_of`. It is also consistent with [@P3179R8]. Furthermore, even if a `reduce_with_iter` algorithm were to @@ -659,7 +755,7 @@ initial value. An example would be `min` on a range of `int` values, where calle represents an actual value in the range, or a fake "identity" element (that callers may get as a result when the range is empty). -We do not propose `reduce_first` here, only outline arguments against and for adding it. +We do not propose `reduce_first` here; we just outline arguments against and for adding it. #### Arguments against `reduce_first` @@ -680,8 +776,12 @@ See [](#initial-value-vs-identity) for more detailed analysis. 1. Some equivalent of `reduce_first` can be used as a building block for parallel reduction with unknown identity, if no other solution is proposed. 1. Even though `min_element`, `max_element`, and `minmax_element` exist, users may still want to combine multiple reductions into a single pass, where some of the reductions are min and/or max, while others have a natural identity. -As an example, users may want the minimum of an array of integers (with no natural identity), along with the least -index of the array element with the minimum value (whose natural identity is zero). This happens often enough that MPI + +As an example of combining multiple reductions into a single pass, +users may want the minimum of an array of integers (with no natural identity), +along with the least index of the array element with the minimum value +(whose natural identity is zero). +This happens often enough that MPI (the Message Passing Interface for distributed-memory parallel computing) has predefined reduction operations for minimum and its index (`MINLOC`) and maximum and its index (`MAXLOC`). On the other hand, even `MINLOC` and `MAXLOC` have reasonable choices of fake "identity" elements that work in practice, e.g., for `MINLOC`, `INT_MAX` for the minimum value @@ -727,30 +827,47 @@ requirement for the non-parallel algorithms we propose. This leaves us with two * (multipass) forward ranges. -The various reduction and scan algorithms we propose can combine the elements of the range in any order. For this reason, -we make the non-parallel algorithms take (multipass) forward ranges, even though this is not consistent with the existing -non-parallel `` algorithms. If users have single-pass iterators, they should just call one of the `fold_*` -algorithms, or use the `views::scan` proposed elsewhere. This has the benefit of letting us specify `ranges::reduce` -to return just the value. We don't propose a separate algorithm `reduce_with_iter`, as we explain elsewhere in this -proposal. +We believe there is no value in `*reduce` and `*_scan` taking single-pass input ranges, +because these algorithms can combine the elements of their input range(s) in any order. +Suppose that an algorithm had that freedom to rearrange operations, +yet was constrained to read the input ranges exactly once, in left-to-right order. +The only way such an algorithm could exploit that freedom +would be for it to copy the input ranges into temporary storage. +Users who want that could just copy the input ranges into contiguous storage themselves. + +For this reason, we make the non-parallel algorithms take (multipass) forward ranges, +even though this is not consistent with the existing non-parallel `` algorithms. +If users have single-pass iterators, they should just call one of the `fold_*` algorithms, +or use the `views::scan` proposed in [@P3351R2]. +This has the benefit of letting us specify `ranges::reduce` to return just the value. +We don't propose a separate `reduce_with_iter` algorithm +to return both the value and the one-past-the-input iterator, +as we explain [in the relevant section](#no-reduce-with-iter). ## Constexpr parallel algorithms? [@P2902R1] proposes to add `constexpr` to the parallel algorithms. [@P3179R8] does not object to this; see Section 2.10. We continue the approach of [@P3179R8] in not opposing [@P2902R1]'s approach, but also not depending on it. -## Reduction's initial value vs. its identity element {#initial-value-vs-identity} +## Specifying a reduction's identity element {#initial-value-vs-identity} -It's important to distinguish between a reduction's initial value, and its identity element. C++17's `std::reduce` takes an -optional initial value `T init` that is included in the terms of the reduction. This is not necessarily the same as the -identity element for a reduction, which is a value that does not change the reduction's result, no matter how many times it -is included. The following example illustrates. +### Initial value not necessarily the same as identity value + +It's important to distinguish between a reduction's initial value, and its identity value. +C++17's parallel `*reduce` algorithms take an optional *initial value* `T init`. +This defaults to `T{}`, and is included in the terms of the reduction. +The initial value is not necessarily the same as an *identity value* for a reduction, +which is a value that does not change the reduction's result, +no matter how many times it is included as a term. +We say "an" identity value because it need not be unique. +The identity value can serve as an initial value, but not vice versa. +The following example illustrates. ```c++ std::vector v{5.0f, 7.0f, 11.0f}; // Default initial value is float{}, which is 0.0f. -// It is also the identity for std::plus<>, the default operation +// It is also the identity for std::plus<>, the default operation. float result = std::reduce(v.begin(), v.end()); assert(result == 23.0f); @@ -771,31 +888,36 @@ result = std::reduce(v.begin(), v.end(), 0.0f); assert(result == 23.0f); ``` -The identity element can serve as an initial value, but not vice versa. This is especially important for parallelism. - -#### Initial value matters most for sequential reduction +### Initial value matters most for sequential reduction -From the serial execution perspective, it is easy to miss importance of the reduction identity. Let's consider typical code -that sums elements of an indexed array. +Users who never use parallel reductions may miss the importance of the reduction identity. +Let's consider typical code that sums elements of an indexed array. ```c++ float sum = 0.0f; -for (size_t i = 0; i{} +auto result1 = std::ranges::reduce(exec_policy, range); + +// Deduce identity: range_value_t{} +auto result2 = std::ranges::reduce(exec_policy, range, std::plus{}); + +// If range_value_t is arithmetic or std::complex, +// deduce identity: range_value_t(1) +auto result3 = std::ranges::reduce(exec_policy, range, std::multiplies{}); + +// Should this even work? +// The "identity" is really -Inf if range_value_t has that, +// but this is one of those cases where users might not like that behavior. +auto result4 = std::ranges::reduce(exec_policy, range, std::ranges::min); +``` + +#### Do not force `T{}` to be an identity value + +The identity value of a binary operator that returns `T` +need not necessarily be `T{}` (a value-constructed `T`) +for all operators and types. + +- For `std::multiplies{}` it's `T(1)` + +- For "addition" in the max-plus ("tropical") algebra it's `-Inf` + +We don't want to force users to wrap reduction result types +so that `T{}` defines the identity (if it exists) for `operator+(T, T)`. + +- What if the operator has no identity value? + +- What if `T` differs from the input range's value type? + +- What if users want to use the same value type for different binary operators, such as `double` for `plus`, `multiplies`, and `ranges::max`? + +- If we make users write a nondefaulted default constructor for `T`, they are more likely to make `T` not trivially constructible, and thus hinder optimizations. + +Note that this differs from std::linalg: +"A value-initialized object of linear algebra value type shall act as the additive identity" +([linalg.reqs.val] 3). However: + +- std::linalg does not take user-defined binary operators; it always uses `operator+` for reductions + +- std::linalg needs "zero" for reasons other than reductions, e.g., its support for user-defined complex number types (_`imag-if-needed`_) + +For these reasons, we think it's reasonable +to make a different design choice for numeric range algorithms. + +### Interface sketches for specifying identity + +#### Design goals + +1. Avoid confusion with C++17 `std::reduce` initial value + + - Strongly counterindicates designs that just pass in the identity as `T id`. + Doing that would easily confuse users who want to switch + from C++17 numeric algorithms to the new ranges versions. + +2. Let users specify a different binary operation and identity for the same reduction result type + + - Counterindicates a purely compile-time traits system. + For an example of such a system, please see the + ["Reduction Variables"](https://github.khronos.org/SYCL_Reference/iface/reduction-variables.html) + section of the SYCL Reference. + +3. Let users specify an identity even if their binary operation is a lambda + + - Counterindicates a purely compile-time traits system + +4. Let users use the identity value to specify a reduction result type + that differs from `range_value_t`, + just as they can use the initial value's type to do that with C++17 `reduce` + +5. Provide default identity value if possible + + - Conform to fold expression default identity if possible + + - Algorithms must know both `range_value_t` and the binary operation in order to deduce a default identity value + +6. Let users specify a nondefault identity value "in line" with invoking the algorithm, without needing to specializing a class + +#### Design 1: `reduce_identity{value}` + +Users would pass in their identity value +by wrapping it in a named struct `reduce_identity`. + +```c++ +template +struct reduce_identity { + T value{}; +}; +``` + +The default identity value would be `T{}`, because the default needs to be _something_. +Users would have two ways to provide a nondefault value. + +1. Construct `reduce_identity` with a default value using aggregate initialization: + `reduce_identity{nondefault_value}` + +2. Specialize `reduce_identity` so `declval>().value` is the value + +For example, users could inherit their specialization from `constant_wrapper`. + +```c++ +namespace impl { + inline constexpr my_number some_value = /* value goes here */; +} +template +struct reduce_identity : + constant_wrapper +{}; +``` + +Here are some use cases. + +```c++ +// User explicitly opts into "most negative integer" +// as the identity for min. This should not be the default, +// as the C++ Standard Library has no way to know +// whether this represents a valid input value. +constexpr auto lowest = std::numeric_limits::lowest(); +auto result5 = std::ranges::reduce(exec_policy, range, + std::ranges::min, reduce_identity{lowest}); + +// range_value_t is float, but identity value is double +// (even though it's otherwise the default value, zero). +// std::plus should use operator()(double, double) -> double +auto result6 = std::ranges::reduce(exec_policy, range, + std::plus{}, reduce_identity{0.0}); +``` + +Advantages of this approach: + +- Users would see in plain text the purpose of this function argument + +- Algorithms could overload on it without risk of ambiguity + +- The struct is an aggregate, which would maximize potential for optimizations + +- It would not impose requirements on the user's binary function + +Disadvantages: + +- The algorithm could not use this to deduce a default identity value from a binary operation + +- A specialization of `reduce_identity` would take effect for all binary operations on `T` + +#### Design 2: `reduce_operation{binary_op, value}` + +The `reduce_operation` struct would encapsulate the binary operation +and its identity value into a single argument. +This would make it easier for an implementation to deduce a default identity value. +The default identity would match the identity for the corresponding binary fold. +Users' specializations would only affect that (operation, value type) pair. +It's a named struct, so algorithms could overload on it without ambiguity. + +Here are some use cases. + +```c++ +auto result7 = std::ranges::reduce(exec_policy, range, + reduce_operation{custom_binary_op, custom_value}); + +constexpr auto minus_Inf = -std::numeric_limits::infinity(); +auto result5 = std::ranges::reduce(exec_policy, range, + reduce_operation{std::ranges::min, minus_Inf}; + +// range_value_t is float +auto result6 = std::ranges::reduce(exec_policy, range, + reduce_operation{std::plus{}, /* double */ 0.0}; + +// Deduce identity: range_value_t{} +auto result2 = std::ranges::reduce(exec_policy, range, std::plus{}); +``` + +Here is an implementation sketch. + +```c++ +// Keep this an aggregate if possible. +// We can constrain BinaryOp on being invocable with T, T, +// but we don't know range_value_t at this point, +// so we don't know yet if the algorithm is well-formed with BinaryOp. +template> + requires semiregular && invocable +struct reduce_operation { + BinaryOp op; + T value{}; +}; + +// std::plus with arithmetic types -> T{}. +// +// We can't specialize for all T, because users might have +// defined std::plus such that T{} is not the identity. +template + requires( + (is_arithmetic_v) && + (is_void_v || (is_arithmetic_v && is_convertible_v)) + ) +struct reduce_operation, T> { + static constexpr std::plus op{}; + static constexpr T value = T{}; +}; + +// (std::plus or std::multiplies) with std::complex takes more effort, +// since e.g., std::complex + float is not well-formed. + +// std::multiplies with arithmetic types -> T(1). +// +// We can't specialize for all T, because users might have +// defined std::multiplies such that T(1) is not the identity. +template + requires( + (is_arithmetic_v) && + (is_void_v || (is_arithmetic_v && is_convertible_v)) + ) +struct reduce_operation> { + static constexpr std::multiplies op{}; + static constexpr T value = T(1); +}; + +// ... more specializations for Standard binary function objects ... +``` + +This is a more complicated design than the `reduce_identity` struct shown above. +However, it offers a straightforward path to providing a reasonable default +for commonly used binary operations. + +This design does impose the requirement that `BinaryOp` +must be copy-constructible or move-constructible. +That's fine for parallel algorithms, but not strictly necessary for non-parallel algorithms. +On the other hand, if you want non-copyable `BinaryOp`, maybe you should use `fold_*` instead. + +### If users can define an identity value, do they need an initial value? + +#### `*reduce` algorithms should not take both + +- Providing both would confuse users and would specify the result type redundantly. + +- There is no performance benefit for providing an initial value, if an identity value is known. + +```c++ +std::vector v{5, 11, 7}; +const int max_identity = std::numeric_limits::lowest(); + +// identity as initial value +int result1 = ranges::reduce(v, max_identity, ranges::max{}); +assert(result1 == 11); + +// identity as, well, identity +int result2 = ranges::reduce(v, + reduce_operation{ranges::max{}, max_identity}); +assert(result2 == 11); + +std::vector empty_vec; +int result3 = ranges::reduce(empty_vec, + reduce_operation{ranges::max{}, max_identity}); +assert(result3 == max_identity); +``` + +#### `*_scan` algorithms would benefit from an initial value + +- Initial value affects every element of output + +- Without it, would need extra `transform` pass over output + +- For exclusive scan, can't use `transform_exclusive_scan` to work around non-identity initial value + +```c++ +std::vector in{5, 7, 11, 13, 17}; +std::vector out(size_t(5)); +const int init = 3; +auto binary_op = plus{}; + +// out: 8, 15, 26, 39, 56 +ranges::inclusive_scan(in, out, binary_op, init); + +// out: 3, 8, 15, 26, 39 +// Yes, init and binary_op have reversed order. +ranges::exclusive_scan(in, out, init, binary_op); + +// out: 8, 15, 26, 39, 56 +auto unary_op = [op = binary_op] (auto x) { return op(x, 3); }; +ranges::transform_inclusive_scan(int, out, binary_op, unary_op); + +// out: 0, 8, 15, 26, 39 +ranges::transform_exclusive_scan(in, out, binary_op, unary_op); +``` + +#### Avoid mixing up identity and initial value + +C++17 `*reduce` and `*_scan` take initial value `T init`, undecorated. + +If new algorithms take `T identity`, then users could be confused when switching from C++17 to new algorithms. + +"Decorating" identity by wrapping it in a struct prevents confusion. It also lets algorithms provide both initial value and identity. + +```c++ +std::vector in{-8, 6, -4, 2, 0, 10, -12}; +std::vector out(size_t(7)); +const int init = 7; +auto binary_op = std::ranges::max{}; + +// inclusive_scan doesn't need an initial value. + +// out: -8, 6, 6, 6, 6, 10, 10 +std::ranges::inclusive_scan(in, out, binary_op); + +// out: 7, 7, 7, 7, 7, 10, 10 +std::ranges::inclusive_scan(in, out, binary_op, init); + +// Suppose the user knows that they +// will never see values smaller than -9. +const int identity_value = -10; + +// out: 7, 7, 7, 7, 7, 10, 10 +std::ranges::inclusive_scan(in, out, + reduce_operation{binary_op, identity_value}, + init); + +// exclusive scan needs an initial value. +// Identity is a reasonable default initial value, +// if you have it. +// +// C++17 *exclusive_scan puts init left of binary_op, +// while inclusive_scan puts init right of binary_op. +// We find this weird so we don't do it. + +// out: 7, 7, 7, 7, 7, 7, 10 +std::ranges::exclusive_scan(in, out, binary_op, init); + +// out: -10, -8, 6, 6, 6, 6, 10 +std::ranges::exclusive_scan(in, out, + reduce_operation{binary_op, identity_value}); + +// out: 7, 7, 7, 7, 7, 7, 7, 10 +std::ranges::exclusive_scan(in, out, + reduce_operation{binary_op, identity_value}, init); +``` -Based on the above considerations, we conclude that there are good reasons to consider a mechanism for users to explicitly -specify the identity element for parallel reduction. There are options of how that could be achieved, of which we list a few. +### Conclusions -- Add an optional extra parameter for the value of identity, defaulting to value initialization. -- Change the meaning of the `init` parameter for parallel algorithms to represent identity instead of the initial value. -- Provide a customization point similar to `sycl::known_identity` that also defaults to value initialization but can be - specialized for a given operation. -- Similarly to `std::linalg`, require that for numeric parallel algorithms a value-initialized object shall act as the identity element. +It's important for both performance and functionality +that users be able to specify an identity value for parallel reductions. +Designs for this should avoid confusion when switching from +C++17 parallel numeric algorithms to the new ranges versions. +We would like feedback from SG9 and LEWG on their preferred design. -At this point, we do not propose any of these options. We would like to hear feedback from SG1 and SG9 on exploring this further. +Our proposed `*reduce` algorithms do not need an initial value parameter. +For our proposed `*_scan` algorithms, an initial value could improve performance +in some cases by avoiding an additional pass over all the output elements. +The `*exclusive_scan` algorithms need an initial value +because it defines the first element of the output range. +The initial value could default to the identity, if it exists and is known. ## `ranges::reduce` design In this section, we focus on `ranges::reduce`'s design. The discussion here applies generally to the other algorithms we propose. -### No default parameters +### No default binary operation or initial value Section 5.1 of [@P2760R1] states: -> One thing is clear: `ranges::reduce` should *not* take a default binary operation *nor* a default initial [value] -> parameter. The user needs to supply both. +> One thing is clear: `ranges::reduce` should *not* take a default binary operation +> *nor* a default initial [value] parameter. The user needs to supply both. This motivates the following convenience wrappers: @@ -896,8 +1376,9 @@ This motivates the following convenience wrappers: - `ranges::product(r)` for `ranges::reduce` with `init = range_value_t(1))` and `multiplies{}` as the reduce operation; and - `ranges::dot(x, y)` for binary `ranges::transform_reduce` with `init = T()` where - `T = decltype(declval>() * declval>())`, `multiplies{}` as the transform operation, - and `plus{}` as the reduce operation. + `T = decltype(declval>() * declval>())`, + `multiplies{}` is the transform operation, + and `plus{}` is the reduce operation. One argument *for* a default initial value in `std::reduce` is that `int` literals like `0` or `1` do not behave in the expected way with a sequence of `float` or `double`. For `ranges::reduce`, however, making its return value type imitate @@ -1032,8 +1513,10 @@ position.) We express what `reduce` does using *GENERALIZED_SUM*. ## Enabling list-initialization for proposed algorithms -Our proposal follows the same principles as described in [@P2248R8] paper. We want to enable the use case with constructing -`init` from curly braces. +Our proposal follows the same principles as described in [@P2248R8], +"Enabling list-initialization for algorithms." +We want to enable the use case where users construct a nondefault initial value +using curly braces without naming the type. ```c++ #include @@ -1048,8 +1531,11 @@ int main() { } ``` -Thus, we need to add a default template argument to `T init` in the proposed signatures. While [@P2248R8] does not propose -a default template parameter for `init` in `` header, we want to address this design question from the beginning +Supporting this use case requires that we add +a default template argument to `T init` in the proposed signatures. +While [@P2248R8] does not propose a default template parameter +for `init` in the `` header, +we want to address this design question from the beginning for the new set of algorithms because `fold_` family already has this feature. # Implementation @@ -1060,7 +1546,7 @@ implementation is done as experimental with the following deviations from this p - Algorithms do not have constraints - `reduce` has more overloads (without init and without binary predicate) - `*_scan` return type is not `in_out_result` -- Convenience wrappers, proposed in the paper are not implemented. The implementation is expected to be trivial, though. +- The convenience wrappers proposed in this paper are not implemented. Their implementation is expected to be trivial, though. # Wording