diff --git a/parallel_algorithms/P3732/numeric_ranges_algorithms.md b/parallel_algorithms/P3732/numeric_ranges_algorithms.md
index b959de4..c6b8275 100644
--- a/parallel_algorithms/P3732/numeric_ranges_algorithms.md
+++ b/parallel_algorithms/P3732/numeric_ranges_algorithms.md
@@ -1,8 +1,8 @@
 
 ---
 title: Numeric range algorithms
-document: P3732R0
-date: 2025-06-14
+document: P3732R1
+date: 2025-07-15
 audience: SG1,SG9
 author:
   - name: Ruslan Arutyunyan
@@ -38,6 +38,28 @@ We propose `ranges` algorithm overloads (both parallel and non-parallel) for the
 
 * Abhilash Majumder (NVIDIA)
 
+# Revision history
+
+## R0 to be submitted 2025-07-15
+
+R0 is the original draft prepared before the June 2025 Sofia WG21 meeting.  SG1 reviewed this draft during the Sofia meeting with the following feedback.
+
+- SG1 agrees (via poll 4/5/1/0/0) that users should have a way to specify an identity value.  SG1 asks whether there is any need to specify this as a compile-time value, or whether a run-time-only interface would suffice.  One concern is the potential cost of broadcasting an identity value at run time to all threads, versus initializing each thread's accumulator to a value known at compile time.
+
+- SG1 has no objection to adding `transform_*` variants of algorithms.
+
+- SG1 asks us to add `reduce_into` and `transform_reduce_into` (via poll 4/4/0/0/0), that is, versions of `reduce` and `transform_reduce` that write the reduction result to an output range of one element.  (We asked SG1 to take this poll because LEWG rejected an analogous design for std::linalg reduction-like algorithms such as dot product and norms.)
+
+- SG1 members would like separate proposals on fixing _`movable-box`_ trivial copyability, and fixing performance issues with views in general.
+
+## R1 in preparation
+
+- Revise non-wording sections
+
+    - Explain `reduce_into` and `transform_reduce_into`
+
+    - Show different designs for specifying identity value
+
 # What we propose
 
 We propose `ranges` overloads (both parallel and non-parallel) of the following algorithms:
@@ -48,14 +70,18 @@ We propose `ranges` overloads (both parallel and non-parallel) of the following
 
 * `exclusive_scan` and `transform_exclusive_scan`.
 
+These correspond to existing algorithms with the same names in the `<numeric>` header.
+Therefore, we called them "numeric range(s) algorithms."
+
 We also propose adding parallel and non-parallel convenience wrappers:
 
 * `ranges::sum` and `ranges::product` for special cases of `reduce` with addition and multiplication, respectively; and
 
 * `ranges::dot` for the special case of binary `transform_reduce` with transform `multiplies{}` and reduction `plus{}`.
 
-The following sections explain why we propose these algorithms and not others.  This relates to other aspects of the design
-besides algorithm selection, such as whether to include optional projection parameters.
+The following sections explain why we propose these algorithms and not others.
+This relates to other aspects of the design besides algorithm selection,
+such as whether to include optional projection parameters.
 
 # Design
 
@@ -171,7 +197,9 @@ as `ranges::transform_inclusive_scan(r, o, f, g)` with `g` as the transform oper
 
 The binary variant of `transform_reduce` is different. Unlike `reduce` and most other numeric algorithms, it takes two
 input sequences and applies a binary function to the pairs of elements from both sequences. Projections, being unary functions,
-cannot replace the binary transform function of the algorithm. `transform_view` is similarly of no help unless it is combined with
+cannot replace the binary transform function of the algorithm.
+Likewise, `transform_view` by itself cannot replace
+the binary transform function unless it is combined with
 `zip_view` and operates on tuples of elements. `zip_transform_view` is a convenient way to express this combination;
 applying `reduce` to `zip_transform_view` gives the necessary result (code examples are shown below).
 
@@ -226,7 +254,7 @@ assert(out2 == expected);
 ```
 :::
 
-The code without projections using a single big lambda to express the binary operation. Users have to read the big lambda
+The code without projections uses a single big lambda to express the binary operation. Users have to read the big lambda
 to see what it does. So does the compiler, which can hinder optimization if it's not good at inlining.
 In contrast, the version with projections lets users read out loud what it does.
 It also separates the "selection" or "query" part of the transform from the "arithmetic" or "computation" part. The power of
@@ -235,11 +263,13 @@ natural to extend this separation to selection logic as well.
 
 ##### Unary transform
 
-It's harder to avoid a lambda, as the function that does an operation, in the unary `transform` case.  Most of the named
-C++ Standard Library arithmetic function objects are binary.  Currying them into unary functions in C++ requires either
-making a lambda (which defeats the purpose somewhat) or using something like `std::bind_front` (which is verbose).  On the
-other hand, using a projection still has the benefit of separating the "selection" part of the transform from the
-"computation" part.
+In the unary `transform` case, it's harder to avoid using a lambda.
+Most of the named C++ Standard Library arithmetic function objects are binary.
+Currying them into unary functions in C++ requires either
+making a lambda (which defeats the purpose somewhat) or
+using something like `std::bind_front` (which is verbose).
+On the other hand, using a projection still has the benefit of
+separating the "selection" part of the transform from the "computation" part.
 
 ```c++
 struct foo {};
@@ -326,10 +356,13 @@ assert(result_no_proj == 52);
 
 ##### Binary `transform_reduce`
 
-As we described above, expressing the functionality of binary `transform_reduce` using only `reduce` requires `zip_transform_view`
-or something like it, making the `reduce`-only version more verbose. Users may also find it troublesome that `zip_view` and `zip_transform_view`
+As we explained above, expressing the functionality of binary `transform_reduce`
+using only `reduce` requires `zip_transform_view` or something like it.
+This makes the `reduce`-only version more verbose.
+Users may also find it troublesome that `zip_view` and `zip_transform_view`
 are not pipeable: there is no `{v1, v2} | views::zip` syntax, for example.
-On the other hand, it's a toss-up which version is easier to understand. Users either need to learn what a "zip transform view" does,
+On the other hand, it's a toss-up which version is easier to understand.
+Users either need to learn what `zip_transform_view` does,
 or they need to learn about `transform_reduce` and know which of the two function arguments does what.  
 
 ```c++
@@ -361,11 +394,12 @@ elements from the two input ranges into a single value. The algorithm then reduc
 function and the initial value. It's perhaps misleading that this binary function is called a "transform"; it's really a
 kind of "inner" reduction on corresponding elements of the two input ranges.
 
-One can imagine a ranges analog of C++17 binary `transform_reduce` that takes two projection functions, as in the example
-below. It's not too hard for a casual reader to tell that the last two arguments of `reduce` apply to each of the input
-sequences in turn, but that's still more consecutive function arguments than for any other algorithm in the C++ Standard
-Library. Without projections, users need to resort to `transform_view`, but this more verbose syntax makes it more
-clear which functions do what.
+One can imagine a ranges analog of C++17 binary `transform_reduce`
+that takes two projection functions, as in the example below.
+The result has four consecutive function arguments in a row,
+which is more than for any other algorithm in the Standard Library.
+Without projections, users need to resort to `transform_view`,
+but this more verbose syntax makes it more clear which functions do what.
 
 ```c++
 struct foo {};
@@ -375,12 +409,12 @@ std::vector<std::pair<std::string, int>> v2{
   {"thirteen", 13}, {"seventeen", 17}, {"nineteen", 19}};
 constexpr int init = 3;
 
-// With projections
+// With projections: 4 functions in a row
 auto result_proj = std::ranges::transform_reduce(v1, v2, init,
   std::plus{}, std::multiplies{}, get_element<0>{}, get_element<1>{});
 assert(result_proj == 396);
 
-// Without projections
+// Without projections: more clear where get_element<k> happens
 auto result_no_proj = std::ranges::transform_reduce(
   std::views::transform(v1, get_element<0>{}),
   std::views::transform(v2, get_element<1>{}),
@@ -490,7 +524,8 @@ parallel execution.
 
 Let's review what we learned from the above discussion.
 
-- In general and particularly for `ranges::transform`, projections improve readability and expose optimization potential,
+- Projections improve readability of `ranges::transform`.
+- Projections expose optimization potential,
 by separating the selection part of an algorithm from the computation part.
 - None of the existing `fold_*` `ranges` algorithms (the closest things the Standard Library currently has to
 `ranges::reduce`) take projections.
@@ -526,6 +561,55 @@ reduction algorithms to have projections.
 `ranges::transform_{in,ex}clusive_scan` as well as `ranges::{in,ex}clusive_scan`, and do not provide projections for any of
 them.
 
+### `reduce_into` and `transform_reduce_into`
+
+We propose new algorithms `reduce_into` and `transform_reduce_into`.
+These work like `reduce` and `transform_reduce`,
+except that instead of returning the reduction result by value,
+they write it to the first element of an output range.
+
+The `reduce_into` algorithm has
+[precedent in the Thrust library](https://nvidia.github.io/cccl/thrust/api_docs/algorithms/reductions.html).
+Its performance advantange is that the algorithm can write its result
+directly to special memory associated with parallel execution,
+such as accelerator memory or a NUMA (Non-Uniform Memory Access) domain
+where the algorithm's threads run.
+
+#### Output should be a single iterator
+
+P3179 (parallel ranges algorithms) always specifies output ranges
+as sized ranges, instead of as a single iterator.
+However, in the case of `*reduce_into`,
+the output range always has exactly one element.
+Thus, there would be no safety improvement in requiring a sized range.
+Users would end up needing to go through a possibly error-prone
+syntax ritual to turn their one output iterator into a sized range.
+The use cases below illustrate this.
+
+```c++
+std::vector<float> input_range{3.0f, 5.0f, 7.0f};
+float out_value{};
+unique_ptr<float[], my_deleter> out{my_alloc(sizeof(float)), deleter};
+
+// Input sized range, output iterator
+ranges::reduce_into(input_range, /* ... */ &out_value);
+ranges::reduce_into(input_range, /* ... */ out.get());
+assert(out_value == out[0]);
+
+// Input sized range, output sized range (size 1)
+ranges::reduce_into(input_range, /* ... */ span<float, 1>{out_value});
+ranges::reduce_into(input_range, /* ... */ span<float, 1>{out.get()});
+assert(out_value == out[0]);
+```
+
+#### Output iterator could be single pass
+
+The output range needs to be copyable for parallelization reasons,
+so its iterator can't be merely an `output_iterator`.
+However, it could be a single-pass range,
+because the algorithm only needs to write one element at most once.
+The Standard does not currently have an iterator category to express this distinction.
+
 ### We propose convenience wrappers to replace some algorithms
 
 #### `accumulate`
@@ -585,27 +669,37 @@ We do not propose `adjacent_transform` for the reasons described above.
 
 #### `partial_sum`
 
-The `partial_sum` algorithm performs operations sequentially. The existing ranges library does not have an equivalent
-algorithm with this left-to-right sequential behavior, nor do we propose such an algorithm. For users who want this
-behavior, [@P2760R1] suggests a view instead of an algorithm. [@P3351R2], "`views::scan`," proposes this view; it
-is currently in SG9 (Ranges Study Group) review.
-
-Users of `partial_sum` who are not concerned about the order of operations can call `inclusive_scan` instead, which we
-propose here. We considered adding a convenience wrapper for the same special case of an inclusive prefix plus-scan that
-`partial_sum` supports. However, names like `partial_sum` or `prefix_sum` would obscure whether this is an inclusive or
-exclusive scan. Also, we already have `std::partial_sum` that operates in order. Using the same name as a convenient wrapper
-on top of out-of-order `*_scan`, we propose in the paper, is misleading. We think it's not a very convenient convenience
-wrapper if users have to look these aspects up every time they use it.
-
-If WG21 did want a convenience wrapper, one option would be to give this common use case a longer but more explicit name,
+The `partial_sum` algorithm combines elements sequentially, from left to right.
+It behaves like an order-constrained version of `inclusive_scan`.
+
+Our proposal focuses on algorithms that permit reordering binary operations.
+For users who want an order-constrained partial sum,
+[@P3351R2], "`views::scan`," proposes a view with the same left-to-right behavior.
+This paper is currently in SG9 (Ranges Study Group) review.
+
+Users of `partial_sum` who are not concerned about the order of operations
+can call the `inclusive_scan` algorithm (proposed here) instead.
+We considered adding a convenience wrapper for the same special case
+of an inclusive prefix plus-scan that `partial_sum` supports.
+However, names like `partial_sum` or `prefix_sum` would obscure
+whether this is an inclusive or exclusive scan.
+Also, the existing `partial_sum` algorithm operates left-to-right.
+A new algorithm with the same name and almost the same interface,
+but with a different order of operations, could be misleading.
+We think it's not a very convenient convenience wrapper
+if users have to look up its behavior every time they use it.
+
+If WG21 did want a convenience wrapper, one option would be
+to give this common use case a longer but more explicit name,
 like `inclusive_sum_scan`.
 
 ### We don't propose "the lost algorithm" (noncommutative parallel reduce)
 
 The Standard lacks an analog of `reduce` that can assume associativity but not commutativity of binary operations.
-One author of this proposal refers to this as "the lost algorithm" (in e.g.,
-[Episode 25 of "ASDP: The Podcast"](https://adspthepodcast.com/2021/05/14/Episode-25.html)). We do not propose this
-algorithm, but we would welcome a separate proposal to do so.
+One author of this proposal refers to this as "the lost algorithm."
+(Please refer to
+[Episode 25 of "ASDP: The Podcast"](https://adspthepodcast.com/2021/05/14/Episode-25.html).)
+We do not propose this algorithm, but we would welcome a separate proposal to do so.
 
 The current numeric algorithms express a variety of permissions to reorder binary operations.
 
@@ -636,7 +730,7 @@ with a two-sided identity element.
 This proposal leaves the described algorithm out of scope. We think the right way would be to propose a new algorithm with
 a distinct name. A reasonable choice of name would be `fold` (just `fold` by itself, not `fold_left` or `fold_right`).
 
-### We don't propose `reduce_with_iter`
+### We don't propose `reduce_with_iter` {#no-reduce-with-iter}
 
 A hypothetical `reduce_with_iter` algorithm would look like `fold_left_with_iter`, but would permit reordering of binary
 operations. It would return both an iterator to one past the last input element, and the computed value. The only reason
@@ -644,6 +738,8 @@ for a reduction to return an iterator would be if the input range is single-pass
 input range really should be using one of the `fold*` algorithms instead of `reduce*`.  As a result, we do not propose the
 analogous `reduce_with_iter` here.
 
+Note that the previous paragraph effectively argues for `*reduce` to require at least forward ranges.
+
 Just like `fold_left`, the `reduce` algorithm should return just the computed value.  Section 4.4 of [@P2322R6] argues that
 this makes it easier to use, and improves consistency with other `ranges` algorithms like `ranges::count` and
 `ranges::any_of`.  It is also consistent with [@P3179R8].  Furthermore, even if a `reduce_with_iter` algorithm were to
@@ -659,7 +755,7 @@ initial value. An example would be `min` on a range of `int` values, where calle
 represents an actual value in the range, or a fake "identity" element (that callers may get as a result when the range is
 empty).
 
-We do not propose `reduce_first` here, only outline arguments against and for adding it.
+We do not propose `reduce_first` here; we just outline arguments against and for adding it.
 
 #### Arguments against `reduce_first`
 
@@ -680,8 +776,12 @@ See [](#initial-value-vs-identity) for more detailed analysis.
 1. Some equivalent of `reduce_first` can be used as a building block for parallel reduction with unknown identity, if no other solution is proposed.
 1. Even though `min_element`, `max_element`, and `minmax_element` exist, users may still want to combine multiple
 reductions into a single pass, where some of the reductions are min and/or max, while others have a natural identity.
-As an example, users may want the minimum of an array of integers (with no natural identity), along with the least
-index of the array element with the minimum value (whose natural identity is zero).  This happens often enough that MPI
+
+As an example of combining multiple reductions into a single pass,
+users may want the minimum of an array of integers (with no natural identity),
+along with the least index of the array element with the minimum value
+(whose natural identity is zero).
+This happens often enough that MPI
 (the Message Passing Interface for distributed-memory parallel computing) has predefined reduction operations for minimum
 and its index (`MINLOC`) and maximum and its index (`MAXLOC`).  On the other hand, even `MINLOC` and `MAXLOC` have
 reasonable choices of fake "identity" elements that work in practice, e.g., for `MINLOC`, `INT_MAX` for the minimum value
@@ -727,30 +827,47 @@ requirement for the non-parallel algorithms we propose. This leaves us with two
 
 * (multipass) forward ranges.
 
-The various reduction and scan algorithms we propose can combine the elements of the range in any order. For this reason,
-we make the non-parallel algorithms take (multipass) forward ranges, even though this is not consistent with the existing
-non-parallel `<numeric>` algorithms. If users have single-pass iterators, they should just call one of the `fold_*`
-algorithms, or use the `views::scan` proposed elsewhere. This has the benefit of letting us specify `ranges::reduce`
-to return just the value. We don't propose a separate algorithm `reduce_with_iter`, as we explain elsewhere in this
-proposal.
+We believe there is no value in `*reduce` and `*_scan` taking single-pass input ranges,
+because these algorithms can combine the elements of their input range(s) in any order.
+Suppose that an algorithm had that freedom to rearrange operations,
+yet was constrained to read the input ranges exactly once, in left-to-right order.
+The only way such an algorithm could exploit that freedom
+would be for it to copy the input ranges into temporary storage.
+Users who want that could just copy the input ranges into contiguous storage themselves.
+
+For this reason, we make the non-parallel algorithms take (multipass) forward ranges,
+even though this is not consistent with the existing non-parallel `<numeric>` algorithms.
+If users have single-pass iterators, they should just call one of the `fold_*` algorithms,
+or use the `views::scan` proposed in [@P3351R2].
+This has the benefit of letting us specify `ranges::reduce` to return just the value.
+We don't propose a separate `reduce_with_iter` algorithm
+to return both the value and the one-past-the-input iterator,
+as we explain [in the relevant section](#no-reduce-with-iter).
 
 ## Constexpr parallel algorithms?
 
 [@P2902R1] proposes to add `constexpr` to the parallel algorithms. [@P3179R8] does not object to this; see Section 2.10.
 We continue the approach of [@P3179R8] in not opposing [@P2902R1]'s approach, but also not depending on it.
 
-## Reduction's initial value vs. its identity element {#initial-value-vs-identity}
+## Specifying a reduction's identity element {#initial-value-vs-identity}
 
-It's important to distinguish between a reduction's initial value, and its identity element. C++17's `std::reduce` takes an
-optional initial value `T init` that is included in the terms of the reduction. This is not necessarily the same as the
-identity element for a reduction, which is a value that does not change the reduction's result, no matter how many times it
-is included. The following example illustrates.
+### Initial value not necessarily the same as identity value
+
+It's important to distinguish between a reduction's initial value, and its identity value.
+C++17's parallel `*reduce` algorithms take an optional *initial value* `T init`.
+This defaults to `T{}`, and is included in the terms of the reduction.
+The initial value is not necessarily the same as an *identity value* for a reduction,
+which is a value that does not change the reduction's result,
+no matter how many times it is included as a term.
+We say "an" identity value because it need not be unique.
+The identity value can serve as an initial value, but not vice versa.
+The following example illustrates.
 
 ```c++
 std::vector<float> v{5.0f, 7.0f, 11.0f};
 
 // Default initial value is float{}, which is 0.0f.
-// It is also the identity for std::plus<>, the default operation 
+// It is also the identity for std::plus<>, the default operation.
 float result = std::reduce(v.begin(), v.end());
 assert(result == 23.0f);
 
@@ -771,31 +888,36 @@ result = std::reduce(v.begin(), v.end(), 0.0f);
 assert(result == 23.0f);
 ```
 
-The identity element can serve as an initial value, but not vice versa. This is especially important for parallelism.
-
-#### Initial value matters most for sequential reduction
+### Initial value matters most for sequential reduction
 
-From the serial execution perspective, it is easy to miss importance of the reduction identity. Let's consider typical code
-that sums elements of an indexed array.
+Users who never use parallel reductions may miss the importance of the reduction identity.
+Let's consider typical code that sums elements of an indexed array.
 
 ```c++
 float sum = 0.0f;
-for (size_t i = 0; i<array_size; ++i)
-    sum += a[i];
+for (size_t i = 0; i < array_size; ++i) {
+  sum += a[i];
+}
 ```
 
-The identity element `0.0f` is used to initialize the *accumulator* where the array values then sum up. However, if an initial
-value for the reduction is provided, it replaces the identity in the code above. An implementation of `reduce` does not
-therefore need to know the identity of its operation when an initial value is provided.
+The identity element `0.0f` is used to initialize the *accumulator*
+into which the array's values are summed.
+If an initial value for the reduction is provided, it replaces the identity in the code above.
+A serial implementation of `reduce` therefore does not need to know
+its binary operation's identity when an initial value is provided.
 
-The initial value parameter of `reduce` also lets users express a "running reduction" where the whole range is not
-available all at once and users need to call `reduce` repeatedly.
+The initial value parameter of `reduce` also lets users express a "running reduction"
+where the whole range is not available all at once
+and users need to call `reduce` repeatedly.
 
-#### Identity element matters most for parallel reduction
+### Identity element matters most for parallel reduction
 
-The situation is different for parallel execution, because there are more than one accumulator to initialize. Any parallel
-reduction somehow distributes the data over multiple threads of execution, and each one uses a local accumulator for its part
-of the job. The initial value can be used to initialize at most one of those; for others, something else is needed.
+The situation is different for parallel execution,
+because more than one accumulator must be initialized.
+Any parallel reduction somehow distributes the data over multiple threads of execution,
+where each thread uses a local accumulator for its part of the job.
+The initial value can be used to initialize at most one of those accumulators;
+for the others, something else is needed.
 
 If an identity `id` for a binary operator `op` is known, then here is a natural way to parallelize `reduce(`$R$`, init, op)`
 over $P$ processors using the serial version as a building block.
@@ -804,33 +926,44 @@ over $P$ processors using the serial version as a building block.
 2. On each processor $p$ compute a local result $L_p$ `= reduce(`$S_p$`, id, op)` (with `id` as the initial value).
 3. Reduce over the local results $L_p$ with `init` as the initial value.
 
-It's not the only and not necessarily the best way though. For example, an implementation for the `unseq` policy probably
-will not call the serial algorithm. Yet it also needs to somehow initialize local accumulators for each SIMD lane.
+It's not the only and not necessarily the best way though.
+For example, a SIMD-based implementation for the `unseq` policy
+likely would not call the serial algorithm,
+yet it would need to initialize a local accumulator for each SIMD lane.
+
+### What to do when the identity element is unknown
 
-#### What to do when the identity element is unknown
+What happens to a parallel implementation of C++17 `std::reduce`
+with a user-defined binary operation?
+The Standard offers no way to know the operation's identity, if it exists at all.
+There are two other ways to initialize each local accumulator.
 
-Then, what happens to a parallel implementation of C++17 `std::reduce` with a user-defined binary operation, where the Standard
-offers no way to know the operation's identity, if it exists? There are two other ways to initialize local accumulators:
-either with values from the respective subsequences or with the reduction of two such values.
+1. With some value from that subsequence, such as the first one.
+2. With the result of applying the binary operation to two values from the subsequence.
 
-The type requirements of `std::reduce` seem to assume the second approach, as the type of the result is not required to be copy-constructible.
+The type requirements of `std::reduce` seem to assume the second approach,
+as the type of the result is not required to be copy-constructible.
 
 ```c++
 // using random access iterators for simplicity
 auto sum = std::move(op(first[0], first[1])));
 size_t sz = last - first;
-for (size_t i = 2; i < sz; ++i)
-    sum = std::move(op(sum, first[i]));
+for (size_t i = 2; i < sz; ++i) {
+  sum = std::move(op(sum, first[i]));
+}
 ```
 
-While technically doable, this approach is often suboptimal. In many use cases, the iteration space and the data storage are aligned
-(e.g. to `std::hardware_constructive_interference_size` or to the SIMD width) to allow for more efficient use of HW.
+While technically doable, this approach is often suboptimal.
+In many use cases, the iteration space and the data storage are aligned
+(e.g., to `std::hardware_constructive_interference_size` or to the SIMD width)
+to allow for more efficient hardware use.
 The loop bound changes shown above break this alignment, affecting code efficiency.
 
-At a glance, a hypothetical `reduce_first` could be used in an alternative solution where it would be a serial building block
-in the step (2) above, instead of `reduce` with `id`. But as we noted, such an implementation is not always the best.
+An alternative solution could use a hypothetical `reduce_first` algorithm
+as a serial building block of parallel `reduce`.
+As we noted above, though, that solution may also have performance issues.
 
-#### Other parallel programming models
+### Other parallel programming models
 
 Other parallel programming models provide all combinations of design options. Some compute only `reduce_first`, some only
 `reduce`, and some compute both. Some have a way to specify only an identity element, some only an initial value, and some
@@ -865,30 +998,377 @@ initial values. If not provided and the binary operation (a "universal function"
 operation on a possibly multidimensional array) has an identity, then the initial values default to the identity. If the
 binary operation has no identity or the initial values are `None`, then this works like `reduce_first`.
 
-#### Conclusions
+### Default identity value?
+
+#### Use cases for default identity value
+
+```c++
+// Should this even work?
+// Deduce binary op: std::plus{}
+// Deduce identity: range_value_t<R>{}
+auto result1 = std::ranges::reduce(exec_policy, range);
+
+// Deduce identity: range_value_t<R>{}
+auto result2 = std::ranges::reduce(exec_policy, range, std::plus{});
+
+// If range_value_t<R> is arithmetic or std::complex,
+// deduce identity: range_value_t<R>(1)
+auto result3 = std::ranges::reduce(exec_policy, range, std::multiplies{});
+
+// Should this even work?
+// The "identity" is really -Inf if range_value_t<R> has that,
+// but this is one of those cases where users might not like that behavior.
+auto result4 = std::ranges::reduce(exec_policy, range, std::ranges::min);
+```
+
+#### Do not force `T{}` to be an identity value
+
+The identity value of a binary operator that returns `T`
+need not necessarily be `T{}` (a value-constructed `T`)
+for all operators and types.
+
+- For `std::multiplies{}` it's `T(1)`
+
+- For "addition" in the max-plus ("tropical") algebra it's `-Inf`
+
+We don't want to force users to wrap reduction result types
+so that `T{}` defines the identity (if it exists) for `operator+(T, T)`.
+
+- What if the operator has no identity value?
+
+- What if `T` differs from the input range's value type?
+
+- What if users want to use the same value type for different binary operators, such as `double` for `plus`, `multiplies`, and `ranges::max`?
+
+- If we make users write a nondefaulted default constructor for `T`, they are more likely to make `T` not trivially constructible, and thus hinder optimizations.
+
+Note that this differs from std::linalg:
+"A value-initialized object of linear algebra value type shall act as the additive identity"
+([linalg.reqs.val] 3).  However:
+
+- std::linalg does not take user-defined binary operators; it always uses `operator+` for reductions
+
+- std::linalg needs "zero" for reasons other than reductions, e.g., its support for user-defined complex number types (_`imag-if-needed`_)
+
+For these reasons, we think it's reasonable
+to make a different design choice for numeric range algorithms.
+
+### Interface sketches for specifying identity
+
+#### Design goals
+
+1. Avoid confusion with C++17 `std::reduce` initial value
+
+    - Strongly counterindicates designs that just pass in the identity as `T id`.
+        Doing that would easily confuse users who want to switch
+        from C++17 numeric algorithms to the new ranges versions.
+
+2. Let users specify a different binary operation and identity for the same reduction result type
+
+    - Counterindicates a purely compile-time traits system.
+        For an example of such a system, please see the
+        ["Reduction Variables"](https://github.khronos.org/SYCL_Reference/iface/reduction-variables.html)
+        section of the SYCL Reference.
+
+3. Let users specify an identity even if their binary operation is a lambda
+
+    - Counterindicates a purely compile-time traits system
+
+4. Let users use the identity value to specify a reduction result type
+    that differs from `range_value_t<R>`,
+    just as they can use the initial value's type to do that with C++17 `reduce`
+
+5. Provide default identity value if possible
+
+    - Conform to fold expression default identity if possible
+
+    - Algorithms must know both `range_value_t<R>` and the binary operation in order to deduce a default identity value
+
+6. Let users specify a nondefault identity value "in line" with invoking the algorithm, without needing to specializing a class
+
+#### Design 1: `reduce_identity<T>{value}`
+
+Users would pass in their identity value
+by wrapping it in a named struct `reduce_identity`.
+
+```c++
+template<semiregular T>
+struct reduce_identity {
+  T value{};
+};
+```
+
+The default identity value would be `T{}`, because the default needs to be _something_.
+Users would have two ways to provide a nondefault value.
+
+1. Construct `reduce_identity` with a default value using aggregate initialization:
+    `reduce_identity{nondefault_value}`
+
+2. Specialize `reduce_identity<T>` so `declval<reduce_identity<T>>().value` is the value
+
+For example, users could inherit their specialization from `constant_wrapper`.
+
+```c++
+namespace impl {
+  inline constexpr my_number some_value = /* value goes here */;
+}
+template<class T>
+struct reduce_identity<my_number> : 
+  constant_wrapper<impl::some_value>
+{};
+```
+
+Here are some use cases.
+
+```c++
+// User explicitly opts into "most negative integer"
+// as the identity for min.  This should not be the default,
+// as the C++ Standard Library has no way to know
+// whether this represents a valid input value.
+constexpr auto lowest = std::numeric_limits<int>::lowest();
+auto result5 = std::ranges::reduce(exec_policy, range,
+  std::ranges::min, reduce_identity{lowest});
+
+// range_value_t<R> is float, but identity value is double
+// (even though it's otherwise the default value, zero).
+// std::plus<void> should use operator()(double, double) -> double
+auto result6 = std::ranges::reduce(exec_policy, range,
+  std::plus{}, reduce_identity{0.0});
+```
+
+Advantages of this approach:
+
+- Users would see in plain text the purpose of this function argument
+
+- Algorithms could overload on it without risk of ambiguity
+
+- The struct is an aggregate, which would maximize potential for optimizations
+
+- It would not impose requirements on the user's binary function
+
+Disadvantages:
+
+- The algorithm could not use this to deduce a default identity value from a binary operation
+
+- A specialization of `reduce_identity<T>` would take effect for all binary operations on `T`
+
+#### Design 2: `reduce_operation{binary_op, value}`
+
+The `reduce_operation` struct would encapsulate the binary operation
+and its identity value into a single argument.
+This would make it easier for an implementation to deduce a default identity value.
+The default identity would match the identity for the corresponding binary fold.
+Users' specializations would only affect that (operation, value type) pair.
+It's a named struct, so algorithms could overload on it without ambiguity.
+
+Here are some use cases.
+
+```c++
+auto result7 = std::ranges::reduce(exec_policy, range,
+  reduce_operation{custom_binary_op, custom_value});
+
+constexpr auto minus_Inf = -std::numeric_limits<float>::infinity();
+auto result5 = std::ranges::reduce(exec_policy, range,
+  reduce_operation{std::ranges::min, minus_Inf};
+
+// range_value_t<R> is float
+auto result6 = std::ranges::reduce(exec_policy, range,
+  reduce_operation{std::plus{}, /* double */ 0.0};
+
+// Deduce identity: range_value_t<R>{}
+auto result2 = std::ranges::reduce(exec_policy, range, std::plus{});
+```
+
+Here is an implementation sketch.
+
+```c++
+// Keep this an aggregate if possible.
+// We can constrain BinaryOp on being invocable with T, T,
+// but we don't know range_value_t<R> at this point,
+// so we don't know yet if the algorithm is well-formed with BinaryOp.
+template<class BinaryOp, class T = invoke_result<BinaryOp, T, T>>
+  requires semiregular<T> && invocable<BinaryOp, T, T>
+struct reduce_operation {
+  BinaryOp op;
+  T value{};
+};
+
+// std::plus with arithmetic types -> T{}.
+//
+// We can't specialize for all T, because users might have
+// defined std::plus<T> such that T{} is not the identity.
+template<class U, class T>
+  requires(
+    (is_arithmetic_v<T>) &&
+    (is_void_v<U> || (is_arithmetic_v<U> && is_convertible_v<U, T>))
+  )
+struct reduce_operation<std::plus<U>, T> {
+  static constexpr std::plus<U> op{};
+  static constexpr T value = T{};
+};
+
+// (std::plus or std::multiplies) with std::complex takes more effort,
+// since e.g., std::complex<double> + float is not well-formed.
+
+// std::multiplies with arithmetic types -> T(1).
+//
+// We can't specialize for all T, because users might have
+// defined std::multiplies<T> such that T(1) is not the identity.
+template<class T, class U>
+  requires(
+    (is_arithmetic_v<T>) &&
+    (is_void_v<U> || (is_arithmetic_v<U> && is_convertible_v<U, T>))
+  )
+struct reduce_operation<T, std::multiplies<U>> {
+  static constexpr std::multiplies<U> op{};
+  static constexpr T value = T(1);
+};
+
+// ... more specializations for Standard binary function objects ...
+```
+
+This is a more complicated design than the `reduce_identity<T>` struct shown above.
+However, it offers a straightforward path to providing a reasonable default
+for commonly used binary operations.
+
+This design does impose the requirement that `BinaryOp`
+must be copy-constructible or move-constructible.
+That's fine for parallel algorithms, but not strictly necessary for non-parallel algorithms.
+On the other hand, if you want non-copyable `BinaryOp`, maybe you should use `fold_*` instead.
+
+### If users can define an identity value, do they need an initial value?
+
+#### `*reduce` algorithms should not take both
+
+- Providing both would confuse users and would specify the result type redundantly.
+
+- There is no performance benefit for providing an initial value, if an identity value is known.
+
+```c++
+std::vector<int> v{5, 11, 7};
+const int max_identity = std::numeric_limits<int>::lowest();
+
+// identity as initial value
+int result1 = ranges::reduce(v, max_identity, ranges::max{});
+assert(result1 == 11);
+
+// identity as, well, identity
+int result2 = ranges::reduce(v,
+  reduce_operation{ranges::max{}, max_identity});
+assert(result2 == 11);
+
+std::vector<int> empty_vec;
+int result3 = ranges::reduce(empty_vec,
+  reduce_operation{ranges::max{}, max_identity});
+assert(result3 == max_identity);
+```
+
+#### `*_scan` algorithms would benefit from an initial value
+
+- Initial value affects every element of output
+
+- Without it, would need extra `transform` pass over output
+
+- For exclusive scan, can't use `transform_exclusive_scan` to work around non-identity initial value
+
+```c++
+std::vector<int> in{5, 7, 11, 13, 17};
+std::vector<int> out(size_t(5));
+const int init = 3;
+auto binary_op = plus{};
+
+// out: 8, 15, 26, 39, 56
+ranges::inclusive_scan(in, out, binary_op, init);
+
+// out: 3, 8, 15, 26, 39
+// Yes, init and binary_op have reversed order.
+ranges::exclusive_scan(in, out, init, binary_op);
+
+// out: 8, 15, 26, 39, 56
+auto unary_op = [op = binary_op] (auto x) { return op(x, 3); };
+ranges::transform_inclusive_scan(int, out, binary_op, unary_op);
+
+// out: 0, 8, 15, 26, 39
+ranges::transform_exclusive_scan(in, out, binary_op, unary_op);
+```
+
+#### Avoid mixing up identity and initial value
+
+C++17 `*reduce` and `*_scan` take initial value `T init`, undecorated.
+
+If new algorithms take `T identity`, then users could be confused when switching from C++17 to new algorithms.
+ 
+"Decorating" identity by wrapping it in a struct prevents confusion.  It also lets algorithms provide both initial value and identity.
+
+```c++
+std::vector<int> in{-8, 6, -4, 2, 0, 10, -12};
+std::vector<int> out(size_t(7));
+const int init = 7;
+auto binary_op = std::ranges::max{};
+
+// inclusive_scan doesn't need an initial value.
+
+// out: -8, 6, 6, 6, 6, 10, 10
+std::ranges::inclusive_scan(in, out, binary_op);
+
+// out: 7, 7, 7, 7, 7, 10, 10
+std::ranges::inclusive_scan(in, out, binary_op, init);
+
+// Suppose the user knows that they
+// will never see values smaller than -9.
+const int identity_value = -10;
+
+// out: 7, 7, 7, 7, 7, 10, 10
+std::ranges::inclusive_scan(in, out,
+  reduce_operation{binary_op, identity_value},
+  init);
+
+// exclusive scan needs an initial value.
+// Identity is a reasonable default initial value,
+// if you have it.
+//
+// C++17 *exclusive_scan puts init left of binary_op,
+// while inclusive_scan puts init right of binary_op.
+// We find this weird so we don't do it.
+
+// out: 7, 7, 7, 7, 7, 7, 10
+std::ranges::exclusive_scan(in, out, binary_op, init);
+
+// out: -10, -8, 6, 6, 6, 6, 10
+std::ranges::exclusive_scan(in, out,
+  reduce_operation{binary_op, identity_value});
+
+// out: 7, 7, 7, 7, 7, 7, 7, 10
+std::ranges::exclusive_scan(in, out,
+  reduce_operation{binary_op, identity_value}, init);
+```
 
-Based on the above considerations, we conclude that there are good reasons to consider a mechanism for users to explicitly
-specify the identity element for parallel reduction. There are options of how that could be achieved, of which we list a few.
+### Conclusions
 
-- Add an optional extra parameter for the value of identity, defaulting to value initialization.
-- Change the meaning of the `init` parameter for parallel algorithms to represent identity instead of the initial value.
-- Provide a customization point similar to `sycl::known_identity` that also defaults to value initialization but can be
-  specialized for a given operation.
-- Similarly to `std::linalg`, require that for numeric parallel algorithms a value-initialized object shall act as the identity element.
+It's important for both performance and functionality
+that users be able to specify an identity value for parallel reductions.
+Designs for this should avoid confusion when switching from
+C++17 parallel numeric algorithms to the new ranges versions.
+We would like feedback from SG9 and LEWG on their preferred design.
 
-At this point, we do not propose any of these options. We would like to hear feedback from SG1 and SG9 on exploring this further.
+Our proposed `*reduce` algorithms do not need an initial value parameter.
+For our proposed `*_scan` algorithms, an initial value could improve performance
+in some cases by avoiding an additional pass over all the output elements.
+The `*exclusive_scan` algorithms need an initial value
+because it defines the first element of the output range.
+The initial value could default to the identity, if it exists and is known.
 
 ## `ranges::reduce` design
 
 In this section, we focus on `ranges::reduce`'s design.  The discussion here applies generally to the other algorithms we
 propose.
 
-### No default parameters
+### No default binary operation or initial value
 
 Section 5.1 of [@P2760R1] states:
 
-> One thing is clear: `ranges::reduce` should *not* take a default binary operation *nor* a default initial [value]
-> parameter. The user needs to supply both.
+> One thing is clear: `ranges::reduce` should *not* take a default binary operation
+> *nor* a default initial [value] parameter. The user needs to supply both.
 
 This motivates the following convenience wrappers:
 
@@ -896,8 +1376,9 @@ This motivates the following convenience wrappers:
 - `ranges::product(r)` for `ranges::reduce` with `init = range_value_t<R>(1))` and `multiplies{}` as the reduce operation;
   and
 - `ranges::dot(x, y)` for binary `ranges::transform_reduce` with `init = T()` where
-  `T = decltype(declval<range_value_t<X>>() * declval<range_value_t<Y>>())`, `multiplies{}` as the transform operation,
-  and `plus{}` as the reduce operation.
+  `T = decltype(declval<range_value_t<X>>() * declval<range_value_t<Y>>())`,
+  `multiplies{}` is the transform operation,
+  and `plus{}` is the reduce operation.
 
 One argument *for* a default initial value in `std::reduce` is that `int` literals like `0` or `1` do not behave in the
 expected way with a sequence of `float` or `double`.  For `ranges::reduce`, however, making its return value type imitate
@@ -1032,8 +1513,10 @@ position.) We express what `reduce` does using *GENERALIZED_SUM*.
 
 ## Enabling list-initialization for proposed algorithms
 
-Our proposal follows the same principles as described in [@P2248R8] paper. We want to enable the use case with constructing
-`init` from curly braces.
+Our proposal follows the same principles as described in [@P2248R8],
+"Enabling list-initialization for algorithms."
+We want to enable the use case where users construct a nondefault initial value
+using curly braces without naming the type.
 
 ```c++
 #include <cassert>
@@ -1048,8 +1531,11 @@ int main() {
 }
 ```
 
-Thus, we need to add a default template argument to `T init` in the proposed signatures. While [@P2248R8] does not propose
-a default template parameter for `init` in `<numeric>` header, we want to address this design question from the beginning
+Supporting this use case requires that we add
+a default template argument to `T init` in the proposed signatures.
+While [@P2248R8] does not propose a default template parameter
+for `init` in the `<numeric>` header,
+we want to address this design question from the beginning
 for the new set of algorithms because `fold_` family already has this feature.
 
 # Implementation
@@ -1060,7 +1546,7 @@ implementation is done as experimental with the following deviations from this p
 - Algorithms do not have constraints
 - `reduce` has more overloads (without init and without binary predicate)
 - `*_scan` return type is not `in_out_result`
-- Convenience wrappers, proposed in the paper are not implemented. The implementation is expected to be trivial, though.
+- The convenience wrappers proposed in this paper are not implemented. Their implementation is expected to be trivial, though.
 
 # Wording