From 9ef3cc82814a75d2535cd8b9e4a020928621a890 Mon Sep 17 00:00:00 2001
From: Frances Liu <francestfls@gmail.com>
Date: Tue, 5 Nov 2024 09:58:28 -0800
Subject: [PATCH 1/2] Add docs for local testing mode

---
 doc/source/serve/advanced-guides/dev-workflow.md | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/doc/source/serve/advanced-guides/dev-workflow.md b/doc/source/serve/advanced-guides/dev-workflow.md
index de8b6745a400..6f1eea1b8d23 100644
--- a/doc/source/serve/advanced-guides/dev-workflow.md
+++ b/doc/source/serve/advanced-guides/dev-workflow.md
@@ -79,6 +79,19 @@ After you're done testing, you can shut down Ray Serve by interrupting the `serv
 
 Note that rerunning `serve run` will redeploy all deployments. To prevent redeploying those deployments whose code hasn't changed, you can use `serve deploy`; see the [Production Guide](serve-in-production) for details.
 
+### Local Testing Mode
+
+Ray Serve now supports a local testing mode that allows you to test your application logic in-process without deploying to a Ray cluster. This can be useful for unit testing and debugging.
+
+To enable local testing mode, you can set the `_local_testing_mode` flag to `True` when calling `serve.run`:
+
+```python
+serve.run(app, _local_testing_mode=True)
+```
+
+Alternatively, you can set the environment variable `RAY_SERVE_FORCE_LOCAL_TESTING_MODE=1` to force local testing mode for all `serve.run` invocations.
+
+In local testing mode, deployments run in the local process, which enables faster development iterations and the use of debugging tools like PDB. However, some features like converting deployment responses to Ray object references are not supported in this mode.
 ## Testing on a remote cluster
 
 To test on a remote cluster, you'll use `serve run` again, but this time you'll pass in an `--address` argument to specify the address of the Ray cluster to connect to.  For remote clusters, this address has the form `ray://<head-node-ip-address>:10001`; see [Ray Client](ray-client-ref) for more information.

From 741311f5553180c47a81375f4ac6ac8644ba889d Mon Sep 17 00:00:00 2001
From: "promptless[bot]" <179508745+promptless[bot]@users.noreply.github.com>
Date: Tue, 5 Nov 2024 18:02:13 +0000
Subject: [PATCH 2/2] Docs update (ee0512a)

---
 doc/source/serve/advanced-guides/dev-workflow.md | 13 +++++++++++++
 doc/source/serve/http-guide.md                   | 16 ++++++++++++++++
 doc/source/serve/model_composition.md            |  8 +-------
 3 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/doc/source/serve/advanced-guides/dev-workflow.md b/doc/source/serve/advanced-guides/dev-workflow.md
index de8b6745a400..2c9221044e03 100644
--- a/doc/source/serve/advanced-guides/dev-workflow.md
+++ b/doc/source/serve/advanced-guides/dev-workflow.md
@@ -79,6 +79,19 @@ After you're done testing, you can shut down Ray Serve by interrupting the `serv
 
 Note that rerunning `serve run` will redeploy all deployments. To prevent redeploying those deployments whose code hasn't changed, you can use `serve deploy`; see the [Production Guide](serve-in-production) for details.
 
+### Local Testing Mode
+
+Ray Serve now supports a local testing mode that allows you to run your deployments locally for faster development and testing. This mode can be enabled by setting the `_local_testing_mode` flag to `True` in the `serve.run` function. This feature is particularly useful for writing unit tests for your application and model composition logic.
+
+To enable local testing mode, you can modify your `serve.run` call as follows:
+
+```python
+serve.run(local_dev:app, _local_testing_mode=True)
+```
+
+In local testing mode, user code for each deployment will be run in a background thread, and the existing `DeploymentHandle` code will work seamlessly. This mode also supports the use of tools like PDB for debugging.
+
+Note: Local testing mode is currently a private feature and may not support all functionalities available in the standard mode.
 ## Testing on a remote cluster
 
 To test on a remote cluster, you'll use `serve run` again, but this time you'll pass in an `--address` argument to specify the address of the Ray cluster to connect to.  For remote clusters, this address has the form `ray://<head-node-ip-address>:10001`; see [Ray Client](ray-client-ref) for more information.
diff --git a/doc/source/serve/http-guide.md b/doc/source/serve/http-guide.md
index 054ac9ff2145..62d9a55718a2 100644
--- a/doc/source/serve/http-guide.md
+++ b/doc/source/serve/http-guide.md
@@ -18,6 +18,7 @@ Considering your use case, you can choose the right level of abstraction:
 
 
 (serve-http)=
+```md
 ## Calling Deployments via HTTP
 When you deploy a Serve application, the [ingress deployment](serve-key-concepts-ingress-deployment) (the one passed to `serve.run`) is exposed over HTTP.
 
@@ -58,6 +59,21 @@ To prevent an async call from being interrupted by `asyncio.CancelledError`, use
 When the request is cancelled, a cancellation error is raised inside the `SnoringSleeper` deployment's `__call__()` method. However, the cancellation is not raised inside the `snore()` call, so `ZZZ` is printed even if the request is cancelled. Note that `asyncio.shield` cannot be used on a `DeploymentHandle` call to prevent the downstream handler from being cancelled. You need to explicitly handle the cancellation error in that handler as well.
 
 (serve-fastapi-http)=
+
+### Local Testing Mode
+
+Ray Serve now supports a local testing mode, which allows you to run deployments locally for faster development and testing. This mode can be enabled by setting the `_local_testing_mode` flag to `True` in `serve.run()`. This feature is particularly useful for writing unit tests for your application and model composition logic.
+
+To enable local testing mode, you can use the following code snippet:
+
+```python
+serve.run(your_deployment.bind(), _local_testing_mode=True)
+```
+
+In local testing mode, user code for each deployment is run in a background thread using the `UserCallableWrapper` that runs in replica actors. This mode also introduces a new `Router` and `ReplicaResult` implementation that interacts with the user code to enable the existing `DeploymentHandle` code to work.
+
+Note that certain features, such as converting `DeploymentResponses` to `ObjectRefs`, are not supported in local testing mode. If you encounter a use case requiring this feature, please file a feature request on GitHub.
+```
 ## FastAPI HTTP Deployments
 
 If you want to define more complex HTTP handling logic, Serve integrates with [FastAPI](https://fastapi.tiangolo.com/). This allows you to define a Serve deployment using the {mod}`@serve.ingress <ray.serve.ingress>` decorator that wraps a FastAPI app with its full range of features. The most basic example of this is shown below, but for more details on all that FastAPI has to offer such as variable routes, automatic type validation, dependency injection (e.g., for database connections), and more, please check out [their documentation](https://fastapi.tiangolo.com/).
diff --git a/doc/source/serve/model_composition.md b/doc/source/serve/model_composition.md
index 28de84c558e6..fa98328af48e 100644
--- a/doc/source/serve/model_composition.md
+++ b/doc/source/serve/model_composition.md
@@ -113,12 +113,7 @@ Note how the response from the `Adder` handle passes directly to the `Multiplier
 
 ## Streaming DeploymentHandle calls
 
-You can also use `DeploymentHandles` to make streaming method calls that return multiple outputs.
-To make a streaming call, the method must be a generator and you must set `handle.options(stream=True)`.
-Then, the handle call returns a {mod}`DeploymentResponseGenerator <ray.serve.handle.DeploymentResponseGenerator>` instead of a unary `DeploymentResponse`.
-You can use `DeploymentResponseGenerators` as a sync or async generator, like in an `async for` code block.
-Similar to `DeploymentResponse.result()`, avoid using a `DeploymentResponseGenerator` as a sync generator within a deployment, as that blocks other requests from executing concurrently on that replica.
-Note that you can't pass `DeploymentResponseGenerators` to other handle calls.
+You can also use `DeploymentHandles` to make streaming method calls that return multiple outputs. To make a streaming call, the method must be a generator and you must set `handle.options(stream=True)`. Then, the handle call returns a {mod}`DeploymentResponseGenerator <ray.serve.handle.DeploymentResponseGenerator>` instead of a unary `DeploymentResponse`. You can use `DeploymentResponseGenerators` as a sync or async generator, like in an `async for` code block. Similar to `DeploymentResponse.result()`, avoid using a `DeploymentResponseGenerator` as a sync generator within a deployment, as that blocks other requests from executing concurrently on that replica. Note that you can't pass `DeploymentResponseGenerators` to other handle calls. If you have a use case requiring this feature, please file a feature request on GitHub.
 
 Example:
 
@@ -127,7 +122,6 @@ Example:
 :end-before: __streaming_example_end__
 :language: python
 ```
-
 ## Advanced: Pass a DeploymentResponse in a nested object [FULLY DEPRECATED]
 
 :::{warning}