gpui: Implement GPU device loss recovery for Linux X11 by larstalian · Pull Request #43070 · zed-industries/zed

larstalian · 2025-11-19T14:21:06Z

Closes #32318
Closes #27751
Partial fix for #23288 (X11 only, not Wayland)

Release Notes:

Fixed Zed hanging with 100% CPU usage after resuming from sleep on Linux X11 with NVIDIA GPUs

cla-bot · 2025-11-19T14:21:10Z

We require contributors to sign our Contributor License Agreement, and we don't have @larstalian on file. You can sign our CLA at https://zed.dev/cla. Once you've signed, post a comment here that says '@cla-bot check'.

larstalian · 2025-11-19T14:25:40Z

crates/gpui/src/platform/blade/blade_renderer.rs

    path_intermediate_msaa_texture: Option<gpu::Texture>,
    path_intermediate_msaa_texture_view: Option<gpu::TextureView>,
    rendering_parameters: RenderingParameters,
+    skip_draws: bool,


this matches the windows implementation

larstalian · 2025-11-19T14:42:59Z

Tested on wayland - no GPU hang like X11, but sprites still don't render after suspend/resume

Vanuan · 2025-11-25T02:02:16Z

crates/gpui/src/platform/linux/x11/window.rs

+            Self::attempt_gpu_recovery(&mut inner, self.0.x_window, &self.0.xcb, 0)
+        {
+            log::error!("GPU recovery failed: {}", recovery_err);
+            std::process::exit(1);


Well, silently exiting the app probably not better than freezing? Maybe panic!() would be better? So that eventually, the panic hook could handle this and re-spawn Zed with a message on start "Some problem occurred, here's the details"?

For example:

use std::{panic, env, process}; // Define an environment variable used to check if the process is a 'restarted' instance const RESTART_ENV_VAR: &str = "APP_RESTARTED_FROM_CRASH"; /// Placeholder function to simulate the logic for relaunching the process. /// In a real-world scenario, this would likely involve using the 'exec' system call /// or signaling an external supervisor. fn relaunch_process(error_details: &str) { eprintln!("--- Relaunch Initiated ---"); eprintln!("Error to pass: '{}'", error_details); eprintln!("In a real app, this would execute a new instance with the error info."); // Example: process::Command::new(env::current_exe().unwrap()) // .env(RESTART_ENV_VAR, error_details) // .spawn().unwrap(); // For this example, we'll just exit gracefully after 'relaunching' process::exit(1); } fn main() { // 1. Check on start for the 'restarted' flag if let Ok(error_details) = env::var(RESTART_ENV_VAR) { println!("*** ⚠️ Application started as a restarted instance! ***"); println!("Previous crash reason: {}", error_details); // Clear the environment variable so future runs are clean (optional) env::remove_var(RESTART_ENV_VAR); // Here you would implement logic like: // - Load previous state from disk // - Present a crash report to the user // - Attempt a less resource-intensive mode } // 2. Install the custom panic hook panic::set_hook(Box::new(|panic_info| { // --- CUSTOM PANIC HANDLER --- eprintln!("\n🚨 Custom Panic Hook Activated 🚨"); eprintln!("Panic occurred at: {}", panic_info.location().unwrap()); // Extract a simple error string to pass to the new process let error_message = format!("Panic: {}", panic_info.payload().downcast_ref::<&str>().unwrap_or(&"Unknown Error")); // Call the relaunch logic (which will likely exit the current process) // NOTE: This call is the key to your "respawn" concept. relaunch_process(&error_message); })); // --- APPLICATION LOGIC --- println!("Application running normally..."); println!("Simulating a critical error now..."); // 3. Trigger the panic // This will cause the panic hook to fire. panic!("GPU device lost"); // This line is never reached println!("Application finished."); }

Okay great thanks, i will address that! Additionally, after testing this solution a couple days i have found that it occasionally does not recover properly. I am investigating and coding up a fix for that. Will set the PR as draft until that is fixed. I really appreciate the feedback

larstalian · 2025-11-26T09:08:24Z

crates/gpui/src/platform/blade/blade_atlas.rs

+    pub fn get_texture_info(&self, id: AtlasTextureId) -> Option<BladeTextureInfo> {
        let lock = self.0.lock();
-        let texture = &lock.storage[id];
-        BladeTextureInfo {
+        let textures = &lock.storage[id.kind];
+        let texture = textures.textures.get(id.index as usize)?.as_ref()?;
+        Some(BladeTextureInfo {
            raw_view: texture.raw_view,
-        }
+        })


After GPU recovery the atlas is cleared, but queued scene frames may still reference old texture IDs, since they can outlive the cleared atlas. skip_draw should prevent rendering until fresh scenes are generated, but this is additional safety: skip instead of panic. This was the panic that occasionally happened in the first commit.

I think @kvark is appropriate person to review this. Though he has a strong opinion that this is unrecoverable without restarting the application.

Vanuan · 2025-11-27T03:25:46Z

crates/gpui/src/platform/blade/blade_renderer.rs

+                    "your device information is: {:?}",
+                    self.gpu.device_information()
                );
+                return Err(anyhow::anyhow!("GPU device hung or lost"));


Actually, it depends on wait_for implementation, could be just a normal frame timeout. Although, MAX_FRAME_TIME_MS is set to 10 seconds, so it usually coincides. The major architectural issue is that blade doesn't surface error codes to the application layer, so it's impossible to know whether device is really lost. But this is a good start, although it might lead to false positives on some platforms.

Vanuan · 2025-11-27T03:30:38Z

crates/gpui/src/platform/blade/blade_renderer.rs

-            );
-            while !self.gpu.wait_for(&last_sp, MAX_FRAME_TIME_MS) {}
        }
+        Ok(())


If this approach is accepted, next logical step would to to extend wait_for or change bool to Ok/Err upstream

Vanuan · 2025-11-27T04:09:10Z

crates/gpui/src/platform/linux/x11/window.rs

+                "GPU recovery failed after device loss. This may be a driver issue. \
+                 Please try restarting Zed. Error: {}",
+                recovery_err
+            );


Some info for context. This is the first step as it will surface telemetry to Zed developers so that they can easily understand how often does this issue happen. As it appears on every suspend resume, this is A LOT and may overwhelm Sentry. If we could reliably determine the device lost error, it would be possible to try alternative approach to recovery: instead of crashing the app we could try restarting it (App:: Restart()). But it will lose telemetry. Maybe extract App::Restart context and sending it to the crash handling server so that it could restart automatically if the issue is known and we don't need telemetry. But then, there's a restart loop problem if the crash is during initialization. I wonder how chromium solves it.

Anyway, acknowledging a problem is always a good start.

Having that information, consider adjusting the panic message to be useful to those reviewing Sentry issues rather than to the user. Maybe a simple "GPU hung" would be ok.

Vanuan · 2025-11-28T23:24:58Z

Check out how blade reports GPU crash:
https://github.com/kvark/blade/blob/8cd905be95d9bcf476f69790fcfdbeb615c5154c/blade-graphics/src/vulkan/command.rs#L427C1-L451

Maybe you could copy paste some of this or just call this function if the context is available.

Vanuan · 2025-11-29T04:24:53Z

FIY I'm trying to push a proper detection for device lost: kvark/blade#286

larstalian · 2025-11-30T16:48:14Z

FIY I'm trying to push a proper detection for device lost: kvark/blade#286

yeah okay. I have some changes locally now that seems to recover properly, but its a larger and more complex. Turned out to be a bit bigger fix than i first envisioned.

Vanuan · 2025-11-30T17:19:21Z

The only robust way I see is separating rendering process from application process and adding (protobuf?) RPC. So that any crash will lead to thin layer restart and immediate release of resources. But that would probably be a move towards chromium and electron which Zed developers despise despite their role in its success

larstalian · 2025-12-03T11:28:38Z

Tested on Ubuntu 24.04 with Nvidia RTX 5060. Recovery now works reliably with multiple windows open. Once kvark/blade#286 lands, we can replace the catch_unwind with proper wait_for_result() error handling.

reflectronic · 2026-02-13T19:32:37Z

Thank you for the pull request and I'm sorry that we did not review it in a timely manner. I think #46758 has made this patch obsolete, so I'll be closing this pull request. I encourage you to try the WGPU renderer and see if improvements to the device lost handling can be made there.

Vanuan · 2026-02-13T22:50:24Z

According to brief research, it appears that wgpu, like Blade, does not expose Vulkan errors and uses abstractions throughout. It's kind of more opaque does not allow detection of device loss events (though there's abstraction Device::lost) and sometimes treats timeouts like successful renders or device lost events. Like Blade, Wgpu doesn't provide mechanisms for automatic device recovery and that should be handled by the application (Zed or GPUI).

Because resources like textures and buffers are tied to a specific Device instance, once that device is lost, those handles are dead. The application (whether it's GPUI or a standalone app) must manually:

Detect the loss (via Device::lost).
Request a new Adapter and Device.
Re-upload all necessary GPU resources

So this patch is not obsolete per se, it's just needs migrating to Wgpu API.

Vanuan · 2026-02-14T01:18:16Z

PTAL #49154

larstalian force-pushed the 23288-gpu-recovery-linux-x11 branch from a5a9c22 to 4825134 Compare November 19, 2025 14:24

cla-bot bot added the cla-signed The user has signed the Contributor License Agreement label Nov 19, 2025

larstalian commented Nov 19, 2025

View reviewed changes

SomeoneToIgnore assigned mikayla-maki and reflectronic Nov 19, 2025

larstalian force-pushed the 23288-gpu-recovery-linux-x11 branch 2 times, most recently from 5bf3125 to 1fe0972 Compare November 19, 2025 14:41

larstalian force-pushed the 23288-gpu-recovery-linux-x11 branch 2 times, most recently from 7528494 to e322149 Compare November 19, 2025 15:02

larstalian marked this pull request as ready for review November 19, 2025 15:13

gpui: Implement GPU device loss recovery for Linux X11

1647db6

larstalian force-pushed the 23288-gpu-recovery-linux-x11 branch from e322149 to 1647db6 Compare November 19, 2025 15:34

Vanuan mentioned this pull request Nov 24, 2025

Zed is sometimes unresponsive when the OS awakes from sleep #7940

Open

1 task

Vanuan reviewed Nov 25, 2025

View reviewed changes

larstalian marked this pull request as draft November 25, 2025 13:39

skip stale textures after recovery, panic instead of silent exit

210de6c

larstalian commented Nov 26, 2025

View reviewed changes

rerun ci

018b5c8

larstalian marked this pull request as ready for review November 26, 2025 09:56

larstalian requested a review from Vanuan November 26, 2025 09:56

update global gpu context after recovery for new windows

658be18

larstalian marked this pull request as draft November 26, 2025 13:12

Vanuan reviewed Nov 27, 2025

View reviewed changes

zelenenka added this to Quality Week – December 2025 Nov 27, 2025

github-project-automation bot moved this to Community PRs in Quality Week – December 2025 Nov 27, 2025

Vanuan mentioned this pull request Nov 28, 2025

Improve wait_for API to return Result kvark/blade#285

Open

Vanuan mentioned this pull request Nov 29, 2025

wait_for API: Boolean to Result kvark/blade#286

Open

larstalian added 2 commits December 1, 2025 11:25

multi window recovery

39a74bc

extract panic message from blade

da1fef3

larstalian marked this pull request as ready for review December 3, 2025 10:59

mikayla-maki removed their assignment Jan 26, 2026

reflectronic closed this Feb 13, 2026

github-project-automation bot moved this from Community PRs to Done in Quality Week – December 2025 Feb 13, 2026

Vanuan mentioned this pull request Feb 14, 2026

linux: Implement GPU device loss recovery for wgpu #49154

Draft

Uh oh!

Conversation

larstalian commented Nov 19, 2025

Uh oh!

cla-bot bot commented Nov 19, 2025

Uh oh!

larstalian Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

larstalian commented Nov 19, 2025

Uh oh!

Vanuan Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vanuan Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

larstalian Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

larstalian Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Vanuan Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Vanuan Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Vanuan Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vanuan Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vanuan commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Vanuan commented Nov 29, 2025

Uh oh!

larstalian commented Nov 30, 2025

Uh oh!

Vanuan commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

larstalian commented Dec 3, 2025

Uh oh!

reflectronic commented Feb 13, 2026

Uh oh!

Vanuan commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Vanuan commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Vanuan Nov 25, 2025 •

edited

Loading

Vanuan Nov 25, 2025 •

edited

Loading

Vanuan Nov 27, 2025 •

edited

Loading

Vanuan Nov 27, 2025 •

edited

Loading

Vanuan commented Nov 28, 2025 •

edited

Loading

Vanuan commented Nov 30, 2025 •

edited

Loading

Vanuan commented Feb 13, 2026 •

edited

Loading