Skip to content

gpui: Implement GPU device loss recovery for Linux X11#43070

Closed
larstalian wants to merge 6 commits intozed-industries:mainfrom
larstalian:23288-gpu-recovery-linux-x11
Closed

gpui: Implement GPU device loss recovery for Linux X11#43070
larstalian wants to merge 6 commits intozed-industries:mainfrom
larstalian:23288-gpu-recovery-linux-x11

Conversation

@larstalian
Copy link

Closes #32318
Closes #27751
Partial fix for #23288 (X11 only, not Wayland)

Release Notes:

  • Fixed Zed hanging with 100% CPU usage after resuming from sleep on Linux X11 with NVIDIA GPUs

@cla-bot
Copy link

cla-bot bot commented Nov 19, 2025

We require contributors to sign our Contributor License Agreement, and we don't have @larstalian on file. You can sign our CLA at https://zed.dev/cla. Once you've signed, post a comment here that says '@cla-bot check'.

@larstalian larstalian force-pushed the 23288-gpu-recovery-linux-x11 branch from a5a9c22 to 4825134 Compare November 19, 2025 14:24
@cla-bot cla-bot bot added the cla-signed The user has signed the Contributor License Agreement label Nov 19, 2025
path_intermediate_msaa_texture: Option<gpu::Texture>,
path_intermediate_msaa_texture_view: Option<gpu::TextureView>,
rendering_parameters: RenderingParameters,
skip_draws: bool,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this matches the windows implementation

@larstalian larstalian force-pushed the 23288-gpu-recovery-linux-x11 branch 2 times, most recently from 5bf3125 to 1fe0972 Compare November 19, 2025 14:41
@larstalian
Copy link
Author

Tested on wayland - no GPU hang like X11, but sprites still don't render after suspend/resume

@larstalian larstalian force-pushed the 23288-gpu-recovery-linux-x11 branch 2 times, most recently from 7528494 to e322149 Compare November 19, 2025 15:02
@larstalian larstalian marked this pull request as ready for review November 19, 2025 15:13
Self::attempt_gpu_recovery(&mut inner, self.0.x_window, &self.0.xcb, 0)
{
log::error!("GPU recovery failed: {}", recovery_err);
std::process::exit(1);
Copy link

@Vanuan Vanuan Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, silently exiting the app probably not better than freezing? Maybe panic!() would be better? So that eventually, the panic hook could handle this and re-spawn Zed with a message on start "Some problem occurred, here's the details"?

Copy link

@Vanuan Vanuan Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example:

use std::{panic, env, process};

// Define an environment variable used to check if the process is a 'restarted' instance
const RESTART_ENV_VAR: &str = "APP_RESTARTED_FROM_CRASH";

/// Placeholder function to simulate the logic for relaunching the process.
/// In a real-world scenario, this would likely involve using the 'exec' system call
/// or signaling an external supervisor.
fn relaunch_process(error_details: &str) {
    eprintln!("--- Relaunch Initiated ---");
    eprintln!("Error to pass: '{}'", error_details);
    eprintln!("In a real app, this would execute a new instance with the error info.");
    // Example: process::Command::new(env::current_exe().unwrap())
    //      .env(RESTART_ENV_VAR, error_details)
    //      .spawn().unwrap();
    
    // For this example, we'll just exit gracefully after 'relaunching'
    process::exit(1); 
}

fn main() {
    // 1. Check on start for the 'restarted' flag
    if let Ok(error_details) = env::var(RESTART_ENV_VAR) {
        println!("*** ⚠️ Application started as a restarted instance! ***");
        println!("Previous crash reason: {}", error_details);
        
        // Clear the environment variable so future runs are clean (optional)
        env::remove_var(RESTART_ENV_VAR);
        
        // Here you would implement logic like:
        // - Load previous state from disk
        // - Present a crash report to the user
        // - Attempt a less resource-intensive mode
    }

    // 2. Install the custom panic hook
    panic::set_hook(Box::new(|panic_info| {
        // --- CUSTOM PANIC HANDLER ---
        eprintln!("\n🚨 Custom Panic Hook Activated 🚨");
        eprintln!("Panic occurred at: {}", panic_info.location().unwrap());
        
        // Extract a simple error string to pass to the new process
        let error_message = format!("Panic: {}", panic_info.payload().downcast_ref::<&str>().unwrap_or(&"Unknown Error"));
        
        // Call the relaunch logic (which will likely exit the current process)
        // NOTE: This call is the key to your "respawn" concept.
        relaunch_process(&error_message);
    }));

    // --- APPLICATION LOGIC ---
    println!("Application running normally...");
    println!("Simulating a critical error now...");
    
    // 3. Trigger the panic
    // This will cause the panic hook to fire.
    panic!("GPU device lost");
    
    // This line is never reached
    println!("Application finished.");
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay great thanks, i will address that! Additionally, after testing this solution a couple days i have found that it occasionally does not recover properly. I am investigating and coding up a fix for that. Will set the PR as draft until that is fixed. I really appreciate the feedback

@larstalian larstalian marked this pull request as draft November 25, 2025 13:39
Comment on lines +95 to +101
pub fn get_texture_info(&self, id: AtlasTextureId) -> Option<BladeTextureInfo> {
let lock = self.0.lock();
let texture = &lock.storage[id];
BladeTextureInfo {
let textures = &lock.storage[id.kind];
let texture = textures.textures.get(id.index as usize)?.as_ref()?;
Some(BladeTextureInfo {
raw_view: texture.raw_view,
}
})
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After GPU recovery the atlas is cleared, but queued scene frames may still reference old texture IDs, since they can outlive the cleared atlas. skip_draw should prevent rendering until fresh scenes are generated, but this is additional safety: skip instead of panic. This was the panic that occasionally happened in the first commit.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @kvark is appropriate person to review this. Though he has a strong opinion that this is unrecoverable without restarting the application.

@larstalian larstalian marked this pull request as ready for review November 26, 2025 09:56
@larstalian larstalian requested a review from Vanuan November 26, 2025 09:56
@larstalian larstalian marked this pull request as draft November 26, 2025 13:12
"your device information is: {:?}",
self.gpu.device_information()
);
return Err(anyhow::anyhow!("GPU device hung or lost"));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it depends on wait_for implementation, could be just a normal frame timeout. Although, MAX_FRAME_TIME_MS is set to 10 seconds, so it usually coincides. The major architectural issue is that blade doesn't surface error codes to the application layer, so it's impossible to know whether device is really lost. But this is a good start, although it might lead to false positives on some platforms.

);
while !self.gpu.wait_for(&last_sp, MAX_FRAME_TIME_MS) {}
}
Ok(())
Copy link

@Vanuan Vanuan Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this approach is accepted, next logical step would to to extend wait_for or change bool to Ok/Err upstream

"GPU recovery failed after device loss. This may be a driver issue. \
Please try restarting Zed. Error: {}",
recovery_err
);
Copy link

@Vanuan Vanuan Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some info for context. This is the first step as it will surface telemetry to Zed developers so that they can easily understand how often does this issue happen. As it appears on every suspend resume, this is A LOT and may overwhelm Sentry. If we could reliably determine the device lost error, it would be possible to try alternative approach to recovery: instead of crashing the app we could try restarting it (App:: Restart()). But it will lose telemetry. Maybe extract App::Restart context and sending it to the crash handling server so that it could restart automatically if the issue is known and we don't need telemetry. But then, there's a restart loop problem if the crash is during initialization. I wonder how chromium solves it.

Anyway, acknowledging a problem is always a good start.

Having that information, consider adjusting the panic message to be useful to those reviewing Sentry issues rather than to the user. Maybe a simple "GPU hung" would be ok.

Screenshot_20251127-035147 Screenshot_20251127-035715 Screenshot_20251127-040009

@Vanuan
Copy link

Vanuan commented Nov 28, 2025

Check out how blade reports GPU crash:
https://github.com/kvark/blade/blob/8cd905be95d9bcf476f69790fcfdbeb615c5154c/blade-graphics/src/vulkan/command.rs#L427C1-L451

Maybe you could copy paste some of this or just call this function if the context is available.

@Vanuan
Copy link

Vanuan commented Nov 29, 2025

FIY I'm trying to push a proper detection for device lost: kvark/blade#286

@larstalian
Copy link
Author

FIY I'm trying to push a proper detection for device lost: kvark/blade#286

yeah okay. I have some changes locally now that seems to recover properly, but its a larger and more complex. Turned out to be a bit bigger fix than i first envisioned.

@Vanuan
Copy link

Vanuan commented Nov 30, 2025

The only robust way I see is separating rendering process from application process and adding (protobuf?) RPC. So that any crash will lead to thin layer restart and immediate release of resources. But that would probably be a move towards chromium and electron which Zed developers despise despite their role in its success
Screenshot_20251130-171814

@larstalian larstalian marked this pull request as ready for review December 3, 2025 10:59
@larstalian
Copy link
Author

Tested on Ubuntu 24.04 with Nvidia RTX 5060. Recovery now works reliably with multiple windows open. Once kvark/blade#286 lands, we can replace the catch_unwind with proper wait_for_result() error handling.

@mikayla-maki mikayla-maki removed their assignment Jan 26, 2026
@reflectronic
Copy link
Member

Thank you for the pull request and I'm sorry that we did not review it in a timely manner. I think #46758 has made this patch obsolete, so I'll be closing this pull request. I encourage you to try the WGPU renderer and see if improvements to the device lost handling can be made there.

@github-project-automation github-project-automation bot moved this from Community PRs to Done in Quality Week – December 2025 Feb 13, 2026
@Vanuan
Copy link

Vanuan commented Feb 13, 2026

According to brief research, it appears that wgpu, like Blade, does not expose Vulkan errors and uses abstractions throughout. It's kind of more opaque does not allow detection of device loss events (though there's abstraction Device::lost) and sometimes treats timeouts like successful renders or device lost events. Like Blade, Wgpu doesn't provide mechanisms for automatic device recovery and that should be handled by the application (Zed or GPUI).

Because resources like textures and buffers are tied to a specific Device instance, once that device is lost, those handles are dead. The application (whether it's GPUI or a standalone app) must manually:

  • Detect the loss (via Device::lost).
  • Request a new Adapter and Device.
  • Re-upload all necessary GPU resources

So this patch is not obsolete per se, it's just needs migrating to Wgpu API.

@Vanuan
Copy link

Vanuan commented Feb 14, 2026

PTAL #49154

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed The user has signed the Contributor License Agreement

Projects

Development

Successfully merging this pull request may close these issues.

Zed hangs after exiting sleep mode in Linux Lock screen shown instead of Zed when the computer is unlocked

4 participants