gpui: Implement GPU device loss recovery for Linux X11#43070
gpui: Implement GPU device loss recovery for Linux X11#43070larstalian wants to merge 6 commits intozed-industries:mainfrom
Conversation
|
We require contributors to sign our Contributor License Agreement, and we don't have @larstalian on file. You can sign our CLA at https://zed.dev/cla. Once you've signed, post a comment here that says '@cla-bot check'. |
a5a9c22 to
4825134
Compare
| path_intermediate_msaa_texture: Option<gpu::Texture>, | ||
| path_intermediate_msaa_texture_view: Option<gpu::TextureView>, | ||
| rendering_parameters: RenderingParameters, | ||
| skip_draws: bool, |
There was a problem hiding this comment.
this matches the windows implementation
5bf3125 to
1fe0972
Compare
|
Tested on wayland - no GPU hang like X11, but sprites still don't render after suspend/resume |
7528494 to
e322149
Compare
e322149 to
1647db6
Compare
| Self::attempt_gpu_recovery(&mut inner, self.0.x_window, &self.0.xcb, 0) | ||
| { | ||
| log::error!("GPU recovery failed: {}", recovery_err); | ||
| std::process::exit(1); |
There was a problem hiding this comment.
Well, silently exiting the app probably not better than freezing? Maybe panic!() would be better? So that eventually, the panic hook could handle this and re-spawn Zed with a message on start "Some problem occurred, here's the details"?
There was a problem hiding this comment.
For example:
use std::{panic, env, process};
// Define an environment variable used to check if the process is a 'restarted' instance
const RESTART_ENV_VAR: &str = "APP_RESTARTED_FROM_CRASH";
/// Placeholder function to simulate the logic for relaunching the process.
/// In a real-world scenario, this would likely involve using the 'exec' system call
/// or signaling an external supervisor.
fn relaunch_process(error_details: &str) {
eprintln!("--- Relaunch Initiated ---");
eprintln!("Error to pass: '{}'", error_details);
eprintln!("In a real app, this would execute a new instance with the error info.");
// Example: process::Command::new(env::current_exe().unwrap())
// .env(RESTART_ENV_VAR, error_details)
// .spawn().unwrap();
// For this example, we'll just exit gracefully after 'relaunching'
process::exit(1);
}
fn main() {
// 1. Check on start for the 'restarted' flag
if let Ok(error_details) = env::var(RESTART_ENV_VAR) {
println!("*** ⚠️ Application started as a restarted instance! ***");
println!("Previous crash reason: {}", error_details);
// Clear the environment variable so future runs are clean (optional)
env::remove_var(RESTART_ENV_VAR);
// Here you would implement logic like:
// - Load previous state from disk
// - Present a crash report to the user
// - Attempt a less resource-intensive mode
}
// 2. Install the custom panic hook
panic::set_hook(Box::new(|panic_info| {
// --- CUSTOM PANIC HANDLER ---
eprintln!("\n🚨 Custom Panic Hook Activated 🚨");
eprintln!("Panic occurred at: {}", panic_info.location().unwrap());
// Extract a simple error string to pass to the new process
let error_message = format!("Panic: {}", panic_info.payload().downcast_ref::<&str>().unwrap_or(&"Unknown Error"));
// Call the relaunch logic (which will likely exit the current process)
// NOTE: This call is the key to your "respawn" concept.
relaunch_process(&error_message);
}));
// --- APPLICATION LOGIC ---
println!("Application running normally...");
println!("Simulating a critical error now...");
// 3. Trigger the panic
// This will cause the panic hook to fire.
panic!("GPU device lost");
// This line is never reached
println!("Application finished.");
}There was a problem hiding this comment.
Okay great thanks, i will address that! Additionally, after testing this solution a couple days i have found that it occasionally does not recover properly. I am investigating and coding up a fix for that. Will set the PR as draft until that is fixed. I really appreciate the feedback
| pub fn get_texture_info(&self, id: AtlasTextureId) -> Option<BladeTextureInfo> { | ||
| let lock = self.0.lock(); | ||
| let texture = &lock.storage[id]; | ||
| BladeTextureInfo { | ||
| let textures = &lock.storage[id.kind]; | ||
| let texture = textures.textures.get(id.index as usize)?.as_ref()?; | ||
| Some(BladeTextureInfo { | ||
| raw_view: texture.raw_view, | ||
| } | ||
| }) |
There was a problem hiding this comment.
After GPU recovery the atlas is cleared, but queued scene frames may still reference old texture IDs, since they can outlive the cleared atlas. skip_draw should prevent rendering until fresh scenes are generated, but this is additional safety: skip instead of panic. This was the panic that occasionally happened in the first commit.
There was a problem hiding this comment.
I think @kvark is appropriate person to review this. Though he has a strong opinion that this is unrecoverable without restarting the application.
| "your device information is: {:?}", | ||
| self.gpu.device_information() | ||
| ); | ||
| return Err(anyhow::anyhow!("GPU device hung or lost")); |
There was a problem hiding this comment.
Actually, it depends on wait_for implementation, could be just a normal frame timeout. Although, MAX_FRAME_TIME_MS is set to 10 seconds, so it usually coincides. The major architectural issue is that blade doesn't surface error codes to the application layer, so it's impossible to know whether device is really lost. But this is a good start, although it might lead to false positives on some platforms.
| ); | ||
| while !self.gpu.wait_for(&last_sp, MAX_FRAME_TIME_MS) {} | ||
| } | ||
| Ok(()) |
There was a problem hiding this comment.
If this approach is accepted, next logical step would to to extend wait_for or change bool to Ok/Err upstream
| "GPU recovery failed after device loss. This may be a driver issue. \ | ||
| Please try restarting Zed. Error: {}", | ||
| recovery_err | ||
| ); |
There was a problem hiding this comment.
Some info for context. This is the first step as it will surface telemetry to Zed developers so that they can easily understand how often does this issue happen. As it appears on every suspend resume, this is A LOT and may overwhelm Sentry. If we could reliably determine the device lost error, it would be possible to try alternative approach to recovery: instead of crashing the app we could try restarting it (App:: Restart()). But it will lose telemetry. Maybe extract App::Restart context and sending it to the crash handling server so that it could restart automatically if the issue is known and we don't need telemetry. But then, there's a restart loop problem if the crash is during initialization. I wonder how chromium solves it.
Anyway, acknowledging a problem is always a good start.
Having that information, consider adjusting the panic message to be useful to those reviewing Sentry issues rather than to the user. Maybe a simple "GPU hung" would be ok.
|
Check out how blade reports GPU crash: Maybe you could copy paste some of this or just call this function if the context is available. |
|
FIY I'm trying to push a proper detection for device lost: kvark/blade#286 |
yeah okay. I have some changes locally now that seems to recover properly, but its a larger and more complex. Turned out to be a bit bigger fix than i first envisioned. |
|
Tested on Ubuntu 24.04 with Nvidia RTX 5060. Recovery now works reliably with multiple windows open. Once kvark/blade#286 lands, we can replace the catch_unwind with proper wait_for_result() error handling. |
|
Thank you for the pull request and I'm sorry that we did not review it in a timely manner. I think #46758 has made this patch obsolete, so I'll be closing this pull request. I encourage you to try the WGPU renderer and see if improvements to the device lost handling can be made there. |
|
According to brief research, it appears that wgpu, like Blade, does not expose Vulkan errors and uses abstractions throughout. It's kind of more opaque does not allow detection of device loss events (though there's abstraction Because resources like textures and buffers are tied to a specific Device instance, once that device is lost, those handles are dead. The application (whether it's GPUI or a standalone app) must manually:
So this patch is not obsolete per se, it's just needs migrating to Wgpu API. |
|
PTAL #49154 |

Closes #32318
Closes #27751
Partial fix for #23288 (X11 only, not Wayland)
Release Notes: