Use vae to encode x and then train
vae = AutoencoderKL.from_pretrained(f"stabilityai/sd-vae-ft-ema").to(device)
x = vae.encode(x).latent_dist.sample().mul_(0.18215)
Sample
with torch.no_grad():
z = ema_sample_method(opt.n_sample, z_shape, guide_w=opt.w)
x_gen = vae.decode(z / 0.18215).sample
The generation effect is poor

Hope there is a solution