Skip to content

Multi-gpu training cause RuntimeError #2

@Hsintien-Ng

Description

@Hsintien-Ng

Hi, I try to run the train.py on ffhq dataset in multi-gpu manner. However, I meet the RuntimeError as follows

training_process(rank, world_size, opt, device)
File "/home/xintian/workspace/GRAM-main/training_loop.py", line 217, in training_process
d_loss = process.train_D(real_imgs, real_poses, generator_ddp, discriminator_ddp, optimizer_D, scaler, config, device)
File "/home/xintian/workspace/GRAM-main/processes/processes.py", line 38, in train_D
g_imgs, g_pos = generator_ddp(subset_z, **config['camera'])
File "/home/xintian/anaconda3/envs/torch18/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xintian/anaconda3/envs/torch18/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/xintian/anaconda3/envs/torch18/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xintian/workspace/GRAM-main/generators/generators.py", line 112, in forward
img, _ = self.renderer.render(self._intersections, self._volume(z, truncation_psi), img_size, camera_origin, camera_pos, fov, ray_start, ray_end, z.device)
File "/home/xintian/workspace/GRAM-main/generators/renderers/manifold_renderer.py", line 194, in render
coarse_output = volume(transformed_points, transformed_ray_directions_expanded).reshape(batchsize, img_size * img_size, self.num_manifolds, 4)
File "/home/xintian/workspace/GRAM-main/generators/generators.py", line 76, in
return lambda points, ray_directions: self.representation.get_radiance(z, points, ray_directions, truncation_psi)
File "/home/xintian/workspace/GRAM-main/generators/representations/gram.py", line 317, in get_radiance
return self.rf_network(x, z, ray_directions, truncation_psi)
File "/home/xintian/anaconda3/envs/torch18/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xintian/workspace/GRAM-main/generators/representations/gram.py", line 260, in forward
frequencies_2, phase_shifts_2 = self.mapping_network(z2)
File "/home/xintian/anaconda3/envs/torch18/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xintian/workspace/GRAM-main/generators/representations/gram.py", line 93, in forward
frequencies_offsets = self.network(z)
File "/home/xintian/anaconda3/envs/torch18/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xintian/anaconda3/envs/torch18/lib/python3.6/site-packages/torch/nn/modules/container.py", line 119, in forward
input = module(input)
File "/home/xintian/anaconda3/envs/torch18/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xintian/anaconda3/envs/torch18/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 94, in forward
return F.linear(input, self.weight, self.bias)
File "/home/xintian/anaconda3/envs/torch18/lib/python3.6/site-packages/torch/nn/functional.py", line 1753, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: Expected tensor for 'out' to have the same device as tensor for argument #2 'mat1'; but device 1 does not equal 0 (while checking arguments for addmm)

How can I fix this? By the way, when I run the train.py in single-gpu manner, it works.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions