Inconsistent GPU usage when embeddings exist vs. when they don’t

**Description:**
When an HPO embedding cache already exists, the process raises a “Can't initialize NVML” warning. As a result, the computation falls back to CPU (see Example 1).

However, if no embedding exists, the error does not occur, and inference runs on GPU as expected (see Example 2).

**Environment:**
This behavior was observed on a Slurm-managed HPC cluster (HMS Biogrid). The exact environment setup may be a contributing factor, but I’m not entirely sure.

**Example1**
```
Executing: python.phenogpt2 /home/ch262025/PhenoGPT2/inference.py -i "/home/ch262025/PhenoGPT2/data/example/task_list_subset.json" -o "/home/ch262025/PhenoGPT2/data/results/example_testing" -model_dir "/programs/local/biogrids/phenogpt2/models/PhenoGPT2-EHR" -index "0" -negation --text_only
/programs/x86_64-linux/phenogpt2/51acdf1/.pixi/envs/default/lib/python3.11/site-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at /home/conda/feedstock_root/build_artifacts/libtorch_1744247799952/work/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
/programs/x86_64-linux/phenogpt2/51acdf1/.pixi/envs/default/lib/python3.11/site-packages/torch/cuda/__init__.py:734: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
`torch_dtype` is deprecated! Use `dtype` instead!
Detected existing HPO Database Embeddings

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00, 79.09it/s]
start phenogpt2
/home/ch262025/PhenoGPT2/data/results/example_testing
use_vision: False

  0%|          | 0/10 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
```

**Example 2**
```
Executing: python.phenogpt2 /home/ch262025/PhenoGPT2/inference.py -i "/home/ch262025/PhenoGPT2/data/example/task_list_subset.json" -o "/home/ch262025/PhenoGPT2/data/results/example_testing" -model_dir "/programs/local/biogrids/phenogpt2/models/PhenoGPT2-EHR" -index "0" -negation --text_only
No existing HPO Database Embeddings are stored - Running embedding now
Embedding HPO database::   0%|          | 0/40451 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

Embedding HPO database::   0%|          | 1/40451 [00:00<1:59:52,  5.62it/s]
Embedding HPO database::   0%|          | 19/40451 [00:00<08:13, 81.94it/s] 
...
Embedding HPO database:: 100%|██████████| 40451/40451 [03:04<00:00, 219.69it/s]
`torch_dtype` is deprecated! Use `dtype` instead!

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:01<00:03,  1.13s/it]
Loading checkpoint shards:  50%|█████     | 2/4 [00:02<00:02,  1.11s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:03<00:01,  1.10s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.26it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.10it/s]
start phenogpt2
/home/ch262025/PhenoGPT2/data/results/example_testing
use_vision: False

  0%|          | 0/10 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
2025-10-02 16:14:02.715 | DEBUG    | PyRuSH.PyRuSHSentencizer:predict:100 - ....
...
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Merging all chunks together: 0it [00:00, ?it/s][A
Merging all chunks together: 1it [00:00, 32768.00it/s]

 10%|█         | 1/10 [00:06<00:58,  6.53s/it]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
...
Merging all chunks together: 0it [00:00, ?it/s][A
Merging all chunks together: 1it [00:00, 38130.04it/s]

100%|██████████| 10/10 [00:38<00:00,  3.29s/it]
100%|██████████| 10/10 [00:38<00:00,  3.89s/it]


```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistent GPU usage when embeddings exist vs. when they don’t #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistent GPU usage when embeddings exist vs. when they don’t #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions