Skip to content

Issues with large models (e.g., tf-matmul) showing high CPU RAM usage and client instability on both RTX 2080 (12GB) and RTX 4090 (24GB) using nvshare #25

@Rory109

Description

@Rory109

Hi ,I've been testing your nvshare mechanism and have encountered some concerning issues when running larger TensorFlow and PyTorch models.

Initially, on a system with an NVIDIA GeForce RTX 2080 (12GB VRAM) and ~120GB CPU RAM (deployed on Kubernetes):

While smaller workloads like tf-matmul-small and pytorch-add-small run successfully, pytorch-add (larger version) doesn't seem to run.
tf-matmul (larger version) exhibits what appears to be significant memory leakage, and critically, it leads to instability.
A key observation from the scheduler logs during the tf-matmul run on the RTX 2080 is that the client is constantly being removed and then re-registering.
To further investigate, I tested the tf-matmul (larger version) workload on a more powerful system with an NVIDIA GeForce RTX 4090 (24GB VRAM) and ample CPU RAM. Interestingly, I encountered a similar problem: the CPU RAM was exhausted by the tf-matmul process. Furthermore, monitoring the RTX 4090 showed that only about 1GB of its 24GB VRAM was being utilized during the run, while system memory filled up.

Could you provide some insight into what might be causing this behavior? The fact that the RTX 4090 shows minimal VRAM usage while system RAM is exhausted (along with the client instability) is particularly puzzling.

Any guidance on these observations would be greatly appreciated.

Thanks!

Here are some screenshots of it running on the RTX4090:
Image
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions