Issues with large models (e.g., tf-matmul) showing high CPU RAM usage and client instability on both RTX 2080 (12GB) and RTX 4090 (24GB) using nvshare

Hi ,I've been testing your nvshare mechanism and have encountered some concerning issues when running larger TensorFlow and PyTorch models.

Initially, on a system with an NVIDIA GeForce RTX 2080 (12GB VRAM) and ~120GB CPU RAM (deployed on Kubernetes):

While smaller workloads like tf-matmul-small and pytorch-add-small run successfully, pytorch-add (larger version) doesn't seem to run.
tf-matmul (larger version) exhibits what appears to be significant memory leakage, and critically, it leads to instability.
A key observation from the scheduler logs during the tf-matmul run on the RTX 2080 is that the client is constantly being removed and then re-registering.
To further investigate, I tested the tf-matmul (larger version) workload on a more powerful system with an NVIDIA GeForce RTX 4090 (24GB VRAM) and ample CPU RAM. Interestingly, I encountered a similar problem: the CPU RAM was exhausted by the tf-matmul process. Furthermore, monitoring the RTX 4090 showed that only about 1GB of its 24GB VRAM was being utilized during the run, while system memory filled up. 

Could you provide some insight into what might be causing this behavior? The fact that the RTX 4090 shows minimal VRAM usage while system RAM is exhausted (along with the client instability) is particularly puzzling.

Any guidance on these observations would be greatly appreciated.

Thanks!

Here are some screenshots of it running on the RTX4090：
![Image](https://github.com/user-attachments/assets/ef6bd0f2-15bb-4534-8672-18e24eb42554)
![Image](https://github.com/user-attachments/assets/94f14c3a-a750-455d-ae46-0dd56e155102)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with large models (e.g., tf-matmul) showing high CPU RAM usage and client instability on both RTX 2080 (12GB) and RTX 4090 (24GB) using nvshare #25

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issues with large models (e.g., tf-matmul) showing high CPU RAM usage and client instability on both RTX 2080 (12GB) and RTX 4090 (24GB) using nvshare #25

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions