Skip to content

Deadlock from re-entrant cuInit #27

@isawang92

Description

@isawang92

Hello, I am a student from National Taiwan University of Science and Technology. I would like to do some research based on your program, but after carefully reading the nvshare code and deploying it on Kubernetes, I encountered the following deadlock/loop:

The workload’s TensorFlow triggers cuInit for the first time, which also triggers initialize_client. The client_fn thread is started during initialize_client, but before initialize_client completes, client_fn calls real_cuInit.

Inside real_cuInit, dlsym("cuInit") is invoked, which is intercepted by your interposer and returns the wrapper. This re-enters the wrapper cuInit, which again calls pthread_once(&init_done, initialize_client). Since initialize_client has not finished, pthread_once blocks, and cuInit gets stuck.

Moreover, I think that even without the client_fn issue, based on the program logic and my actual runs, this could still evolve into a situation where real_cuInit and cuInit repeatedly call each other, causing a deadlock.

I would greatly appreciate your advice and help. Thank you very much.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions