You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GPU is an expensive resource, and deep learning practitioners have to monitor the
11
-
health and usage of their GPUs, such as the temperature, memory, utilization, and the users.
12
-
This can be done with tools like `nvidia-smi` and `gpustat` from the terminal or command-line.
13
-
Often times, however, it is not convenient to `ssh` into servers to just check the GPU status.
14
-
`gpuview` is meant to mitigate this by running a lightweight web dashboard on top of
11
+
GPU is an expensive resource, and deep learning practitioners have to monitor the health and usage of their GPUs, such as the temperature, memory, utilization, and the users. This can be done with tools like `nvidia-smi` and `gpustat` from the terminal or command-line. Often times, however, it is not convenient to `ssh` into servers to just check the GPU status. `gpuview` is meant to mitigate this by running a lightweight web dashboard on top of
15
12
[`gpustat`][repo_gpustat].
16
13
17
-
With `gpuview` one can monitor GPUs on the go, though a web browser. Moreover, **multiple GPU servers**
18
-
can be registered into one `gpuview` dashboard and all stats are aggregated and accessible from one place.
19
-
14
+
With `gpuview` one can monitor GPUs on the go, through a web browser. Moreover, **multiple GPU servers** can be registered into one `gpuview` dashboard and all stats are aggregated and accessible from one place.
20
15
21
-
Thumbnail view of GPUs across multiple servers.
16
+
The dashboard features **live auto-refresh** (every 3 seconds) and includes interactive tooltips, temperature-based color coding, and pause/resume controls for real-time GPU monitoring.
22
17
23
-

18
+
Dashboard view of nine GPUs across multiple servers:
24
19
20
+

25
21
26
-
Setup
27
-
-----
22
+
## Setup
28
23
29
-
Python is required,`gpuview` has been tested with both 2.7 and 3 versions.
> `gpuview` installs the latest version of `gpustat` from `pypi`, therefore, its commands are available
38
+
> `gpuview` installs the latest version of `gpustat` from `pypi`, therefore, its commands are available
44
39
from the terminal.
45
40
46
-
47
-
48
-
Usage
49
-
-----
41
+
## Usage
50
42
51
43
`gpuview` can be used in two modes as a temporary process or as a background service.
52
44
53
45
### Run gpuview
46
+
54
47
Once `gpuview` is installed, it can be started as follows:
48
+
49
+
```sh
50
+
gpuview run --safe-zone
55
51
```
56
-
$ gpuview run --safe-zone
52
+
53
+
This will start the dashboard at `http://0.0.0.0:9988`.
54
+
55
+
By default, `gpuview` runs at `0.0.0.0` and port `9988`, but these can be changed using `--host` and `--port`. The `safe-zone` option means report all details including usernames, but it can be turned off for security reasons.
56
+
57
+
For testing and development purposes, you can run gpuview with synthetic data:
58
+
59
+
```sh
60
+
gpuview run --demo
57
61
```
58
-
This will start the dasboard at `http://0.0.0.0:9988`.
59
62
63
+
This displays fake GPU statistics and is useful when developing on systems without NVIDIA GPUs or when showcasing the dashboard.
64
+
65
+
## API Endpoints
60
66
61
-
By default, `gpuview` runs at `0.0.0.0` and port `9988`, but these can be changed using `--host` and `--port`. The `safe-zone` option means report all detials including usernames, but it can be turned off for security reasons.
67
+
gpuview provides REST API endpoints for programmatic access:
68
+
69
+
*`GET /api/gpustat/self` - Returns GPU statistics for the main host
70
+
*`GET /api/gpustat/all` - Returns aggregated GPU statistics for all registered hosts
71
+
72
+
**Legacy endpoints:**
73
+
74
+
*`GET /gpustat` - Returns GPU statistics for the local host (backward compatibility)
62
75
63
76
### Run as a Service
77
+
64
78
To permanently run `gpuview` it needs to be deployed as a background service.
65
79
This will require a `sudo` privilege authentication.
66
80
The following command needs to be executed only once:
67
81
68
-
```
69
-
$ gpuview service [--safe-zone] [--exlude-self]
82
+
```sh
83
+
gpuview service [--safe-zone] [--exlude-self]
70
84
```
71
85
72
86
If successful, the `gpuview` service is run immediately and will also autostart at boot time. It can be controlled using `supervisorctl start|stop|restart gpuview`.
73
87
74
-
75
88
### Runtime options
76
89
77
90
There a few important options in `gpuview`, use `-h` to see them all.
78
91
79
-
```
80
-
$ gpuview -h
92
+
```sh
93
+
gpuview -h
81
94
```
82
95
83
96
*`run` : Start `gpuview` dashboard server
84
97
*`--host` : URL or IP address of host (default: 0.0.0.0)
85
98
*`--port` : Port number to listen to (default: 9988)
86
99
*`--safe-zone` : Safe to report all details, eg. usernames
87
100
*`--exclude-self` : Don't report to others but to self-dashboard
101
+
*`--demo` : Run with fake data for testing purposes
88
102
*`-d`, `--debug` : Run server in debug mode (for developers)
89
-
*`add` : Add a GPU host to dashboard
103
+
*`add` : Add a GPU host to the dashboard
90
104
*`--url` : URL of host [IP:Port], eg. X.X.X.X:9988
91
105
*`--name` : Optional readable name for the host, eg. Node101
92
106
*`remove` : Remove a registered host from dashboard
@@ -100,56 +114,49 @@ $ gpuview -h
100
114
*`-v`, `--version` : Print versions of `gpuview` and `gpustat`
101
115
*`-h`, `--help` : Print help for command-line options
102
116
103
-
104
117
### Monitoring multiple hosts
105
118
106
119
To aggregate the stats of multiple machines, they can be registered to one dashboard using their address and the port number running `gpustat`.
107
120
108
121
Register a host to monitor as follows:
109
-
```
110
-
$ gpuview add --url <ip:port> --name <name>
122
+
123
+
```sh
124
+
gpuview add --url <ip:port> --name <name>
111
125
```
112
126
113
127
Remove a registered host as follows:
114
-
```
115
-
$ gpuview remove --url <ip:port> --name <name>
116
-
```
117
128
118
-
Display all registered hosts as follows:
119
-
```
120
-
$ gpuview hosts
129
+
```sh
130
+
gpuview remove --url <ip:port> --name <name>
121
131
```
122
132
123
-
> Note: the `gpuview` service needs to run in all hosts that will be monitored.
133
+
Display all registered hosts/nodes as follows:
124
134
125
-
> Tip: `gpuview` can be setup on a none GPU machine, such as laptops, to monitor remote GPU servers.
135
+
```sh
136
+
gpuview hosts
137
+
```
126
138
139
+
The `gpuview` service needs to run in all hosts that will be monitored.
127
140
128
-
etc
129
-
---
141
+
> Tip: `gpuview` can be setup on a none GPU machine, such as laptops, to monitor remote GPU servers.
130
142
131
-
Helpful tips related to the underlying performance are available at the [`gpustat`][repo_gpustat] repo.
143
+
## etc
132
144
145
+
Helpful tips related to the underlying performance are available at the [`gpustat`][repo_gpustat] repo.
133
146
134
147
For the sake of simplicity, `gpuview` does not have a user authentication in place. As a security measure,
135
-
it does not report sensitive details such as user names by default. This can be changed if the service is
136
-
running in a trusted network, using the `--safe-zone` option to report all details.
137
-
148
+
it does not report sensitive details such as user names by default. This can be changed if the service is
149
+
running in a trusted network, using the `--safe-zone` option to report all details.
138
150
139
151
The `--exclude-self` option of the run command can be used to prevent other dashboards from getting stats of the current machine. This way the stats are shown only on the host's own dashboard.
140
152
141
-
142
153
Detailed view of GPUs across multiple servers.
143
154
144
-

145
-
146
-
147
-
License
148
-
-------
149
-
150
-
[MIT License](LICENSE)
155
+

151
156
157
+
## License
152
158
159
+
`gpuview` is licensed under the [MIT License](LICENSE), which is a permissive open-source license that allows you to freely use, modify, and distribute this software.
0 commit comments