A tool to collect GPU metrics from DCGM Exporter instasnces, and forward them to cast.ai.
The exporter can run as a sidecar to the DCGM DaemonSet, or as a single instance service in the cluster.
When it runs as a sidecar, the DCGM_HOST should be set. In this case it will only scrape metrics from that particular
instance of DCGM and send them to cast.ai
If it is deployed as a single instance in the cluster, it will automatically discover the DCGM instances and scrape
the metrics from them. If the DCGM instances have some custom labels, make sure to properly set the DCGM_LABELS
environment variable.
It is also possible to deploy the DCGM exporter but have it configured to read the metrics from an existing nv-hostengine.
Make sure that these fields are exposed by DCGM exporter as metrics:
DCGM_FI_PROF_SM_ACTIVE
DCGM_FI_PROF_SM_OCCUPANCY
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
DCGM_FI_PROF_DRAM_ACTIVE
DCGM_FI_PROF_PCIE_TX_BYTES
DCGM_FI_PROF_PCIE_RX_BYTES
DCGM_FI_PROF_GR_ENGINE_ACTIVE
DCGM_FI_DEV_FB_TOTAL
DCGM_FI_DEV_FB_FREE
DCGM_FI_DEV_FB_USED
DCGM_FI_DEV_PCIE_LINK_GEN
DCGM_FI_DEV_PCIE_LINK_WIDTH
DCGM_FI_DEV_GPU_TEMP
DCGM_FI_DEV_MEMORY_TEMP
DCGM_FI_DEV_POWER_USAGE
You can clone this repository and install the chart with the following commands:
$ cd charts/gpu-metrics-exporter
$ helm install --generate-name <deployment-name> -f values.yaml .Where:
<deployment-name>is a name of your choice
You can add the cast.ai repository and install the chart with the following commands:
$ helm repo add castai https://castai.github.io/charts
$ helm repo update
$ helm pull castai/gpu-metrics-exporter --untar
$ cd gpu-metrics-exporter
$ helm install --generate-name castai/gpu-metrics-exporter -f values.yamlBy default, it will be deployed as a sidecar to the DCGM exporter. If you don't want to deploy it as a sidecar, in the values.yaml file you can:
- Set
dcgmExporter.enabledto false - Set the
DCGM_HOSTandDCGM_LABELSenvironment variables ingpuMetricsExporter.configof the values.yaml fileDCGM_HOSTis the address of the DCGM exporter instanceDCGM_LABELSis a comma-separated list of labels that the DCGM instances have
- If you want to deploy the DCGM exporter but have it configured to read the metrics from an existing nv-hostengine,
you can:
- set the
dcgmExporter.useExternalHostEngineto true in the values.yaml file - it will try to connect to the 5555 port of the node.
- set the