NVIDIA GPU Kubernetes for Computer Vision

This repository demonstrates how to set up a Computer Vision application on Kubernetes, making use of NVIDIA GPU acceleration for Machine Learning workloads. It integrates with the robust and scalable surveillance solution at kerberos.io.

Overview

This project provides a complete end-to-end solution for computer vision workloads on Kubernetes with NVIDIA GPU support. It includes:

GPU-enabled Kubernetes configurations
Kubeflow pipelines for ML workflows
Pre-built components for common computer vision tasks
Integration with Kerberos.io surveillance platform
Monitoring and observability with Prometheus and Grafana

The project leverages Kubeflow for MLOps on Kubernetes, helping Machine Learning Engineers track training sessions and experiments with different parameters and inputs.

Prerequisites

Before running the examples, you need:

A running Kubernetes cluster
NVIDIA drivers installed on all nodes
NVIDIA Container Toolkit properly installed
A running instance of Kerberos Vault
kubectl configured to communicate with your cluster
(Optional) Kubeflow installed for pipeline execution

Installation

1. NVIDIA Container Toolkit Setup

Follow the official NVIDIA Container Toolkit installation guide.

For containerd runtime, you may need to update your configuration as shown in the [nvidia argument](nvidia argument) file:

# /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl

[plugins.cri.containerd.runtimes.runc]
  runtime_type = "io.containerd.runtime.v1.linux"

  [plugins.linux]
  runtime = "nvidia-container-runtime"

2. Verify GPU Access

Use the GPU smoke pipeline to verify that your cluster can access GPUs:

cd gpu_smoke_pipeline
kubectl apply -f gpu_smoke_pod.yaml

Check the logs to confirm GPU access:

kubectl logs <pod-name>

3. Install Kubeflow (Optional)

If you want to run the Kubeflow pipelines, follow the official Kubeflow installation guide.

Project Structure

nvidia-gpu-kubernetes/
├── converting_pipeline/          # Pipeline for converting models between frameworks
├── gpu_smoke_pipeline/          # Simple pipeline to test GPU access
├── inference_pipeline/           # Pipeline for running inference on videos
├── training_pipeline/            # Pipeline for training models
├── visual_inference_pipeline/    # Pipeline for visual inference tasks
└── kubeflow_components/          # Reusable Kubeflow components
    ├── convert-darknet-2-tf-op/  # Convert Darknet models to TensorFlow
    ├── convert-pascal-2-yolo-op/  # Convert Pascal VOC to YOLO format
    ├── detect-video/             # Video detection component
    ├── download-and-extract-google-drive-op/  # Download from Google Drive
    ├── merge-files-op/           # Merge multiple files
    ├── predict-metadata-op/      # Predict metadata component
    └── predict-video-op/         # Video prediction component

Pipelines

GPU Smoke Pipeline

A simple pipeline that verifies GPU access in your Kubernetes cluster. It runs the nvidia-smi command inside a TensorFlow container and outputs the results to the logs.

Location: gpu_smoke_pipeline/

Converting Pipeline

Converts models from Darknet framework to TensorFlow format, enabling them to be used with TensorFlow-based inference pipelines.

Location: converting_pipeline/

Training Pipeline

A complete training pipeline that includes:

Downloading datasets from Google Drive
Converting data formats (Pascal VOC to YOLO)
Training models
Merging output files

Location: training_pipeline/

Inference Pipeline

Runs object detection on videos using trained models, with integration to Kerberos.io for video sources and Prometheus for metrics collection.

Location: inference_pipeline/

Visual Inference Pipeline

A specialized pipeline for visual inference tasks with enhanced visualization capabilities.

Location: visual_inference_pipeline/

Kubeflow Components

The project includes several reusable Kubeflow components:

convert-darknet-2-tf-op

Transforms a model from Darknet framework format to TensorFlow format.

Location: kubeflow_components/convert-darknet-2-tf-op/

convert-pascal-2-yolo-op

Converts Pascal VOC format annotations to YOLO format.

Location: kubeflow_components/convert-pascal-2-yolo-op/

detect-video

Subscribes to a Kafka broker for video IDs from Kerberos Vault, downloads videos, and runs object detection tasks.

Location: kubeflow_components/detect-video/

download-and-extract-google-drive-op

Downloads and extracts files from Google Drive.

Location: kubeflow_components/download-and-extract-google-drive-op/

merge-files-op

Merges multiple files into a single output.

Location: kubeflow_components/merge-files-op/

predict-metadata-op

Predicts metadata for input data.

Location: kubeflow_components/predict-metadata-op/

predict-video-op

Runs prediction on video files using TensorFlow models.

Location: kubeflow_components/predict-video-op/

Usage Examples

Running the GPU Smoke Test

cd gpu_smoke_pipeline
kubectl apply -f gpu_smoke_pod.yaml
kubectl logs -f <pod-name>

Running a Kubeflow Pipeline

First, ensure Kubeflow is installed in your cluster
Navigate to the desired pipeline directory
Compile and upload the pipeline:

cd inference_pipeline
python make_pipeline.py

Access the Kubeflow UI and run the uploaded pipeline

Building Custom Components

Each component includes a build_image.sh script to build the Docker image:

cd kubeflow_components/detect-video
./build_image.sh

Monitoring with Prometheus and Grafana

The inference pipeline includes integration with Prometheus for collecting metrics and Grafana for visualization.

Prometheus Setup

Install Prometheus using Helm:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install [RELEASE_NAME] prometheus-community/kube-prometheus-stack

Grafana Access

Expose Grafana to your local machine:

kubectl port-forward service/prometheus-stack-grafana 3000:80

Access the dashboard at http://localhost:3000.

Troubleshooting

GPU Not Detected

Verify NVIDIA drivers are installed on all nodes
Check that the NVIDIA Container Toolkit is properly configured
Run the GPU smoke pipeline to test GPU access

Pipeline Failures

Check pod logs for error messages
Verify all required resources are available
Ensure proper permissions for service accounts

Performance Issues

Monitor GPU utilization using nvidia-smi
Check resource requests and limits in pod specifications
Consider optimizing batch sizes and model configurations

Contributing

We welcome contributions! Please follow these steps:

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

Code Style

Follow Python PEP 8 style guidelines
Include documentation for new components
Update README.md for new features

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
converting_pipeline		converting_pipeline
gpu_smoke_pipeline		gpu_smoke_pipeline
inference_pipeline		inference_pipeline
kubeflow_components		kubeflow_components
training_pipeline		training_pipeline
visual_inference_pipeline		visual_inference_pipeline
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
nvidia argument		nvidia argument

License

Pedrohgv/nvidia-gpu-kubernetes

Folders and files

Latest commit

History

Repository files navigation

NVIDIA GPU Kubernetes for Computer Vision

Table of Contents

Overview

Prerequisites

Installation

1. NVIDIA Container Toolkit Setup

2. Verify GPU Access

3. Install Kubeflow (Optional)

Project Structure

Pipelines

GPU Smoke Pipeline

Converting Pipeline

Training Pipeline

Inference Pipeline

Visual Inference Pipeline

Kubeflow Components

convert-darknet-2-tf-op

convert-pascal-2-yolo-op

detect-video

download-and-extract-google-drive-op

merge-files-op

predict-metadata-op

predict-video-op

Usage Examples

Running the GPU Smoke Test

Running a Kubeflow Pipeline

Building Custom Components

Monitoring with Prometheus and Grafana

Prometheus Setup

Grafana Access

Troubleshooting

GPU Not Detected

Pipeline Failures

Performance Issues

Contributing

Code Style

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages