This repository demonstrates how to set up a Computer Vision application on Kubernetes, making use of NVIDIA GPU acceleration for Machine Learning workloads. It integrates with the robust and scalable surveillance solution at kerberos.io.
- Overview
- Prerequisites
- Installation
- Project Structure
- Pipelines
- Kubeflow Components
- Usage Examples
- Monitoring with Prometheus and Grafana
- Troubleshooting
- Contributing
- License
This project provides a complete end-to-end solution for computer vision workloads on Kubernetes with NVIDIA GPU support. It includes:
- GPU-enabled Kubernetes configurations
- Kubeflow pipelines for ML workflows
- Pre-built components for common computer vision tasks
- Integration with Kerberos.io surveillance platform
- Monitoring and observability with Prometheus and Grafana
The project leverages Kubeflow for MLOps on Kubernetes, helping Machine Learning Engineers track training sessions and experiments with different parameters and inputs.
Before running the examples, you need:
- A running Kubernetes cluster
- NVIDIA drivers installed on all nodes
- NVIDIA Container Toolkit properly installed
- A running instance of Kerberos Vault
- kubectl configured to communicate with your cluster
- (Optional) Kubeflow installed for pipeline execution
Follow the official NVIDIA Container Toolkit installation guide.
For containerd runtime, you may need to update your configuration as shown in the [nvidia argument](nvidia argument) file:
# /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
[plugins.cri.containerd.runtimes.runc]
runtime_type = "io.containerd.runtime.v1.linux"
[plugins.linux]
runtime = "nvidia-container-runtime"Use the GPU smoke pipeline to verify that your cluster can access GPUs:
cd gpu_smoke_pipeline
kubectl apply -f gpu_smoke_pod.yamlCheck the logs to confirm GPU access:
kubectl logs <pod-name>If you want to run the Kubeflow pipelines, follow the official Kubeflow installation guide.
nvidia-gpu-kubernetes/
├── converting_pipeline/ # Pipeline for converting models between frameworks
├── gpu_smoke_pipeline/ # Simple pipeline to test GPU access
├── inference_pipeline/ # Pipeline for running inference on videos
├── training_pipeline/ # Pipeline for training models
├── visual_inference_pipeline/ # Pipeline for visual inference tasks
└── kubeflow_components/ # Reusable Kubeflow components
├── convert-darknet-2-tf-op/ # Convert Darknet models to TensorFlow
├── convert-pascal-2-yolo-op/ # Convert Pascal VOC to YOLO format
├── detect-video/ # Video detection component
├── download-and-extract-google-drive-op/ # Download from Google Drive
├── merge-files-op/ # Merge multiple files
├── predict-metadata-op/ # Predict metadata component
└── predict-video-op/ # Video prediction component
A simple pipeline that verifies GPU access in your Kubernetes cluster. It runs the nvidia-smi command inside a TensorFlow container and outputs the results to the logs.
Location: gpu_smoke_pipeline/
Converts models from Darknet framework to TensorFlow format, enabling them to be used with TensorFlow-based inference pipelines.
Location: converting_pipeline/
A complete training pipeline that includes:
- Downloading datasets from Google Drive
- Converting data formats (Pascal VOC to YOLO)
- Training models
- Merging output files
Location: training_pipeline/
Runs object detection on videos using trained models, with integration to Kerberos.io for video sources and Prometheus for metrics collection.
Location: inference_pipeline/
A specialized pipeline for visual inference tasks with enhanced visualization capabilities.
Location: visual_inference_pipeline/
The project includes several reusable Kubeflow components:
Transforms a model from Darknet framework format to TensorFlow format.
Location: kubeflow_components/convert-darknet-2-tf-op/
Converts Pascal VOC format annotations to YOLO format.
Location: kubeflow_components/convert-pascal-2-yolo-op/
Subscribes to a Kafka broker for video IDs from Kerberos Vault, downloads videos, and runs object detection tasks.
Location: kubeflow_components/detect-video/
Downloads and extracts files from Google Drive.
Location: kubeflow_components/download-and-extract-google-drive-op/
Merges multiple files into a single output.
Location: kubeflow_components/merge-files-op/
Predicts metadata for input data.
Location: kubeflow_components/predict-metadata-op/
Runs prediction on video files using TensorFlow models.
Location: kubeflow_components/predict-video-op/
cd gpu_smoke_pipeline
kubectl apply -f gpu_smoke_pod.yaml
kubectl logs -f <pod-name>- First, ensure Kubeflow is installed in your cluster
- Navigate to the desired pipeline directory
- Compile and upload the pipeline:
cd inference_pipeline
python make_pipeline.py- Access the Kubeflow UI and run the uploaded pipeline
Each component includes a build_image.sh script to build the Docker image:
cd kubeflow_components/detect-video
./build_image.shThe inference pipeline includes integration with Prometheus for collecting metrics and Grafana for visualization.
Install Prometheus using Helm:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install [RELEASE_NAME] prometheus-community/kube-prometheus-stackExpose Grafana to your local machine:
kubectl port-forward service/prometheus-stack-grafana 3000:80Access the dashboard at http://localhost:3000.
- Verify NVIDIA drivers are installed on all nodes
- Check that the NVIDIA Container Toolkit is properly configured
- Run the GPU smoke pipeline to test GPU access
- Check pod logs for error messages
- Verify all required resources are available
- Ensure proper permissions for service accounts
- Monitor GPU utilization using
nvidia-smi - Check resource requests and limits in pod specifications
- Consider optimizing batch sizes and model configurations
We welcome contributions! Please follow these steps:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
- Follow Python PEP 8 style guidelines
- Include documentation for new components
- Update README.md for new features
This project is licensed under the MIT License - see the LICENSE file for details.
Copyright (c) 2021 Kerberos.io