Skip to content

Machine learning with Kerberos on Kubernetes. A Kubeflow example that integrates with Kerberos Vault.

License

Notifications You must be signed in to change notification settings

Pedrohgv/nvidia-gpu-kubernetes

 
 

Repository files navigation

NVIDIA GPU Kubernetes for Computer Vision

License: MIT

This repository demonstrates how to set up a Computer Vision application on Kubernetes, making use of NVIDIA GPU acceleration for Machine Learning workloads. It integrates with the robust and scalable surveillance solution at kerberos.io.

Table of Contents

Overview

This project provides a complete end-to-end solution for computer vision workloads on Kubernetes with NVIDIA GPU support. It includes:

  • GPU-enabled Kubernetes configurations
  • Kubeflow pipelines for ML workflows
  • Pre-built components for common computer vision tasks
  • Integration with Kerberos.io surveillance platform
  • Monitoring and observability with Prometheus and Grafana

The project leverages Kubeflow for MLOps on Kubernetes, helping Machine Learning Engineers track training sessions and experiments with different parameters and inputs.

Prerequisites

Before running the examples, you need:

  1. A running Kubernetes cluster
  2. NVIDIA drivers installed on all nodes
  3. NVIDIA Container Toolkit properly installed
  4. A running instance of Kerberos Vault
  5. kubectl configured to communicate with your cluster
  6. (Optional) Kubeflow installed for pipeline execution

Installation

1. NVIDIA Container Toolkit Setup

Follow the official NVIDIA Container Toolkit installation guide.

For containerd runtime, you may need to update your configuration as shown in the [nvidia argument](nvidia argument) file:

# /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl

[plugins.cri.containerd.runtimes.runc]
  runtime_type = "io.containerd.runtime.v1.linux"

  [plugins.linux]
  runtime = "nvidia-container-runtime"

2. Verify GPU Access

Use the GPU smoke pipeline to verify that your cluster can access GPUs:

cd gpu_smoke_pipeline
kubectl apply -f gpu_smoke_pod.yaml

Check the logs to confirm GPU access:

kubectl logs <pod-name>

3. Install Kubeflow (Optional)

If you want to run the Kubeflow pipelines, follow the official Kubeflow installation guide.

Project Structure

nvidia-gpu-kubernetes/
├── converting_pipeline/          # Pipeline for converting models between frameworks
├── gpu_smoke_pipeline/          # Simple pipeline to test GPU access
├── inference_pipeline/           # Pipeline for running inference on videos
├── training_pipeline/            # Pipeline for training models
├── visual_inference_pipeline/    # Pipeline for visual inference tasks
└── kubeflow_components/          # Reusable Kubeflow components
    ├── convert-darknet-2-tf-op/  # Convert Darknet models to TensorFlow
    ├── convert-pascal-2-yolo-op/  # Convert Pascal VOC to YOLO format
    ├── detect-video/             # Video detection component
    ├── download-and-extract-google-drive-op/  # Download from Google Drive
    ├── merge-files-op/           # Merge multiple files
    ├── predict-metadata-op/      # Predict metadata component
    └── predict-video-op/         # Video prediction component

Pipelines

GPU Smoke Pipeline

A simple pipeline that verifies GPU access in your Kubernetes cluster. It runs the nvidia-smi command inside a TensorFlow container and outputs the results to the logs.

Location: gpu_smoke_pipeline/

Converting Pipeline

Converts models from Darknet framework to TensorFlow format, enabling them to be used with TensorFlow-based inference pipelines.

Location: converting_pipeline/

Training Pipeline

A complete training pipeline that includes:

  • Downloading datasets from Google Drive
  • Converting data formats (Pascal VOC to YOLO)
  • Training models
  • Merging output files

Location: training_pipeline/

Inference Pipeline

Runs object detection on videos using trained models, with integration to Kerberos.io for video sources and Prometheus for metrics collection.

Location: inference_pipeline/

Visual Inference Pipeline

A specialized pipeline for visual inference tasks with enhanced visualization capabilities.

Location: visual_inference_pipeline/

Kubeflow Components

The project includes several reusable Kubeflow components:

convert-darknet-2-tf-op

Transforms a model from Darknet framework format to TensorFlow format.

Location: kubeflow_components/convert-darknet-2-tf-op/

convert-pascal-2-yolo-op

Converts Pascal VOC format annotations to YOLO format.

Location: kubeflow_components/convert-pascal-2-yolo-op/

detect-video

Subscribes to a Kafka broker for video IDs from Kerberos Vault, downloads videos, and runs object detection tasks.

Location: kubeflow_components/detect-video/

download-and-extract-google-drive-op

Downloads and extracts files from Google Drive.

Location: kubeflow_components/download-and-extract-google-drive-op/

merge-files-op

Merges multiple files into a single output.

Location: kubeflow_components/merge-files-op/

predict-metadata-op

Predicts metadata for input data.

Location: kubeflow_components/predict-metadata-op/

predict-video-op

Runs prediction on video files using TensorFlow models.

Location: kubeflow_components/predict-video-op/

Usage Examples

Running the GPU Smoke Test

cd gpu_smoke_pipeline
kubectl apply -f gpu_smoke_pod.yaml
kubectl logs -f <pod-name>

Running a Kubeflow Pipeline

  1. First, ensure Kubeflow is installed in your cluster
  2. Navigate to the desired pipeline directory
  3. Compile and upload the pipeline:
cd inference_pipeline
python make_pipeline.py
  1. Access the Kubeflow UI and run the uploaded pipeline

Building Custom Components

Each component includes a build_image.sh script to build the Docker image:

cd kubeflow_components/detect-video
./build_image.sh

Monitoring with Prometheus and Grafana

The inference pipeline includes integration with Prometheus for collecting metrics and Grafana for visualization.

Prometheus Setup

Install Prometheus using Helm:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install [RELEASE_NAME] prometheus-community/kube-prometheus-stack

Grafana Access

Expose Grafana to your local machine:

kubectl port-forward service/prometheus-stack-grafana 3000:80

Access the dashboard at http://localhost:3000.

Troubleshooting

GPU Not Detected

  1. Verify NVIDIA drivers are installed on all nodes
  2. Check that the NVIDIA Container Toolkit is properly configured
  3. Run the GPU smoke pipeline to test GPU access

Pipeline Failures

  1. Check pod logs for error messages
  2. Verify all required resources are available
  3. Ensure proper permissions for service accounts

Performance Issues

  1. Monitor GPU utilization using nvidia-smi
  2. Check resource requests and limits in pod specifications
  3. Consider optimizing batch sizes and model configurations

Contributing

We welcome contributions! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

Code Style

  • Follow Python PEP 8 style guidelines
  • Include documentation for new components
  • Update README.md for new features

License

This project is licensed under the MIT License - see the LICENSE file for details.

Copyright (c) 2021 Kerberos.io

About

Machine learning with Kerberos on Kubernetes. A Kubeflow example that integrates with Kerberos Vault.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.3%
  • Shell 2.7%
  • Dockerfile 1.0%