docs: add KernelCI + labgrid integration research by aparcar · Pull Request #214 · aparcar/openwrt-tests

aparcar · 2026-01-25T08:27:26Z

Comprehensive research document analyzing how to integrate KernelCI
as a backend for OpenWrt testing infrastructure while preserving
the existing labgrid-based test framework.

Key findings:

KernelCI's new pull-mode architecture enables secure lab federation
Labgrid adapter approach (used by Pengutronix) is recommended
KCIDB-ng provides standardized results submission API
Phased implementation starting with results integration

Document includes:

Current infrastructure analysis (7 labs, 38+ devices)
KernelCI architecture overview (Maestro, KCIDB, Events)
Four integration options with trade-offs
Detailed 4-phase implementation plan
Technical specifications and code examples

Comprehensive research document analyzing how to integrate KernelCI as a backend for OpenWrt testing infrastructure while preserving the existing labgrid-based test framework. Key findings: - KernelCI's new pull-mode architecture enables secure lab federation - Labgrid adapter approach (used by Pengutronix) is recommended - KCIDB-ng provides standardized results submission API - Phased implementation starting with results integration Document includes: - Current infrastructure analysis (7 labs, 38+ devices) - KernelCI architecture overview (Maestro, KCIDB, Events) - Four integration options with trade-offs - Detailed 4-phase implementation plan - Technical specifications and code examples

Major update to the KernelCI integration document focusing on self-hosted deployment for OpenWrt firmware testing. Key additions: - Complete Docker Compose deployment stack - MongoDB, Redis, MinIO for storage - KernelCI API (Maestro) and Pipeline services - Dashboard with OpenWrt-specific views - Traefik reverse proxy with TLS - Multi-source firmware management - Official OpenWrt releases (snapshot, stable, oldstable) - GitHub PR artifact integration - Custom developer upload API - Buildbot webhook integration - Comprehensive health check system - Periodic device health monitoring - Automatic device disable on failures - GitHub issue creation/closure - Visual fleet status dashboard - OpenWrt-specific adaptations - Custom firmware schema (replaces kernel builds) - Test plan definitions matching existing pytest suite - Feature-based job scheduling - Device capability mapping - Labgrid adapter for pull-mode operation - Labs stay behind firewalls - Job polling from central KernelCI - Preserves existing 38+ device targets - 5-phase implementation plan with clear deliverables

Implements the self-hosted KernelCI infrastructure for OpenWrt testing: Docker Compose Stack: - MongoDB 7.0 for data storage with initialization script - Redis 7 for pub/sub messaging - MinIO for S3-compatible artifact storage - KernelCI API (Maestro) for job management - Traefik reverse proxy with automatic TLS - Pipeline services (trigger, scheduler, health, results) - Dashboard for result visualization Configuration: - api-config.toml: KernelCI API settings with OpenWrt customizations - pipeline.yaml: Firmware sources, test plans, scheduler settings - mongo-init.js: Database collections and indexes - .env.example: Environment variable template Pipeline Core Modules: - models.py: Pydantic models for firmware, jobs, results, devices - config.py: Configuration loading from env and YAML - api_client.py: Async HTTP client for KernelCI API Key Features: - Multi-source firmware support (official, PR, custom, buildbot) - Test plan definitions matching existing pytest suite - Device type mapping to OpenWrt targets - Health check configuration - JWT authentication - S3 artifact storage

Implements firmware source modules for multi-source firmware ingestion: Official Release Source (official.py): - Scans downloads.openwrt.org for profiles.json files - Supports snapshot, stable, and oldstable releases - Extracts firmware metadata and artifact URLs - Calculates checksums for verification - Configurable target filtering for efficiency GitHub PR Source (github_pr.py): - Monitors PRs with trigger labels (ci-test-requested) - Extracts firmware from workflow run artifacts - Parses target info from artifact names - Supports PR status updates and comments - Automatic artifact download and extraction Custom Upload Handler (custom.py): - FastAPI router for firmware uploads - Validates file size and extensions - Stores firmware in MinIO - Generates unique firmware IDs - Auto-detects firmware type from filename Firmware Trigger Service (firmware_trigger.py): - Main orchestration service - Initializes and manages all sources - Periodic scanning with configurable intervals - Creates firmware entries in KernelCI API - Publishes events for job scheduling - Includes health check endpoint - FastAPI server for upload API Base Classes: - FirmwareSource abstract base class - Consistent interface for all source types - Async generator pattern for scanning

Implements the bridge between KernelCI and labgrid test labs using pull-mode architecture where labs fetch jobs from the central API. Labgrid Adapter (kernelci/labgrid-adapter/): - Dockerfile with QEMU and serial tools - Pull-mode job poller (poller.py) - Registers lab with KernelCI API - Sends periodic heartbeats - Polls for pending jobs matching device capabilities - Claims and dispatches jobs to executor - Test executor (executor.py) - Downloads firmware artifacts with caching - Builds pytest command with labgrid integration - Captures console logs and test output - Parses pytest JSON results - Uploads logs to MinIO storage - Main service (service.py) - Discovers devices from target YAML files - Extracts features from labgrid configs - Coordinates poller and executor - Handles graceful shutdown - Configuration via environment variables Test Scheduler (openwrt-pipeline/test_scheduler.py): - Listens for new firmware events - Finds compatible devices based on target/subtarget - Creates test jobs with appropriate test plans - Feature-based test plan assignment - Priority-based scheduling (PR > snapshot > stable) - Handles job monitoring and timeouts Key Features: - Labs stay behind firewalls (pull-mode) - Automatic device discovery from target files - Feature-based test filtering - Firmware caching for efficiency - Console log capture and upload - pytest JSON result parsing

Implements comprehensive device health monitoring with automated notifications and device management. Device Registry (health/device_registry.py): - Tracks health status for all devices - Status levels: healthy, failing, disabled, unknown - Configurable failure thresholds (warning, disable) - Last check and consecutive failure tracking - Bulk status queries and summary generation - Automatic status transitions based on results Notification Manager (health/notifications.py): - GitHub issue creation for disabled devices - Auto-close issues when devices recover - Issue caching to prevent duplicates - Formatted issue body with device details - Console log links in issues - Resolution steps documentation Health Check Scheduler (health/scheduler.py): - Periodic check scheduling based on interval - High-priority health check job creation - Job completion monitoring - Result processing with status updates - Recovery detection and notification - Manual health check trigger API - Status reporting endpoint Key Features: - Devices automatically disabled after threshold failures - GitHub issues track device problems - Automatic issue closure on recovery - Minimal tests (shell + SSH) for quick checks - Skip firmware flash for health checks - Concurrent schedule and monitor loops

Add React TypeScript components for the KernelCI dashboard: - DeviceFleetStatus: Visual overview of devices across all labs with health status indicators, feature tags, and quick actions - FirmwareMatrix: Matrix view showing test results with devices as rows and firmware versions as columns, with drill-down to individual tests - HealthCheckDashboard: Device health monitoring with summary stats, device status table, health check history timeline, and manual controls - PRStatusView: GitHub PR testing status with PR list, test progress, job details, and direct links to GitHub and artifacts Components are designed to integrate with KernelCI dashboard or can be deployed as a custom dashboard extension.

Update labgrid adapter configuration to use the modern gRPC-based coordinator instead of the legacy Crossbar/WAMP protocol: - Rename lg_crossbar config to lg_coordinator (host:port format) - Set LG_COORDINATOR environment variable for pytest execution - Add grpcio dependencies to requirements.txt

- Remove unused imports across all modules - Fix f-strings without placeholders (use plain strings for structlog) - Rename ambiguous variable 'l' to 'lbl' in github_pr.py - Remove unused local variables - Sort imports with isort rules - Apply consistent code formatting

- Add ruff and isort configuration to pyproject.toml - Configure ruff to handle import sorting (I rules) - Remove test_lan_interface_has_neighbor which fails inconsistently (IPv6 multicast ping doesn't always return DUP! responses) - Update test plan configs to remove the flaky test

- Break long f-strings across multiple lines - Extract long shell commands into variables - Wrap long docstrings at 88 characters - Fix commented code line lengths

Remove custom dashboard components - use the standard KernelCI dashboard instead (ghcr.io/kernelci/dashboard). The dashboard connects to the same API and provides all needed visualization. Move health check from pipeline to labgrid-adapter: - Health checks are a lab maintenance concern, not public-facing - Lab maintainers run checks locally, not via KernelCI - Add standalone health_check.py tool for lab maintainers Removed: - kernelci/dashboard/ (custom React components) - kernelci/openwrt-pipeline/openwrt_pipeline/health/ (pipeline health) - pipeline-health and pipeline-results services from docker-compose Added: - labgrid_kci_adapter/health_check.py (lab-side tool)

Add automatic health check functionality to the labgrid adapter: - Health checks run every 24 hours by default (configurable via HEALTH_CHECK_INTERVAL environment variable) - Devices that fail health checks are removed from the job pool - Devices that recover are automatically re-added - Initial health check runs at startup before accepting jobs Configuration options: - HEALTH_CHECK_INTERVAL: seconds between checks (default: 86400 = 24h) - HEALTH_CHECK_ENABLED: set to false to disable (default: true) This ensures only working devices receive test jobs from KernelCI, and lab maintainers are informed via logs when devices fail.

API Client: - Rewrite to use KernelCI's Node-based API (/latest/nodes endpoint) - Jobs are nodes with kind=job, tests are nodes with kind=test - Use state field (available/running/done) for job lifecycle - Add OpenWrt-specific helpers (create_firmware_node, create_test_job) Job Poller: - Update to query /latest/nodes with kind=job, state=available - Claim jobs by updating node state to 'running' - Simplified implementation without custom lab registration GitHub Status: - Add GitHubStatusPoster for commit status and PR comments - Post test results as commit statuses with device context - Create detailed PR comments for test failures - Support multi-device testing with separate status contexts Documentation: - Update README with Node-based API reference - Document lab configuration and health checks - Remove references to removed services (pipeline-health, pipeline-results) - Add API examples for node operations

Major changes: - Use pytest.main() with ResultCollectorPlugin instead of subprocess - Consolidate duplicate firmware ID generation into base.py - Consolidate duplicate firmware type detection into base.py - Fix test_scheduler to use correct Node-based API methods - Remove unused pub/sub subscribe stub from api_client - Remove dead code (_scan_all_targets unreachable yield) - Move inline import to top-level in custom.py - Update documentation for pytest execution and health checks

Following the LAVA approach where tests are fetched at job execution time, add support for: 1. Per-job test fetching: Job definition includes tests_repo URL, adapter fetches tests when executing (recommended for shared tests) 2. Static sync: Configure TESTS_REPO_URL to sync tests on startup and periodically (simpler setup for fixed test sets) This ensures all labs run the same version of tests without manual synchronization. New config options: - TESTS_REPO_URL: Git URL for static test sync - TESTS_REPO_BRANCH: Branch to use (default: main) - TESTS_SYNC_INTERVAL: Seconds between syncs (default: 3600) Job data options: - tests_repo: Git URL for per-job test fetch - tests_branch: Branch to use (default: main)

Simplify test synchronization: - Pull tests from git before each job execution - Clone if repo doesn't exist, update if it does - Remove background sync loop (no more TESTS_SYNC_INTERVAL) This is simpler and follows LAVA pattern more closely where tests are fetched at job execution time. Config options: - TESTS_REPO_URL: Git URL for tests (pulled before each job) - TESTS_REPO_BRANCH: Branch to use (default: main) Jobs can override with tests_repo/tests_branch in job data.

Configure proper tree/branch mapping for OpenWrt: - Tree: openwrt - Branches: main (SNAPSHOT), openwrt-24.10, openwrt-25.12 Node structure now includes: - group: tree identifier for dashboard grouping - data.kernel_revision: {tree, branch, commit, url} - path: [tree, branch, target, subtarget, profile] This enables the KernelCI dashboard to properly display test results organized by branch.

Instead of hardcoding versions in pipeline.yaml, now fetches active branches dynamically from downloads.openwrt.org/.versions.json This automatically discovers: - main (SNAPSHOT builds) - stable (current release from stable_version) - oldstable (previous release series from versions_list) Changes: - Add versions.py module with get_active_branches() - Update firmware_trigger to create sources dynamically - Simplify pipeline.yaml to just specify targets Config now only needs: targets: [ath79/generic, x86/64, ...] include_snapshot: true include_oldstable: true

The labgrid-kci-adapter is now a generic, reusable component that can be used by any project connecting labgrid to KernelCI. Changes: - Make MinIO bucket name configurable (MINIO_LOGS_BUCKET) - Add comprehensive README for labgrid-adapter explaining: - Architecture and features - Configuration options - How to use with other projects - Test structure and job format - Update main README to document modular architecture The adapter is designed to be extracted into its own repository for use by other projects beyond OpenWrt.

labgrid-adapter tests: - test_test_sync.py: Tests for ensure_tests(), git operations - test_executor.py: Tests for ResultCollectorPlugin, TestExecutor openwrt-pipeline tests: - test_versions.py: Tests for version_to_branch(), get_active_branches() - test_api_client.py: Tests for KernelCIClient, node operations All tests use pytest with async support and mocking for external dependencies (HTTP clients, git operations).

Allow specifying a subdirectory within the tests repository that contains the actual test files. This supports monorepo structures where tests might be in a subfolder like "tests/openwrt". - Add TESTS_REPO_SUBDIR config option (default: empty string) - Update ensure_tests() to accept subdir parameter - Return path to subdirectory when specified - Validate subdirectory exists after clone/update - Add comprehensive tests for subdirectory functionality

…ices When a lab has multiple physical devices of the same type (e.g., 3x openwrt_one), it can now run tests for different firmware versions in parallel across all available devices. Changes: - Add LabgridClient to query coordinator for available places - Update poller to track jobs per device type (not just job IDs) - Query labgrid coordinator for free slots before claiming jobs - Claim multiple jobs for same device type if places available - Fix bug where device name was checked against job ID set This follows the model: one job per (firmware_version, device_type), with parallel execution when multiple physical devices exist.

Adds infrastructure to distinguish between firmware tests (OpenWrt functionality) and kernel selftests (Linux kernel validation). Each test type can require different firmware images and device capabilities. New components: - asu_client.py: Client for sysupgrade.openwrt.org API to build custom images with additional packages (bash, python3, kselftest packages) - test_types.py: Defines TestType enum, ImageProfile, and TestTypeConfig with required capabilities and packages for each test type Key changes: - Scheduler creates jobs per test type, building custom images via ASU when needed (kselftest requires packages not in standard images) - Jobs include test_type field for lab filtering - Devices declare capabilities (serial_console, isolated_network, etc.) - Labs can filter jobs by supported_test_types config - Pipeline config includes enabled_test_types and device capabilities Test types: - firmware: Standard OpenWrt tests, uses official images - kselftest: Kernel tests, requires custom image with kselftest packages The kselftest packages (kselftests-net, kselftests-timers, etc.) are assumed to exist in OpenWrt feeds - they will be created separately.

Update package names to match the actual OpenWrt kselftest packages: - kselftests-size: Binary size test - kselftests-kcmp: Process comparison tests - kselftests-rtc: Real-time clock tests - kselftests-timers: Timer subsystem tests - kselftests-futex: Futex tests - kselftests-exec: Program execution tests - kselftests-clone3: clone3 syscall tests - kselftests-openat2: openat2 syscall tests - kselftests-mincore: mincore syscall tests - kselftests-mqueue: POSIX message queue tests - kselftests-net: Networking stack tests - kselftests-sigaltstack: Signal alternate stack tests - kselftests-splice: splice syscall tests - kselftests-sync: sync_file_range tests Added corresponding test plans in pipeline.yaml for each subsystem.

Add KTAP (Kernel Test Anything Protocol) parser to extract individual subtest results from kselftest output. This allows KernelCI to report granular pass/fail status for each kselftest subtest instead of just the overall test result. Changes: - Add ktap_parser.py with support for: - TAP version 13/14 and KTAP version 1 formats - Nested subtests via 2-space indentation - Directives: SKIP, TODO, XFAIL, TIMEOUT, ERROR - Hierarchical test naming (e.g., "kselftest.net.socket.af_inet") - Update executor.py to: - Capture stdout per-test for KTAP parsing - Detect KTAP output and expand into individual TestResult objects - Fall back to standard pytest result handling when no KTAP detected - Add pytest wrapper tests in tests/kselftest/ that run kselftest subsystems and print KTAP output for capture - Add comprehensive unit tests for KTAP parser This follows the LAVA/KernelCI pattern where test results are reported as flat nodes with hierarchical names, allowing the dashboard to show individual subtest results.

Fix critical bug where KTAP status values were not correctly mapped: - KTAP returns: "pass", "fail", "skip", "error" - Pytest returns: "passed", "failed", "skipped" - Created separate status maps for each to avoid all KTAP results being incorrectly marked as ERROR Also: - Add docstring to TestStatus enum in ktap_parser.py noting it mirrors models.TestStatus (kept separate to avoid pydantic dependency) - Add comprehensive integration tests for KTAP-executor bridge: - Test _try_parse_ktap with valid/invalid KTAP - Test nested KTAP subtests parsing - Test _convert_results expands KTAP into multiple TestResults - Test mixed KTAP and regular pytest results - Test stdout capture from report sections

Improve kselftest fixtures with proper error handling: - Add KselftestError and KselftestTimeout exception classes - Wrap shell_command.run() in try/except for timeout handling - Add _validate_ktap_output() to warn if output isn't KTAP format - Log warnings for empty output or missing KTAP markers - Log info for non-zero exit codes (normal for failed subtests) Add comprehensive README.md documenting: - KTAP format overview - Fixture usage examples - Test plan mapping table - Result flow diagram - Troubleshooting guide - Required packages list - Device capabilities requirements

- Update executor to use https:// for MinIO log URLs when minio_secure=true - Update KCIDB bridge to expand test_results array into individual test entries - Each test now has its own KCIDB entry with path format: device.plan.test_name - Increase job query limit to 500 with state=done filter for better coverage - Log URLs are attached to each individual test entry Dashboard now shows: - Individual test names (test_shell, test_uname, etc.) - Per-test status (PASS, SKIP, FAIL) - Clickable log URLs for each test Files modified: - kernelci/labgrid-adapter/labgrid_kci_adapter/executor.py - kernelci/openwrt-pipeline/openwrt_pipeline/kcidb_bridge.py

Add support for capturing the kernel boot log (serial console output) during device boot via labgrid's --lg-log option. Changes: - Add --lg-log parameter to pytest to capture serial console output - Add _upload_boot_log() method to find and upload labgrid console logs - Update _upload_log() to accept custom log names - Add boot_log_url field to JobResult model - Store boot_log_url in job data when submitting results Boot logs are now available at: https://storage.openwrt-kci.aparcar.org/logs/logs/{job_id}/boot.log This provides visibility into: - Bootloader output (U-Boot, stage1) - Kernel boot messages - Device initialization - Boot failures before tests start

…est output Add --log-cli-level=CONSOLE and --lg-colored-steps to pytest args, matching the Makefile approach. This streams the labgrid serial console output (boot log) directly into the pytest output, making it visible in the single log_url in the KCIDB dashboard. The combined boot log + pytest output is now available in one file without needing separate log URLs.

Fetch log content from log_url and extract a relevant excerpt (up to 16KB as per KCIDB schema limit). The excerpt prioritizes: - pytest summary sections (passed/failed) - Error messages and failures - Last portion of log as fallback This populates the 'Log Excerpt' section in the KCIDB dashboard instead of showing 'No Log Excerpt available'.

claude and others added 30 commits January 24, 2026 20:14

style: fix line length violations for ruff compliance

040c9ca

- Break long f-strings across multiple lines - Extract long shell commands into variables - Wrap long docstrings at 88 characters - Fix commented code line lengths

aparcar added 2 commits February 4, 2026 03:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add KernelCI + labgrid integration research#214

docs: add KernelCI + labgrid integration research#214
aparcar wants to merge 32 commits intomainfrom
claude/kernelci-labgrid-research-3yMbl

aparcar commented Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aparcar commented Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants