Skip to content

docs: add KernelCI + labgrid integration research#214

Open
aparcar wants to merge 32 commits intomainfrom
claude/kernelci-labgrid-research-3yMbl
Open

docs: add KernelCI + labgrid integration research#214
aparcar wants to merge 32 commits intomainfrom
claude/kernelci-labgrid-research-3yMbl

Conversation

@aparcar
Copy link
Owner

@aparcar aparcar commented Jan 25, 2026

Comprehensive research document analyzing how to integrate KernelCI
as a backend for OpenWrt testing infrastructure while preserving
the existing labgrid-based test framework.

Key findings:

  • KernelCI's new pull-mode architecture enables secure lab federation
  • Labgrid adapter approach (used by Pengutronix) is recommended
  • KCIDB-ng provides standardized results submission API
  • Phased implementation starting with results integration

Document includes:

  • Current infrastructure analysis (7 labs, 38+ devices)
  • KernelCI architecture overview (Maestro, KCIDB, Events)
  • Four integration options with trade-offs
  • Detailed 4-phase implementation plan
  • Technical specifications and code examples

claude and others added 30 commits January 24, 2026 20:14
Comprehensive research document analyzing how to integrate KernelCI
as a backend for OpenWrt testing infrastructure while preserving
the existing labgrid-based test framework.

Key findings:
- KernelCI's new pull-mode architecture enables secure lab federation
- Labgrid adapter approach (used by Pengutronix) is recommended
- KCIDB-ng provides standardized results submission API
- Phased implementation starting with results integration

Document includes:
- Current infrastructure analysis (7 labs, 38+ devices)
- KernelCI architecture overview (Maestro, KCIDB, Events)
- Four integration options with trade-offs
- Detailed 4-phase implementation plan
- Technical specifications and code examples
Major update to the KernelCI integration document focusing on
self-hosted deployment for OpenWrt firmware testing.

Key additions:
- Complete Docker Compose deployment stack
  - MongoDB, Redis, MinIO for storage
  - KernelCI API (Maestro) and Pipeline services
  - Dashboard with OpenWrt-specific views
  - Traefik reverse proxy with TLS

- Multi-source firmware management
  - Official OpenWrt releases (snapshot, stable, oldstable)
  - GitHub PR artifact integration
  - Custom developer upload API
  - Buildbot webhook integration

- Comprehensive health check system
  - Periodic device health monitoring
  - Automatic device disable on failures
  - GitHub issue creation/closure
  - Visual fleet status dashboard

- OpenWrt-specific adaptations
  - Custom firmware schema (replaces kernel builds)
  - Test plan definitions matching existing pytest suite
  - Feature-based job scheduling
  - Device capability mapping

- Labgrid adapter for pull-mode operation
  - Labs stay behind firewalls
  - Job polling from central KernelCI
  - Preserves existing 38+ device targets

- 5-phase implementation plan with clear deliverables
Implements the self-hosted KernelCI infrastructure for OpenWrt testing:

Docker Compose Stack:
- MongoDB 7.0 for data storage with initialization script
- Redis 7 for pub/sub messaging
- MinIO for S3-compatible artifact storage
- KernelCI API (Maestro) for job management
- Traefik reverse proxy with automatic TLS
- Pipeline services (trigger, scheduler, health, results)
- Dashboard for result visualization

Configuration:
- api-config.toml: KernelCI API settings with OpenWrt customizations
- pipeline.yaml: Firmware sources, test plans, scheduler settings
- mongo-init.js: Database collections and indexes
- .env.example: Environment variable template

Pipeline Core Modules:
- models.py: Pydantic models for firmware, jobs, results, devices
- config.py: Configuration loading from env and YAML
- api_client.py: Async HTTP client for KernelCI API

Key Features:
- Multi-source firmware support (official, PR, custom, buildbot)
- Test plan definitions matching existing pytest suite
- Device type mapping to OpenWrt targets
- Health check configuration
- JWT authentication
- S3 artifact storage
Implements firmware source modules for multi-source firmware ingestion:

Official Release Source (official.py):
- Scans downloads.openwrt.org for profiles.json files
- Supports snapshot, stable, and oldstable releases
- Extracts firmware metadata and artifact URLs
- Calculates checksums for verification
- Configurable target filtering for efficiency

GitHub PR Source (github_pr.py):
- Monitors PRs with trigger labels (ci-test-requested)
- Extracts firmware from workflow run artifacts
- Parses target info from artifact names
- Supports PR status updates and comments
- Automatic artifact download and extraction

Custom Upload Handler (custom.py):
- FastAPI router for firmware uploads
- Validates file size and extensions
- Stores firmware in MinIO
- Generates unique firmware IDs
- Auto-detects firmware type from filename

Firmware Trigger Service (firmware_trigger.py):
- Main orchestration service
- Initializes and manages all sources
- Periodic scanning with configurable intervals
- Creates firmware entries in KernelCI API
- Publishes events for job scheduling
- Includes health check endpoint
- FastAPI server for upload API

Base Classes:
- FirmwareSource abstract base class
- Consistent interface for all source types
- Async generator pattern for scanning
Implements the bridge between KernelCI and labgrid test labs using
pull-mode architecture where labs fetch jobs from the central API.

Labgrid Adapter (kernelci/labgrid-adapter/):
- Dockerfile with QEMU and serial tools
- Pull-mode job poller (poller.py)
  - Registers lab with KernelCI API
  - Sends periodic heartbeats
  - Polls for pending jobs matching device capabilities
  - Claims and dispatches jobs to executor
- Test executor (executor.py)
  - Downloads firmware artifacts with caching
  - Builds pytest command with labgrid integration
  - Captures console logs and test output
  - Parses pytest JSON results
  - Uploads logs to MinIO storage
- Main service (service.py)
  - Discovers devices from target YAML files
  - Extracts features from labgrid configs
  - Coordinates poller and executor
  - Handles graceful shutdown
- Configuration via environment variables

Test Scheduler (openwrt-pipeline/test_scheduler.py):
- Listens for new firmware events
- Finds compatible devices based on target/subtarget
- Creates test jobs with appropriate test plans
- Feature-based test plan assignment
- Priority-based scheduling (PR > snapshot > stable)
- Handles job monitoring and timeouts

Key Features:
- Labs stay behind firewalls (pull-mode)
- Automatic device discovery from target files
- Feature-based test filtering
- Firmware caching for efficiency
- Console log capture and upload
- pytest JSON result parsing
Implements comprehensive device health monitoring with automated
notifications and device management.

Device Registry (health/device_registry.py):
- Tracks health status for all devices
- Status levels: healthy, failing, disabled, unknown
- Configurable failure thresholds (warning, disable)
- Last check and consecutive failure tracking
- Bulk status queries and summary generation
- Automatic status transitions based on results

Notification Manager (health/notifications.py):
- GitHub issue creation for disabled devices
- Auto-close issues when devices recover
- Issue caching to prevent duplicates
- Formatted issue body with device details
- Console log links in issues
- Resolution steps documentation

Health Check Scheduler (health/scheduler.py):
- Periodic check scheduling based on interval
- High-priority health check job creation
- Job completion monitoring
- Result processing with status updates
- Recovery detection and notification
- Manual health check trigger API
- Status reporting endpoint

Key Features:
- Devices automatically disabled after threshold failures
- GitHub issues track device problems
- Automatic issue closure on recovery
- Minimal tests (shell + SSH) for quick checks
- Skip firmware flash for health checks
- Concurrent schedule and monitor loops
Add React TypeScript components for the KernelCI dashboard:

- DeviceFleetStatus: Visual overview of devices across all labs with
  health status indicators, feature tags, and quick actions

- FirmwareMatrix: Matrix view showing test results with devices as rows
  and firmware versions as columns, with drill-down to individual tests

- HealthCheckDashboard: Device health monitoring with summary stats,
  device status table, health check history timeline, and manual controls

- PRStatusView: GitHub PR testing status with PR list, test progress,
  job details, and direct links to GitHub and artifacts

Components are designed to integrate with KernelCI dashboard or can be
deployed as a custom dashboard extension.
Update labgrid adapter configuration to use the modern gRPC-based
coordinator instead of the legacy Crossbar/WAMP protocol:

- Rename lg_crossbar config to lg_coordinator (host:port format)
- Set LG_COORDINATOR environment variable for pytest execution
- Add grpcio dependencies to requirements.txt
- Remove unused imports across all modules
- Fix f-strings without placeholders (use plain strings for structlog)
- Rename ambiguous variable 'l' to 'lbl' in github_pr.py
- Remove unused local variables
- Sort imports with isort rules
- Apply consistent code formatting
- Add ruff and isort configuration to pyproject.toml
- Configure ruff to handle import sorting (I rules)
- Remove test_lan_interface_has_neighbor which fails inconsistently
  (IPv6 multicast ping doesn't always return DUP! responses)
- Update test plan configs to remove the flaky test
- Break long f-strings across multiple lines
- Extract long shell commands into variables
- Wrap long docstrings at 88 characters
- Fix commented code line lengths
Remove custom dashboard components - use the standard KernelCI dashboard
instead (ghcr.io/kernelci/dashboard). The dashboard connects to the
same API and provides all needed visualization.

Move health check from pipeline to labgrid-adapter:
- Health checks are a lab maintenance concern, not public-facing
- Lab maintainers run checks locally, not via KernelCI
- Add standalone health_check.py tool for lab maintainers

Removed:
- kernelci/dashboard/ (custom React components)
- kernelci/openwrt-pipeline/openwrt_pipeline/health/ (pipeline health)
- pipeline-health and pipeline-results services from docker-compose

Added:
- labgrid_kci_adapter/health_check.py (lab-side tool)
Add automatic health check functionality to the labgrid adapter:

- Health checks run every 24 hours by default (configurable via
  HEALTH_CHECK_INTERVAL environment variable)
- Devices that fail health checks are removed from the job pool
- Devices that recover are automatically re-added
- Initial health check runs at startup before accepting jobs

Configuration options:
- HEALTH_CHECK_INTERVAL: seconds between checks (default: 86400 = 24h)
- HEALTH_CHECK_ENABLED: set to false to disable (default: true)

This ensures only working devices receive test jobs from KernelCI,
and lab maintainers are informed via logs when devices fail.
API Client:
- Rewrite to use KernelCI's Node-based API (/latest/nodes endpoint)
- Jobs are nodes with kind=job, tests are nodes with kind=test
- Use state field (available/running/done) for job lifecycle
- Add OpenWrt-specific helpers (create_firmware_node, create_test_job)

Job Poller:
- Update to query /latest/nodes with kind=job, state=available
- Claim jobs by updating node state to 'running'
- Simplified implementation without custom lab registration

GitHub Status:
- Add GitHubStatusPoster for commit status and PR comments
- Post test results as commit statuses with device context
- Create detailed PR comments for test failures
- Support multi-device testing with separate status contexts

Documentation:
- Update README with Node-based API reference
- Document lab configuration and health checks
- Remove references to removed services (pipeline-health, pipeline-results)
- Add API examples for node operations
Major changes:
- Use pytest.main() with ResultCollectorPlugin instead of subprocess
- Consolidate duplicate firmware ID generation into base.py
- Consolidate duplicate firmware type detection into base.py
- Fix test_scheduler to use correct Node-based API methods
- Remove unused pub/sub subscribe stub from api_client
- Remove dead code (_scan_all_targets unreachable yield)
- Move inline import to top-level in custom.py
- Update documentation for pytest execution and health checks
Following the LAVA approach where tests are fetched at job execution
time, add support for:

1. Per-job test fetching: Job definition includes tests_repo URL,
   adapter fetches tests when executing (recommended for shared tests)

2. Static sync: Configure TESTS_REPO_URL to sync tests on startup
   and periodically (simpler setup for fixed test sets)

This ensures all labs run the same version of tests without manual
synchronization.

New config options:
- TESTS_REPO_URL: Git URL for static test sync
- TESTS_REPO_BRANCH: Branch to use (default: main)
- TESTS_SYNC_INTERVAL: Seconds between syncs (default: 3600)

Job data options:
- tests_repo: Git URL for per-job test fetch
- tests_branch: Branch to use (default: main)
Simplify test synchronization:
- Pull tests from git before each job execution
- Clone if repo doesn't exist, update if it does
- Remove background sync loop (no more TESTS_SYNC_INTERVAL)

This is simpler and follows LAVA pattern more closely where tests
are fetched at job execution time.

Config options:
- TESTS_REPO_URL: Git URL for tests (pulled before each job)
- TESTS_REPO_BRANCH: Branch to use (default: main)

Jobs can override with tests_repo/tests_branch in job data.
Configure proper tree/branch mapping for OpenWrt:
- Tree: openwrt
- Branches: main (SNAPSHOT), openwrt-24.10, openwrt-25.12

Node structure now includes:
- group: tree identifier for dashboard grouping
- data.kernel_revision: {tree, branch, commit, url}
- path: [tree, branch, target, subtarget, profile]

This enables the KernelCI dashboard to properly display
test results organized by branch.
Instead of hardcoding versions in pipeline.yaml, now fetches
active branches dynamically from downloads.openwrt.org/.versions.json

This automatically discovers:
- main (SNAPSHOT builds)
- stable (current release from stable_version)
- oldstable (previous release series from versions_list)

Changes:
- Add versions.py module with get_active_branches()
- Update firmware_trigger to create sources dynamically
- Simplify pipeline.yaml to just specify targets

Config now only needs:
  targets: [ath79/generic, x86/64, ...]
  include_snapshot: true
  include_oldstable: true
The labgrid-kci-adapter is now a generic, reusable component that
can be used by any project connecting labgrid to KernelCI.

Changes:
- Make MinIO bucket name configurable (MINIO_LOGS_BUCKET)
- Add comprehensive README for labgrid-adapter explaining:
  - Architecture and features
  - Configuration options
  - How to use with other projects
  - Test structure and job format
- Update main README to document modular architecture

The adapter is designed to be extracted into its own repository
for use by other projects beyond OpenWrt.
labgrid-adapter tests:
- test_test_sync.py: Tests for ensure_tests(), git operations
- test_executor.py: Tests for ResultCollectorPlugin, TestExecutor

openwrt-pipeline tests:
- test_versions.py: Tests for version_to_branch(), get_active_branches()
- test_api_client.py: Tests for KernelCIClient, node operations

All tests use pytest with async support and mocking for external
dependencies (HTTP clients, git operations).
Allow specifying a subdirectory within the tests repository that
contains the actual test files. This supports monorepo structures
where tests might be in a subfolder like "tests/openwrt".

- Add TESTS_REPO_SUBDIR config option (default: empty string)
- Update ensure_tests() to accept subdir parameter
- Return path to subdirectory when specified
- Validate subdirectory exists after clone/update
- Add comprehensive tests for subdirectory functionality
…ices

When a lab has multiple physical devices of the same type (e.g., 3x
openwrt_one), it can now run tests for different firmware versions
in parallel across all available devices.

Changes:
- Add LabgridClient to query coordinator for available places
- Update poller to track jobs per device type (not just job IDs)
- Query labgrid coordinator for free slots before claiming jobs
- Claim multiple jobs for same device type if places available
- Fix bug where device name was checked against job ID set

This follows the model: one job per (firmware_version, device_type),
with parallel execution when multiple physical devices exist.
Adds infrastructure to distinguish between firmware tests (OpenWrt
functionality) and kernel selftests (Linux kernel validation). Each
test type can require different firmware images and device capabilities.

New components:
- asu_client.py: Client for sysupgrade.openwrt.org API to build custom
  images with additional packages (bash, python3, kselftest packages)
- test_types.py: Defines TestType enum, ImageProfile, and TestTypeConfig
  with required capabilities and packages for each test type

Key changes:
- Scheduler creates jobs per test type, building custom images via ASU
  when needed (kselftest requires packages not in standard images)
- Jobs include test_type field for lab filtering
- Devices declare capabilities (serial_console, isolated_network, etc.)
- Labs can filter jobs by supported_test_types config
- Pipeline config includes enabled_test_types and device capabilities

Test types:
- firmware: Standard OpenWrt tests, uses official images
- kselftest: Kernel tests, requires custom image with kselftest packages

The kselftest packages (kselftests-net, kselftests-timers, etc.) are
assumed to exist in OpenWrt feeds - they will be created separately.
Update package names to match the actual OpenWrt kselftest packages:
- kselftests-size: Binary size test
- kselftests-kcmp: Process comparison tests
- kselftests-rtc: Real-time clock tests
- kselftests-timers: Timer subsystem tests
- kselftests-futex: Futex tests
- kselftests-exec: Program execution tests
- kselftests-clone3: clone3 syscall tests
- kselftests-openat2: openat2 syscall tests
- kselftests-mincore: mincore syscall tests
- kselftests-mqueue: POSIX message queue tests
- kselftests-net: Networking stack tests
- kselftests-sigaltstack: Signal alternate stack tests
- kselftests-splice: splice syscall tests
- kselftests-sync: sync_file_range tests

Added corresponding test plans in pipeline.yaml for each subsystem.
Add KTAP (Kernel Test Anything Protocol) parser to extract individual
subtest results from kselftest output. This allows KernelCI to report
granular pass/fail status for each kselftest subtest instead of just
the overall test result.

Changes:
- Add ktap_parser.py with support for:
  - TAP version 13/14 and KTAP version 1 formats
  - Nested subtests via 2-space indentation
  - Directives: SKIP, TODO, XFAIL, TIMEOUT, ERROR
  - Hierarchical test naming (e.g., "kselftest.net.socket.af_inet")

- Update executor.py to:
  - Capture stdout per-test for KTAP parsing
  - Detect KTAP output and expand into individual TestResult objects
  - Fall back to standard pytest result handling when no KTAP detected

- Add pytest wrapper tests in tests/kselftest/ that run kselftest
  subsystems and print KTAP output for capture

- Add comprehensive unit tests for KTAP parser

This follows the LAVA/KernelCI pattern where test results are reported
as flat nodes with hierarchical names, allowing the dashboard to show
individual subtest results.
Fix critical bug where KTAP status values were not correctly mapped:
- KTAP returns: "pass", "fail", "skip", "error"
- Pytest returns: "passed", "failed", "skipped"
- Created separate status maps for each to avoid all KTAP results
  being incorrectly marked as ERROR

Also:
- Add docstring to TestStatus enum in ktap_parser.py noting it mirrors
  models.TestStatus (kept separate to avoid pydantic dependency)
- Add comprehensive integration tests for KTAP-executor bridge:
  - Test _try_parse_ktap with valid/invalid KTAP
  - Test nested KTAP subtests parsing
  - Test _convert_results expands KTAP into multiple TestResults
  - Test mixed KTAP and regular pytest results
  - Test stdout capture from report sections
Improve kselftest fixtures with proper error handling:
- Add KselftestError and KselftestTimeout exception classes
- Wrap shell_command.run() in try/except for timeout handling
- Add _validate_ktap_output() to warn if output isn't KTAP format
- Log warnings for empty output or missing KTAP markers
- Log info for non-zero exit codes (normal for failed subtests)

Add comprehensive README.md documenting:
- KTAP format overview
- Fixture usage examples
- Test plan mapping table
- Result flow diagram
- Troubleshooting guide
- Required packages list
- Device capabilities requirements
- Update executor to use https:// for MinIO log URLs when minio_secure=true
- Update KCIDB bridge to expand test_results array into individual test entries
- Each test now has its own KCIDB entry with path format: device.plan.test_name
- Increase job query limit to 500 with state=done filter for better coverage
- Log URLs are attached to each individual test entry

Dashboard now shows:
- Individual test names (test_shell, test_uname, etc.)
- Per-test status (PASS, SKIP, FAIL)
- Clickable log URLs for each test

Files modified:
- kernelci/labgrid-adapter/labgrid_kci_adapter/executor.py
- kernelci/openwrt-pipeline/openwrt_pipeline/kcidb_bridge.py
Add support for capturing the kernel boot log (serial console output)
during device boot via labgrid's --lg-log option.

Changes:
- Add --lg-log parameter to pytest to capture serial console output
- Add _upload_boot_log() method to find and upload labgrid console logs
- Update _upload_log() to accept custom log names
- Add boot_log_url field to JobResult model
- Store boot_log_url in job data when submitting results

Boot logs are now available at:
  https://storage.openwrt-kci.aparcar.org/logs/logs/{job_id}/boot.log

This provides visibility into:
- Bootloader output (U-Boot, stage1)
- Kernel boot messages
- Device initialization
- Boot failures before tests start
…est output

Add --log-cli-level=CONSOLE and --lg-colored-steps to pytest args,
matching the Makefile approach. This streams the labgrid serial console
output (boot log) directly into the pytest output, making it visible
in the single log_url in the KCIDB dashboard.

The combined boot log + pytest output is now available in one file
without needing separate log URLs.
Fetch log content from log_url and extract a relevant excerpt
(up to 16KB as per KCIDB schema limit). The excerpt prioritizes:
- pytest summary sections (passed/failed)
- Error messages and failures
- Last portion of log as fallback

This populates the 'Log Excerpt' section in the KCIDB dashboard
instead of showing 'No Log Excerpt available'.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants