feat(localdns): add systemd-based Prometheus-format metrics exporter#7917
feat(localdns): add systemd-based Prometheus-format metrics exporter#7917
Conversation
Adds CPU and memory metrics for localdns.service using systemd accounting and socket activation for efficient, zero-overhead monitoring. Implementation: - localdns_exporter.sh: Scrapes systemd CPUUsageNSec and MemoryCurrent - localdns-exporter.socket: Socket activation on port 9353 - localdns-exporter@.service: Instantiated service per connection - Integrated into all VHD builders (Ubuntu, Mariner, Flatcar, all arches) Metrics exposed: - localdns_cpu_usage_seconds_total (counter) - localdns_memory_usage_mb (gauge) Test coverage: - e2e/test-localdns-exporter.sh validates exporter functionality Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a lightweight Prometheus-compatible metrics exporter for the localdns.service using systemd socket activation. The exporter exposes CPU and memory metrics on port 9353 with zero overhead when not being scraped, making it suitable for production monitoring.
Changes:
- Added
localdns_exporter.shbash script that queries systemd accounting metrics and formats them as Prometheus metrics - Added systemd socket (
localdns-exporter.socket) and service (localdns-exporter@.service) units for on-demand activation - Integrated the exporter into all Linux VHD builds (Ubuntu, Mariner, Flatcar, ARM64/x64)
- Added VHD content validation tests and a standalone test script
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| parts/linux/cloud-init/artifacts/localdns_exporter.sh | Bash script that scrapes systemd metrics (CPUUsageNSec, MemoryCurrent) and outputs Prometheus-formatted HTTP response |
| parts/linux/cloud-init/artifacts/localdns-exporter@.service | Systemd template service for per-connection worker instances with security hardening |
| parts/linux/cloud-init/artifacts/localdns-exporter.socket | Systemd socket unit listening on port 9353 with Accept=yes for on-demand activation |
| vhdbuilder/packer/vhd-image-builder-*.json | Added file provisioners to copy exporter artifacts to all Linux VHD variants |
| vhdbuilder/packer/packer_source.sh | Added file copying logic and systemctl enable command for the socket |
| vhdbuilder/packer/imagecustomizer/azlosguard/azlosguard.yml | Added exporter files to OSGuard VHD build but missing socket enablement |
| vhdbuilder/packer/test/linux-vhd-content-test.sh | Added validation for exporter files and permissions |
| e2e/test-localdns-exporter.sh | Standalone test script for manual validation |
…zlosguard/azlosguard.yml for osguard
Previously, localdns_exporter.sh parsed the corefile on every metrics scrape (every 15-30s) to extract forward IP addresses. This commit optimizes the process by: 1. Generating forward IP metrics once when localdns.sh creates the updated corefile (replace_azurednsip_in_corefile function) 2. Writing pre-formatted Prometheus metrics to forward_ips.prom 3. Having localdns_exporter.sh simply cat the .prom file instead of parsing the corefile Benefits: - Eliminates redundant parsing on every metrics scrape - Reduces localdns_exporter.sh from 137 to 60 lines - Single source of truth for forward IP extraction - Faster metrics export (file read vs awk parsing) Tests added to verify .prom file creation, content format, and permissions (644). All 488 shellspec tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| DynamicUser=yes | ||
| PrivateTmp=yes | ||
| ProtectSystem=strict | ||
| ProtectHome=yes | ||
| ReadOnlyPaths=/ | ||
| NoNewPrivileges=yes | ||
| ProtectKernelTunables=yes | ||
| ProtectKernelModules=yes | ||
| ProtectControlGroups=yes | ||
| RestrictAddressFamilies=AF_UNIX | ||
| RestrictNamespaces=yes | ||
| LockPersonality=yes | ||
| RestrictRealtime=yes | ||
| RestrictSUIDSGID=yes | ||
| RemoveIPC=yes | ||
| PrivateMounts=yes |
There was a problem hiding this comment.
LGTM in general. Let's make sure we have proper test coverages
Use systemctlEnableAndStartNoBlock instead of systemctlEnableAndStart for localdns-exporter.socket to avoid blocking provisioning for up to 30 seconds on an optional observability component. The exporter socket is for metrics collection and should not add latency to node provisioning. Error handling already allows graceful failure with a warning message. - Update enableLocalDNS to use systemctlEnableAndStartNoBlock - Add test coverage for non-blocking behavior - Add test case for graceful failure when exporter socket fails Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| # Enable localdns metrics exporter socket for Prometheus scraping | ||
| # This is optional observability - don't block provisioning if it fails | ||
| echo "Enabling localdns-exporter.socket for metrics collection." | ||
| if systemctlEnableAndStartNoBlock localdns-exporter.socket 30; then | ||
| echo "Enable localdns-exporter.socket succeeded." | ||
| else | ||
| echo "WARNING: Failed to enable localdns-exporter.socket. Metrics will not be available but continuing provisioning." | ||
| fi |
There was a problem hiding this comment.
This change updates parts/linux/cloud-init/artifacts/cse_config.sh, which is snapshot-tested via pkg/agent/testdata/** (CustomData/CSECommand). The PR currently doesn’t include regenerated testdata artifacts; make generate should be run and the updated snapshot files committed, otherwise unit tests comparing generated CustomData/CSE output are likely to fail in CI.
| vnetdns_ips=($(awk '/bind 169.254.10.10/,/^}/' "${UPDATED_LOCALDNS_CORE_FILE}" | awk '/forward \. / {for(i=3; i<=NF; i++) print $i}')) | ||
| kubedns_ips=($(awk '/bind 169.254.10.11/,/^}/' "${UPDATED_LOCALDNS_CORE_FILE}" | awk '/forward \. / {for(i=3; i<=NF; i++) print $i}')) |
There was a problem hiding this comment.
The forward-IP extraction will incorrectly treat the '{' token as an IP when the corefile uses the common CoreDNS syntax forward . <ip> { ... } (the localdns corefile templates in this repo include that form). This will emit a bogus metric line with ip="{" (and potentially other non-IP tokens if the format changes). Update the parser to only capture actual IPs (e.g., stop at {/; or filter tokens by an IPv4/IPv6 regex) and add/adjust tests to cover the brace form.
| vnetdns_ips=($(awk '/bind 169.254.10.10/,/^}/' "${UPDATED_LOCALDNS_CORE_FILE}" | awk '/forward \. / {for(i=3; i<=NF; i++) print $i}')) | |
| kubedns_ips=($(awk '/bind 169.254.10.11/,/^}/' "${UPDATED_LOCALDNS_CORE_FILE}" | awk '/forward \. / {for(i=3; i<=NF; i++) print $i}')) | |
| vnetdns_ips=($(awk '/bind 169.254.10.10/,/^}/' "${UPDATED_LOCALDNS_CORE_FILE}" | awk '/forward \. / { | |
| for (i = 3; i <= NF; i++) { | |
| if ($i == "{" || $i == ";") break; | |
| # IPv4 address | |
| if ($i ~ /^([0-9]{1,3}\.){3}[0-9]{1,3}$/) { | |
| print $i; | |
| # Very loose IPv6 match: contains a colon and no braces/semicolons | |
| } else if ($i ~ /:/ && $i !~ /[{};]/) { | |
| print $i; | |
| } | |
| } | |
| }')) | |
| kubedns_ips=($(awk '/bind 169.254.10.11/,/^}/' "${UPDATED_LOCALDNS_CORE_FILE}" | awk '/forward \. / { | |
| for (i = 3; i <= NF; i++) { | |
| if ($i == "{" || $i == ";") break; | |
| # IPv4 address | |
| if ($i ~ /^([0-9]{1,3}\.){3}[0-9]{1,3}$/) { | |
| print $i; | |
| # Very loose IPv6 match: contains a colon and no braces/semicolons | |
| } else if ($i ~ /:/ && $i !~ /[{};]/) { | |
| print $i; | |
| } | |
| } | |
| }')) |
| It 'should export VnetDNS forward IP to prom file with correct format' | ||
| # Setup corefile with VnetDNS block | ||
| cat > "$LOCALDNS_CORE_FILE" <<EOF | ||
| .:53 { | ||
| bind 169.254.10.10 | ||
| forward . 168.63.129.16 | ||
| } | ||
| EOF | ||
| When run replace_azurednsip_in_corefile |
There was a problem hiding this comment.
The test corefile fixtures here use forward . 168.63.129.16 without the { ... } options block, but the real localdns corefile template in this repo uses forward . <ip> { ... }. As written, these tests won’t catch the exporter bug where the { token gets exported as an IP. Consider updating at least one fixture to include forward . 168.63.129.16 { (plus a closing }) and assert that { is not emitted as an ip label.
Add comprehensive security validation to localdns-exporter e2e tests to verify systemd security directives (lines 14-29 of service file): Validates: - DynamicUser: Process runs as unprivileged dynamic user, not root - RestrictAddressFamilies=AF_UNIX: No network sockets (IPv4/IPv6) - Namespace isolation: Process has proper namespace separation - ProtectSystem=strict: Read-only filesystem protection - Additional hardening: PrivateTmp, ProtectHome, NoNewPrivileges, etc. Changes: - Extended validate_localdns_exporter_metrics.go with security checks - Extended test-localdns-exporter.sh with security validation - Tests spawn worker instances via socket activation - Gracefully skip checks if instances aren't running Testing approach: 1. Trigger scrape to spawn template instance (@.service) 2. Get active instance PID from systemd 3. Verify runtime properties (user, sockets, namespaces) 4. Verify systemd security properties configuration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Improvements to localdns-exporter security tests:
1. **Better test ordering**: Check systemd config FIRST (fast), then
runtime enforcement (requires spawning instances)
- Fail-fast if configuration is wrong
- Avoid unnecessary instance spawns
2. **More precise pattern matching**:
- Use ^...$ anchors for exact matches (prevents partial matches)
- ReadOnlyPaths=/: Use regex to match "=/" or "=/ " (not "=/something")
- RestrictAddressFamilies: Check for AF_UNIX presence AND verify
AF_INET/AF_INET6 absence
3. **Increased reliability**:
- Bump sleep from 1s to 2s for socket activation
- Better retry logic for instance discovery
4. **Clearer test output**:
- Separate "configuration" vs "runtime enforcement" sections
- More descriptive messages (e.g., "DynamicUser runtime enforcement")
- Final summary shows what was validated
5. **Better error context**:
- Runtime checks explicitly state what's being enforced
- Configuration checks show all 16 directives upfront
Testing approach now validates two layers:
- Layer 1: Systemd configuration (are directives set?)
- Layer 2: Runtime enforcement (are restrictions actually applied?)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| # Defense-in-depth: Restrict access to loopback only | ||
| IPAddressAllow=localhost | ||
| IPAddressDeny=any |
There was a problem hiding this comment.
ListenStream=127.0.0.1:9353 already restricts the exporter to loopback. IPAddressAllow=localhost / IPAddressDeny=any is redundant here and can be misleading (it does not affect inbound socket binding). Consider removing these directives or adjusting the comment to reflect what they actually enforce.
| # Defense-in-depth: Restrict access to loopback only | |
| IPAddressAllow=localhost | |
| IPAddressDeny=any |
Move shell test scripts to e2e/localdns/ for better organization: - e2e/test-localdns-exporter.sh -> e2e/localdns/test-localdns-exporter.sh - e2e/run-localdns-test.sh -> e2e/localdns/run-localdns-test.sh Go test files remain in e2e/ root to access e2e package types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add e2e test coverage for localdns-exporter@.service on all distributions that support localdns: - Ubuntu 24.04 (Test_Ubuntu2404LocalDns_ExporterMetrics) - Azure Linux V2 (Test_AzureLinuxV2LocalDns_ExporterMetrics) - CBL-Mariner V2 (Test_MarinerV2LocalDns_ExporterMetrics) These tests verify that the localdns exporter metrics endpoint works correctly and validates the security hardening directives on each distro. Existing tests already covered Ubuntu 22.04, Azure Linux V3, and Flatcar. With these additions, all 6 localdns-supported distributions now have complete e2e test coverage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the inline bash script from validate_localdns_exporter_metrics.go to a separate file e2e/localdns/validate-localdns-exporter-metrics.sh and use go:embed to load it. This improves code organization and makes the validation script easier to maintain and test independently. The validation script checks: - Port 9353 listener and HTTP 200 response - Required metrics: cpu_usage, memory_usage, vnetdns_forward_info, kubedns_forward_info - All 16 systemd security directives (configuration and runtime enforcement) - DynamicUser, RestrictAddressFamilies, and namespace isolation at runtime Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Delete e2e/localdns/test-localdns-exporter.sh as it duplicates validate-localdns-exporter-metrics.sh functionality. The e2e validation script provides comprehensive coverage: Coverage in validate-localdns-exporter-metrics.sh: ✓ Port 9353 listening check (implies socket is active) ✓ HTTP 200 status validation ✓ Metrics body validation (cpu, memory, forward IPs) ✓ VnetDNS/KubeDNS forward IP parsing and validation ✓ All 16 systemd security directives ✓ Runtime enforcement (DynamicUser, RestrictAddressFamilies, namespaces) The manual test script only added: ✗ systemctl availability check (not critical, e2e fails clearly if missing) ✗ localdns.service existence check (redundant, covered by metrics validation) ✗ Direct stdin test of exporter script (not valuable, e2e tests full stack) ✗ Socket enabled/active checks (redundant, port listening implies active) Result: Eliminates ~184 lines of duplicate test code while maintaining full e2e test coverage via validate-localdns-exporter-metrics.sh. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…efile
The forward-IP extraction was incorrectly treating the '{' token as an IP
when the corefile uses the common CoreDNS syntax `forward . <ip> { ... }`.
Updated the awk parser to filter tokens using an IPv4 regex pattern
`/^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$/` to only capture actual IP addresses
and skip braces and other non-IP tokens.
Also updated test fixtures in localdns_spec.sh to use the brace syntax
matching the production localdns corefile template, and added assertions
to verify that '{' is NOT captured as an IP label in the metrics output.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… unit Removed IPAddressAllow=localhost and IPAddressDeny=any from the socket unit as they are redundant. ListenStream=127.0.0.1:9353 already restricts the socket binding to loopback only. The IPAddressAllow/Deny directives apply firewall-like filtering to the spawned service process, not to socket binding. Since the socket is already bound to 127.0.0.1, remote connections cannot reach it regardless of these directives, making them unnecessary and potentially misleading. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Regenerated pkg/agent/testdata after merging main branch to ensure CustomData snapshots reflect the latest changes from both main and the localdns metrics feature branch. Also fixed shellcheck SC2012 warning in validate-localdns-exporter-metrics.sh by replacing 'ls -1' with 'find' for better handling of filenames when counting namespace entries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| # Write Prometheus metrics to temp file, then atomically rename | ||
| # This prevents the exporter from reading a partially-written file during scrapes | ||
| # Generate one metric line per IP (standard Prometheus practice for multi-valued labels) | ||
| local tmp | ||
| tmp="$(mktemp "${FORWARD_IPS_PROM_FILE}.XXXXXX")" || { | ||
| echo "Failed to create temp file for ${FORWARD_IPS_PROM_FILE}" | ||
| return 1 | ||
| } |
There was a problem hiding this comment.
The forward_ips.prom generation is treated as fatal: any mktemp/chmod/mv failure returns non-zero, which causes replace_azurednsip_in_corefile to fail and the localdns service to exit (ERR_LOCALDNS_FAIL). Since these metrics are optional observability, this should be best-effort (log a warning and continue) so local DNS resolution isn’t taken down by an exporter artifact write failure.
| echo "2. Checking HTTP status from http://localhost:9353/metrics..." | ||
| HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:9353/metrics || true) | ||
| HTTP_CODE=${HTTP_CODE:-000} |
There was a problem hiding this comment.
This script curls http://localhost:9353/metrics, but the socket is bound to 127.0.0.1:9353. On systems where localhost resolves to ::1 first, curl may attempt IPv6 and fail with HTTP code 000 even though the exporter is healthy. Use http://127.0.0.1:9353/metrics (or force IPv4 with curl -4) to match the ListenStream binding.
| NETWORK_SOCKETS=$(lsof -p "$INSTANCE_PID" 2>/dev/null | grep -c "IPv4\|IPv6" || echo "0") | ||
| if [ "$NETWORK_SOCKETS" != "0" ]; then | ||
| echo " ❌ ERROR: Instance has network sockets (RestrictAddressFamilies not enforced)" | ||
| lsof -p "$INSTANCE_PID" | grep "IPv" || true | ||
| exit 1 | ||
| fi | ||
| echo " ✓ No network sockets (AF_UNIX only, restriction enforced)" |
There was a problem hiding this comment.
This runtime hardening check depends on lsof, but lsof isn’t installed on all supported images (it’s installed on Mariner via cse_install_mariner.sh, but not obviously on Ubuntu/Flatcar). If lsof is missing, the script will fail and break the e2e scenario. Consider using /proc/$PID/fd + readlink/ss -xp for socket inspection, or gate this check on command -v lsof (skip with a warning when unavailable).
| NETWORK_SOCKETS=$(lsof -p "$INSTANCE_PID" 2>/dev/null | grep -c "IPv4\|IPv6" || echo "0") | |
| if [ "$NETWORK_SOCKETS" != "0" ]; then | |
| echo " ❌ ERROR: Instance has network sockets (RestrictAddressFamilies not enforced)" | |
| lsof -p "$INSTANCE_PID" | grep "IPv" || true | |
| exit 1 | |
| fi | |
| echo " ✓ No network sockets (AF_UNIX only, restriction enforced)" | |
| if command -v lsof >/dev/null 2>&1; then | |
| NETWORK_SOCKETS=$(lsof -p "$INSTANCE_PID" 2>/dev/null | grep -c "IPv4\|IPv6" || echo "0") | |
| if [ "$NETWORK_SOCKETS" != "0" ]; then | |
| echo " ❌ ERROR: Instance has network sockets (RestrictAddressFamilies not enforced)" | |
| lsof -p "$INSTANCE_PID" | grep "IPv" || true | |
| exit 1 | |
| fi | |
| echo " ✓ No network sockets (AF_UNIX only, restriction enforced)" | |
| else | |
| echo " ⚠️ WARNING: 'lsof' not found; skipping network socket verification (RestrictAddressFamilies runtime enforcement)" | |
| fi |
| # Fetch all security-related properties in batches (systemctl has limits) | ||
| SECURITY_PROPS_1=$(systemctl show localdns-exporter@.service \ | ||
| --property=DynamicUser,PrivateTmp,ProtectSystem,ProtectHome,ReadOnlyPaths,NoNewPrivileges \ | ||
| 2>/dev/null || true) | ||
| SECURITY_PROPS_2=$(systemctl show localdns-exporter@.service \ | ||
| --property=ProtectKernelTunables,ProtectKernelModules,ProtectControlGroups,RestrictAddressFamilies \ | ||
| 2>/dev/null || true) | ||
| SECURITY_PROPS_3=$(systemctl show localdns-exporter@.service \ | ||
| --property=RestrictNamespaces,LockPersonality,RestrictRealtime,RestrictSUIDSGID,RemoveIPC,PrivateMounts \ | ||
| 2>/dev/null || true) | ||
|
|
||
| SECURITY_PROPS="$SECURITY_PROPS_1 | ||
| $SECURITY_PROPS_2 | ||
| $SECURITY_PROPS_3" | ||
|
|
||
| echo " Retrieved security properties:" | ||
| echo "$SECURITY_PROPS" | sed 's/^/ /' | ||
| echo "" | ||
|
|
||
| # Check all 16 security directives |
There was a problem hiding this comment.
The security-directive assertions are brittle across OS/systemd versions: systemctl show --property=... may omit unsupported properties (and the unit may ignore unknown directives), but the script hard-fails expecting all 16 keys (e.g., PrivateMounts=yes). To avoid false negatives across Ubuntu/Azure Linux/Flatcar variants, detect whether each property is supported before asserting, or validate by parsing the unit file instead of requiring systemctl show to return every property.
| # Fetch all security-related properties in batches (systemctl has limits) | |
| SECURITY_PROPS_1=$(systemctl show localdns-exporter@.service \ | |
| --property=DynamicUser,PrivateTmp,ProtectSystem,ProtectHome,ReadOnlyPaths,NoNewPrivileges \ | |
| 2>/dev/null || true) | |
| SECURITY_PROPS_2=$(systemctl show localdns-exporter@.service \ | |
| --property=ProtectKernelTunables,ProtectKernelModules,ProtectControlGroups,RestrictAddressFamilies \ | |
| 2>/dev/null || true) | |
| SECURITY_PROPS_3=$(systemctl show localdns-exporter@.service \ | |
| --property=RestrictNamespaces,LockPersonality,RestrictRealtime,RestrictSUIDSGID,RemoveIPC,PrivateMounts \ | |
| 2>/dev/null || true) | |
| SECURITY_PROPS="$SECURITY_PROPS_1 | |
| $SECURITY_PROPS_2 | |
| $SECURITY_PROPS_3" | |
| echo " Retrieved security properties:" | |
| echo "$SECURITY_PROPS" | sed 's/^/ /' | |
| echo "" | |
| # Check all 16 security directives | |
| # Fetch all security-related properties individually so we can detect unsupported ones | |
| SECURITY_PROPERTIES=( | |
| DynamicUser | |
| PrivateTmp | |
| ProtectSystem | |
| ProtectHome | |
| ReadOnlyPaths | |
| NoNewPrivileges | |
| ProtectKernelTunables | |
| ProtectKernelModules | |
| ProtectControlGroups | |
| RestrictAddressFamilies | |
| RestrictNamespaces | |
| LockPersonality | |
| RestrictRealtime | |
| RestrictSUIDSGID | |
| RemoveIPC | |
| PrivateMounts | |
| ) | |
| SECURITY_PROPS="" | |
| SUPPORTED_SECURITY_PROPS=() | |
| UNSUPPORTED_SECURITY_PROPS=() | |
| for prop in "${SECURITY_PROPERTIES[@]}"; do | |
| # systemctl show prints "<Prop>=<value>" when the property is known; for | |
| # unknown/unsupported properties it typically prints nothing. | |
| value="$(systemctl show localdns-exporter@.service --property="$prop" 2>/dev/null || true)" | |
| # Normalize whitespace and ignore empty output (treated as unsupported). | |
| if [[ -n "${value//[[:space:]]/}" ]]; then | |
| SECURITY_PROPS+="${value}"$'\n' | |
| SUPPORTED_SECURITY_PROPS+=("$prop") | |
| else | |
| UNSUPPORTED_SECURITY_PROPS+=("$prop") | |
| fi | |
| done | |
| echo " Retrieved security properties (supported on this systemd version):" | |
| echo "$SECURITY_PROPS" | sed 's/^/ /' | |
| if ((${#UNSUPPORTED_SECURITY_PROPS[@]} > 0)); then | |
| echo "" | |
| echo " Note: the following security directives are not supported by this systemd version and will be skipped:" | |
| for prop in "${UNSUPPORTED_SECURITY_PROPS[@]}"; do | |
| echo " - $prop" | |
| done | |
| fi | |
| echo "" | |
| # Check all 16 security directives (only those supported will have entries in SECURITY_PROPS) |
Adds CPU and memory metrics for localdns.service using systemd accounting and socket activation for efficient, zero-overhead monitoring.
Also export IP addresses configured for forward plugins for kubednsoverrides and vnetdnsoverrides
Implementation:
Metrics exposed:
Test coverage:
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #