Skip to content

test: setup packet capture steps for debugging secure TLS bootstrapping flakes#7939

Draft
cameronmeissner wants to merge 5 commits intomainfrom
cameissner/stls-pcap-debug
Draft

test: setup packet capture steps for debugging secure TLS bootstrapping flakes#7939
cameronmeissner wants to merge 5 commits intomainfrom
cameissner/stls-pcap-debug

Conversation

@cameronmeissner
Copy link
Contributor

@cameronmeissner cameronmeissner commented Feb 24, 2026

What this PR does / why we need it:

for debugging only - NOT FOR MERGE

Which issue(s) this PR fixes:

Fixes #

Copilot AI review requested due to automatic review settings February 24, 2026 22:26
@cameronmeissner cameronmeissner changed the title test: setup packet capture steps for debugging secure TLS bootstrappi… test: setup packet capture steps for debugging secure TLS bootstrapping flakes Feb 24, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a systemd-based tcpdump capture step intended to help debug secure TLS bootstrapping flakes, wiring it into the Linux cloud-init/CSE flow and updating generated CustomData test snapshots.

Changes:

  • Add aks-pcap.service to Linux node custom data and start it during secure TLS bootstrapping configuration.
  • Install tcpdump on Mariner/AzureLinux images.
  • Modify aks-log-collector.sh behavior, including max-zip-size handling and (notably) reducing the default set of collected artifacts.

Reviewed changes

Copilot reviewed 22 out of 70 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pkg/agent/testdata/CustomizedImageLinuxGuard/CustomData Regenerated CustomData snapshot reflecting added unit and other cloud-init changes.
pkg/agent/testdata/CustomizedImage/CustomData Regenerated CustomData snapshot reflecting added unit and other cloud-init changes.
parts/linux/cloud-init/nodecustomdata.yml Writes aks-pcap.service into the node via cloud-init.
parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh Adds tcpdump to Mariner/AzureLinux dependency install list.
parts/linux/cloud-init/artifacts/cse_helpers.sh Introduces a new error code for aks-pcap startup failure.
parts/linux/cloud-init/artifacts/cse_config.sh Starts/enables aks-pcap when configuring secure TLS bootstrapping.
parts/linux/cloud-init/artifacts/aks-log-collector.sh Updates zip-size enforcement logic and changes what gets collected by default.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 78 changed files in this pull request and generated 6 comments.

Comment on lines +163 to +168
# adding this last file made the zip go over that size, remove that file and try the next one.
# Using continue instead of break ensures smaller subsequent files can still be included.
FILE_SIZE=$(stat --printf "%s" "${ZIP}")
if [ "$FILE_SIZE" -ge "$MAX_SIZE" ]; then
echo "WARNING: ZIP file size $FILE_SIZE >= $MAX_SIZE after adding $file; removing it and trying next file."
zip -d "${ZIP}" "$file"
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ENFORCE_MAX_ZIP_SIZE logic change introduces a subtle behavioral difference that could cause issues:

Old behavior (lines removed): When zip exceeded MAX_SIZE, the script would:

  1. Remove the last file that pushed it over
  2. STOP processing (break)

New behavior (lines 150-169): When zip exceeds MAX_SIZE:

  1. Skip adding more files until a file is added
  2. If that file pushes over limit, remove it and CONTINUE trying next files
  3. Only stops when hitting the pre-check at line 153

Problem: The continue logic (line 165-169) is inside the "if test -e" block, so smaller files after a large one can still be added. However, the pre-check (line 150-157) will prevent ANY files from being added once the limit is hit, creating inconsistent behavior depending on file ordering.

Recommendation: Simplify to always break when limit is exceeded (restore original behavior), OR move the pre-check outside the file loop for consistent "stop when full" behavior.

Suggested change
# adding this last file made the zip go over that size, remove that file and try the next one.
# Using continue instead of break ensures smaller subsequent files can still be included.
FILE_SIZE=$(stat --printf "%s" "${ZIP}")
if [ "$FILE_SIZE" -ge "$MAX_SIZE" ]; then
echo "WARNING: ZIP file size $FILE_SIZE >= $MAX_SIZE after adding $file; removing it and trying next file."
zip -d "${ZIP}" "$file"
# adding this last file made the zip go over that size, remove that file and stop adding more.
FILE_SIZE=$(stat --printf "%s" "${ZIP}")
if [ "$FILE_SIZE" -ge "$MAX_SIZE" ]; then
echo "WARNING: ZIP file size $FILE_SIZE >= $MAX_SIZE after adding $file; removing it and stopping further additions."
zip -d "${ZIP}" "$file"
break

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings February 27, 2026 21:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 78 changed files in this pull request and generated 3 comments.

[Service]
Type=oneshot
RemainAfterExit=yes
SuccessExitStatus=124
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SuccessExitStatus=124 suggests a normal timeout exit should be treated as success, but the current aks-pcap.sh uses || which prevents the unit from ever exiting with 124 (it will run the collector and return that exit code instead). Align the unit semantics with the script (either remove SuccessExitStatus=124 or change the script to preserve the 124 exit code on timeout).

Suggested change
SuccessExitStatus=124

Copilot uses AI. Check for mistakes.
Comment on lines +281 to +287
AKS_PCAP_SERVICE_SRC=/home/packer/aks-pcap.service
AKS_PCAP_SERVICE_DEST=/etc/systemd/system/aks-pcap.service
cpAndMode $AKS_PCAP_SERVICE_SRC $AKS_PCAP_SERVICE_DEST 0644

AKS_PCAP_SCRIPT_SRC=/home/packer/aks-pcap.sh
AKS_PCAP_SCRIPT_DEST=/opt/azure/containers/aks-pcap.sh
cpAndMode $AKS_PCAP_SCRIPT_SRC $AKS_PCAP_SCRIPT_DEST 0744
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyPackerFiles now unconditionally copies /home/packer/aks-pcap.{service,sh}. Some packer templates (e.g., Flatcar builders) don’t upload these files, so this will break VHD builds with “file not found”. Either add the file provisioners to every template that runs pre-install-dependencies.sh/copyPackerFiles, or make these copies conditional on the source file existing.

Copilot uses AI. Check for mistakes.
Comment on lines 56 to 63
# AKS specific entries
GLOBS+=(/etc/cni/net.d/*)
GLOBS+=(/etc/containerd/*)
GLOBS+=(/etc/default/kubelet)
GLOBS+=(/etc/kubernetes/manifests/*)
GLOBS+=(/var/log/azure-cni*)
GLOBS+=(/var/log/azure-cns*)
GLOBS+=(/var/log/azure-ipam*)
GLOBS+=(/var/log/azure-vnet*)
GLOBS+=(/var/log/cilium-cni*)
GLOBS+=(/var/run/azure-vnet*)
GLOBS+=(/var/run/azure-cns*)

# GPU specific entries
GLOBS+=(/var/log/nvidia*.log)
GLOBS+=(/var/log/azure/nvidia*.log)
GLOBS+=(/var/log/fabricmanager*.log)

# based on MANIFEST_FULL from Azure Linux Agent's log collector
# https://github.com/Azure/WALinuxAgent/blob/master/azurelinuxagent/common/logcollector_manifests.py
GLOBS+=(/var/lib/waagent/provisioned)
GLOBS+=(/etc/fstab)
GLOBS+=(/etc/ssh/sshd_config)
GLOBS+=(/boot/grub*/grub.c*)
GLOBS+=(/boot/grub*/menu.lst)
GLOBS+=(/etc/*-release)
GLOBS+=(/etc/HOSTNAME)
GLOBS+=(/etc/hostname)
GLOBS+=(/etc/apt/sources.list)
GLOBS+=(/etc/apt/sources.list.d/*)
GLOBS+=(/etc/network/interfaces)
GLOBS+=(/etc/network/interfaces.d/*.cfg)
GLOBS+=(/etc/netplan/*.yaml)
GLOBS+=(/etc/nsswitch.conf)
GLOBS+=(/etc/resolv.conf)
GLOBS+=(/run/systemd/resolve/stub-resolv.conf)
GLOBS+=(/run/resolvconf/resolv.conf)
GLOBS+=(/etc/sysconfig/iptables)
GLOBS+=(/etc/sysconfig/network)
GLOBS+=(/etc/sysconfig/network/ifcfg-eth*)
GLOBS+=(/etc/sysconfig/network/routes)
GLOBS+=(/etc/sysconfig/network-scripts/ifcfg-eth*)
GLOBS+=(/etc/sysconfig/network-scripts/route-eth*)
GLOBS+=(/etc/ufw/ufw.conf)
GLOBS+=(/etc/waagent.conf)
GLOBS+=(/var/lib/hyperv/.kvp_pool_*)
GLOBS+=(/var/lib/dhcp/dhclient.eth0.leases)
GLOBS+=(/var/lib/dhclient/dhclient-eth0.leases)
GLOBS+=(/var/lib/wicked/lease-eth0-dhcp-ipv4.xml)
GLOBS+=(/var/log/azure/custom-script/handler.log)
GLOBS+=(/var/log/azure/run-command/handler.log)
GLOBS+=(/var/lib/waagent/ovf-env.xml)
GLOBS+=(/var/lib/waagent/*/status/*.status)
GLOBS+=(/var/lib/waagent/*/config/*.settings)
GLOBS+=(/var/lib/waagent/*/config/HandlerState)
GLOBS+=(/var/lib/waagent/*/config/HandlerStatus)
GLOBS+=(/var/lib/waagent/SharedConfig.xml)
GLOBS+=(/var/lib/waagent/ManagedIdentity-*.json)
GLOBS+=(/var/lib/waagent/waagent_status.json)
GLOBS+=(/var/lib/waagent/*/error.json)
GLOBS+=(/var/log/cloud-init*)
GLOBS+=(/var/log/azure/aks/aks-node.pcap)
GLOBS+=(/var/log/azure/*/*)
GLOBS+=(/var/log/azure/*/*/*)
GLOBS+=(/var/log/syslog*)
GLOBS+=(/var/log/rsyslog*)
GLOBS+=(/var/log/messages*)
GLOBS+=(/var/log/kern*)
GLOBS+=(/var/log/dmesg*)
GLOBS+=(/var/log/dpkg*)
GLOBS+=(/var/log/yum*)
GLOBS+=(/var/log/boot*)
GLOBS+=(/var/log/auth*)
GLOBS+=(/var/log/secure*)
GLOBS+=(/var/log/journal*)
GLOBS+=(/etc/default/kubelet)

Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change removes most of the default AKS log bundle content (CNI/containerd/k8s manifests, waagent, network state, etc.) and leaves only the pcap + a small set of logs. That’s a major behavior regression for supportability. If the goal is to add the pcap for secure TLS bootstrap debugging, keep the existing globs/collectors and append the pcap, or gate the reduced collection behind a debug tag/flag.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants