test: setup packet capture steps for debugging secure TLS bootstrapping flakes#7939
test: setup packet capture steps for debugging secure TLS bootstrapping flakes#7939cameronmeissner wants to merge 5 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a systemd-based tcpdump capture step intended to help debug secure TLS bootstrapping flakes, wiring it into the Linux cloud-init/CSE flow and updating generated CustomData test snapshots.
Changes:
- Add
aks-pcap.serviceto Linux node custom data and start it during secure TLS bootstrapping configuration. - Install
tcpdumpon Mariner/AzureLinux images. - Modify
aks-log-collector.shbehavior, including max-zip-size handling and (notably) reducing the default set of collected artifacts.
Reviewed changes
Copilot reviewed 22 out of 70 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/agent/testdata/CustomizedImageLinuxGuard/CustomData | Regenerated CustomData snapshot reflecting added unit and other cloud-init changes. |
| pkg/agent/testdata/CustomizedImage/CustomData | Regenerated CustomData snapshot reflecting added unit and other cloud-init changes. |
| parts/linux/cloud-init/nodecustomdata.yml | Writes aks-pcap.service into the node via cloud-init. |
| parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh | Adds tcpdump to Mariner/AzureLinux dependency install list. |
| parts/linux/cloud-init/artifacts/cse_helpers.sh | Introduces a new error code for aks-pcap startup failure. |
| parts/linux/cloud-init/artifacts/cse_config.sh | Starts/enables aks-pcap when configuring secure TLS bootstrapping. |
| parts/linux/cloud-init/artifacts/aks-log-collector.sh | Updates zip-size enforcement logic and changes what gets collected by default. |
| # adding this last file made the zip go over that size, remove that file and try the next one. | ||
| # Using continue instead of break ensures smaller subsequent files can still be included. | ||
| FILE_SIZE=$(stat --printf "%s" "${ZIP}") | ||
| if [ "$FILE_SIZE" -ge "$MAX_SIZE" ]; then | ||
| echo "WARNING: ZIP file size $FILE_SIZE >= $MAX_SIZE after adding $file; removing it and trying next file." | ||
| zip -d "${ZIP}" "$file" |
There was a problem hiding this comment.
The ENFORCE_MAX_ZIP_SIZE logic change introduces a subtle behavioral difference that could cause issues:
Old behavior (lines removed): When zip exceeded MAX_SIZE, the script would:
- Remove the last file that pushed it over
- STOP processing (break)
New behavior (lines 150-169): When zip exceeds MAX_SIZE:
- Skip adding more files until a file is added
- If that file pushes over limit, remove it and CONTINUE trying next files
- Only stops when hitting the pre-check at line 153
Problem: The continue logic (line 165-169) is inside the "if test -e" block, so smaller files after a large one can still be added. However, the pre-check (line 150-157) will prevent ANY files from being added once the limit is hit, creating inconsistent behavior depending on file ordering.
Recommendation: Simplify to always break when limit is exceeded (restore original behavior), OR move the pre-check outside the file loop for consistent "stop when full" behavior.
| # adding this last file made the zip go over that size, remove that file and try the next one. | |
| # Using continue instead of break ensures smaller subsequent files can still be included. | |
| FILE_SIZE=$(stat --printf "%s" "${ZIP}") | |
| if [ "$FILE_SIZE" -ge "$MAX_SIZE" ]; then | |
| echo "WARNING: ZIP file size $FILE_SIZE >= $MAX_SIZE after adding $file; removing it and trying next file." | |
| zip -d "${ZIP}" "$file" | |
| # adding this last file made the zip go over that size, remove that file and stop adding more. | |
| FILE_SIZE=$(stat --printf "%s" "${ZIP}") | |
| if [ "$FILE_SIZE" -ge "$MAX_SIZE" ]; then | |
| echo "WARNING: ZIP file size $FILE_SIZE >= $MAX_SIZE after adding $file; removing it and stopping further additions." | |
| zip -d "${ZIP}" "$file" | |
| break |
| [Service] | ||
| Type=oneshot | ||
| RemainAfterExit=yes | ||
| SuccessExitStatus=124 |
There was a problem hiding this comment.
SuccessExitStatus=124 suggests a normal timeout exit should be treated as success, but the current aks-pcap.sh uses || which prevents the unit from ever exiting with 124 (it will run the collector and return that exit code instead). Align the unit semantics with the script (either remove SuccessExitStatus=124 or change the script to preserve the 124 exit code on timeout).
| SuccessExitStatus=124 |
| AKS_PCAP_SERVICE_SRC=/home/packer/aks-pcap.service | ||
| AKS_PCAP_SERVICE_DEST=/etc/systemd/system/aks-pcap.service | ||
| cpAndMode $AKS_PCAP_SERVICE_SRC $AKS_PCAP_SERVICE_DEST 0644 | ||
|
|
||
| AKS_PCAP_SCRIPT_SRC=/home/packer/aks-pcap.sh | ||
| AKS_PCAP_SCRIPT_DEST=/opt/azure/containers/aks-pcap.sh | ||
| cpAndMode $AKS_PCAP_SCRIPT_SRC $AKS_PCAP_SCRIPT_DEST 0744 |
There was a problem hiding this comment.
copyPackerFiles now unconditionally copies /home/packer/aks-pcap.{service,sh}. Some packer templates (e.g., Flatcar builders) don’t upload these files, so this will break VHD builds with “file not found”. Either add the file provisioners to every template that runs pre-install-dependencies.sh/copyPackerFiles, or make these copies conditional on the source file existing.
| # AKS specific entries | ||
| GLOBS+=(/etc/cni/net.d/*) | ||
| GLOBS+=(/etc/containerd/*) | ||
| GLOBS+=(/etc/default/kubelet) | ||
| GLOBS+=(/etc/kubernetes/manifests/*) | ||
| GLOBS+=(/var/log/azure-cni*) | ||
| GLOBS+=(/var/log/azure-cns*) | ||
| GLOBS+=(/var/log/azure-ipam*) | ||
| GLOBS+=(/var/log/azure-vnet*) | ||
| GLOBS+=(/var/log/cilium-cni*) | ||
| GLOBS+=(/var/run/azure-vnet*) | ||
| GLOBS+=(/var/run/azure-cns*) | ||
|
|
||
| # GPU specific entries | ||
| GLOBS+=(/var/log/nvidia*.log) | ||
| GLOBS+=(/var/log/azure/nvidia*.log) | ||
| GLOBS+=(/var/log/fabricmanager*.log) | ||
|
|
||
| # based on MANIFEST_FULL from Azure Linux Agent's log collector | ||
| # https://github.com/Azure/WALinuxAgent/blob/master/azurelinuxagent/common/logcollector_manifests.py | ||
| GLOBS+=(/var/lib/waagent/provisioned) | ||
| GLOBS+=(/etc/fstab) | ||
| GLOBS+=(/etc/ssh/sshd_config) | ||
| GLOBS+=(/boot/grub*/grub.c*) | ||
| GLOBS+=(/boot/grub*/menu.lst) | ||
| GLOBS+=(/etc/*-release) | ||
| GLOBS+=(/etc/HOSTNAME) | ||
| GLOBS+=(/etc/hostname) | ||
| GLOBS+=(/etc/apt/sources.list) | ||
| GLOBS+=(/etc/apt/sources.list.d/*) | ||
| GLOBS+=(/etc/network/interfaces) | ||
| GLOBS+=(/etc/network/interfaces.d/*.cfg) | ||
| GLOBS+=(/etc/netplan/*.yaml) | ||
| GLOBS+=(/etc/nsswitch.conf) | ||
| GLOBS+=(/etc/resolv.conf) | ||
| GLOBS+=(/run/systemd/resolve/stub-resolv.conf) | ||
| GLOBS+=(/run/resolvconf/resolv.conf) | ||
| GLOBS+=(/etc/sysconfig/iptables) | ||
| GLOBS+=(/etc/sysconfig/network) | ||
| GLOBS+=(/etc/sysconfig/network/ifcfg-eth*) | ||
| GLOBS+=(/etc/sysconfig/network/routes) | ||
| GLOBS+=(/etc/sysconfig/network-scripts/ifcfg-eth*) | ||
| GLOBS+=(/etc/sysconfig/network-scripts/route-eth*) | ||
| GLOBS+=(/etc/ufw/ufw.conf) | ||
| GLOBS+=(/etc/waagent.conf) | ||
| GLOBS+=(/var/lib/hyperv/.kvp_pool_*) | ||
| GLOBS+=(/var/lib/dhcp/dhclient.eth0.leases) | ||
| GLOBS+=(/var/lib/dhclient/dhclient-eth0.leases) | ||
| GLOBS+=(/var/lib/wicked/lease-eth0-dhcp-ipv4.xml) | ||
| GLOBS+=(/var/log/azure/custom-script/handler.log) | ||
| GLOBS+=(/var/log/azure/run-command/handler.log) | ||
| GLOBS+=(/var/lib/waagent/ovf-env.xml) | ||
| GLOBS+=(/var/lib/waagent/*/status/*.status) | ||
| GLOBS+=(/var/lib/waagent/*/config/*.settings) | ||
| GLOBS+=(/var/lib/waagent/*/config/HandlerState) | ||
| GLOBS+=(/var/lib/waagent/*/config/HandlerStatus) | ||
| GLOBS+=(/var/lib/waagent/SharedConfig.xml) | ||
| GLOBS+=(/var/lib/waagent/ManagedIdentity-*.json) | ||
| GLOBS+=(/var/lib/waagent/waagent_status.json) | ||
| GLOBS+=(/var/lib/waagent/*/error.json) | ||
| GLOBS+=(/var/log/cloud-init*) | ||
| GLOBS+=(/var/log/azure/aks/aks-node.pcap) | ||
| GLOBS+=(/var/log/azure/*/*) | ||
| GLOBS+=(/var/log/azure/*/*/*) | ||
| GLOBS+=(/var/log/syslog*) | ||
| GLOBS+=(/var/log/rsyslog*) | ||
| GLOBS+=(/var/log/messages*) | ||
| GLOBS+=(/var/log/kern*) | ||
| GLOBS+=(/var/log/dmesg*) | ||
| GLOBS+=(/var/log/dpkg*) | ||
| GLOBS+=(/var/log/yum*) | ||
| GLOBS+=(/var/log/boot*) | ||
| GLOBS+=(/var/log/auth*) | ||
| GLOBS+=(/var/log/secure*) | ||
| GLOBS+=(/var/log/journal*) | ||
| GLOBS+=(/etc/default/kubelet) | ||
|
|
There was a problem hiding this comment.
This change removes most of the default AKS log bundle content (CNI/containerd/k8s manifests, waagent, network state, etc.) and leaves only the pcap + a small set of logs. That’s a major behavior regression for supportability. If the goal is to add the pcap for secure TLS bootstrap debugging, keep the existing globs/collectors and append the pcap, or gate the reduced collection behind a debug tag/flag.
What this PR does / why we need it:
for debugging only - NOT FOR MERGE
Which issue(s) this PR fixes:
Fixes #