Skip to content

Conversation

@goldberl
Copy link
Contributor

@goldberl goldberl commented Jan 30, 2026

Proposed Commit Message

fix (EC2): wait for NICs before failing datasource discovery

On some EC2 instance types, network interfaces may not 
be present during the init-local stage when the EC2 
datasource attempts to discover metadata.

Previously, the datasource would fail immediately if no
eligible NICs were found, causing metadata and userdata
retrieval to fail on first boot. This resulted in missing
SSH keys and required a reboot to recover.

Add a bounded wait for eligible NICs before failing datasource
discovery. This avoids a race condition and ensures userdata is
applied correctly on first boot.

Fixes GH-6697

Additional Context

Addresses #6697

Test Steps

  1. Launch an Ubuntu 24.04 Noble Minimal EC2 AMI (hpc7a.96xlarge instance - it will reproduce the issue every time)
  2. Wait several minutes after the instance reaches running, then attempt to SSH into the instance
    You will be hit with an error:
ubuntu@<public-ip>: Permission denied (publickey).

You can SSH into the machine on the first boot using the AWS UI
Running cloud-init status --long shows you cloud-init fallsback to DataSourceNone and there are metadata and NIC errors

ubuntu@ubuntu:~$ cloud-init status --long
status: done
extended_status: degraded done
boot_status_code: enabled-by-generator
last_update: Thu, 01 Jan 1970 00:06:08 +0000
detail: DataSourceNone
errors: []
recoverable_errors:
ERROR:
        - Unable to get response from urls: ['http://169.254.169.254/latest/api/token', 'http://[fd00:ec2::254]/latest/api/token']
        - Unable to get metadata
        - The instance must have at least one eligible NIC
        - The instance must have at least one eligible NIC
        - The instance must have at least one eligible NIC
WARNING:
        - Calling 'http://[fd00:ec2::254]/latest/api/token' failed [232/240s]: request error [HTTPConnectionPool(host='fd00:ec2::254', port=80): Max retries exceeded with url: /latest/api/token (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7a26fe907200>: Failed to establish a new connection: [Errno 101] Network is unreachable'))]
        - IMDS's HTTP endpoint is probably disabled
        - Used fallback datasource
  1. Reboot the machine and then you should be able to SSH in
    If you check cloud-init status --long, it will now have detail: DataSourceEc2 but the NIC errors will still appear:
ubuntu@ip-172-31-23-161:~$ cloud-init status --long
status: done
extended_status: degraded done
boot_status_code: enabled-by-generator
last_update: Thu, 01 Jan 1970 00:00:08 +0000
detail: DataSourceEc2
errors: []
recoverable_errors:
ERROR:
	- The instance must have at least one eligible NIC
	- The instance must have at least one eligible NIC
	- The instance must have at least one eligible NIC
	- The instance must have at least one eligible NIC
WARNING:
	- Calling 'http://[fd00:ec2::254]/latest/api/token' failed [0/240s]: request error [HTTPConnectionPool(host='fd00:ec2::254', port=80): Max retries exceeded with url: /latest/api/token (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x76a37851f8c0>: Failed to establish a new connection: [Errno 101] Network is unreachable'))]
  1. Apply this patch to the /usr/lib/python3/dist-packages/cloudinit/sources/DataSourceEc2.py file
  2. Clean cloud-init and reboot: sudo cloud-init clean --logs --reboot
  3. SSH back in after the reboot

After applying the patch, cleaning cloud-init and rebooting, cloud-init status --long will say

status: done
extended_status: done
boot_status_code: enabled-by-generator
last_update: Thu, 01 Jan 1970 00:00:10 +0000
detail: DataSourceEc2Local
errors: []
recoverable_errors: {}

And you can check the patch via the following commands:

sudo cat /var/log/cloud-init.log | grep "No NICs yet"
sudo cat /var/log/cloud-init.log | grep "Eligible NICs found"

The logs will show something like:

2026-01-30 15:04:20,710 - DataSourceEc2.py[DEBUG]: No NICs yet, waiting for udev/network...
2026-01-30 15:04:21,711 - DataSourceEc2.py[DEBUG]: Eligible NICs found after 1s: ['enp34s0']

Merge type

  • Squash merge using "Proposed Commit Message"
  • Rebase and merge unique commits. Requires commit messages per-commit each referencing the pull request number (#<PR_NUM>)

goldberl and others added 4 commits January 29, 2026 18:07
On some EC2 instance types, particularly Nitro-based systems,
network interfaces may not be present during the init-local
stage when the EC2 datasource attempts to discover metadata.

Previously, the datasource would fail immediately if no
eligible NICs were found, causing metadata and userdata
retrieval to fail on first boot. This resulted in missing
SSH keys and required a reboot to recover.

Add a bounded wait for eligible NICs before failing datasource
discovery. This avoids a race condition and ensures userdata is
applied correctly on first boot.
@goldberl goldberl marked this pull request as ready for review January 30, 2026 18:10
@holmanb holmanb self-assigned this Feb 2, 2026
@holmanb
Copy link
Member

holmanb commented Feb 2, 2026

Clean cloud-init and reboot: sudo cloud-init clean --logs --reboot

This leaves behind a netplan configuration on the filesystem - which means that this code might not be exercised by this test. Can you please retest this change using sudo cloud-init clean --logs --config all and report back with full logs?

Per my comment in the bug, it would be good to know why the network-online service timed out after 120s - how long does this code have to wait before the device comes online? Any idea why it is so long?

I notice that the the device appears to be configured due to cloud-init calling netplan apply - if this is what makes the network accessible then I doubt that simply polling on the existence of the device is a sufficient solution.

@goldberl
Copy link
Contributor Author

goldberl commented Feb 3, 2026

@holmanb Thanks for your comments, I retried applying my patch and then doing sudo cloud-init clean --logs --config all and rebooting the machine. Here are the logs after that reboot: cloud-init-after-patch-applied-reboot.tar.gz

It seems like the patch was run properly

goldberl@lemon-box:~/cases/425440-aws-cloud-init/cloud-init-logs-2026-02-03$ rg NICs
var/log/cloud-init.log
31:2026-02-03 21:36:15,310 - DataSourceEc2.py[DEBUG]: No NICs yet, waiting for udev/network...
46:2026-02-03 21:36:16,320 - DataSourceEc2.py[DEBUG]: Eligible NICs found after 1s: ['enp34s0']

I'll keep investigating why the network-online service timed out and get back to you

@goldberl
Copy link
Contributor Author

goldberl commented Feb 3, 2026

So I took a look at all three log these logs (the first two are the same logs I uploaded to the bug)

And created a timeline for each from the journal.txt file.

First boot of machine
Timeline: service starts => NIC renamed => service times out => NIC is connected => NIC is routable

# starting systemd-networkd-wait-online.service
Jan 30 17:45:05.890129 ubuntu systemd[1]: Starting systemd-networkd-wait-online.service - Wait for Network to be Configured...

# NIC is renamed from eth0 to enp34s0
Jan 30 17:45:06.326522 ubuntu systemd-networkd[1698]: eth0: Interface name change detected, renamed to enp34s0.

# Network service times out here after 120 seconds
Jan 30 17:47:06.072785 ubuntu systemd-networkd-wait-online[1706]: Timeout occurred while waiting for network connectivity.

# NIC is connected
Jan 30 17:50:59.286733 ubuntu systemd-networkd[2539]: enp34s0: Gained carrier

# NIC is routable
Jan 30 17:51:00.928161 ubuntu systemd-networkd[2539]: enp34s0: Gained IPv6LL

Reboot of machine before the patch
service config is skipped => NIC renamed => NIC is connected => NIC is routable

# systemd-networkd-wait-online.service is just skipped
Jan 30 17:53:26.787705 ubuntu systemd[1]: systemd-networkd-wait-online.service - Wait for Network to be Configured was skipped because of an unmet condition check (ConditionPathIsSymbolicLink=/run/systemd/generator/network-online.target.wants/systemd-networkd-wait-online.service).

# NIC is renamed from eth0 to enp34s0
Jan 30 17:53:27.715702 ubuntu kernel: ena 0000:22:00.0 enp34s0: renamed from eth0

# NIC is connected
Jan 30 17:53:27.777878 ubuntu systemd-networkd[1707]: enp34s0: Gained carrier

# NIC is routable
Jan 30 17:53:29.477749 ip-172-31-23-132 systemd-networkd[2544]: enp34s0: Gained IPv6LL

Reboot of machine after the patch
NIC renamed => NIC is connected => service starts => service is routable => service finishes

# NIC renamed from eth0 to enp34s0
Feb 03 21:36:16.167499 ip-172-31-30-97 kernel: ena 0000:22:00.0 enp34s0: renamed from eth0

# NIC is connected
Feb 03 21:36:18.327684 ip-172-31-30-97 systemd-networkd[2384]: enp34s0: Gained carrier

# starting systemd-networkd-wait-online.service
Feb 03 21:36:18.384721 ip-172-31-30-97 systemd[1]: Starting systemd-networkd-wait-online.service - Wait for Network to be Configured...

# NIC is routable
Feb 03 21:36:20.279704 ip-172-31-30-97 systemd-networkd[2384]: enp34s0: Gained IPv6LL

# systemd-networkd-wait-online.service finishes
Feb 03 21:36:20.298234 ip-172-31-30-97 systemd[1]: Finished systemd-networkd-wait-online.service - Wait for Network to be Configured.

So in the first boot of machine, systemd-networkd-wait-online starts before the NIC is ready and times out because it can't find the device like @holmanb mentioned in the bug comment.

It was mentioned in a customer case file that this issue happens intermittently on all hpc7a instances, but always happens on hpc7a.96xlarge instances so they recommended using it for testing. So on the smaller instances, it seems like a race condition if it intermittently happens. But maybe larger instances like hpc7a.96xlarge take longer to provision their NICs so that's why it always happens on large instances?

As for the fix, it seems the original code just tried to find the NIC once, and then moved on:

     if self.perform_dhcp_setup:  # Setup networking in init-local stage.
            if util.is_FreeBSD():
                LOG.debug("FreeBSD doesn't support running dhclient with -sf")
                return False
            candidate_nics = net.find_candidate_nics()
            LOG.debug("Looking for the primary NIC in: %s", candidate_nics)
            if len(candidate_nics) < 1:
                LOG.error("The instance must have at least one eligible NIC")
                return False

So I thought polling for its existence would be helpful. But it seems the NIC also has to be provisioned correctly (not just exist) before we can move on, does that seem correct?

Then instead of polling only for NIC existence, the patch should also wait until the NIC is actually usable/reachable for metadata?

@goldberl
Copy link
Contributor Author

goldberl commented Feb 9, 2026

I’ve rearranged the existing logic so that we wait until the NIC has the metadata instead of just polling for its existence.

There is currently one failing test (test_aws_token_403_fails_without_retries) and I’m not certain what the best approach is to fix it.

Feedback on how to handle this properly would be appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants