fix (EC2): wait for NICs before failing datasource discovery #6698

goldberl · 2026-01-30T15:32:08Z

Proposed Commit Message

fix (EC2): wait for NICs before failing datasource discovery

On some EC2 instance types, network interfaces may not 
be present during the init-local stage when the EC2 
datasource attempts to discover metadata.

Previously, the datasource would fail immediately if no
eligible NICs were found, causing metadata and userdata
retrieval to fail on first boot. This resulted in missing
SSH keys and required a reboot to recover.

Add a bounded wait for eligible NICs before failing datasource
discovery. This avoids a race condition and ensures userdata is
applied correctly on first boot.

Fixes GH-6697

Additional Context

Addresses #6697

Test Steps

Launch an Ubuntu 24.04 Noble Minimal EC2 AMI (hpc7a.96xlarge instance - it will reproduce the issue every time)
Wait several minutes after the instance reaches running, then attempt to SSH into the instance
You will be hit with an error:

ubuntu@<public-ip>: Permission denied (publickey).

You can SSH into the machine on the first boot using the AWS UI
Running cloud-init status --long shows you cloud-init fallsback to DataSourceNone and there are metadata and NIC errors

ubuntu@ubuntu:~$ cloud-init status --long
status: done
extended_status: degraded done
boot_status_code: enabled-by-generator
last_update: Thu, 01 Jan 1970 00:06:08 +0000
detail: DataSourceNone
errors: []
recoverable_errors:
ERROR:
        - Unable to get response from urls: ['http://169.254.169.254/latest/api/token', 'http://[fd00:ec2::254]/latest/api/token']
        - Unable to get metadata
        - The instance must have at least one eligible NIC
        - The instance must have at least one eligible NIC
        - The instance must have at least one eligible NIC
WARNING:
        - Calling 'http://[fd00:ec2::254]/latest/api/token' failed [232/240s]: request error [HTTPConnectionPool(host='fd00:ec2::254', port=80): Max retries exceeded with url: /latest/api/token (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7a26fe907200>: Failed to establish a new connection: [Errno 101] Network is unreachable'))]
        - IMDS's HTTP endpoint is probably disabled
        - Used fallback datasource

Reboot the machine and then you should be able to SSH in
If you check cloud-init status --long, it will now have detail: DataSourceEc2 but the NIC errors will still appear:

ubuntu@ip-172-31-23-161:~$ cloud-init status --long
status: done
extended_status: degraded done
boot_status_code: enabled-by-generator
last_update: Thu, 01 Jan 1970 00:00:08 +0000
detail: DataSourceEc2
errors: []
recoverable_errors:
ERROR:
	- The instance must have at least one eligible NIC
	- The instance must have at least one eligible NIC
	- The instance must have at least one eligible NIC
	- The instance must have at least one eligible NIC
WARNING:
	- Calling 'http://[fd00:ec2::254]/latest/api/token' failed [0/240s]: request error [HTTPConnectionPool(host='fd00:ec2::254', port=80): Max retries exceeded with url: /latest/api/token (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x76a37851f8c0>: Failed to establish a new connection: [Errno 101] Network is unreachable'))]

Apply this patch to the /usr/lib/python3/dist-packages/cloudinit/sources/DataSourceEc2.py file
Clean cloud-init and reboot: sudo cloud-init clean --logs --reboot
SSH back in after the reboot

After applying the patch, cleaning cloud-init and rebooting, cloud-init status --long will say

status: done
extended_status: done
boot_status_code: enabled-by-generator
last_update: Thu, 01 Jan 1970 00:00:10 +0000
detail: DataSourceEc2Local
errors: []
recoverable_errors: {}

And you can check the patch via the following commands:

sudo cat /var/log/cloud-init.log | grep "No NICs yet"
sudo cat /var/log/cloud-init.log | grep "Eligible NICs found"

The logs will show something like:

2026-01-30 15:04:20,710 - DataSourceEc2.py[DEBUG]: No NICs yet, waiting for udev/network...
2026-01-30 15:04:21,711 - DataSourceEc2.py[DEBUG]: Eligible NICs found after 1s: ['enp34s0']

Merge type

Squash merge using "Proposed Commit Message"
Rebase and merge unique commits. Requires commit messages per-commit each referencing the pull request number (#<PR_NUM>)

On some EC2 instance types, particularly Nitro-based systems, network interfaces may not be present during the init-local stage when the EC2 datasource attempts to discover metadata. Previously, the datasource would fail immediately if no eligible NICs were found, causing metadata and userdata retrieval to fail on first boot. This resulted in missing SSH keys and required a reboot to recover. Add a bounded wait for eligible NICs before failing datasource discovery. This avoids a race condition and ensures userdata is applied correctly on first boot.

…nto ec2-wait-for-nic

holmanb · 2026-02-02T16:17:19Z

Clean cloud-init and reboot: sudo cloud-init clean --logs --reboot

This leaves behind a netplan configuration on the filesystem - which means that this code might not be exercised by this test. Can you please retest this change using sudo cloud-init clean --logs --config all and report back with full logs?

Per my comment in the bug, it would be good to know why the network-online service timed out after 120s - how long does this code have to wait before the device comes online? Any idea why it is so long?

I notice that the the device appears to be configured due to cloud-init calling netplan apply - if this is what makes the network accessible then I doubt that simply polling on the existence of the device is a sufficient solution.

goldberl · 2026-02-03T21:43:47Z

@holmanb Thanks for your comments, I retried applying my patch and then doing sudo cloud-init clean --logs --config all and rebooting the machine. Here are the logs after that reboot: cloud-init-after-patch-applied-reboot.tar.gz

It seems like the patch was run properly

goldberl@lemon-box:~/cases/425440-aws-cloud-init/cloud-init-logs-2026-02-03$ rg NICs
var/log/cloud-init.log
31:2026-02-03 21:36:15,310 - DataSourceEc2.py[DEBUG]: No NICs yet, waiting for udev/network...
46:2026-02-03 21:36:16,320 - DataSourceEc2.py[DEBUG]: Eligible NICs found after 1s: ['enp34s0']

I'll keep investigating why the network-online service timed out and get back to you

goldberl · 2026-02-03T22:59:30Z

So I took a look at all three log these logs (the first two are the same logs I uploaded to the bug)

cloud-init-first-boot.tar.gz (first boot of machine)
cloud-init-reboot.tar.gz (reboot of machine before the patch)
cloud-init-after-patch-applied-reboot.tar.gz (reboot of machine after the patch)

And created a timeline for each from the journal.txt file.

First boot of machine
Timeline: service starts => NIC renamed => service times out => NIC is connected => NIC is routable

# starting systemd-networkd-wait-online.service
Jan 30 17:45:05.890129 ubuntu systemd[1]: Starting systemd-networkd-wait-online.service - Wait for Network to be Configured...

# NIC is renamed from eth0 to enp34s0
Jan 30 17:45:06.326522 ubuntu systemd-networkd[1698]: eth0: Interface name change detected, renamed to enp34s0.

# Network service times out here after 120 seconds
Jan 30 17:47:06.072785 ubuntu systemd-networkd-wait-online[1706]: Timeout occurred while waiting for network connectivity.

# NIC is connected
Jan 30 17:50:59.286733 ubuntu systemd-networkd[2539]: enp34s0: Gained carrier

# NIC is routable
Jan 30 17:51:00.928161 ubuntu systemd-networkd[2539]: enp34s0: Gained IPv6LL

Reboot of machine before the patch
service config is skipped => NIC renamed => NIC is connected => NIC is routable

# systemd-networkd-wait-online.service is just skipped
Jan 30 17:53:26.787705 ubuntu systemd[1]: systemd-networkd-wait-online.service - Wait for Network to be Configured was skipped because of an unmet condition check (ConditionPathIsSymbolicLink=/run/systemd/generator/network-online.target.wants/systemd-networkd-wait-online.service).

# NIC is renamed from eth0 to enp34s0
Jan 30 17:53:27.715702 ubuntu kernel: ena 0000:22:00.0 enp34s0: renamed from eth0

# NIC is connected
Jan 30 17:53:27.777878 ubuntu systemd-networkd[1707]: enp34s0: Gained carrier

# NIC is routable
Jan 30 17:53:29.477749 ip-172-31-23-132 systemd-networkd[2544]: enp34s0: Gained IPv6LL

Reboot of machine after the patch
NIC renamed => NIC is connected => service starts => service is routable => service finishes

# NIC renamed from eth0 to enp34s0
Feb 03 21:36:16.167499 ip-172-31-30-97 kernel: ena 0000:22:00.0 enp34s0: renamed from eth0

# NIC is connected
Feb 03 21:36:18.327684 ip-172-31-30-97 systemd-networkd[2384]: enp34s0: Gained carrier

# starting systemd-networkd-wait-online.service
Feb 03 21:36:18.384721 ip-172-31-30-97 systemd[1]: Starting systemd-networkd-wait-online.service - Wait for Network to be Configured...

# NIC is routable
Feb 03 21:36:20.279704 ip-172-31-30-97 systemd-networkd[2384]: enp34s0: Gained IPv6LL

# systemd-networkd-wait-online.service finishes
Feb 03 21:36:20.298234 ip-172-31-30-97 systemd[1]: Finished systemd-networkd-wait-online.service - Wait for Network to be Configured.

So in the first boot of machine, systemd-networkd-wait-online starts before the NIC is ready and times out because it can't find the device like @holmanb mentioned in the bug comment.

It was mentioned in a customer case file that this issue happens intermittently on all hpc7a instances, but always happens on hpc7a.96xlarge instances so they recommended using it for testing. So on the smaller instances, it seems like a race condition if it intermittently happens. But maybe larger instances like hpc7a.96xlarge take longer to provision their NICs so that's why it always happens on large instances?

As for the fix, it seems the original code just tried to find the NIC once, and then moved on:

     if self.perform_dhcp_setup:  # Setup networking in init-local stage.
            if util.is_FreeBSD():
                LOG.debug("FreeBSD doesn't support running dhclient with -sf")
                return False
            candidate_nics = net.find_candidate_nics()
            LOG.debug("Looking for the primary NIC in: %s", candidate_nics)
            if len(candidate_nics) < 1:
                LOG.error("The instance must have at least one eligible NIC")
                return False

So I thought polling for its existence would be helpful. But it seems the NIC also has to be provisioned correctly (not just exist) before we can move on, does that seem correct?

Then instead of polling only for NIC existence, the patch should also wait until the NIC is actually usable/reachable for metadata?

…just checking for NIC existence

goldberl · 2026-02-09T15:06:30Z

I’ve rearranged the existing logic so that we wait until the NIC has the metadata instead of just polling for its existence.

There is currently one failing test (test_aws_token_403_fails_without_retries) and I’m not certain what the best approach is to fix it.

Feedback on how to handle this properly would be appreciated.

goldberl and others added 4 commits January 29, 2026 18:07

Merge branch 'canonical:main' into ec2-wait-for-nic

9d486ef

added debug log

9a3e796

Merge remote-tracking branch 'refs/remotes/origin/ec2-wait-for-nic' i…

4b53890

…nto ec2-wait-for-nic

goldberl marked this pull request as ready for review January 30, 2026 18:10

holmanb self-assigned this Feb 2, 2026

Merge branch 'main' into ec2-wait-for-nic

a1de9c4

goldberl mentioned this pull request Feb 3, 2026

[bug]: EC2 datasource failing on first boot due to missing NICs #6697

Open

goldberl added 2 commits February 3, 2026 18:03

Changed debug log from seconds to miliseconds

042b33d

Fixed lint issues

9274842

goldberl force-pushed the ec2-wait-for-nic branch from 3edb35f to 9274842 Compare February 3, 2026 23:16

goldberl added 2 commits February 9, 2026 08:11

Merge branch 'main' into ec2-wait-for-nic

01566fc

Retry metadata discovery until NICs are ready or timeout, instead of …

9a4b333

…just checking for NIC existence

goldberl added 2 commits February 9, 2026 12:07

Merge branch 'main' into ec2-wait-for-nic

f0f1c06

Merge branch 'main' into ec2-wait-for-nic

aa4519f

goldberl mentioned this pull request Feb 10, 2026

fix (GCE): wait for NIC before fetching metadata #6742

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix (EC2): wait for NICs before failing datasource discovery #6698

fix (EC2): wait for NICs before failing datasource discovery #6698

goldberl commented Jan 30, 2026 •

edited

Loading

Uh oh!

holmanb commented Feb 2, 2026

Uh oh!

goldberl commented Feb 3, 2026

Uh oh!

goldberl commented Feb 3, 2026

Uh oh!

goldberl commented Feb 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix (EC2): wait for NICs before failing datasource discovery #6698

Are you sure you want to change the base?

fix (EC2): wait for NICs before failing datasource discovery #6698

Conversation

goldberl commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed Commit Message

Additional Context

Test Steps

Merge type

Uh oh!

holmanb commented Feb 2, 2026

Uh oh!

goldberl commented Feb 3, 2026

Uh oh!

goldberl commented Feb 3, 2026

Uh oh!

goldberl commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

goldberl commented Jan 30, 2026 •

edited

Loading

goldberl commented Feb 9, 2026 •

edited

Loading