-
Notifications
You must be signed in to change notification settings - Fork 1k
fix (EC2): wait for NICs before failing datasource discovery #6698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
On some EC2 instance types, particularly Nitro-based systems, network interfaces may not be present during the init-local stage when the EC2 datasource attempts to discover metadata. Previously, the datasource would fail immediately if no eligible NICs were found, causing metadata and userdata retrieval to fail on first boot. This resulted in missing SSH keys and required a reboot to recover. Add a bounded wait for eligible NICs before failing datasource discovery. This avoids a race condition and ensures userdata is applied correctly on first boot.
…nto ec2-wait-for-nic
This leaves behind a netplan configuration on the filesystem - which means that this code might not be exercised by this test. Can you please retest this change using Per my comment in the bug, it would be good to know why the network-online service timed out after 120s - how long does this code have to wait before the device comes online? Any idea why it is so long? I notice that the the device appears to be configured due to cloud-init calling |
|
@holmanb Thanks for your comments, I retried applying my patch and then doing It seems like the patch was run properly I'll keep investigating why the network-online service timed out and get back to you |
|
So I took a look at all three log these logs (the first two are the same logs I uploaded to the bug)
And created a timeline for each from the First boot of machine Reboot of machine before the patch Reboot of machine after the patch So in the first boot of machine, It was mentioned in a customer case file that this issue happens intermittently on all As for the fix, it seems the original code just tried to find the NIC once, and then moved on: So I thought polling for its existence would be helpful. But it seems the NIC also has to be provisioned correctly (not just exist) before we can move on, does that seem correct? Then instead of polling only for NIC existence, the patch should also wait until the NIC is actually usable/reachable for metadata? |
3edb35f to
9274842
Compare
…just checking for NIC existence
|
I’ve rearranged the existing logic so that we wait until the NIC has the metadata instead of just polling for its existence. There is currently one failing test ( Feedback on how to handle this properly would be appreciated. |
Proposed Commit Message
Additional Context
Addresses #6697
Test Steps
You will be hit with an error:
You can SSH into the machine on the first boot using the AWS UI
Running
cloud-init status --longshows you cloud-init fallsback toDataSourceNoneand there are metadata and NIC errorsIf you check
cloud-init status --long, it will now havedetail: DataSourceEc2but the NIC errors will still appear:/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceEc2.pyfilesudo cloud-init clean --logs --rebootAfter applying the patch, cleaning cloud-init and rebooting,
cloud-init status --longwill sayAnd you can check the patch via the following commands:
The logs will show something like:
Merge type