Skip to content

Conversation

@xDev789
Copy link

@xDev789 xDev789 commented Nov 28, 2025

Reuse existing client endpoint when configuring server interface.

Description

Wireguard container now preserves peer's endpoint when configuring the server interface, which allows for uninterrupted connectivity in an event of PublicKey resource reconciliation.

Fixes #3090

How Has This Been Tested?

  • make unit
  • liqoctl install on Kind clusters

@adamjensenbot
Copy link
Collaborator

Hi @xDev789. Thanks for your PR!

I am @adamjensenbot.
You can interact with me issuing a slash command in the first line of a comment.
Currently, I understand the following commands:

  • /rebase: Rebase this PR onto the master branch (You can add the option test=true to launch the tests
    when the rebase operation is completed)
  • /merge: Merge this PR into the master branch
  • /build Build Liqo components
  • /test Launch the E2E and Unit tests
  • /hold, /unhold Add/remove the hold label to prevent merging with /merge

Make sure this PR appears in the liqo changelog, adding one of the following labels:

  • feat: 🚀 New Feature
  • fix: 🐛 Bug Fix
  • refactor: 🧹 Code Refactoring
  • docs: 📝 Documentation
  • style: 💄 Code Style
  • perf: 🐎 Performance Improvement
  • test: ✅ Tests
  • chore: 🚚 Dependencies Management
  • build: 📦 Builds Management
  • ci: 👷 CI/CD
  • revert: ⏪ Reverts Previous Changes

cheina97
cheina97 previously approved these changes Dec 16, 2025
@cheina97
Copy link
Member

/rebase

@cheina97
Copy link
Member

/rebase test=true

@cheina97
Copy link
Member

Hi @xDev789, I've checked your PR and something is not clear to me. You are forcing the wg interface to use always the same peer's endpoint, but what happens if it changes? I tested it with cilium and I've noticed that this field is populated with an IP which is the one assigned to the cilium_host@cilium_net interface on the node. What happens if the gateway is rescheduled on another node? How you tested this PR?

@cheina97
Copy link
Member

/rebase test=true

Reuse existing client endpoint when configuring server interface.
@cheina97
Copy link
Member

I tried to move pods from one node to another, keeping the wrong IP in the server, and it seems that everything is working. It seems that the peer's endpoint is set in server mode but it's balue is ignored. I need to take some additional tests

@xDev789
Copy link
Author

xDev789 commented Dec 16, 2025

Hi @cheina97! Thank you for reviewing my PR. You are right to be sceptical about reusing the same peer endpoint but it is only being explicitly set when configureDevice function gets called (i.e. when PublicKey custom resource gets reconciled). In case the peer endpoint changes, wireguard will update the endpoint due to its automatic peer discovery. It is the same mechanism Liqo currently relies on, except without this PR, reconciling PublicKey CR results in connection interruption because discovered peer endpoint gets erased. As for the testing part, I've tested it on two Kind clusters with two nodes each and it worked as expected. We also use the stable version of Liqo with these patches applied to establish tunnels between the control and worker clusters.

@cheina97 cheina97 self-requested a review December 16, 2025 15:34
@cheina97 cheina97 dismissed their stale review December 16, 2025 15:34

Noticed a potential issue

@cheina97
Copy link
Member

Hi @cheina97! Thank you for reviewing my PR. You are right to be sceptical about reusing the same peer endpoint but it is only being explicitly set when configureDevice function gets called (i.e. when PublicKey custom resource gets reconciled). In case the peer endpoint changes, wireguard will update the endpoint due to its automatic peer discovery. It is the same mechanism Liqo currently relies on, except without this PR, reconciling PublicKey CR results in connection interruption because discovered peer endpoint gets erased. As for the testing part, I've tested it on two Kind clusters with two nodes each and it worked as expected. We also use the stable version of Liqo with these patches applied to establish tunnels between the control and worker clusters.

I've tried and it seems that wireguard was not able to reconfigure automatically the IP since it is set forcefully by the controller. Instead, without explicit setup, that IP is updated correctly. I'm not against this change, but I would prefer to wait a little bit and test it properly.

Can you share all the scenarios you have tried? I would like to know what provider have you tried and which CNI were you using (also with Kind clusters)

@xDev789
Copy link
Author

xDev789 commented Dec 17, 2025

I've tried and it seems that wireguard was not able to reconfigure automatically the IP since it is set forcefully by the controller. Instead, without explicit setup, that IP is updated correctly. I'm not against this change, but I would prefer to wait a little bit and test it properly.

Can you share all the scenarios you have tried? I would like to know what provider have you tried and which CNI were you using (also with Kind clusters)

I totally agree with you, I strongly believe in shipping quality product and I'm not trying to rush this change either. We use RKE2 with Cilium in Kube-Proxy replacement mode for both server and client clusters. With Kind clusters, I also used Cilium. I've tried rescheduling the gateway client pod on another node. I've also tried the HA scenario when another pod obtains the lease and becomes the active gateway. Both scenarios resulted in connection being reestablished as expected. Could you share more details on the problematic case? You said it worked initially but some other tests helped you spot an issue.

@cheina97
Copy link
Member

I've tried and it seems that wireguard was not able to reconfigure automatically the IP since it is set forcefully by the controller. Instead, without explicit setup, that IP is updated correctly. I'm not against this change, but I would prefer to wait a little bit and test it properly.
Can you share all the scenarios you have tried? I would like to know what provider have you tried and which CNI were you using (also with Kind clusters)

I totally agree with you, I strongly believe in shipping quality product and I'm not trying to rush this change either. We use RKE2 with Cilium in Kube-Proxy replacement mode for both server and client clusters. With Kind clusters, I also used Cilium. I've tried rescheduling the gateway client pod on another node. I've also tried the HA scenario when another pod obtains the lease and becomes the active gateway. Both scenarios resulted in connection being reestablished as expected. Could you share more details on the problematic case? You said it worked initially but some other tests helped you spot an issue.

Thanks for your help, we just need to test it properly with other CNIs. I just need some time to test it and to process

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

the connection beween k8s cluster reconnect every about 10 hours

3 participants