-
-
Notifications
You must be signed in to change notification settings - Fork 195
Autoscale or initial setup is taking too long #688
Replies: 1 comment · 5 replies
-
|
So it seems that something was broken with the NAT gateway. I deleted everything and redone cluster + NAT. now the deployment is faster and autoscale takes like 3-5 minutes to spin up new nodes. |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Glad you sorted it out. would you mind describing your configuration in detail? It may be worth adding to the docs. |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Of course, I plan to share my full experience later, but here is what happened so far. Step 1Create a NAT server. Super important Do not forget to create the route 0.0.0.0/0 to point to your NAT server (example 10.0.0.2) Step 2Warning: In case you already have a cluster up and you want to rebuild from scratch, DELETE any backups from etcd (if you are using that. I was using S3 backup and I had to also delete those, otherwise something with caching is going crazy. On the first attempt, I had the networking like this: But eventually, I changed that to 0.0.0.0/0 as shown below. Then I tried the same as is in the documentation Private_clusters_with_public_network_interface_disabled, that didn't work as expected, so, again mix and match. I updated the DNS to Hetzner's. Probably something could be done and avoid this change, but I was feeling tired to retry it. So final cluster_config.yaml is the following: Step 3As I have decided to use Traefik for ingress, my initial assumption, which was wrong, was that the setup would create a Hetzner Load Balancer to expose the cluster. So, in order to achieve this, and avoid messing that early with shared storage, I decided to use cert-manager. So, first step, we need to change the following to traefik. Install cert-manager to avoid having traefik, to save acme.json and manage shared storage for HA traefik. verify: Then use cert-manager-cloudflare.yaml And apply Finally, update traefik with the following traefik-config.yaml At this stage, you should be able to start deploying your applications. Currently, I did test to deploy a NextJS application with the following yaml. What comes nextNext step for me is to test the CSI driver. One of the use cases I need to manage, while dummy and probably not ideal, is WordPress hosting. At first, I was thinking Longhorn to have RWX, but the longer I thought about it, the drawback of shared storage between nodes is that if I need 1TB I need to spend a lot $$ in nodes with high storage. So, the hcloud volumes, at least for the initial setup, seem to be a nice solution, although I will not have RWX. I saw some posts about a solution around this topic, but if there is any suggestion, feel free to share. |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Thanks a lot! I will need to investigate adding proper support for dedicated servers at some point and this info can be helpful with that too. |
Beta Was this translation helpful? Give feedback.
All reactions
-
❤️ 1
-
|
Thank you for this great tool. I will get back with more feedback and some suggestions/feature requests, let's say.
|
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
So, I do have a bastion machine which I also use as NAT for the nodes.
The cluster-config.yml has this:
If I do SSH to the node, I can normally ping Google, quay.io, or anything. I do not really understand the following,
and then:
Is this network configuration issue or with the versions? I was expecting that the cluster init or scale should be fast not 15minutes ±
Beta Was this translation helpful? Give feedback.
All reactions