-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Quota predictions are wrong, especially for non-standard control plane counts.
This is bad, as its allowing clusters to create that go over quota, and CAPO gets stuck 409-ing forever.
See:
azimuth/api/azimuth/cluster_api/base.py
Lines 128 to 154 in e70f579
| def _from_api_cluster_template(self, ct): | |
| """ | |
| Converts a cluster template from the Kubernetes API to a DTO. | |
| """ | |
| values = ct.spec["values"] | |
| # We only need to account for the etcd volume if it has type Volume | |
| etcd_volume_size = 0 | |
| etcd_volume = values.get("etcd", {}).get("blockDevice") | |
| if etcd_volume and etcd_volume.get("type", "Volume") == "Volume": | |
| etcd_volume_size = etcd_volume["size"] | |
| return dto.ClusterTemplate( | |
| ct.metadata.name, | |
| ct.spec.label, | |
| ct.spec.get("description"), | |
| values["kubernetesVersion"], | |
| ct.spec.get("deprecated", False), | |
| values.get("controlPlane", {}).get("machineCount", 3), | |
| etcd_volume_size, | |
| values.get("controlPlane", {}).get("machineRootVolume", {}).get("diskSize") | |
| or 0, | |
| values.get("nodeGroupDefaults", {}) | |
| .get("machineRootVolume", {}) | |
| .get("diskSize") | |
| or 0, | |
| ct.spec.get("tags", []), | |
| dateutil.parser.parse(ct.metadata["creationTimestamp"]), | |
| ) |
If I'm tracing the logic right, it's using values like controlPlane.machineCount, and root volume disk size, that don't actually exist in the cluster templates:
https://github.com/azimuth-cloud/ansible-collection-azimuth-ops/blob/e70cc3d808af3b206033d04c595860aeb74660b3/roles/azimuth_capi_operator/defaults/main.yml#L557-L566
I.e. consider the quota predictions for this cluster, which has 6 nodes (5 control plane + 1 worker) each with 2 CPU, 8GB RAM, 50GB disk:
- CPUs is wrong, it's assuming 8 more cores -> only 4 new machines
- Because, as linked above, machineCount is missing and defaults to 3. Then there's 1 worker node.
- RAM is wrong, it's assuming 32GB more -> 4 new machines
- Same as CPU
- Volume Storage is potentially wrong, its only counting the 3*10GB from Metrics/Logs/Alertmanager, not any control plane/node group volumes. Here, though, we haven't set root volume specs for the machines so it's just using the ephemeral root disk for the flavour.
- Similar to above, rootVolume disk size is missing and defaults to 0, diskSize is missing and defaults to 0. The Metrics/Logs/Alertmanager volumes total the 30GB
- Machine count is somehow right here. But on Dev we use 1 control plane node, then it is wrong.
I'm struggling to trace all the value sources as I'm unfamiliar with the codebase. Apologies for this issue being confusing. But I hope I've conveyed the central issue.
But as I understand it:
- Some values are properly found from the node sizes/flavours:
Lines 1351 to 1399 in e70f579
def kubernetes_cluster_check_quotas(session, cluster, template, **data): """ Check the quotas for a Kubernetes cluster. """ calculator = scheduling.KubernetesClusterCalculator(session) # Calculate the resources used by the current cluster if cluster: # Index the sizes that have already been loaded so we don't have to load them # again known_sizes = {} if "control_plane_size" in data: known_sizes[data["control_plane_size"].id] = data["control_plane_size"] for ng in data.get("node_groups", []): known_sizes[ng["machine_size"].id] = ng["machine_size"] # Calculate the data for the current state of the cluster current_data = { "control_plane_size": ( known_sizes[cluster.control_plane_size_id] if cluster.control_plane_size_id in known_sizes else session.find_size(cluster.control_plane_size_id) ), "node_groups": [ { "name": ng.name, "machine_size": ( known_sizes[ng.machine_size_id] if ng.machine_size_id in known_sizes else session.find_size(ng.machine_size_id) ), "autoscale": ng.autoscale, "count": ng.count, "min_count": ng.min_count, "max_count": ng.max_count, } for ng in cluster.node_groups ], "monitoring_enabled": cluster.monitoring_enabled, "monitoring_metrics_volume_size": cluster.monitoring_metrics_volume_size, "monitoring_logs_volume_size": cluster.monitoring_logs_volume_size, } # Calculate the resources for the current state of the cluster current_resources = calculator.calculate(template, **current_data) # Overwrite with any changes from the incoming data data = {**current_data, **data} else: current_resources = None future_resources = calculator.calculate(template, **data) checker = scheduling.QuotaChecker(session) return [future_resources, *checker.check(future_resources, current_resources)]
Lines 1426 to 1428 in e70f579
_, fits, quotas = kubernetes_cluster_check_quotas( session, None, **input_serializer.validated_data ) - Other values, like control plane count and node volume information, are being found from the template but don't exist there.
azimuth/api/azimuth/cluster_api/base.py
Lines 128 to 154 in e70f579
def _from_api_cluster_template(self, ct): """ Converts a cluster template from the Kubernetes API to a DTO. """ values = ct.spec["values"] # We only need to account for the etcd volume if it has type Volume etcd_volume_size = 0 etcd_volume = values.get("etcd", {}).get("blockDevice") if etcd_volume and etcd_volume.get("type", "Volume") == "Volume": etcd_volume_size = etcd_volume["size"] return dto.ClusterTemplate( ct.metadata.name, ct.spec.label, ct.spec.get("description"), values["kubernetesVersion"], ct.spec.get("deprecated", False), values.get("controlPlane", {}).get("machineCount", 3), etcd_volume_size, values.get("controlPlane", {}).get("machineRootVolume", {}).get("diskSize") or 0, values.get("nodeGroupDefaults", {}) .get("machineRootVolume", {}) .get("diskSize") or 0, ct.spec.get("tags", []), dateutil.parser.parse(ct.metadata["creationTimestamp"]), ) - These should be being added to the template in azimuth-ops, or found dynamically like the other values.
Side Note: It'd be nice to have quota checks for additional OpenStack quotas that aren't already here, like network security group count. These should also be added to the "Quotas" page on the sidebar.
