Skip to content

Kubernetes Cluster Quota Predictions Predictions are wrong - clusters being made that can never create #427

@AlexCK-STFC

Description

@AlexCK-STFC

Quota predictions are wrong, especially for non-standard control plane counts.

This is bad, as its allowing clusters to create that go over quota, and CAPO gets stuck 409-ing forever.

See:

def _from_api_cluster_template(self, ct):
"""
Converts a cluster template from the Kubernetes API to a DTO.
"""
values = ct.spec["values"]
# We only need to account for the etcd volume if it has type Volume
etcd_volume_size = 0
etcd_volume = values.get("etcd", {}).get("blockDevice")
if etcd_volume and etcd_volume.get("type", "Volume") == "Volume":
etcd_volume_size = etcd_volume["size"]
return dto.ClusterTemplate(
ct.metadata.name,
ct.spec.label,
ct.spec.get("description"),
values["kubernetesVersion"],
ct.spec.get("deprecated", False),
values.get("controlPlane", {}).get("machineCount", 3),
etcd_volume_size,
values.get("controlPlane", {}).get("machineRootVolume", {}).get("diskSize")
or 0,
values.get("nodeGroupDefaults", {})
.get("machineRootVolume", {})
.get("diskSize")
or 0,
ct.spec.get("tags", []),
dateutil.parser.parse(ct.metadata["creationTimestamp"]),
)

If I'm tracing the logic right, it's using values like controlPlane.machineCount, and root volume disk size, that don't actually exist in the cluster templates:
https://github.com/azimuth-cloud/ansible-collection-azimuth-ops/blob/e70cc3d808af3b206033d04c595860aeb74660b3/roles/azimuth_capi_operator/defaults/main.yml#L557-L566

I.e. consider the quota predictions for this cluster, which has 6 nodes (5 control plane + 1 worker) each with 2 CPU, 8GB RAM, 50GB disk:

Image
  • CPUs is wrong, it's assuming 8 more cores -> only 4 new machines
    • Because, as linked above, machineCount is missing and defaults to 3. Then there's 1 worker node.
  • RAM is wrong, it's assuming 32GB more -> 4 new machines
    • Same as CPU
  • Volume Storage is potentially wrong, its only counting the 3*10GB from Metrics/Logs/Alertmanager, not any control plane/node group volumes. Here, though, we haven't set root volume specs for the machines so it's just using the ephemeral root disk for the flavour.
  • Machine count is somehow right here. But on Dev we use 1 control plane node, then it is wrong.

I'm struggling to trace all the value sources as I'm unfamiliar with the codebase. Apologies for this issue being confusing. But I hope I've conveyed the central issue.

But as I understand it:

  • Some values are properly found from the node sizes/flavours:

    azimuth/api/azimuth/views.py

    Lines 1351 to 1399 in e70f579

    def kubernetes_cluster_check_quotas(session, cluster, template, **data):
    """
    Check the quotas for a Kubernetes cluster.
    """
    calculator = scheduling.KubernetesClusterCalculator(session)
    # Calculate the resources used by the current cluster
    if cluster:
    # Index the sizes that have already been loaded so we don't have to load them
    # again
    known_sizes = {}
    if "control_plane_size" in data:
    known_sizes[data["control_plane_size"].id] = data["control_plane_size"]
    for ng in data.get("node_groups", []):
    known_sizes[ng["machine_size"].id] = ng["machine_size"]
    # Calculate the data for the current state of the cluster
    current_data = {
    "control_plane_size": (
    known_sizes[cluster.control_plane_size_id]
    if cluster.control_plane_size_id in known_sizes
    else session.find_size(cluster.control_plane_size_id)
    ),
    "node_groups": [
    {
    "name": ng.name,
    "machine_size": (
    known_sizes[ng.machine_size_id]
    if ng.machine_size_id in known_sizes
    else session.find_size(ng.machine_size_id)
    ),
    "autoscale": ng.autoscale,
    "count": ng.count,
    "min_count": ng.min_count,
    "max_count": ng.max_count,
    }
    for ng in cluster.node_groups
    ],
    "monitoring_enabled": cluster.monitoring_enabled,
    "monitoring_metrics_volume_size": cluster.monitoring_metrics_volume_size,
    "monitoring_logs_volume_size": cluster.monitoring_logs_volume_size,
    }
    # Calculate the resources for the current state of the cluster
    current_resources = calculator.calculate(template, **current_data)
    # Overwrite with any changes from the incoming data
    data = {**current_data, **data}
    else:
    current_resources = None
    future_resources = calculator.calculate(template, **data)
    checker = scheduling.QuotaChecker(session)
    return [future_resources, *checker.check(future_resources, current_resources)]

    azimuth/api/azimuth/views.py

    Lines 1426 to 1428 in e70f579

    _, fits, quotas = kubernetes_cluster_check_quotas(
    session, None, **input_serializer.validated_data
    )
  • Other values, like control plane count and node volume information, are being found from the template but don't exist there.
    def _from_api_cluster_template(self, ct):
    """
    Converts a cluster template from the Kubernetes API to a DTO.
    """
    values = ct.spec["values"]
    # We only need to account for the etcd volume if it has type Volume
    etcd_volume_size = 0
    etcd_volume = values.get("etcd", {}).get("blockDevice")
    if etcd_volume and etcd_volume.get("type", "Volume") == "Volume":
    etcd_volume_size = etcd_volume["size"]
    return dto.ClusterTemplate(
    ct.metadata.name,
    ct.spec.label,
    ct.spec.get("description"),
    values["kubernetesVersion"],
    ct.spec.get("deprecated", False),
    values.get("controlPlane", {}).get("machineCount", 3),
    etcd_volume_size,
    values.get("controlPlane", {}).get("machineRootVolume", {}).get("diskSize")
    or 0,
    values.get("nodeGroupDefaults", {})
    .get("machineRootVolume", {})
    .get("diskSize")
    or 0,
    ct.spec.get("tags", []),
    dateutil.parser.parse(ct.metadata["creationTimestamp"]),
    )
    • These should be being added to the template in azimuth-ops, or found dynamically like the other values.

Side Note: It'd be nice to have quota checks for additional OpenStack quotas that aren't already here, like network security group count. These should also be added to the "Quotas" page on the sidebar.

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions