Skip to content

Comments

feat: public pricing API + DaemonSet-aware node scoring#4

Merged
Guimove merged 5 commits intomainfrom
fix/pricing-enrichment
Feb 12, 2026
Merged

feat: public pricing API + DaemonSet-aware node scoring#4
Guimove merged 5 commits intomainfrom
fix/pricing-enrichment

Conversation

@Guimove
Copy link
Owner

@Guimove Guimove commented Feb 12, 2026

Summary

  • Replace AWS Pricing API (pricing:GetProducts) with public runs-on.com API — no IAM permissions needed
  • Remove aws-sdk-go-v2/service/pricing dependency entirely
  • Favor fewer, larger nodes per AWS EKS best practices
  • Update default instance families to current gen (m7i, c7i, r7i, m7a, c7a, r7a)
  • Fix "0 scenarios" log bug

Scoring changes

Per AWS: "fewer, larger instances are better, especially with many DaemonSets"

Nodes Old Score New Score
1 20 20
3 60 90
5-15 100 100
30 100 85
40 100 70 - DS penalty
100 80 55 - DS penalty

DaemonSet penalty: min(dsCount × nodeCount / 100 × 5, 20) — with 12 DS and 40 nodes, that's a -9.6 point penalty.

Test plan

  • go test -race ./... — all pass
  • golangci-lint run — 0 issues
  • Real cluster: 880 workloads + 12 DS, pricing shows correct $/month values

Replace the AWS Pricing API (requires pricing:GetProducts IAM permission)
with the public runs-on.com EC2 pricing API. No authentication required,
includes both on-demand and spot prices, updated hourly.

- Remove aws-sdk-go-v2/service/pricing dependency entirely
- Simplify AWSProvider: only needs ec2:DescribeInstanceTypes permission
- EnrichWithPricing called automatically in GetInstanceTypes
- Update default instance families to current gen (m7i, c7i, r7i, etc.)
- Fix "0 scenarios" log message (was hardcoded, never updated)
Per AWS docs: "fewer, larger instances are better, especially if you
have a lot of DaemonSets" — each DS runs on every node, so more nodes
means more wasted resources on DS replicas.

- Refine resilience scoring: sweet spot at 5-15 nodes, progressive
  penalty above 30 nodes instead of flat 100 for 3-50
- Add DaemonSet overhead penalty: high DS count × high node count
  reduces the resilience score (up to -20 points)
- Pass DaemonSetCount from orchestrator to scorer
Add cluster-wide P95 CPU/memory and observed min/max node count queries
to capture HPA/autoscaler scaling peaks that per-pod instant snapshots
miss. Enforce a configurable minimum node count (default 3) as an HA
constraint in bin-packing. Compute scaling efficiency per candidate
instance type and penalize poor trough utilization in scoring.

- Add ClusterAggregateMetrics and ScalingEfficiency model types
- Replace unused peak replica queries with 4 cluster aggregate PromQL
- Add MinNodes to SimulationConfig (default 3) with BFD padding
- Compute trough CPU utilization from scaling ratio
- Penalize resilience score when trough utilization < 30%
- Display cluster P95, node range, min nodes in report headers
- Show [trough: XX%] warning in table/markdown notes
- Update README with full documentation and correct AWS requirements
Per-pod effective sizing uses max(request, P95_usage) which inflates
CPU when pods over-request relative to actual usage. This led to
"compute-optimized" classification on a cluster that was actually
memory-bound (0.8 vCPU, 9.4 GiB → 11.75 GiB/vCPU).

Prefer cluster-level aggregate P95 CPU/memory (from the full metrics
window) for classification when available. Fall back to per-pod
totals when aggregate metrics are absent.
When auto-classifying to an extreme (compute or memory-optimized),
also include M-series (general-purpose) families. Per-pod requests
may skew the bin-packing constraint away from the aggregate
classification — M-series provides a balanced middle ground that the
scorer evaluates alongside the primary family.

Fixes clusters where aggregate usage is memory-heavy but per-pod CPU
requests are inflated, causing R-series to be CPU-saturated (97%) with
wasted memory (32%).
@Guimove Guimove merged commit 83e1e7c into main Feb 12, 2026
3 checks passed
@Guimove Guimove deleted the fix/pricing-enrichment branch February 12, 2026 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant