Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 117 additions & 0 deletions .claude/test-cheatsheet.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# Mantis Testing Cheat Sheet

Quick reference for common testing commands and patterns.

## 🚀 Quick Commands

### Run Control Plane Tests
```bash
# All standard tests
./gradlew :mantis-control-plane:mantis-control-plane-server:test --rerun-tasks

# All Akka tests (single-threaded)
./gradlew :mantis-control-plane:mantis-control-plane-server:akkaTest --rerun-tasks

# Both suites
./gradlew :mantis-control-plane:mantis-control-plane-server:test :mantis-control-plane:mantis-control-plane-server:akkaTest --rerun-tasks

# Specific test class
./gradlew :mantis-control-plane:mantis-control-plane-server:test --tests "*.ResourceClusterActorTest" --rerun-tasks

# Specific test method
./gradlew :mantis-control-plane:mantis-control-plane-server:akkaTest --tests "*.JobClusterAkkaTest.testScaleStage" --rerun-tasks
```

## 🤖 Subagent Prompts (Copy-Paste)

### Basic
```
Run mantis-control-plane-server akkaTest and fix any failures
```

### Comprehensive
```
Use a subagent to run both test suites, analyze all failures, and create a detailed report
```

### Parallel
```
Run test and akkaTest in parallel using subagents, then summarize results
```

### Investigate
```
Investigate why testAssignmentTimeout is failing using a subagent
```

### Background
```
Run the full test suite in the background
```

## 📊 Check Results

```bash
# View HTML report
open mantis-control-plane/mantis-control-plane-server/build/reports/tests/test/index.html
open mantis-control-plane/mantis-control-plane-server/build/reports/tests/akkaTest/index.html

# Find failures in XML
find mantis-control-plane/mantis-control-plane-server/build/test-results -name "*.xml" -exec grep -l "failure" {} \;

# Extract failure details
grep -A 20 "<failure" mantis-control-plane/mantis-control-plane-server/build/test-results/test/TEST-*.xml
```

## 🔍 Common Issues & Fixes

### NullPointerException from Mocks
```java
// Add stub in test setup
when(gateway.submitTask(any())).thenReturn(CompletableFuture.completedFuture(Ack.getInstance()));
```

### Scheduler API Verification Failures
```java
// OLD (fails)
verify(schedulerMock).scheduleWorkers(any());

// NEW (correct)
verify(schedulerMock).upsertReservation(any());
```

### Assignment Timeout Test
```java
// For timeout tests, use hanging future
CompletableFuture<Ack> hangingFuture = new CompletableFuture<>();
when(gateway.submitTask(any())).thenReturn(hangingFuture);
```

## 📁 Key Files

| File | Purpose |
|------|---------|
| `JobTestHelper.java:535` | Mock scheduler setup |
| `ResourceClusterActorTest.java` | Main resource cluster tests |
| `JobClusterAkkaTest.java` | Job cluster Akka tests |
| `build.gradle` (root) | `akkaTest` task config |

## 🎯 Test Suites

| Suite | Count | Runtime | Notes |
|-------|-------|---------|-------|
| `test` | 330+ | ~2 min | Standard unit tests |
| `akkaTest` | 81+ | ~5 sec | Akka actor tests, single-threaded |

## 💡 Pro Tips

1. **Always use `--rerun-tasks`** to bypass Gradle cache
2. **Run specific tests first** when debugging
3. **Check HTML reports** for detailed failure info
4. **Use subagents** for multi-step workflows
5. **Run flaky tests multiple times** to confirm stability

## 🔗 See Also

- [Full Testing Guide](.claude/testing-guide.md) - Comprehensive testing documentation
- [Subagent Examples](.claude/subagent-examples.md) - Copy-paste subagent prompts
44 changes: 22 additions & 22 deletions .github/workflows/nebula-snapshot.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,25 +38,25 @@ jobs:
NETFLIX_OSS_REPO_PASSWORD: ${{ secrets.ORG_NETFLIXOSS_PASSWORD }}
with:
arguments: --info --stacktrace build -x test snapshot
- name: Create PR Comment String
run: |
src=${GITHUB_WORKSPACE}/build/versions.txt
dest=${GITHUB_WORKSPACE}/build/versions2.txt
echo 'resolutionStrategy {' > $dest
awk '{print " force \"" $0 "\""}' $src >> $dest
echo '}' >> $dest
echo "PR_STR<<EOF" >> $GITHUB_ENV
cat ${dest} >> $GITHUB_ENV
echo 'EOF' >> $GITHUB_ENV
- name: Upload
uses: mshick/add-pr-comment@v2
with:
message: |
## Uploaded Artifacts
To use these artifacts in your Gradle project, paste the following lines in your build.gradle.
```
${{ env.PR_STR }}
```
message-id: ${{ github.event.number }} # For sticky messages
repo-token: ${{ secrets.GITHUB_TOKEN }}
allow-repeats: false # This is the default
# - name: Create PR Comment String
# run: |
# src=${GITHUB_WORKSPACE}/build/versions.txt
# dest=${GITHUB_WORKSPACE}/build/versions2.txt
# echo 'resolutionStrategy {' > $dest
# awk '{print " force \"" $0 "\""}' $src >> $dest
# echo '}' >> $dest
# echo "PR_STR<<EOF" >> $GITHUB_ENV
# cat ${dest} >> $GITHUB_ENV
# echo 'EOF' >> $GITHUB_ENV
# - name: Upload
# uses: mshick/add-pr-comment@v2
# with:
# message: |
# ## Uploaded Artifacts
# To use these artifacts in your Gradle project, paste the following lines in your build.gradle.
# ```
# ${{ env.PR_STR }}
# ```
# message-id: ${{ github.event.number }} # For sticky messages
# repo-token: ${{ secrets.GITHUB_TOKEN }}
# allow-repeats: false # This is the default
38 changes: 20 additions & 18 deletions build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -118,8 +118,9 @@ allprojects {
}
}

def printAllReleasedArtifacts = project.tasks.create('printAllReleasedArtifacts')
project.snapshot.configure { finalizedBy printAllReleasedArtifacts }
// def printAllReleasedArtifacts = project.tasks.create('printAllReleasedArtifacts')
// project.snapshot.configure { finalizedBy printAllReleasedArtifacts }
// project.tasks.named("build").configure { finalizedBy printAllReleasedArtifacts }
subprojects {
apply plugin: 'java-library'

Expand Down Expand Up @@ -172,22 +173,23 @@ subprojects {
options.compilerArgs << "-Xlint:deprecation"
}

project.plugins.withType(MavenPublishPlugin) {
def printReleasedArtifact = project.tasks.create('printReleasedArtifact')
printReleasedArtifact.doLast {
def file1 = file("${buildDir}/tmp/publishNebulaPublicationToNetflixOSSRepository/maven-metadata.xml")
def file2 = file("${buildDir}/tmp/publishNebulaPublicationToNetflixOSSRepository/snapshot-maven-metadata.xml")
def xmlText = file1.exists() ? file1.text : (file2.exists() ? file2.text : "file not found")
def xml = new XmlParser(false, false).parseText(xmlText)
def snapshotVersion = xml.versioning.snapshotVersions.snapshotVersion[0].'value'.text()
logger.lifecycle("${project.group}:${project.name}:${snapshotVersion}")
file("${project.rootProject.buildDir}/versions.txt").append("${project.group}:${project.name}:${snapshotVersion}" + '\n')
}

printReleasedArtifact.dependsOn(project.rootProject.snapshot)
printAllReleasedArtifacts.dependsOn("${project.path}:printReleasedArtifact")
}

// project.plugins.withType(MavenPublishPlugin) {
// def printReleasedArtifact = project.tasks.create('printReleasedArtifact')
// printReleasedArtifact.doLast {
// def file1 = file("${buildDir}/tmp/publishNebulaPublicationToNetflixOSSRepository/maven-metadata.xml")
// def file2 = file("${buildDir}/tmp/publishNebulaPublicationToNetflixOSSRepository/snapshot-maven-metadata.xml")
// def xmlText = file1.exists() ? file1.text : (file2.exists() ? file2.text : "file not found")
// logger.quiet("maven-metadata.xml prefix: ${xmlText.take(100)}")
// def xml = new XmlParser(false, false).parseText(xmlText)
// def snapshotVersion = xml.versioning.snapshotVersions.snapshotVersion[0].'value'.text()
// logger.lifecycle("${project.group}:${project.name}:${snapshotVersion}")
// file("${project.rootProject.buildDir}/versions.txt").append("${project.group}:${project.name}:${snapshotVersion}" + '\n')
// }

// printAllReleasedArtifacts.dependsOn("${project.path}:printReleasedArtifact")
// }

// test tasks config
task akkaTest(type: Test) {
maxParallelForks = 1
filter {
Expand Down
107 changes: 107 additions & 0 deletions context/archive/reservation/context-scheduling-logic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Scheduling Logic and Resource Cluster Interactions

This document explains how job creation and job scaling translate into scheduling and resource changes, and how those changes interact with the Resource Cluster and the Resource Cluster Scaler.

## A) Job Creation → Scheduling → Task Execution

```mermaid
sequenceDiagram
autonumber
participant Client
participant JCM as JobClustersManagerActor
participant JC as JobClusterActor
participant JA as JobActor
participant MS as MantisScheduler
participant RCSA as ResourceClusterAwareSchedulerActor
participant RC as ResourceClusterActor
participant TE as Task Executor

Client->>JCM: SubmitJobRequest
JCM->>JC: forward(request)
JC->>JA: InitJob (initialize workers)
JA->>MS: scheduleWorkers(BatchScheduleRequest)
MS->>RCSA: BatchScheduleRequestEvent
RCSA->>RC: getTaskExecutorsFor(allocationRequests)
RC-->>RCSA: allocations (TaskExecutorID -> AllocationRequest)
RCSA->>TE: submitTask(executeStageRequestFactory.of(...))
TE-->>RCSA: Ack / Failure
RCSA->>JA: WorkerLaunched / WorkerLaunchFailed
Note over RC: Inventory & matching by ExecutorStateManagerImpl
```

Key notes:
- Job submission enters at `JobClustersManagerActor.onJobSubmit`, forwarded to `JobClusterActor`, then `JobActor.initialize` constructs workers.
- `JobActor` enqueues workers via batch scheduling to the resource-cluster-aware scheduler.
- `ResourceClusterAwareSchedulerActor` queries `ResourceClusterActor` for best-fit executors and pushes tasks to TEs via `TaskExecutorGateway`.

## B) Job Stage Scale-Up/Down and Resource Cluster Scaler Loop

```mermaid
sequenceDiagram
autonumber
participant JA as JobActor
participant WM as WorkerManager
participant MS as MantisScheduler
participant RCSA as ResourceClusterAwareSchedulerActor
participant RC as ResourceClusterActor
participant RCS as ResourceClusterScalerActor
participant Host as ResourceClusterHostActor
participant TE as Task Executor

JA->>JA: onScaleStage(StageScalingPolicy, ScalerRule min/max)
JA->>WM: scaleStage(newNumWorkers)
alt Scale Up
WM->>MS: scheduleWorkers(BatchScheduleRequest)
MS->>RCSA: BatchScheduleRequestEvent
RCSA->>RC: getTaskExecutorsFor(...)
RC-->>RCSA: allocations or NoResourceAvailable
RCSA->>TE: submitTask(...)
note over RC: pendingJobRequests updated when not enough TEs
else Scale Down
WM->>RC: unscheduleAndTerminateWorker(...)
end

loop Periodic Capacity Loop (every scalerPullThreshold)
RCS->>RC: GetClusterUsageRequest
RC-->>RCS: GetClusterUsageResponse (idle/total by SKU)
RCS->>RCS: apply rules (minIdleToKeep, maxIdleToKeep, cooldown)
alt ScaleUp decision
RCS->>Host: ScaleResourceRequest(desired size increase)
Host-->>RC: New TEs provisioned
TE->>RC: TaskExecutorRegistration + Heartbeats
RC->>RC: mark available (unless disabled)
else ScaleDown decision
RCS->>RC: GetClusterIdleInstancesRequest
RC-->>RCS: idle TE list
RCS->>Host: ScaleResourceRequest(desired size decrease, idleInstances)
RCS->>RC: DisableTaskExecutorsRequest(idle targets during drain)
end
end
```

Key notes:
- Job scale-up adds workers; if TEs are insufficient, the scheduler records demand (pending) so the scaler sees fewer effective idle slots and scales up capacity.
- Scale-down removes workers and may free idle TEs; the scaler can shrink capacity by selecting idle instances to drain and disabling them during deprovisioning.

## Code Reference Map

- Submission & job initialization:
- `JobClustersManagerActor.onJobSubmit(...)`
- `JobActor.onJobInitialize(...)`, `initialize(...)`, `submitInitialWorkers(...)`, `queueTasks(...)`

- Scheduling (push model):
- `ResourceClusterAwareSchedulerActor.onBatchScheduleRequestEvent(...)`, `onAssignedScheduleRequestEvent(...)`
- `ResourceClusterActor` + `ExecutorStateManagerImpl.findBestFit(...)`, `findBestGroupBySizeNameMatch(...)`, `findBestGroupByFitnessCalculator(...)`

- Job scaling:
- `JobActor.onScaleStage(...)` → `WorkerManager.scaleStage(...)`

- Resource-cluster autoscaling:
- `ResourceClusterScalerActor.onTriggerClusterUsageRequest(...)`, `onGetClusterUsageResponse(...)`, `onGetClusterIdleInstancesResponse(...)`
- `ExecutorStateManagerImpl.getClusterUsage(...)` (idle = available − pending), `pendingJobRequests`
- `ResourceClusterActor.onTaskExecutorRegistration(...)` (TE joins, becomes available)

## Coupling Between Scheduling and Scaling

- When allocations fail due to lack of capacity, the resource cluster records pending demand per job/group; the scaler subtracts this pending from available to compute effective idle, triggering scale-up.
- New TEs register with the resource cluster and are immediately considered for subsequent scheduling waves.
Loading
Loading