Netflix · Andyz26 · Dec 6, 2025 · Dec 23, 2025 · Jan 22, 2026 · Jan 22, 2026
@@ -0,0 +1,117 @@
+# Mantis Testing Cheat Sheet
+
+Quick reference for common testing commands and patterns.
+
+## 🚀 Quick Commands
+
+### Run Control Plane Tests
+```bash
+# All standard tests
+./gradlew :mantis-control-plane:mantis-control-plane-server:test --rerun-tasks
+
+# All Akka tests (single-threaded)
+./gradlew :mantis-control-plane:mantis-control-plane-server:akkaTest --rerun-tasks
+
+# Both suites
+./gradlew :mantis-control-plane:mantis-control-plane-server:test :mantis-control-plane:mantis-control-plane-server:akkaTest --rerun-tasks
+
+# Specific test class
+./gradlew :mantis-control-plane:mantis-control-plane-server:test --tests "*.ResourceClusterActorTest" --rerun-tasks
+
+# Specific test method
+./gradlew :mantis-control-plane:mantis-control-plane-server:akkaTest --tests "*.JobClusterAkkaTest.testScaleStage" --rerun-tasks
+```
+
+## 🤖 Subagent Prompts (Copy-Paste)
+
+### Basic
+```
+Run mantis-control-plane-server akkaTest and fix any failures
+```
+
+### Comprehensive
+```
+Use a subagent to run both test suites, analyze all failures, and create a detailed report
+```
+
+### Parallel
+```
+Run test and akkaTest in parallel using subagents, then summarize results
+```
+
+### Investigate
+```
+Investigate why testAssignmentTimeout is failing using a subagent
+```
+
+### Background
+```
+Run the full test suite in the background
+```
+
+## 📊 Check Results
+
+```bash
+# View HTML report
+open mantis-control-plane/mantis-control-plane-server/build/reports/tests/test/index.html
+open mantis-control-plane/mantis-control-plane-server/build/reports/tests/akkaTest/index.html
+
+# Find failures in XML
+find mantis-control-plane/mantis-control-plane-server/build/test-results -name "*.xml" -exec grep -l "failure" {} \;
+
+# Extract failure details
+grep -A 20 "<failure" mantis-control-plane/mantis-control-plane-server/build/test-results/test/TEST-*.xml
+```
+
+## 🔍 Common Issues & Fixes
+
+### NullPointerException from Mocks
+```java
+// Add stub in test setup
+when(gateway.submitTask(any())).thenReturn(CompletableFuture.completedFuture(Ack.getInstance()));
+```
+
+### Scheduler API Verification Failures
+```java
+// OLD (fails)
+verify(schedulerMock).scheduleWorkers(any());
+
+// NEW (correct)
+verify(schedulerMock).upsertReservation(any());
+```
+
+### Assignment Timeout Test
+```java
+// For timeout tests, use hanging future
+CompletableFuture<Ack> hangingFuture = new CompletableFuture<>();
+when(gateway.submitTask(any())).thenReturn(hangingFuture);
+```
+
+## 📁 Key Files
+
+| File | Purpose |
+|------|---------|
+| `JobTestHelper.java:535` | Mock scheduler setup |
+| `ResourceClusterActorTest.java` | Main resource cluster tests |
+| `JobClusterAkkaTest.java` | Job cluster Akka tests |
+| `build.gradle` (root) | `akkaTest` task config |
+
+## 🎯 Test Suites
+
+| Suite | Count | Runtime | Notes |
+|-------|-------|---------|-------|
+| `test` | 330+ | ~2 min | Standard unit tests |
+| `akkaTest` | 81+ | ~5 sec | Akka actor tests, single-threaded |
+
+## 💡 Pro Tips
+
+1. **Always use `--rerun-tasks`** to bypass Gradle cache
+2. **Run specific tests first** when debugging
+3. **Check HTML reports** for detailed failure info
+4. **Use subagents** for multi-step workflows
+5. **Run flaky tests multiple times** to confirm stability
+
+## 🔗 See Also
+
+- [Full Testing Guide](.claude/testing-guide.md) - Comprehensive testing documentation
+- [Subagent Examples](.claude/subagent-examples.md) - Copy-paste subagent prompts
@@ -38,25 +38,25 @@ jobs:
           NETFLIX_OSS_REPO_PASSWORD: ${{ secrets.ORG_NETFLIXOSS_PASSWORD }}
         with:
           arguments: --info --stacktrace build -x test snapshot
-      - name: Create PR Comment String
-        run: |
-          src=${GITHUB_WORKSPACE}/build/versions.txt
-          dest=${GITHUB_WORKSPACE}/build/versions2.txt
-          echo 'resolutionStrategy {' > $dest
-          awk '{print "    force \"" $0 "\""}' $src >> $dest
-          echo '}' >> $dest
-          echo "PR_STR<<EOF" >> $GITHUB_ENV
-          cat ${dest} >> $GITHUB_ENV
-          echo 'EOF' >> $GITHUB_ENV
-      - name: Upload
-        uses: mshick/add-pr-comment@v2
-        with:
-          message: |
-            ## Uploaded Artifacts
-            To use these artifacts in your Gradle project, paste the following lines in your build.gradle.
-            ```
-            ${{ env.PR_STR }}
-            ```
-          message-id: ${{ github.event.number }} # For sticky messages
-          repo-token: ${{ secrets.GITHUB_TOKEN }}
-          allow-repeats: false # This is the default
+      # - name: Create PR Comment String
+      #   run: |
+      #     src=${GITHUB_WORKSPACE}/build/versions.txt
+      #     dest=${GITHUB_WORKSPACE}/build/versions2.txt
+      #     echo 'resolutionStrategy {' > $dest
+      #     awk '{print "    force \"" $0 "\""}' $src >> $dest
+      #     echo '}' >> $dest
+      #     echo "PR_STR<<EOF" >> $GITHUB_ENV
+      #     cat ${dest} >> $GITHUB_ENV
+      #     echo 'EOF' >> $GITHUB_ENV
+      # - name: Upload
+      #   uses: mshick/add-pr-comment@v2
+      #   with:
+      #     message: |
+      #       ## Uploaded Artifacts
+      #       To use these artifacts in your Gradle project, paste the following lines in your build.gradle.
+      #       ```
+      #       ${{ env.PR_STR }}
+      #       ```
+      #     message-id: ${{ github.event.number }} # For sticky messages
+      #     repo-token: ${{ secrets.GITHUB_TOKEN }}
+      #     allow-repeats: false # This is the default
@@ -118,8 +118,9 @@ allprojects {
     }
 }
 
-def printAllReleasedArtifacts = project.tasks.create('printAllReleasedArtifacts')
-project.snapshot.configure { finalizedBy printAllReleasedArtifacts }
+// def printAllReleasedArtifacts = project.tasks.create('printAllReleasedArtifacts')
+// project.snapshot.configure { finalizedBy printAllReleasedArtifacts }
+// project.tasks.named("build").configure { finalizedBy printAllReleasedArtifacts }
 subprojects {
     apply plugin: 'java-library'
 
@@ -172,22 +173,23 @@ subprojects {
         options.compilerArgs << "-Xlint:deprecation"
     }
 
-    project.plugins.withType(MavenPublishPlugin) {
-        def printReleasedArtifact = project.tasks.create('printReleasedArtifact')
-        printReleasedArtifact.doLast {
-            def file1 = file("${buildDir}/tmp/publishNebulaPublicationToNetflixOSSRepository/maven-metadata.xml")
-            def file2 = file("${buildDir}/tmp/publishNebulaPublicationToNetflixOSSRepository/snapshot-maven-metadata.xml")
-            def xmlText = file1.exists() ? file1.text : (file2.exists() ? file2.text : "file not found")
-            def xml = new XmlParser(false, false).parseText(xmlText)
-            def snapshotVersion = xml.versioning.snapshotVersions.snapshotVersion[0].'value'.text()
-            logger.lifecycle("${project.group}:${project.name}:${snapshotVersion}")
-            file("${project.rootProject.buildDir}/versions.txt").append("${project.group}:${project.name}:${snapshotVersion}" + '\n')
-        }
-
-        printReleasedArtifact.dependsOn(project.rootProject.snapshot)
-        printAllReleasedArtifacts.dependsOn("${project.path}:printReleasedArtifact")
-    }
-
+    // project.plugins.withType(MavenPublishPlugin) {
+    //     def printReleasedArtifact = project.tasks.create('printReleasedArtifact')
+    //     printReleasedArtifact.doLast {
+    //         def file1 = file("${buildDir}/tmp/publishNebulaPublicationToNetflixOSSRepository/maven-metadata.xml")
+    //         def file2 = file("${buildDir}/tmp/publishNebulaPublicationToNetflixOSSRepository/snapshot-maven-metadata.xml")
+    //         def xmlText = file1.exists() ? file1.text : (file2.exists() ? file2.text : "file not found")
+    //         logger.quiet("maven-metadata.xml prefix: ${xmlText.take(100)}")
+    //         def xml = new XmlParser(false, false).parseText(xmlText)
+    //         def snapshotVersion = xml.versioning.snapshotVersions.snapshotVersion[0].'value'.text()
+    //         logger.lifecycle("${project.group}:${project.name}:${snapshotVersion}")
+    //         file("${project.rootProject.buildDir}/versions.txt").append("${project.group}:${project.name}:${snapshotVersion}" + '\n')
+    //     }
+
+    //     printAllReleasedArtifacts.dependsOn("${project.path}:printReleasedArtifact")
+    // }
+
+    // test tasks config
     task akkaTest(type: Test) {
         maxParallelForks = 1
         filter {

@@ -0,0 +1,107 @@
+# Scheduling Logic and Resource Cluster Interactions
+
+This document explains how job creation and job scaling translate into scheduling and resource changes, and how those changes interact with the Resource Cluster and the Resource Cluster Scaler.
+
+## A) Job Creation → Scheduling → Task Execution
+
+```mermaid
+sequenceDiagram
+    autonumber
+    participant Client
+    participant JCM as JobClustersManagerActor
+    participant JC as JobClusterActor
+    participant JA as JobActor
+    participant MS as MantisScheduler
+    participant RCSA as ResourceClusterAwareSchedulerActor
+    participant RC as ResourceClusterActor
+    participant TE as Task Executor
+
+    Client->>JCM: SubmitJobRequest
+    JCM->>JC: forward(request)
+    JC->>JA: InitJob (initialize workers)
+    JA->>MS: scheduleWorkers(BatchScheduleRequest)
+    MS->>RCSA: BatchScheduleRequestEvent
+    RCSA->>RC: getTaskExecutorsFor(allocationRequests)
+    RC-->>RCSA: allocations (TaskExecutorID -> AllocationRequest)
+    RCSA->>TE: submitTask(executeStageRequestFactory.of(...))
+    TE-->>RCSA: Ack / Failure
+    RCSA->>JA: WorkerLaunched / WorkerLaunchFailed
+    Note over RC: Inventory & matching by ExecutorStateManagerImpl
+```
+
+Key notes:
+- Job submission enters at `JobClustersManagerActor.onJobSubmit`, forwarded to `JobClusterActor`, then `JobActor.initialize` constructs workers.
+- `JobActor` enqueues workers via batch scheduling to the resource-cluster-aware scheduler.
+- `ResourceClusterAwareSchedulerActor` queries `ResourceClusterActor` for best-fit executors and pushes tasks to TEs via `TaskExecutorGateway`.
+
+## B) Job Stage Scale-Up/Down and Resource Cluster Scaler Loop
+
+```mermaid
+sequenceDiagram
+    autonumber
+    participant JA as JobActor
+    participant WM as WorkerManager
+    participant MS as MantisScheduler
+    participant RCSA as ResourceClusterAwareSchedulerActor
+    participant RC as ResourceClusterActor
+    participant RCS as ResourceClusterScalerActor
+    participant Host as ResourceClusterHostActor
+    participant TE as Task Executor
+
+    JA->>JA: onScaleStage(StageScalingPolicy, ScalerRule min/max)
+    JA->>WM: scaleStage(newNumWorkers)
+    alt Scale Up
+        WM->>MS: scheduleWorkers(BatchScheduleRequest)
+        MS->>RCSA: BatchScheduleRequestEvent
+        RCSA->>RC: getTaskExecutorsFor(...)
+        RC-->>RCSA: allocations or NoResourceAvailable
+        RCSA->>TE: submitTask(...)
+        note over RC: pendingJobRequests updated when not enough TEs
+    else Scale Down
+        WM->>RC: unscheduleAndTerminateWorker(...)
+    end
+
+    loop Periodic Capacity Loop (every scalerPullThreshold)
+        RCS->>RC: GetClusterUsageRequest
+        RC-->>RCS: GetClusterUsageResponse (idle/total by SKU)
+        RCS->>RCS: apply rules (minIdleToKeep, maxIdleToKeep, cooldown)
+        alt ScaleUp decision
+            RCS->>Host: ScaleResourceRequest(desired size increase)
+            Host-->>RC: New TEs provisioned
+            TE->>RC: TaskExecutorRegistration + Heartbeats
+            RC->>RC: mark available (unless disabled)
+        else ScaleDown decision
+            RCS->>RC: GetClusterIdleInstancesRequest
+            RC-->>RCS: idle TE list
+            RCS->>Host: ScaleResourceRequest(desired size decrease, idleInstances)
+            RCS->>RC: DisableTaskExecutorsRequest(idle targets during drain)
+        end
+    end
+```
+
+Key notes:
+- Job scale-up adds workers; if TEs are insufficient, the scheduler records demand (pending) so the scaler sees fewer effective idle slots and scales up capacity.
+- Scale-down removes workers and may free idle TEs; the scaler can shrink capacity by selecting idle instances to drain and disabling them during deprovisioning.
+
+## Code Reference Map
+
+- Submission & job initialization:
+  - `JobClustersManagerActor.onJobSubmit(...)`
+  - `JobActor.onJobInitialize(...)`, `initialize(...)`, `submitInitialWorkers(...)`, `queueTasks(...)`
+
+- Scheduling (push model):
+  - `ResourceClusterAwareSchedulerActor.onBatchScheduleRequestEvent(...)`, `onAssignedScheduleRequestEvent(...)`
+  - `ResourceClusterActor` + `ExecutorStateManagerImpl.findBestFit(...)`, `findBestGroupBySizeNameMatch(...)`, `findBestGroupByFitnessCalculator(...)`
+
+- Job scaling:
+  - `JobActor.onScaleStage(...)` → `WorkerManager.scaleStage(...)`
+
+- Resource-cluster autoscaling:
+  - `ResourceClusterScalerActor.onTriggerClusterUsageRequest(...)`, `onGetClusterUsageResponse(...)`, `onGetClusterIdleInstancesResponse(...)`
+  - `ExecutorStateManagerImpl.getClusterUsage(...)` (idle = available − pending), `pendingJobRequests`
+  - `ResourceClusterActor.onTaskExecutorRegistration(...)` (TE joins, becomes available)
+
+## Coupling Between Scheduling and Scaling
+
+- When allocations fail due to lack of capacity, the resource cluster records pending demand per job/group; the scaler subtracts this pending from available to compute effective idle, triggering scale-up.
+- New TEs register with the resource cluster and are immediately considered for subsequent scheduling waves.