A framework for blind behavioral testing of live API systems. Ship an independent test suite that validates your platform's security, auth, RBAC, input handling, and UI behavior — without importing a single line of your application code.
Traditional testing verifies code from the inside. Blind scenario testing verifies behavior from the outside. Your test suite:
- Lives in a separate repository from the platform it tests
- Has zero access to source code, internal models, or implementation details
- Treats the platform as a black box and asserts only on observable HTTP behavior
- Runs against a digital twin — a Docker-composed replica of your production stack with a mock LLM backend
This approach catches regressions that unit tests miss: broken auth cookies, missing security headers, RBAC gaps, injection vulnerabilities, and policy enforcement failures. Because the tests don't know how the platform works internally, they can't be accidentally coupled to implementation details.
| Problem | How blind testing solves it |
|---|---|
| Tests coupled to implementation details | Tests only know HTTP endpoints and expected status codes |
| Security regressions slip through | Dedicated injection, header, and privilege escalation scenarios |
| Auth/RBAC changes break silently | Role-based scenarios catch permission drift immediately |
| "Works on my machine" | Docker-composed digital twin is the single source of truth |
| LLM costs during testing | Mock LLM server returns deterministic, keyword-matched responses |
- Python 3.11+
- Docker & Docker Compose
- Your platform's source code (for building the digital twin)
# Clone this repo
git clone https://github.com/NathanMaine/blind-scenario-testing.git
cd blind-scenario-testing
# Create a virtual environment
python -m venv .venv && source .venv/bin/activate
# Install dependencies
pip install -e .
# Install Playwright browsers (for UI tests)
playwright install chromium
# Copy and configure environment
cp .env.example .env
# Edit .env — set PLATFORM_PATH to your platform's source directory# Full sweep: start stack, seed data, run all tests, tear down
make sweep
# Or step by step:
make up # Start the digital twin stack
make seed # Create test users and upload fixture documents
make test # Run all scenario tests
make down # Tear down the stack
# Run subsets:
make test-api # API tests only (skip UI/Playwright)
make test-ui # UI tests onlyblind-scenario-testing/
├── conftest.py # Root fixtures: base URLs and httpx clients
├── pytest.ini # Test configuration and markers
├── pyproject.toml # Dependencies
├── Makefile # Lifecycle targets (up/seed/test/down/sweep)
├── docker-compose.test.yml # Digital twin stack definition
├── .env.example # Environment variable template
│
├── mock_llm/ # Ollama-compatible mock LLM server
│ ├── Dockerfile
│ ├── server.py # FastAPI server (OpenAI + Ollama endpoints)
│ └── responses.py # Keyword→response rules (customize this!)
│
├── fixtures/
│ ├── seed.py # Bootstrap test users and documents
│ ├── users.json # Test user definitions (roles + credentials)
│ └── documents/ # Fixture documents for RAG testing
│ └── sample_document.txt
│
├── scenarios/ # All test scenarios
│ ├── conftest.py # Pre-authenticates test users
│ ├── auth/ # Authentication tests
│ │ ├── test_login_flows.py
│ │ └── test_cookie_security.py
│ ├── authz/ # Authorization / RBAC tests
│ │ ├── test_rbac_enforcement.py
│ │ └── test_privilege_escalation.py
│ ├── chat/ # Chat pipeline tests
│ │ └── test_chat_pipeline.py
│ ├── health/ # Health endpoint tests
│ │ └── test_health_endpoints.py
│ ├── security/ # Security header + injection tests
│ │ ├── test_security_headers.py
│ │ └── test_injection_attacks.py
│ └── ui/ # Playwright browser tests
│ ├── conftest.py # Browser/page fixtures + screenshot on failure
│ └── test_login_ui.py
│
├── ci/
│ ├── run-sweep.sh # Full sweep script with cleanup trap
│ └── github-actions.yml # Example GitHub Actions workflow
│
└── reports/ # Generated test reports (gitignored)
Edit docker-compose.test.yml to add your platform's services. The mock LLM is already configured — just wire your app to point at http://mock-llm:11434 for its LLM backend.
services:
mock-llm:
build: ./mock_llm # Already configured
your-app:
build:
context: ${PLATFORM_PATH}
ports:
- "18080:8080"
environment:
- LLM_HOST=http://mock-llm:11434
depends_on:
mock-llm:
condition: service_healthyEdit scenarios/conftest.py to match your platform's login endpoint:
def _login(client, username, password):
resp = client.post("/your/login/endpoint", json={
"username": username,
"password": password,
})
resp.raise_for_status()
return resp.json()["your_token_field"]Update fixtures/users.json with your platform's roles and credentials.
Edit mock_llm/responses.py with domain-specific keyword/response pairs:
KEYWORD_RESPONSES = [
(["your-domain-term"], "Domain-specific response here."),
(["another-keyword"], "Another response."),
]Use the included example scenarios as templates. Each scenario category has its own directory under scenarios/. The pattern is simple:
import pytest
pytestmark = pytest.mark.auth
class TestYourFeature:
def test_something_works(self, client, admin_headers):
resp = client.get("/your/endpoint", headers=admin_headers)
assert resp.status_code == 200- Create a new directory under
scenarios/(e.g.,scenarios/payments/) - Add an
__init__.py - Register the marker in
pytest.ini - Write your test files
| Category | What It Tests |
|---|---|
auth |
Login flows, cookie security, password policy, session management, MFA |
authz |
RBAC enforcement, privilege escalation, unauthenticated access |
chat |
Chat pipeline, response structure, input validation |
health |
Service health endpoints, component status |
security |
HTTP headers, injection attacks (SQL, XSS, path traversal), oversized payloads |
ui |
Browser-based login, interaction, and error handling |
- This repo = behavioral assertions only. No application imports.
- Your platform repo = the system under test. Built via Docker.
- Mock LLM = deterministic responses. Zero GPU, zero cost, fully reproducible.
- Session-scoped fixtures for base URLs and auth tokens (created once)
- Throwaway users with UUID suffixes for mutation-heavy tests
- Each UI test gets a fresh browser context
Tests assert on both "the right thing happened" and "the wrong thing didn't happen":
# Not just: "did login succeed?"
# Also: "does a forged JWT get rejected?"
# Also: "does a viewer get 403 on admin endpoints?"Copy ci/github-actions.yml to your repo as .github/workflows/scenario-sweep.yml. You'll need:
- A
PLATFORM_TOKENsecret with access to your platform repo - Update the
repositoryfield to point at your platform
./ci/run-sweep.shThis runs the full lifecycle with automatic cleanup on failure.
If your platform handles sensitive data, create a scenarios/pii/ category:
class TestPIIBlocking:
def test_ssn_blocked(self, client, admin_headers):
resp = client.post("/api/chat", json={
"message": "My SSN is 123-45-6789",
}, headers=admin_headers)
# Assert the platform blocks or redacts PII
assert resp.status_code in (200, 502)
if resp.status_code == 200:
data = resp.json()
assert data.get("policy_decision") in ("blocked_pii", "redacted")class TestRateLimiting:
def test_burst_triggers_throttle(self, client, admin_headers):
results = []
for _ in range(50):
resp = client.post("/api/chat",
json={"message": "test"},
headers=admin_headers)
results.append(resp.status_code)
assert 429 in results, "No rate limiting detected"class TestDocumentLifecycle:
def test_upload_and_retrieve(self, client, admin_headers):
with open("fixtures/documents/sample_document.txt", "rb") as f:
resp = client.post("/api/documents",
files={"file": ("test.txt", f)},
headers=admin_headers)
assert resp.status_code in (200, 201)MIT License. See LICENSE for details.
This framework was extracted from a production blind testing system used to validate a CMMC compliance platform. The original implementation ran 150+ parametrized scenarios across authentication, authorization, PII/CUI protection, chat pipelines, document management, gateway policies, audit logging, and browser UI — all without importing a single line of the platform's source code.
The "Dark Factory" pattern (named for manufacturing's lights-out factory concept) treats your entire platform as a sealed black box. You build a digital twin, point your tests at it, and assert on behavior. If a test fails, the platform has a behavioral regression — no matter what the internal code looks like.