Skip to content

Comments

feat: enforce strict code quality (v0.14.0)#469

Merged
greynewell merged 1 commit intomainfrom
feat/strict-code-quality-v0.14.0
Feb 13, 2026
Merged

feat: enforce strict code quality (v0.14.0)#469
greynewell merged 1 commit intomainfrom
feat/strict-code-quality-v0.14.0

Conversation

@greynewell
Copy link
Owner

@greynewell greynewell commented Feb 13, 2026

Summary

  • Expand Ruff linting rules (B, UP, SIM, RUF, C4, PIE, PT, ASYNC, S, T20) and add mypy type checking with Pydantic plugin across all 134 source files
  • Fix all 72 ruff lint violations across 36 files and all 267 mypy type errors across 39 files
  • Add mypy pre-commit hook (local, using uv run mypy) and CI type-check job

Test plan

  • uvx ruff check src/ tests/ — All checks passed
  • uvx ruff format --check src/ tests/ — 266 files already formatted
  • uv run mypy src/mcpbr/ — Success: no issues found in 134 source files
  • uv run pytest -m "not integration" — 4293 passed, 0 failures
  • All pre-commit hooks pass (sync-version, ruff, ruff-format, mypy, trailing-whitespace, end-of-file-fixer, check-yaml, check-added-large-files, check-merge-conflict)

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes: Version 0.14.0

  • New Features

    • Added comprehensive type checking with mypy integration in CI/pre-commit workflows
    • Expanded code quality enforcement with stricter linting rules
  • Bug Fixes

    • Fixed multiple code quality violations and type errors across the codebase
    • Improved error handling and timezone handling consistency
    • Enhanced validation and safety checks throughout infrastructure providers
  • Documentation

    • Updated development guidelines to include type checking requirements

@coderabbitai
Copy link

coderabbitai bot commented Feb 13, 2026

📝 Walkthrough

Walkthrough

This PR implements a comprehensive code quality and type-safety initiative, bumping versions from 0.13.4 to 0.14.0 across manifests, adding mypy type-checking infrastructure with pre-commit and CI integration, expanding Ruff lint rules, modernizing timezone handling via UTC constants, improving exception handling patterns, and adding type annotations throughout the codebase.

Changes

Cohort / File(s) Summary
Version Bumping
.claude-plugin/marketplace.json, .claude-plugin/package.json, .claude-plugin/plugin.json, package.json, pyproject.toml, src/mcpbr/__init__.py
Bumped project version from 0.13.4 to 0.14.0 across all manifest and package configuration files. Exposed __version__ in public API via __all__.
Type Checking Infrastructure
.pre-commit-config.yaml, .github/workflows/ci.yml, pyproject.toml, AGENTS.md, CHANGELOG.md
Added mypy to pre-commit hooks and introduced dedicated CI type-check job. Expanded Ruff lint rules, added mypy configuration with strict options and per-file ignore mappings. Updated documentation to reflect type-checking requirements.
Public API & Type Signature Updates
src/mcpbr/benchmarks/__init__.py, src/mcpbr/reporting.py, src/mcpbr/dashboard.py, src/mcpbr/sdk.py, src/mcpbr/log_formatter.py, src/mcpbr/infrastructure/base.py
Exposed BENCHMARK_REGISTRY and expanded benchmark class exports. Broadened return types to dict[str, Any]. Changed create_app return type from quoted to concrete FastAPI. Added InstanceLogWriter class; updated log_file parameter union type. Made collect_artifacts concrete with return type Path | None.
Timezone Modernization
src/mcpbr/cache.py, src/mcpbr/audit.py, src/mcpbr/analytics/database.py, src/mcpbr/docker_cache.py, src/mcpbr/graceful_degradation.py, src/mcpbr/privacy.py, src/mcpbr/profiler.py, src/mcpbr/state_tracker.py, src/mcpbr/tutorial.py, src/mcpbr/dataset_versioning.py, src/mcpbr/reproducibility.py, tests/*
Replaced timezone.utc with UTC constant from datetime module across timestamp generation, caching, auditing, profiling, and state management. Import pattern updated to from datetime import UTC, datetime.
Exception Handling Improvements
src/mcpbr/evaluation.py, src/mcpbr/harness.py, src/mcpbr/harnesses.py, src/mcpbr/distributed.py, tests/infrastructure/test_*.py, tests/test_eval_reliability.py, tests/test_graceful_degradation.py
Replaced asyncio.TimeoutError with built-in TimeoutError for broader exception coverage. Updated exception chaining with from e to preserve context. Introduced contextlib.suppress() patterns for non-fatal cleanup paths.
Type Annotation Enhancements
src/mcpbr/benchmarks/arc.py, src/mcpbr/benchmarks/bigcodebench.py, src/mcpbr/benchmarks/gaia.py, src/mcpbr/benchmarks/hellaswag.py, src/mcpbr/benchmarks/mbpp.py, src/mcpbr/benchmarks/leetcode.py, src/mcpbr/custom_metrics.py, src/mcpbr/profiler.py, src/mcpbr/providers.py, src/mcpbr/streaming.py
Added explicit type annotations for local variables: augmented_tasks: list[dict[str, Any]], intermediate lists (costs: list[float], runtimes: list[float]), and method return types. Moved Callable import from typing to collections.abc.
Control Flow & Code Refactoring
src/mcpbr/agent.py, src/mcpbr/analytics/adversarial.py, src/mcpbr/benchmarks/cybergym.py, src/mcpbr/config_migration.py, src/mcpbr/docker_prewarm.py, src/mcpbr/dry_run.py, src/mcpbr/few_shot.py, src/mcpbr/graceful_degradation.py, src/mcpbr/prompt_security.py, src/mcpbr/sandbox.py, src/mcpbr/config_inheritance.py, src/mcpbr/env_expansion.py
Simplified control flow with single conditions instead of nested checks, replaced loops with any() expressions, consolidated URL protocol checks with tuple startswith(), replaced Enum with StrEnum, extended FindingSeverity enum with HIGH and CRITICAL members, refactored problem statement templates to dictionary mapping.
CLI & Infrastructure Refinements
src/mcpbr/cli.py, src/mcpbr/docker_env.py, src/mcpbr/infrastructure/aws.py, src/mcpbr/infrastructure/azure.py, src/mcpbr/infrastructure/gcp.py, src/mcpbr/infrastructure/k8s.py, src/mcpbr/infrastructure/cloudflare.py, src/mcpbr/infrastructure/local.py
Introduced ExitStack for resource management, added assertions for preconditions (ssh_client, instance_id), enhanced config guards with ValueError raises, updated method signatures for get_run_status to dict[str, Any], improved type safety and null-checking across SSH/SFTP operations.
Test Infrastructure Updates
tests/infrastructure/*.py, tests/test_*.py
Refactored nested context managers into single with statements using tuple syntax, replaced IOError with OSError, changed fixture decorators from @pytest.fixture() to @pytest.fixture, consolidated async mock patterns, renamed unused variables to underscored variants, updated regex patterns to raw strings, replaced asyncio.TimeoutError with TimeoutError in mock scenarios.
Minor Cleanups & Utilities
scripts/sync_version.py, src/mcpbr/plugin_registry.py, src/mcpbr/compat.py, src/mcpbr/config.py, src/mcpbr/config_validator.py, src/mcpbr/config_wizard.py, src/mcpbr/notifications.py, src/mcpbr/resource_limits.py, src/mcpbr/sampling.py, src/mcpbr/smoke_test.py, src/mcpbr/storage/*.py
Removed unnecessary pass statements, added noqa comments for linting, improved Windows path validation bounds checking, added type ignore hints for optional dependencies, introduced explicit local variable typing for responses, simplified method bodies for abstract classes, enhanced JSON/YAML error messages with problem values.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related issues

  • greynewell/mcpbr#467: Directly addresses static analysis enforcement with mypy type-checking configuration, expanded Ruff lint rules, pre-commit hook integration, and documentation updates—all central to this PR's objectives.

Possibly related PRs

  • greynewell/mcpbr#377: Modifies the same analytics modules touched by this PR (src/mcpbr/analytics/*), indicating overlapping type annotation and code quality improvements.
  • greynewell/mcpbr#380: Shares version bumping and plugin manifest updates across .claude-plugin/* configuration files.
  • greynewell/mcpbr#389: Related through timeout exception handling changes, replacing asyncio.TimeoutError with TimeoutError across evaluation and harness modules.

Suggested labels

enhancement, type: infrastructure, size: XL


🐰 A quality quest complete at last,
Types checked and lint rules cast,
UTC zones now align,
Exceptions refined,
Code standards holding fast!

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: enforcing strict code quality by expanding linting rules and adding type checking, with a version bump to 0.14.0.
Docstring Coverage ✅ Passed Docstring coverage is 96.04% which is sufficient. The required threshold is 80.00%.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/strict-code-quality-v0.14.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greynewell greynewell force-pushed the feat/strict-code-quality-v0.14.0 branch 3 times, most recently from 6bc1ca3 to 5b020d1 Compare February 13, 2026 22:56
…king (v0.14.0)

Add comprehensive static analysis enforcement across all 134 source files:

- Expand Ruff rules: B, UP, SIM, RUF, C4, PIE, PT, ASYNC, S, T20
- Add mypy type checking with Pydantic plugin and CI type-check job
- Fix 72 ruff lint violations across 36 files
- Fix 267 mypy type errors across 39 files
- All 4293 tests pass with zero regressions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@greynewell greynewell force-pushed the feat/strict-code-quality-v0.14.0 branch from 5b020d1 to b9c29c7 Compare February 13, 2026 23:00
@greynewell greynewell merged commit 0f6cf4b into main Feb 13, 2026
12 of 13 checks passed
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
tests/test_benchmark_integration.py (1)

111-116: ⚠️ Potential issue | 🟡 Minor

Avoid broad except Exception in thread aggregation (BLE001).

This blanket catch triggers Ruff BLE001 and can mask programming errors. Catch specific expected exceptions (e.g., network/load failures) or re-raise unexpected ones after recording the error.

tests/test_eval_reliability.py (1)

148-154: ⚠️ Potential issue | 🟡 Minor

Stale docstring/comment and near-duplicate test after TimeoutError unification.

Line 149's docstring still says "should catch asyncio.TimeoutError (not just TimeoutError)" and line 153's comment references "Python <3.11 compat", but the side effect now raises the built-in TimeoutError. Additionally, this test is now functionally identical to test_builtin_timeout_error_still_caught (lines 170–185) — both raise TimeoutError() and assert the same outcome.

Consider either:

  • Removing one of the two duplicate tests, or
  • If you want to keep explicit coverage for both exception types, restoring asyncio.TimeoutError() in this test (it's still valid in 3.11+ as an alias) and updating the comment to clarify the intent.

Either way, the docstring and comment on lines 149/153 should be updated to reflect reality.

src/mcpbr/infrastructure/azure.py (1)

572-586: ⚠️ Potential issue | 🟡 Minor

datetime.now() without UTC is inconsistent with the PR's timezone migration.

Line 585 uses datetime.now().isoformat() which produces a naive local timestamp, while the rest of this PR consistently migrates to datetime.now(UTC). This should be updated for consistency.

Proposed fix
-            from datetime import datetime
+            from datetime import UTC, datetime

             assert self.vm_name is not None
             assert self.vm_ip is not None
             run_state = RunState(
                 vm_name=self.vm_name,
                 vm_ip=self.vm_ip,
                 resource_group=self.azure_config.resource_group,
                 location=self.azure_config.location,
                 ssh_key_path=str(self.ssh_key_path),
                 config_path=str(self.config.config_path)
                 if hasattr(self.config, "config_path")
                 else "",
-                started_at=datetime.now().isoformat(),
+                started_at=datetime.now(UTC).isoformat(),
             )
🤖 Fix all issues with AI agents
In `@pyproject.toml`:
- Around line 137-138: Remove the redundant per-file-ignores entry
`"infrastructure/**/*.py" = ["S603", "S607"]` from pyproject.toml; instead keep
the existing `"tests/**/*.py"` and the `"src/mcpbr/infrastructure/**/*.py"`
entries and ensure the remaining JSON-like mapping remains valid after deleting
that line so the file parses correctly.

In `@src/mcpbr/env_expansion.py`:
- Around line 139-148: The exclusion `not key_lower.endswith("_key")` in the
conditional is too broad and skips sensitive names; update the check in
env_expansion.py around the `if (` that uses key_lower and value so that `_key`
names are treated safely — either add sensitive `_key` patterns (e.g.,
"secret_key", "private_key", "encryption_key", "signing_key") to the first
branch's keyword list checked with `any(keyword in key_lower ...)`, or replace
the blanket exclusion with an allowlist of known-benign `_key` suffixes (e.g.,
"cache_key", "primary_key") and keep all other `_key` endings subject to the
API-key heuristic, ensuring warnings.append still references path and value
length checks as before.

In `@src/mcpbr/log_formatter.py`:
- Line 509: The line that opens the log file contains an unnecessary noqa
suppression ("# noqa: SIM115")—remove that directive from the open call so the
line becomes a plain open(log_file_path, "w") call; locate the open(...)
invocation where the variable log_file is assigned and drop the trailing "#
noqa: SIM115" comment.

In `@src/mcpbr/profiler.py`:
- Around line 152-157: The broad "except Exception:" in the memory-sampling
block of the Profiler (around the except handling that sets
self.enable_memory_profiling) triggers Ruff BLE001; replace the broad catch with
specific exceptions (e.g., except (psutil.Error, OSError):) or, if the broad
catch is intentional for best-effort telemetry, annotate it with the linter
exemption comment "# noqa: BLE001" so Ruff stops flagging it; update the except
block around the memory profiling sampling in profiler.py (the handler that
currently has "# noqa: S110 -- best-effort memory sampling; non-critical
telemetry") accordingly.

In `@src/mcpbr/reproducibility.py`:
- Around line 114-119: The loop in get package collection uses unstable metadata
indexing (dist.metadata["Name"]/["Version"]) which can KeyError; change it to
use the stable Distribution properties (dist.name and dist.version) when
building the packages dict in the function that iterates over distributions(),
e.g., replace references to dist.metadata["Name"] and dist.metadata["Version"]
with dist.name and dist.version and optionally guard with or "" to preserve
current behavior; keep the surrounding exception handling as a last-resort
fallback.

In `@tests/test_benchmark_integration.py`:
- Around line 65-66: Replace the broad "except Exception as e" that writes to
result["error"] with a targeted handler that only catches expected errors from
benchmark creation/loading (e.g., ValueError, IOError, OSError or your
domain-specific exceptions) and records them into result["error"]; for anything
else, re-raise the exception so tests fail and provide a traceback. Concretely:
change the "except Exception as e" block to either "except (ValueError, IOError)
as e:" (adding any project-specific exception classes) or use "except Exception
as e: if isinstance(e, (ValueError, IOError, YourBenchmarkError)):
result['error'] = ... else: raise" to avoid swallowing unexpected failures while
preserving the existing error message formatting.

In `@tests/test_harness_notifications.py`:
- Around line 42-48: The test currently assigns a hardcoded literal directly to
config.slack_bot_token (triggering Ruff S105); change it to assign the token to
a non-literal variable (e.g., token_var = "xoxb-token"), set
config.slack_bot_token = token_var, and update the assertion to compare
result["slack_bot_token"] against that variable instead of a literal, leaving
the rest of the test (discord/slack_webhook assertions and use of
_build_notify_config) unchanged.

In `@tests/test_thinking_budget.py`:
- Around line 283-288: Replace the hard-coded "/tmp/test" and the broad
contextlib.suppress(Exception) in the patched test block with a pytest tmp_path
fixture and a narrow suppression (or none); specifically update the
patch/context manager that surrounds the call to harness.solve(task,
"/tmp/test", timeout=1) to call harness.solve(task, tmp_path / "test",
timeout=1) and change contextlib.suppress(Exception) to
contextlib.suppress(TimeoutError) (or remove the suppressor entirely), and apply
the same change to the other async test block that patches
mcpbr.harnesses._run_cli_command and shutil.which so both tests use tmp_path and
only suppress TimeoutError.
🧹 Nitpick comments (16)
src/mcpbr/analytics/regression_detector.py (1)

348-353: Nit: prefer truthy check over len(...) > 0.

Since this PR enables the SIM rule set, len(task_regressions) > 0 can be simplified to just task_regressions. An empty list is falsy in Python, so the semantics are identical.

♻️ Suggested simplification
         elif (
             cost_regression_detected
             or latency_regression_detected
             or token_regression_detected
-            or len(task_regressions) > 0
+            or task_regressions
         ):
src/mcpbr/analytics/correlation.py (1)

50-50: Prefer strict=True here — lengths are already validated.

Since lines 39–42 guarantee len(x) == len(y), using strict=True provides defense-in-depth at no cost, and better communicates the intent that these sequences must be the same length. strict=False (the default) silences B905 but opts out of the safety net.

Suggested change
-    numerator = math.fsum((xi - x_mean) * (yi - y_mean) for xi, yi in zip(x, y, strict=False))
+    numerator = math.fsum((xi - x_mean) * (yi - y_mean) for xi, yi in zip(x, y, strict=True))
tests/test_audit.py (1)

985-985: Consider strict=True here instead of strict=False.

Line 983 already asserts both lists have length 2, so they're guaranteed equal at this point. Using strict=True would be the more defensive choice — it catches length mismatches at the zip call itself rather than silently truncating, which is especially valuable if the assertion on line 983 is ever relaxed or removed.

Suggested change
-            for json_entry, csv_row in zip(json_data, csv_rows, strict=False):
+            for json_entry, csv_row in zip(json_data, csv_rows, strict=True):
src/mcpbr/graceful_degradation.py (1)

111-120: Redundant entries in _TRANSIENT_ERRORS.

asyncio.TimeoutError is an alias for TimeoutError since Python 3.11, IOError is an alias for OSError since Python 3.3, and ConnectionResetError/ConnectionRefusedError/ConnectionAbortedError are subclasses of ConnectionError. These overlap but are harmless — just something to clean up if you want a tighter tuple.

tests/test_graceful_degradation.py (1)

109-113: This test is now a duplicate of test_timeout_is_transient.

After replacing asyncio.TimeoutError() with TimeoutError(), this test does exactly the same thing as test_timeout_is_transient (line 67). Since they're aliases in Python 3.11+, both tests instantiate TimeoutError and assert TRANSIENT. Consider removing this test or, if you want to keep explicit coverage for the alias, at least updating the name/docstring to reflect that it's intentionally testing the alias equivalence.

tests/test_docker_retry.py (1)

22-24: Ruff ARG001 false positive — mock_docker_client is a pytest fixture dependency.

The static analysis hint flags mock_docker_client as unused, but it's required here as an implicit fixture dependency: it ensures docker.from_env is patched before DockerEnvironmentManager() is instantiated on line 24. This is idiomatic pytest and safe to ignore. If Ruff's ARG001 rule is enforced project-wide, you could suppress it with a leading underscore or a # noqa: ARG001 comment, but that's entirely optional.

src/mcpbr/notifications.py (1)

493-494: Type annotation looks good.

The explicit url variable with str | None type annotation correctly reflects the API response structure and improves type safety.

Optional: Consider try/except/else structure (Ruff TRY300).

The static analyzer suggests moving the return statement to an else block for more explicit control flow. However, the current pattern is perfectly readable and commonly used. This is purely a style preference.

♻️ Optional refactor to satisfy TRY300
     try:
         response = requests.post(
             "https://api.github.com/gists",
             headers={
                 "Authorization": f"token {github_token}",
                 "Accept": "application/vnd.github.v3+json",
             },
             json={
                 "description": description,
                 "public": False,
                 "files": {
                     "results.json": {"content": results_json},
                 },
             },
             timeout=30,
         )
         response.raise_for_status()
-        url: str | None = response.json().get("html_url")
-        return url
     except Exception as e:
         logger.warning("Failed to create GitHub Gist: %s", e)
         return None
+    else:
+        url: str | None = response.json().get("html_url")
+        return url
src/mcpbr/sampling.py (1)

136-140: Optional: Simplify defaultdict usage.

Since groups is a defaultdict(list), line 139 is redundant. The defaultdict automatically creates an empty list for missing keys, so you can remove the explicit assignment.

♻️ Proposed simplification
     groups: dict[str, list[dict[str, Any]]] = defaultdict(list)
     for task in tasks:
         key = str(task.get(stratify_field, "_unknown_"))
-        groups[key] = groups.get(key, [])
         groups[key].append(task)
.pre-commit-config.yaml (1)

19-26: Consider scoping the hook trigger to src/mcpbr/ files only.

The hook runs mypy on src/mcpbr/ but triggers on any Python file change (including tests) due to types: [python] without a files filter. Adding a files pattern would avoid unnecessary mypy runs when only test files change.

       - id: mypy
         name: mypy
         entry: uv run --extra dev mypy src/mcpbr/
         language: system
         pass_filenames: false
         types: [python]
+        files: ^src/mcpbr/
src/mcpbr/smoke_test.py (1)

164-165: Acceptable type: ignore for asyncio.to_thread callable typing.

The # type: ignore[arg-type] is a reasonable suppression here since asyncio.to_thread has strict callable typing that doesn't align well with SDK method signatures passed with keyword arguments. Consider adding a brief explanation in the comment (e.g., # type: ignore[arg-type] # asyncio.to_thread callable signature mismatch) for future readers.

src/mcpbr/benchmarks/mcptoolbench.py (1)

397-399: enumerate is unnecessary — _i is never used.

The index variable _i from enumerate is unused. The enumerate wrapper can be dropped, and strict=False is already the default for zip.

♻️ Simplify the loop
-            for _i, (agent_call, gt_call) in enumerate(
-                zip(agent_calls, ground_truth, strict=False)
-            ):
+            for agent_call, gt_call in zip(agent_calls, ground_truth):
CHANGELOG.md (1)

10-32: Changelog entry looks good and accurately documents the v0.14.0 changes.

Minor note: The [0.13.0] section (line 34) doesn't have a corresponding link reference at the bottom of the file. This appears to be pre-existing but worth fixing while you're editing the file.

src/mcpbr/config_validator.py (1)

492-492: dict(error) conversion is redundant but harmless.

Pydantic's e.errors() already returns list[dict[str, Any]], so error is already a dict. The dict() call is a no-op here but may satisfy mypy if the inferred type was more narrow. Not worth changing, just noting.

src/mcpbr/infrastructure/aws.py (1)

417-417: Replace assert statements with explicit runtime error checks for provider state validation.

All six preconditions use assert, which are stripped with python -O, allowing None to flow into AWS CLI and paramiko calls. Use RuntimeError or similar explicit error handling instead for robustness.

🛠️ Example for one guard
-        assert self.instance_id is not None
+        if self.instance_id is None:
+            raise RuntimeError("Instance ID not set; call _create_instance() first")

Affected lines: 417, 476, 607, 771, 825, 884–885

src/mcpbr/infrastructure/gcp.py (2)

688-702: Naive datetime.now() — consider using UTC for consistency with the broader PR.

The AI summary notes this PR normalizes timezone handling to UTC across multiple modules. Line 701 uses datetime.now() which produces a naive (timezone-unaware) timestamp, inconsistent with that pattern.

Suggested fix
-        from datetime import datetime
+        from datetime import UTC, datetime
...
-                started_at=datetime.now().isoformat(),
+                started_at=datetime.now(UTC).isoformat(),

364-364: Assertions for mypy narrowing are acceptable but note they're no-ops under python -O.

The assertions on self.instance_ip, self.ssh_client, etc. help mypy but are skipped if Python runs with optimizations enabled (-O/-OO). For ssh_client, _ssh_exec (line 399) already has a proper runtime guard; the direct self.ssh_client.exec_command calls on lines 749 and 802-803 bypass that guard. If this code could ever run under -O, those paths would hit an AttributeError on None.

Given the codebase context (CLI tool, unlikely to run optimized), this is low risk.

Also applies to: 498-498, 748-748, 802-802, 861-861

Comment on lines +137 to +138
"infrastructure/**/*.py" = ["S603", "S607"]
"src/mcpbr/infrastructure/**/*.py" = ["S603", "S607", "S108"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if there's a top-level infrastructure/ directory
fd -t d '^infrastructure$' --max-depth 2

# Also list the directory structure to understand the layout
echo "=== Root level directories ==="
ls -la | grep '^d' | head -20

# Examine the relevant section of pyproject.toml
echo "=== Ruff per-file-ignores section ==="
sed -n '135,145p' pyproject.toml

Repository: greynewell/mcpbr

Length of output: 1729


🏁 Script executed:

#!/bin/bash
# Check for Python files in infrastructure directories
echo "=== Python files in docs/infrastructure ==="
find docs/infrastructure -name "*.py" -type f 2>/dev/null | head -20

echo "=== Python files in tests/infrastructure ==="
find tests/infrastructure -name "*.py" -type f 2>/dev/null | head -20

echo "=== All files in docs/infrastructure ==="
find docs/infrastructure -type f 2>/dev/null

echo "=== All files in tests/infrastructure ==="
find tests/infrastructure -type f 2>/dev/null

Repository: greynewell/mcpbr

Length of output: 1226


Remove redundant per-file-ignores pattern on Line 137.

"infrastructure/**/*.py" on line 137 matches files in tests/infrastructure/ (13 test files found), but these are already covered by line 136's "tests/**/*.py" pattern, which takes precedence in ruff's rule matching. This makes line 137 unreachable for its only Python files. Remove it to clarify the configuration.

Proposed fix
 [tool.ruff.lint.per-file-ignores]
 "tests/**/*.py" = ["S", "T20"]
-"infrastructure/**/*.py" = ["S603", "S607"]
 "src/mcpbr/infrastructure/**/*.py" = ["S603", "S607", "S108"]
 "scripts/**/*.py" = ["T20", "S"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"infrastructure/**/*.py" = ["S603", "S607"]
"src/mcpbr/infrastructure/**/*.py" = ["S603", "S607", "S108"]
[tool.ruff.lint.per-file-ignores]
"tests/**/*.py" = ["S", "T20"]
"src/mcpbr/infrastructure/**/*.py" = ["S603", "S607", "S108"]
"scripts/**/*.py" = ["T20", "S"]
🤖 Prompt for AI Agents
In `@pyproject.toml` around lines 137 - 138, Remove the redundant per-file-ignores
entry `"infrastructure/**/*.py" = ["S603", "S607"]` from pyproject.toml; instead
keep the existing `"tests/**/*.py"` and the `"src/mcpbr/infrastructure/**/*.py"`
entries and ensure the remaining JSON-like mapping remains valid after deleting
that line so the file parses correctly.

Comment on lines +139 to +148
# Check for API keys (specific patterns with lower threshold,
# generic "key" suffix with higher threshold to avoid false positives)
if (
any(keyword in key_lower for keyword in ["api_key", "api-key", "apikey"])
and len(value) > 5
) or (key_lower.endswith("key") and not key_lower.endswith("_key") and len(value) > 10):
warnings.append(
f"Possible API key hardcoded at '{path}'. "
f"Consider using environment variables: ${{API_KEY}}"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

The not key_lower.endswith("_key") exclusion is overly broad and misses genuinely sensitive keys.

The second branch excludes every key ending in _key to avoid false positives on benign names like "cache_key" or "primary_key". However, this also silently skips "secret_key", "private_key", "encryption_key", and "signing_key" — all of which are common names for hardcoded secrets.

Consider using an allowlist of known-benign _key suffixes instead of blanket exclusion, or add the sensitive _key patterns to the first branch's keyword list.

Suggested approach: expand the keyword list in the first branch
             if (
-                any(keyword in key_lower for keyword in ["api_key", "api-key", "apikey"])
+                any(keyword in key_lower for keyword in [
+                    "api_key", "api-key", "apikey",
+                    "secret_key", "private_key", "encryption_key", "signing_key",
+                ])
                 and len(value) > 5
             ) or (key_lower.endswith("key") and not key_lower.endswith("_key") and len(value) > 10):
🤖 Prompt for AI Agents
In `@src/mcpbr/env_expansion.py` around lines 139 - 148, The exclusion `not
key_lower.endswith("_key")` in the conditional is too broad and skips sensitive
names; update the check in env_expansion.py around the `if (` that uses
key_lower and value so that `_key` names are treated safely — either add
sensitive `_key` patterns (e.g., "secret_key", "private_key", "encryption_key",
"signing_key") to the first branch's keyword list checked with `any(keyword in
key_lower ...)`, or replace the blanket exclusion with an allowlist of
known-benign `_key` suffixes (e.g., "cache_key", "primary_key") and keep all
other `_key` endings subject to the API-key heuristic, ensuring warnings.append
still references path and value length checks as before.

if log_file_path:
log_file_path.parent.mkdir(parents=True, exist_ok=True)
log_file = open(log_file_path, "w")
log_file = open(log_file_path, "w") # noqa: SIM115 - caller closes the file handle
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove unused noqa directive.

Ruff reports SIM115 is not enabled, so this suppression is unnecessary.

Fix
-        log_file = open(log_file_path, "w")  # noqa: SIM115 - caller closes the file handle
+        log_file = open(log_file_path, "w")  # caller closes the file handle
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
log_file = open(log_file_path, "w") # noqa: SIM115 - caller closes the file handle
log_file = open(log_file_path, "w") # caller closes the file handle
🧰 Tools
🪛 Ruff (0.15.0)

[warning] 509-509: Unused noqa directive (non-enabled: SIM115)

Remove unused noqa directive

(RUF100)

🤖 Prompt for AI Agents
In `@src/mcpbr/log_formatter.py` at line 509, The line that opens the log file
contains an unnecessary noqa suppression ("# noqa: SIM115")—remove that
directive from the open call so the line becomes a plain open(log_file_path,
"w") call; locate the open(...) invocation where the variable log_file is
assigned and drop the trailing "# noqa: SIM115" comment.

Comment on lines 152 to 157
except ImportError:
# psutil not available, disable memory profiling
self.enable_memory_profiling = False
except Exception:
except Exception: # noqa: S110 -- best-effort memory sampling; non-critical telemetry
# Failed to sample memory, skip silently
pass
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Ruff BLE001 still triggered by the broad Exception catch.

# noqa: S110 doesn’t silence BLE001. Either narrow the exception (e.g., psutil.Error, OSError) or add # noqa: BLE001 if the broad catch is intentional to keep lint green.

🛠️ Suggested narrow catch
-        except Exception:  # noqa: S110 -- best-effort memory sampling; non-critical telemetry
+        except (psutil.Error, OSError):  # noqa: S110 -- best-effort memory sampling; non-critical telemetry
             # Failed to sample memory, skip silently
             pass

As per coding guidelines: "Run linting checks with Ruff and ensure all issues are resolved before committing".

🧰 Tools
🪛 Ruff (0.15.0)

[warning] 155-155: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In `@src/mcpbr/profiler.py` around lines 152 - 157, The broad "except Exception:"
in the memory-sampling block of the Profiler (around the except handling that
sets self.enable_memory_profiling) triggers Ruff BLE001; replace the broad catch
with specific exceptions (e.g., except (psutil.Error, OSError):) or, if the
broad catch is intentional for best-effort telemetry, annotate it with the
linter exemption comment "# noqa: BLE001" so Ruff stops flagging it; update the
except block around the memory profiling sampling in profiler.py (the handler
that currently has "# noqa: S110 -- best-effort memory sampling; non-critical
telemetry") accordingly.

Comment on lines 114 to +119
for dist in distributions():
name = dist.metadata.get("Name", "")
version = dist.metadata.get("Version", "")
name = dist.metadata["Name"] or ""
version = dist.metadata["Version"] or ""
if name:
packages[name] = version
except Exception:
except Exception: # noqa: S110 -- best-effort package collection; environment may lack importlib.metadata
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

importlib.metadata Distribution metadata "Name" "Version" keys guaranteed

💡 Result:

Yes—for a standards-compliant distribution metadata file, Name and Version are required core-metadata fields (along with Metadata-Version). The current PyPA Core Metadata spec (v2.5) explicitly lists Metadata-Version, Name, and Version as the required fields. [1]

However, importlib.metadata does not give you an absolute “guarantee” for arbitrary installed environments (metadata can be missing/corrupt). The stdlib docs describe importlib.metadata as operating on installed distributions whose metadata is defined by the Core Metadata specs, and its PackageMetadata behaves like a mapping where obj['Key'] is equivalent to obj.get('Key') (so a missing key can yield None rather than being guaranteed present). [2]

Practical takeaways:

  • If you need the version reliably, prefer importlib.metadata.version(dist_name) / Distribution.version (it’s the dedicated API). [2]
  • If you read dist.metadata, treat dist.metadata.get("Name") / .get("Version") as “should exist per spec”, but still defensively handle missing values. [1][2]

Sources: [1] PyPA Core Metadata specification (required fields) • [2] Python importlib.metadata docs (PackageMetadata behavior and scope)


🌐 Web query:

importlib.metadata.Distribution.metadata safe access pattern KeyError

💡 Result:

Distribution.metadata is a PackageMetadata (dict-like) object. To avoid KeyError across versions/backports, don’t rely on md['Key']—use get() / get_all() (and defaults) instead. In stdlib docs, subscripting is described as equivalent to get(), but the backport (importlib-metadata) has explicitly warned that missing-key behavior is changing toward raising KeyError. [1], [2]

from importlib.metadata import distribution, PackageNotFoundError

try:
    md = distribution("some-dist").metadata
except PackageNotFoundError:
    md = None

if md is not None:
    name    = md.get("Name")                 # -> str | None
    license = md.get("License", "")          # -> default if missing

    # fields that can appear multiple times (e.g., Classifier, Project-URL)
    classifiers = md.get_all("Classifier") or []   # -> list[str]

If you truly want strict behavior, do:

value = md.get("License")
if value is None:
    ...  # handle missing

Sources: Python importlib.metadata docs for PackageMetadata/subscripting. [1] Backport (importlib-metadata) changelog noting missing-key behavior moving toward KeyError. [2]

[1] (docs.python.org)
[2] (importlib-metadata.readthedocs.io)

Citations:


🌐 Web query:

Python importlib.metadata how to safely access distribution metadata

💡 Result:

To safely access installed distribution metadata with importlib.metadata, use the public API (version, metadata, distribution, requires, files) and defensively handle (a) “package not installed” and (b) “field not present”.

1) Handle “not installed” (PackageNotFoundError)

Most query functions raise importlib.metadata.PackageNotFoundError if the named distribution isn’t installed. [1]

from importlib import metadata

try:
    v = metadata.version("wheel")
except metadata.PackageNotFoundError:
    v = None

2) Prefer metadata() / distribution().metadata and use dict-like access

metadata(name) returns a PackageMetadata object; treat it as a mapping and use .get() (or in) for optional fields. [2]

from importlib import metadata

try:
    m = metadata.metadata("wheel")  # PackageMetadata
except metadata.PackageNotFoundError:
    m = None

summary = m.get("Summary") if m else None
license_ = m.get("License") if m else None
requires_python = m.get("Requires-Python") if m else None

If you need more than metadata (e.g., .version, .origin), use distribution():

from importlib import metadata

try:
    dist = metadata.distribution("wheel")
except metadata.PackageNotFoundError:
    dist = None

version = dist.version if dist else None
requires_python = dist.metadata.get("Requires-Python") if dist else None

3) Don’t rely on the underlying message type / parsing quirks

The returned metadata object may be implemented using email.message types, but that’s not a stable API—stick to dictionary-like access methods/syntax. [3]

4) Treat files() as optional (None)

files(name) can return None if the distribution is found but file records are missing. [2]

from importlib import metadata

try:
    files = metadata.files("some-dist")  # may be None
except metadata.PackageNotFoundError:
    files = None

5) If you want normalized JSON-compatible keys, use .json

PackageMetadata.json provides PEP 566-style JSON-compatible metadata (helpful for normalized key names like requires_python). [2]

from importlib import metadata

m = metadata.metadata("wheel")
requires_python = m.json.get("requires_python")

Sources: Python importlib.metadata documentation (APIs, PackageNotFoundError, PackageMetadata, .json, files() returning None). [1][2][3]


Prefer dist.name and dist.version properties over metadata indexing.

Direct metadata indexing (dist.metadata["Name"] / ["Version"]) can raise KeyError across Python versions and backports. While the outer except Exception catches this, it masks the error and returns an empty packages dict. Instead, use the stable Distribution.name and Distribution.version properties, or defensively access metadata via .get():

💡 Suggested approaches

Option 1 (Preferred—uses stable properties):

-            name = dist.metadata["Name"] or ""
-            version = dist.metadata["Version"] or ""
+            name = dist.name or ""
+            version = dist.version or ""

Option 2 (If metadata access is required):

-            name = dist.metadata["Name"] or ""
-            version = dist.metadata["Version"] or ""
+            name = dist.metadata.get("Name", "") or ""
+            version = dist.metadata.get("Version", "") or ""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for dist in distributions():
name = dist.metadata.get("Name", "")
version = dist.metadata.get("Version", "")
name = dist.metadata["Name"] or ""
version = dist.metadata["Version"] or ""
if name:
packages[name] = version
except Exception:
except Exception: # noqa: S110 -- best-effort package collection; environment may lack importlib.metadata
for dist in distributions():
name = dist.name or ""
version = dist.version or ""
if name:
packages[name] = version
except Exception: # noqa: S110 -- best-effort package collection; environment may lack importlib.metadata
🧰 Tools
🪛 Ruff (0.15.0)

[warning] 119-119: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In `@src/mcpbr/reproducibility.py` around lines 114 - 119, The loop in get package
collection uses unstable metadata indexing (dist.metadata["Name"]/["Version"])
which can KeyError; change it to use the stable Distribution properties
(dist.name and dist.version) when building the packages dict in the function
that iterates over distributions(), e.g., replace references to
dist.metadata["Name"] and dist.metadata["Version"] with dist.name and
dist.version and optionally guard with or "" to preserve current behavior; keep
the surrounding exception handling as a last-resort fallback.

Comment on lines +65 to 66
except Exception as e:
result["error"] = f"{type(e).__name__}: {str(e)[:300]}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid broad except Exception (BLE001).

Catching all exceptions here will trip Ruff’s BLE001 and can hide unexpected failures. Prefer catching expected exception types from benchmark creation/loading, or let unexpected exceptions bubble to fail the test with a stack trace.

🧰 Tools
🪛 Ruff (0.15.0)

[warning] 65-65: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In `@tests/test_benchmark_integration.py` around lines 65 - 66, Replace the broad
"except Exception as e" that writes to result["error"] with a targeted handler
that only catches expected errors from benchmark creation/loading (e.g.,
ValueError, IOError, OSError or your domain-specific exceptions) and records
them into result["error"]; for anything else, re-raise the exception so tests
fail and provide a traceback. Concretely: change the "except Exception as e"
block to either "except (ValueError, IOError) as e:" (adding any
project-specific exception classes) or use "except Exception as e: if
isinstance(e, (ValueError, IOError, YourBenchmarkError)): result['error'] = ...
else: raise" to avoid swallowing unexpected failures while preserving the
existing error message formatting.

Comment on lines +42 to +48
config.slack_bot_token = "xoxb-token"
config.slack_channel = "#evals"
config.github_token = None
result = _build_notify_config(config)
assert result["slack_webhook"] == "https://hooks.slack.com/test"
assert result["discord_webhook"] == "https://discord.com/api/webhooks/test"
assert result["slack_bot_token"] == "xoxb-token" # noqa: S105
assert result["slack_bot_token"] == "xoxb-token"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix Ruff S105 by avoiding hardcoded literal assignment to slack_bot_token.

Removing noqa reintroduces S105 on Lines 42 and 48. Consider assigning from a non-literal variable and asserting against that variable to keep the test intent while satisfying lint.

✅ Proposed change
-        config.slack_bot_token = "xoxb-token"
+        token_value = "xoxb-token"
+        config.slack_bot_token = token_value
@@
-        assert result["slack_bot_token"] == "xoxb-token"
+        assert result["slack_bot_token"] == token_value
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
config.slack_bot_token = "xoxb-token"
config.slack_channel = "#evals"
config.github_token = None
result = _build_notify_config(config)
assert result["slack_webhook"] == "https://hooks.slack.com/test"
assert result["discord_webhook"] == "https://discord.com/api/webhooks/test"
assert result["slack_bot_token"] == "xoxb-token" # noqa: S105
assert result["slack_bot_token"] == "xoxb-token"
token_value = "xoxb-token"
config.slack_bot_token = token_value
config.slack_channel = "#evals"
config.github_token = None
result = _build_notify_config(config)
assert result["slack_webhook"] == "https://hooks.slack.com/test"
assert result["discord_webhook"] == "https://discord.com/api/webhooks/test"
assert result["slack_bot_token"] == token_value
🧰 Tools
🪛 Ruff (0.15.0)

[error] 42-42: Possible hardcoded password assigned to: "slack_bot_token"

(S105)


[error] 48-48: Possible hardcoded password assigned to: "slack_bot_token"

(S105)

🤖 Prompt for AI Agents
In `@tests/test_harness_notifications.py` around lines 42 - 48, The test currently
assigns a hardcoded literal directly to config.slack_bot_token (triggering Ruff
S105); change it to assign the token to a non-literal variable (e.g., token_var
= "xoxb-token"), set config.slack_bot_token = token_var, and update the
assertion to compare result["slack_bot_token"] against that variable instead of
a literal, leaving the rest of the test (discord/slack_webhook assertions and
use of _build_notify_config) unchanged.

Comment on lines +283 to +288
with (
patch("mcpbr.harnesses._run_cli_command", side_effect=mock_run_cli),
patch("mcpbr.harnesses.shutil.which", return_value="/usr/bin/claude"),
contextlib.suppress(Exception),
):
await harness.solve(task, "/tmp/test", timeout=1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid hard-coded /tmp paths and broad exception suppression in these tests.

Using /tmp/test trips Ruff S108 and is less portable. Also, suppressing Exception can hide unexpected failures; narrow it to TimeoutError (or remove it) while using tmp_path.

Suggested update (apply to both async tests)
@@
-    `@pytest.mark.asyncio`
-    async def test_local_mode_sets_max_thinking_tokens(self):
+    `@pytest.mark.asyncio`
+    async def test_local_mode_sets_max_thinking_tokens(self, tmp_path):
@@
-        with (
-            patch("mcpbr.harnesses._run_cli_command", side_effect=mock_run_cli),
-            patch("mcpbr.harnesses.shutil.which", return_value="/usr/bin/claude"),
-            contextlib.suppress(Exception),
-        ):
-            await harness.solve(task, "/tmp/test", timeout=1)
+        with (
+            patch("mcpbr.harnesses._run_cli_command", side_effect=mock_run_cli),
+            patch("mcpbr.harnesses.shutil.which", return_value="/usr/bin/claude"),
+            contextlib.suppress(TimeoutError),
+        ):
+            await harness.solve(task, str(tmp_path), timeout=1)
@@
-    `@pytest.mark.asyncio`
-    async def test_local_mode_no_env_when_thinking_budget_none(self):
+    `@pytest.mark.asyncio`
+    async def test_local_mode_no_env_when_thinking_budget_none(self, tmp_path):
@@
-        with (
-            patch("mcpbr.harnesses._run_cli_command", side_effect=mock_run_cli),
-            patch("mcpbr.harnesses.shutil.which", return_value="/usr/bin/claude"),
-            contextlib.suppress(Exception),
-        ):
-            await harness.solve(task, "/tmp/test", timeout=1)
+        with (
+            patch("mcpbr.harnesses._run_cli_command", side_effect=mock_run_cli),
+            patch("mcpbr.harnesses.shutil.which", return_value="/usr/bin/claude"),
+            contextlib.suppress(TimeoutError),
+        ):
+            await harness.solve(task, str(tmp_path), timeout=1)

As per coding guidelines, "Run linting checks with Ruff and ensure all issues are resolved before committing" and "tests/**/*.py: Place tests in tests/ directory, mirror source file structure (e.g., src/mcpbr/foo.py → tests/test_foo.py), use pytest fixtures for common setup, and mark integration tests with @pytest.mark.integration".

Also applies to: 316-321

🧰 Tools
🪛 Ruff (0.15.0)

[error] 288-288: Probable insecure usage of temporary file or directory: "/tmp/test"

(S108)

🤖 Prompt for AI Agents
In `@tests/test_thinking_budget.py` around lines 283 - 288, Replace the hard-coded
"/tmp/test" and the broad contextlib.suppress(Exception) in the patched test
block with a pytest tmp_path fixture and a narrow suppression (or none);
specifically update the patch/context manager that surrounds the call to
harness.solve(task, "/tmp/test", timeout=1) to call harness.solve(task, tmp_path
/ "test", timeout=1) and change contextlib.suppress(Exception) to
contextlib.suppress(TimeoutError) (or remove the suppressor entirely), and apply
the same change to the other async test block that patches
mcpbr.harnesses._run_cli_command and shutil.which so both tests use tmp_path and
only suppress TimeoutError.

@greynewell greynewell deleted the feat/strict-code-quality-v0.14.0 branch February 15, 2026 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant