[feat] Add `DaytonaRunner` for code `evaluators` #3258

junaway · 2025-12-20T00:10:55Z

No description provided.

vercel · 2025-12-20T00:11:00Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Review	Updated (UTC)
agenta-documentation	Ready	Preview, Comment	Jan 2, 2026 10:22am

Copilot

Pull request overview

This PR implements and tests Daytona-based code evaluation functionality, transitioning from the legacy local sandbox to a new SDK-based approach. It includes improvements to code editor indentation handling for Python/code blocks and adds example evaluators for testing various dependencies and API endpoints.

Key Changes

Replaced legacy custom_code_run with new sdk_custom_code_run that uses the SDK's workflow-based evaluator system
Enhanced code editor to preserve exact indentation for Python/code (no transformations) while maintaining space-to-tab conversion for JSON/YAML
Added example evaluators for testing OpenAI, NumPy, and Agenta API endpoints in Daytona environments

Reviewed changes

Copilot reviewed 20 out of 25 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`api/oss/src/services/evaluators_service.py`	Implements new SDK-based custom code runner function that delegates to workflow system
`api/oss/src/resources/evaluators/evaluators.py`	Updates default code template with deprecation note for app_params
`sdk/agenta/sdk/workflows/runners/daytona.py`	Adds environment variables (OPENAI_API_KEY, AGENTA_HOST, AGENTA_CREDENTIALS) to sandbox
`sdk/agenta/sdk/workflows/runners/local.py`	Exposes built-in Python types (dict, list, str, etc.) to restricted environment
`sdk/agenta/sdk/decorators/running.py`	Adds fallback to request.credentials in credential resolution chain
`web/oss/src/components/Editor/plugins/code/utils/pasteUtils.ts`	Preserves exact indentation for Python/code, converts spaces to tabs for JSON/YAML
`web/oss/src/components/Editor/plugins/code/plugins/IndentationPlugin.tsx`	Uses 4 spaces for Python/code tab insertion, 2 spaces for JSON/YAML
`web/oss/src/components/Editor/plugins/code/plugins/AutoFormatAndValidateOnPastePlugin.tsx`	Skips indentation transformation for Python/code, maintains it for JSON/YAML
`examples/python/evaluators/openai/*.py`	Adds OpenAI SDK evaluators for testing API availability and exact match comparisons
`examples/python/evaluators/numpy/*.py`	Adds NumPy evaluators for testing library availability and character counting
`examples/python/evaluators/basic/*.py`	Adds basic evaluators using Python stdlib for string matching, length checks, JSON validation
`examples/python/evaluators/ag/*.py`	Adds Agenta API endpoint evaluators for health, secrets, and config endpoints
`examples/python/evaluators/*.md`	Provides comprehensive documentation (README, QUICKSTART, SUMMARY) for evaluators

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

examples/python/evaluators/numpy/dependency_check.py

examples/python/evaluators/ag/secrets_check.py

examples/python/evaluators/ag/configs_check.py

sdk/agenta/sdk/workflows/runners/daytona.py

web/oss/src/components/Editor/plugins/code/utils/pasteUtils.ts

examples/python/evaluators/openai/exact_match.py

examples/python/evaluators/ag/health_check.py

examples/python/evaluators/openai/dependency_check.py

examples/python/evaluators/numpy/dependency_check.py

…ck-daytona-code-evaluator

Copilot

Pull request overview

Copilot reviewed 32 out of 37 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sdk/agenta/sdk/workflows/handlers.py

sdk/agenta/sdk/workflows/runners/daytona.py

api/oss/src/services/evaluators_service.py

examples/python/evaluators/openai/exact_match.py

sdk/agenta/sdk/workflows/runners/local.py

sdk/agenta/sdk/workflows/runners/daytona.py

examples/python/evaluators/openai/dependency_check.py

Add standard provider keys from vault as env vars Add templates Fix credentials (and thus secrets and traces) in evaluator playground

Copilot

Pull request overview

Copilot reviewed 182 out of 299 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-25T17:37:07Z

sdk/agenta/sdk/workflows/runners/daytona.py

+            runtime = runtime or "python"
+
+            # Select general snapshot
+            snapshot_id = os.getenv("DAYTONA_SNAPSHOT")


The environment variable name changed from AGENTA_SERVICES_SANDBOX_SNAPSHOT_PYTHON to DAYTONA_SNAPSHOT, but this is inconsistent with the naming pattern used elsewhere (e.g., AGENTA_HOST, AGENTA_API_URL). Consider using AGENTA_DAYTONA_SNAPSHOT or documenting why the AGENTA_ prefix was dropped for this variable.

Copilot · 2025-12-25T17:37:07Z

examples/test_daytona_scripts.py

+
+def _run_file(daytona: Daytona, runtime: str, path: Path) -> None:
+    code = path.read_text(encoding="utf-8")
+    wrapped = _wrap_python(code) if runtime == "python" else _wrap_js(code)
+


The sandbox creation doesn't specify a snapshot ID, but the _create_sandbox method in daytona.py requires DAYTONA_SNAPSHOT to be set. This will fail if the environment variable is not configured. Consider adding explicit snapshot configuration or error handling.

Suggested change

def _run_file(daytona: Daytona, runtime: str, path: Path) -> None:

code = path.read_text(encoding="utf-8")

wrapped = _wrap_python(code) if runtime == "python" else _wrap_js(code)

def _require_daytona_snapshot() -> str:

"""Ensure that DAYTONA_SNAPSHOT is configured before creating sandboxes."""

snapshot = os.getenv("DAYTONA_SNAPSHOT")

if not snapshot:

raise RuntimeError(

"DAYTONA_SNAPSHOT is required to create Daytona sandboxes. "

"Please set the environment variable to a valid snapshot ID."

)

return snapshot

def _run_file(daytona: Daytona, runtime: str, path: Path) -> None:

code = path.read_text(encoding="utf-8")

wrapped = _wrap_python(code) if runtime == "python" else _wrap_js(code)

# Validate that the required snapshot configuration is present before creating a sandbox.

_require_daytona_snapshot()

Copilot · 2025-12-25T17:37:08Z

api/oss/src/routers/evaluators_router.py

+    tracing_ctx = TracingContext.get()
+    tracing_ctx.credentials = credentials

-        with running_context_manager(RunningContext.get()):
-            running_ctx = RunningContext.get()
-            running_ctx.credentials = f"Secret {secret_token}"
+    ctx = RunningContext.get()
+    ctx.credentials = credentials

+    with tracing_context_manager(tracing_ctx):


The context objects are retrieved and modified before being passed to context managers. This pattern could lead to issues if the contexts are modified elsewhere between get() and the context manager entry. Consider retrieving fresh contexts inside the managers or ensuring contexts are isolated.

web/oss/src/services/testsets/api/index.ts

+    const response = await axios.post(
+        `${getAgentaApiUrl()}/testsets/revisions/${revisionId}/archive?project_id=${projectId}`,
+    )


General fix: Ensure that the user-controlled revisionId is validated/normalized on the client before being interpolated into the URL path. Reject values that are not in an expected safe format (e.g., a UUID or a restricted ID pattern), and avoid letting path traversal sequences or reserved URL meta-characters be passed through. If invalid, throw or refuse to make the request.

Best concrete fix in this code: In web/oss/src/services/testsets/api/index.ts, in archiveTestsetRevision, validate revisionId before constructing the URL. A minimal and safe approach is:

Introduce a small local validator (e.g., isSafeRevisionId) in this file that enforces a strict pattern (e.g., only letters, digits, hyphen, underscore, and limited length).

Call this validator at the top of archiveTestsetRevision. If the ID is invalid, throw an error instead of making the HTTP request.

Use encodeURIComponent when interpolating revisionId into the URL path, to prevent any unexpected interpretation of characters.

This keeps the current API shape and behavior for valid IDs, while making it impossible for a malicious query parameter to inject dangerous characters or path segments into the URL used by axios.post. No changes are necessary to the calling code in useTestcaseActions other than benefiting from the safer implementation.

Concretely:

In web/oss/src/services/testsets/api/index.ts, add a small helper function isSafeRevisionId near the archiveTestsetRevision function.

In archiveTestsetRevision, before using revisionId, check if (!isSafeRevisionId(revisionId)) throw new Error("Invalid revision ID").

When building the URL, wrap revisionId with encodeURIComponent(revisionId).

No imports are needed; we only use built-in RegExp and encodeURIComponent.

…k-daytona-code-evaluator

Copilot

Pull request overview

Copilot reviewed 183 out of 299 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-25T18:15:13Z

api/oss/src/services/evaluators_service.py

+from openai import AsyncOpenAI
+
+# COMMENTED OUT: autoevals dependency removed
+# from autoevals.ragas import Faithfulness, ContextRelevancy


Corrected spelling of 'Relevancy' to 'Relevance' in the comment.

Suggested change

# from autoevals.ragas import Faithfulness, ContextRelevancy

# from autoevals.ragas import Faithfulness, ContextRelevancy # Commented out due to autoevals removal, corrected spelling of 'Relevance'

Copilot · 2025-12-25T18:15:13Z

web/oss/src/components/Editor/plugins/code/plugins/AutoFormatAndValidateOnPastePlugin.tsx

+                // Get the actual language from the CodeBlock node, or default to "code"
+                const language = $isCodeBlockNode(parentBlock) ? parentBlock.getLanguage() : "code"


The fallback to 'code' when parentBlock is not a CodeBlockNode may mask errors. Consider logging a warning or throwing an error if the parent is unexpectedly not a CodeBlockNode, as this likely indicates a programming error.

Suggested change

// Get the actual language from the CodeBlock node, or default to "code"

const language = $isCodeBlockNode(parentBlock) ? parentBlock.getLanguage() : "code"

// Get the actual language from the CodeBlock node, or default to "code".

// If parentBlock is not a CodeBlockNode, log a warning as this likely indicates

// a structural/editor bug, but still fall back to "code" to preserve behavior.

let language: string

if ($isCodeBlockNode(parentBlock)) {

language = parentBlock.getLanguage()

} else {

log("Paste: Expected parentBlock to be a CodeBlockNode", {

selection,

anchorNode,

currentLine,

parentBlock,

})

language = "code"

}

Copilot · 2025-12-25T18:15:13Z

sdk/agenta/sdk/workflows/runners/local.py

+        # Local runner only supports Python
+        if runtime != "python":
+            raise ValueError(
+                f"LocalRunner only supports 'python' runtime, got: {runtime}"


Removing RestrictedPython eliminates sandboxing protections. The local runner now executes arbitrary Python code without restrictions. This is a significant security regression if untrusted code can be executed. Ensure that the local runner is only used in trusted development environments and that production deployments use the Daytona runner.

Copilot · 2025-12-25T18:15:14Z

sdk/agenta/sdk/workflows/runners/daytona.py

+            agenta_credentials = (
+                RunningContext.get().credentials
+                #
+                or ""


String slicing agenta_credentials[7:] assumes 'ApiKey ' prefix is exactly 7 characters. However, the check is for 'ApiKey ' (with space), which is also 7 characters, so this is correct. But if the prefix format changes (e.g., 'ApiKey ' with two spaces), this will fail silently. Consider using agenta_credentials.removeprefix('ApiKey ') for robustness.

Suggested change

or ""

agenta_credentials.removeprefix("ApiKey ")

Copilot · 2025-12-25T18:15:14Z

web/oss/src/components/Editor/plugins/code/plugins/IndentationPlugin.tsx

+                        // Insert spaces instead of tab character
+                        // Use 4 spaces for Python/code (PEP 8 standard)
+                        // Use 2 spaces for JSON/YAML (typical formatting)
+                        const spaces = language === "json" || language === "yaml" ? "  " : "    "


Consider extracting these magic numbers (2 spaces for JSON/YAML, 4 spaces for code/Python/JavaScript/TypeScript) into named constants at the module level. This would make the indentation standards more visible and easier to modify consistently across the codebase.

Copilot · 2025-12-25T18:15:14Z

examples/test_daytona_scripts.py

+    for runtime, folder in BASIC_DIRS.items():
+        if not folder.exists():
+            continue
+        pattern = "*.py" if runtime == "python" else "*.js" if runtime == "javascript" else "*.ts"


This nested ternary expression is difficult to read. Consider using a dictionary mapping or if-elif-else structure for better clarity.

Suggested change

pattern = "*.py" if runtime == "python" else "*.js" if runtime == "javascript" else "*.ts"

if runtime == "python":

pattern = "*.py"

elif runtime == "javascript":

pattern = "*.js"

else:

pattern = "*.ts"

…k-daytona-code-evaluator

Copilot

Copilot reviewed 168 out of 310 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-26T16:20:35Z

web/oss/src/components/SharedDrawers/AddToTestsetDrawer/atoms/testsetQueries.ts

+} from "@/oss/state/testsetSelection"
+
+/**
+ * Testset Queries - Clean atom-based data fetching


Corrected spelling of 'recieve' to 'receive' in comment.

Copilot · 2025-12-26T16:20:36Z

web/oss/src/components/Editor/plugins/code/nodes/Base64Node.tsx

+        }
+    }, [parsed.fullValue])
+
+    const isPdf = mimeType === "application/pdf"


The variable isPdf is declared but never used in the component. Consider removing it or using it in the conditional rendering logic if PDF-specific behavior is intended.

…k-daytona-code-evaluator

Copilot

Pull request overview

Copilot reviewed 48 out of 55 changed files in this pull request and generated 20 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-02T10:30:47Z

sdk/agenta/sdk/workflows/runners/local.py

        """
-        Execute provided Python code safely using RestrictedPython.
+        Execute provided Python code directly.


The LocalRunner now executes code directly without any sandboxing or restrictions, but the docstring still references "safe execution". The comment on line 8 says "Local code runner using direct Python execution" which is accurate, but the run method docstring should be updated to reflect that this is NOT safe execution and should only be used in trusted environments.

Copilot · 2026-01-02T10:30:47Z

sdk/agenta/sdk/workflows/sandbox.py

+    Execute the provided code safely.

-    Uses the configured runner (local RestrictedPython or remote Daytona)
+    Uses the configured runner (local or remote Daytona)


The docstring for execute_code_safely still says the function executes code "safely", but with the LocalRunner now using direct exec() without restrictions, this is misleading. The function name and docstring should be updated to reflect that safety depends on the runner implementation, and LocalRunner is not actually safe.

Copilot · 2026-01-02T10:30:48Z

web/oss/src/components/Editor/plugins/code/utils/pasteUtils.ts

+            // NO transformation for Python/code - keep indent exactly as-is
+            // Just add the indent as a plain text node (preserves spaces AND tabs)
+            if (indent.length > 0) {
+                codeLine.append($createCodeHighlightNode(indent, "plain", false, null))


The comment on line 248 says "NO transformation for Python/code - keep indent exactly as-is" which is accurate, but then the comment on line 249 says "Just add the indent as a plain text node (preserves spaces AND tabs)". These could be combined into a single, clearer comment explaining that for Python/JS/TS, indentation is preserved exactly as pasted (both spaces and tabs) by inserting it as a plain text node.

Copilot · 2026-01-02T10:30:48Z

sdk/agenta/sdk/workflows/runners/daytona.py

+            runtime = runtime or "python"
+
+            # Select general snapshot
+            snapshot_id = os.getenv("DAYTONA_SNAPSHOT")


The environment variable name has changed from AGENTA_SERVICES_SANDBOX_SNAPSHOT_PYTHON to DAYTONA_SNAPSHOT. This appears to be a breaking change that could affect existing deployments. Consider either maintaining backward compatibility by checking both variable names, or documenting this breaking change clearly in migration notes.

Copilot · 2026-01-02T10:30:48Z

sdk/agenta/sdk/workflows/runners/daytona.py

-            if response_error:
-                log.error(f"Sandbox execution error: {response_error}")
-                raise RuntimeError(f"Sandbox execution failed: {response_error}")
+            if response_exit_code and response_exit_code != 0:


The code checks if response_exit_code is truthy before checking if it's non-zero. However, if exit_code is 0 (success), the expression response_exit_code and response_exit_code != 0 would be False (correct). But if exit_code is None (when the attribute doesn't exist), this would also be False, potentially masking errors. Consider explicitly checking if response_exit_code is not None and response_exit_code != 0 to distinguish between "no exit code" and "exit code is 0".

Copilot · 2026-01-02T10:30:51Z

api/oss/src/resources/evaluators/evaluators.py

+                    "code": "from typing import Dict, Union, Any\n\n\ndef evaluate(\n    app_params: Dict[str, str],  # deprecated; currently receives {}\n    inputs: Dict[str, str],\n    output: Union[str, Dict[str, Any]],\n    correct_answer: str,\n) -> float:\n    if output == correct_answer:\n        return 1.0\n    return 0.0\n",
+                },
+                "description": "Exact match evaluator implemented in Python.",
+            },
+            {
+                "key": "javascript_default",
+                "name": "Exact Match (JavaScript)",
+                "values": {
+                    "requires_llm_api_keys": False,
+                    "runtime": "javascript",
+                    "correct_answer_key": "correct_answer",
+                    "code": 'function evaluate(appParams, inputs, output, correctAnswer) {\n  void appParams\n  void inputs\n\n  const outputStr =\n    typeof output === "string" ? output : JSON.stringify(output)\n\n  return outputStr === String(correctAnswer) ? 1.0 : 0.0\n}\n',
+                },
+                "description": "Exact match evaluator implemented in JavaScript.",
+            },
+            {
+                "key": "typescript_default",
+                "name": "Exact Match (TypeScript)",
+                "values": {
+                    "requires_llm_api_keys": False,
+                    "runtime": "typescript",
+                    "correct_answer_key": "correct_answer",
+                    "code": 'type OutputValue = string | Record<string, unknown>\n\nfunction evaluate(\n  app_params: Record<string, string>,\n  inputs: Record<string, string>,\n  output: OutputValue,\n  correct_answer: string\n): number {\n  void app_params\n  void inputs\n\n  const outputStr =\n    (typeof output === "string" ? output : JSON.stringify(output)) as string\n\n  return outputStr === String(correct_answer) ? 1.0 : 0.0\n}\n',
+                },


The preset code values are stored as long single-line strings with embedded newlines (\n). This makes the code difficult to read and maintain in the resource file. Consider using multiline strings or loading these presets from separate files to improve readability and maintainability of the evaluator preset code.

Copilot · 2026-01-02T10:30:51Z

sdk/agenta/sdk/workflows/runners/daytona.py

+            response = sandbox.process.code_run(wrapped_code)
+            response_stdout = response.result if hasattr(response, "result") else ""


The response handling uses response.result as the stdout content on line 250, but the production code comment history shows that previously it was response.stdout. The test script on line 119 also uses resp.result. However, there's no clear documentation about the Daytona API version being used. Consider documenting which Daytona SDK version this code is compatible with to avoid confusion about the correct attribute names.

Copilot · 2026-01-02T10:30:51Z

sdk/agenta/sdk/workflows/runners/daytona.py

            if not snapshot_id:
                raise RuntimeError(
-                    "AGENTA_SERVICES_SANDBOX_SNAPSHOT_PYTHON environment variable is required. "
-                    "Set it to the Daytona sandbox ID or snapshot name you want to use."
+                    f"No Daytona snapshot configured for runtime '{runtime}'. "
+                    f"Set DAYTONA_SNAPSHOT environment variable."
                )


The error message references runtime variable but uses a generic message format. When DAYTONA_SNAPSHOT is not set, the error message says "No Daytona snapshot configured for runtime '{runtime}'", but the snapshot selection logic doesn't actually vary by runtime - it uses the same DAYTONA_SNAPSHOT for all runtimes. This could be misleading. Consider clarifying the error message to reflect that a single snapshot is used for all runtimes.

Copilot · 2026-01-02T10:30:52Z

sdk/agenta/sdk/workflows/runners/daytona.py

+            agenta_api_key = (
+                agenta_credentials[7:]
+                if agenta_credentials.startswith("ApiKey ")
+                else ""
+            )


The code extracts API key from credentials by checking if it starts with "ApiKey " and slicing from position 7, but if the credentials string is exactly "ApiKey " (with no actual key following), this would result in an empty string, which would still be added to env vars. Consider adding validation to ensure the extracted API key is non-empty.

Copilot · 2026-01-02T10:30:52Z

sdk/agenta/sdk/workflows/runners/daytona.py

+            # Fallback: attempt to extract a JSON object containing "result"
+            for line in reversed(output_lines):
+                if "result" not in line:
+                    continue
+                start = line.find("{")
+                end = line.rfind("}")
+                if start == -1 or end == -1 or end <= start:
+                    continue
+                try:
+                    result_obj = json.loads(line[start : end + 1])


The fallback result parsing logic has a potential issue. The code finds the last occurrence of '}' with rfind("}"), but this could match a closing brace that isn't part of the result JSON object. For example, if the output contains nested JSON or code snippets, this could incorrectly identify a brace position. Consider using a more robust JSON extraction approach or validating that the extracted substring is actually valid JSON before attempting to parse it.

jp-agenta added 9 commits December 19, 2025 11:29

adding evaluators (WIP)

f75cb59

adding evaluators (WIP)

c2c553a

fixing evaluators

5a8dcd0

Merge branch 'release/v0.69.5' into chore/check-daytona-code-evaluator

91e69e8

testing numpy/openai/agenta

a602930

fix typos in init

59f4797

confirm works with localhost if public host

6c297c9

fix playground

a717366

fix presets

b3d90f2

Copilot AI review requested due to automatic review settings December 20, 2025 00:10

vercel bot deployed to Preview December 20, 2025 00:11 View deployment

Copilot started reviewing on behalf of junaway December 20, 2025 00:11 View session

jp-agenta added 2 commits December 20, 2025 01:12

remove blaot

5bdc802

remove bloat

5304e0f

vercel bot deployed to Preview December 20, 2025 00:13 View deployment

fix daytona imports

00958cc

vercel bot deployed to Preview December 20, 2025 00:15 View deployment

remove openai key from daytona

4071a3d

vercel bot deployed to Preview December 20, 2025 00:16 View deployment

Copilot AI reviewed Dec 20, 2025

View reviewed changes

jp-agenta added 2 commits December 23, 2025 12:31

WIP add runtimes

7d3ac94

Merge branch 'fix/remove-autoevals-and-rag-evaluators' into chore/che…

84bbdaa

…ck-daytona-code-evaluator

vercel bot deployed to Preview December 23, 2025 11:32 View deployment

Copilot AI review requested due to automatic review settings December 23, 2025 11:39

Merge branch 'main' into chore/check-daytona-code-evaluator

a4ffa8c

Copilot started reviewing on behalf of junaway December 23, 2025 11:39 View session

vercel bot deployed to Preview December 23, 2025 11:40 View deployment

Copilot AI reviewed Dec 23, 2025

View reviewed changes

WIP

93d7bb5

Add standard provider keys from vault as env vars Add templates Fix credentials (and thus secrets and traces) in evaluator playground

apply es lint

59a6e6b

vercel bot deployed to Preview December 23, 2025 18:01 View deployment

Copilot AI review requested due to automatic review settings December 25, 2025 17:36

vercel bot deployed to Preview December 25, 2025 17:37 View deployment

Copilot AI reviewed Dec 25, 2025

View reviewed changes

github-advanced-security bot found potential problems Dec 25, 2025

View reviewed changes

junaway force-pushed the chore/check-daytona-code-evaluator branch from 1562c7d to 59a6e6b Compare December 25, 2025 18:06

jp-agenta added 2 commits December 25, 2025 19:11

Merge branch 'frontend-feat/new-testsets-integration' into chore/chec…

cdf1ae0

…k-daytona-code-evaluator

fix merge issues

51856c6

Copilot AI review requested due to automatic review settings December 25, 2025 18:14

vercel bot deployed to Preview December 25, 2025 18:15 View deployment

Copilot AI reviewed Dec 25, 2025

View reviewed changes

Merge branch 'frontend-feat/new-testsets-integration' into chore/chec…

3966be2

…k-daytona-code-evaluator

vercel bot deployed to Preview December 25, 2025 18:39 View deployment

junaway changed the title ~~[feat] Add daytona code evaluators~~ [feat] Add DaytonaRunner for code evaluators Dec 25, 2025

Merge branch 'frontend-feat/new-testsets-integration' into chore/chec…

8a8d9df

…k-daytona-code-evaluator

Copilot AI review requested due to automatic review settings December 26, 2025 16:20

Copilot AI reviewed Dec 26, 2025

View reviewed changes

vercel bot deployed to Preview December 26, 2025 16:20 View deployment

junaway marked this pull request as draft December 29, 2025 10:27

junaway changed the base branch from main to frontend-feat/new-testsets-integration December 29, 2025 10:27

Merge branch 'frontend-feat/new-testsets-integration' into chore/chec…

18e2e3c

…k-daytona-code-evaluator

vercel bot deployed to Preview December 30, 2025 08:53 View deployment

Merge branch 'frontend-feat/new-testsets-integration' into chore/chec…

9ce5afe

…k-daytona-code-evaluator

Copilot AI review requested due to automatic review settings January 2, 2026 10:19

Copilot started reviewing on behalf of junaway January 2, 2026 10:20 View session

vercel bot deployed to Preview January 2, 2026 10:21 View deployment

ruff format

d9d6858

vercel bot deployed to Preview January 2, 2026 10:22 View deployment

Copilot AI reviewed Jan 2, 2026

View reviewed changes

@@ -397,11 +397,23 @@
              * @param revisionId - The ID of the revision to archive
              * @returns The archived revision data
              */
+            function isSafeRevisionId(revisionId: string): boolean {
+                // Allow only typical ID characters; adjust pattern if backend uses a stricter format (e.g., UUID)
+                // This prevents path traversal and other special characters from being used in the URL path segment.
+                return /^[A-Za-z0-9_-]{1,128}$/.test(revisionId)
+            }
             export async function archiveTestsetRevision(revisionId: string) {
+                if (!isSafeRevisionId(revisionId)) {
+                    throw new Error("Invalid revision ID")
+                }
                 const {projectId} = getProjectValues()
+                const safeRevisionId = encodeURIComponent(revisionId)
                 const response = await axios.post(
-                    `${getAgentaApiUrl()}/testsets/revisions/${revisionId}/archive?project_id=${projectId}`,
+                    `${getAgentaApiUrl()}/testsets/revisions/${safeRevisionId}/archive?project_id=${projectId}`,
                 )
                 return response.data

-def _run_file(daytona: Daytona, runtime: str, path: Path) -> None:
-    code = path.read_text(encoding="utf-8")
-    wrapped = _wrap_python(code) if runtime == "python" else _wrap_js(code)
+def _require_daytona_snapshot() -> str:
+    """Ensure that DAYTONA_SNAPSHOT is configured before creating sandboxes."""
+    snapshot = os.getenv("DAYTONA_SNAPSHOT")
+    if not snapshot:
+        raise RuntimeError(
+            "DAYTONA_SNAPSHOT is required to create Daytona sandboxes. "
+            "Please set the environment variable to a valid snapshot ID."
+        )
+    return snapshot
+def _run_file(daytona: Daytona, runtime: str, path: Path) -> None:
+    code = path.read_text(encoding="utf-8")
+    wrapped = _wrap_python(code) if runtime == "python" else _wrap_js(code)
+    # Validate that the required snapshot configuration is present before creating a sandbox.
+    _require_daytona_snapshot()

	# from autoevals.ragas import Faithfulness, ContextRelevancy
	# from autoevals.ragas import Faithfulness, ContextRelevancy # Commented out due to autoevals removal, corrected spelling of 'Relevance'

		// Get the actual language from the CodeBlock node, or default to "code"
		const language = $isCodeBlockNode(parentBlock) ? parentBlock.getLanguage() : "code"

-                // Get the actual language from the CodeBlock node, or default to "code"
-                const language = $isCodeBlockNode(parentBlock) ? parentBlock.getLanguage() : "code"
+                // Get the actual language from the CodeBlock node, or default to "code".
+                // If parentBlock is not a CodeBlockNode, log a warning as this likely indicates
+                // a structural/editor bug, but still fall back to "code" to preserve behavior.
+                let language: string
+                if ($isCodeBlockNode(parentBlock)) {
+                    language = parentBlock.getLanguage()
+                } else {
+                    log("Paste: Expected parentBlock to be a CodeBlockNode", {
+                        selection,
+                        anchorNode,
+                        currentLine,
+                        parentBlock,
+                    })
+                    language = "code"
+                }

-        pattern = "*.py" if runtime == "python" else "*.js" if runtime == "javascript" else "*.ts"
+        if runtime == "python":
+            pattern = "*.py"
+        elif runtime == "javascript":
+            pattern = "*.js"
+        else:
+            pattern = "*.ts"

		response = sandbox.process.code_run(wrapped_code)
		response_stdout = response.result if hasattr(response, "result") else ""

[feat] Add DaytonaRunner for code evaluators #3258

Are you sure you want to change the base?

[feat] Add DaytonaRunner for code evaluators #3258

Conversation

junaway commented Dec 20, 2025

Uh oh!

vercel bot commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Check failure

Uh oh!

Uh oh!

Copilot Autofix

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 26, 2025

[feat] Add `DaytonaRunner` for code `evaluators` #3258

[feat] Add `DaytonaRunner` for code `evaluators` #3258

vercel bot commented Dec 20, 2025 •

edited

Loading