Skip to content

feat: image generation, edit and analysis with Responses API#414

Open
nickpismenkov wants to merge 10 commits intomainfrom
feat/image-responses-api
Open

feat: image generation, edit and analysis with Responses API#414
nickpismenkov wants to merge 10 commits intomainfrom
feat/image-responses-api

Conversation

@nickpismenkov
Copy link
Contributor

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @nickpismenkov, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly extends the Responses API to support advanced multimodal interactions, specifically focusing on image generation, editing, and analysis. It enables the system to process image-related requests, route them to specialized models, and return image outputs or text-based analyses of image inputs. The changes involve updates across the API and service layers, including new data models for image content and usage, and a robust suite of tests to validate these new functionalities.

Highlights

  • Image Generation and Editing: Introduced comprehensive support for image generation and editing capabilities within the Responses API, allowing users to create and modify images programmatically.
  • Multimodal Input for Text Models: Enabled multimodal input for text-based models, allowing images to be provided as part of the input for analysis, leading to text-based responses.
  • New Data Structures and Usage Tracking: Added new data structures, OutputImage and ImageOutputData, to represent image responses, and integrated image_count into the Usage statistics for tracking image generation.
  • API and Service Layer Integration: Modified the API and service layers to detect image generation/editing requests, route them to appropriate inference providers, and handle base64 image decoding for input images.
  • Expanded Test Coverage: Introduced extensive end-to-end tests for image generation, editing, and analysis through the Responses API, ensuring proper functionality, conversation threading, and backward compatibility for existing text completions.
Changelog
  • Cargo.lock
    • Added base64 dependency to support encoding/decoding image data.
  • crates/api/src/lib.rs
    • Updated test model configurations to include new text models (zai-org/GLM-4.7, openai/gpt-oss-120b), a multimodal image analysis model (Qwen/Qwen3-VL-30B-A3B-Instruct), and an image generation model (black-forest-labs/FLUX.2-klein-4B).
  • crates/api/src/routes/completions.rs
    • Modified convert_chat_request_to_service and convert_text_request_to_service to include multimodal_content: None in CompletionMessage creation, preparing for multimodal support.
  • crates/api/src/routes/conversations.rs
    • Added handling for services::responses::models::ResponseContentItem::OutputImage in convert_output_item_to_conversation_item, currently skipping direct inclusion of images in conversation content.
  • crates/api/src/routes/responses.rs
    • Added explicit handling for ResponseContentItem::OutputImage in convert_to_input_part, ensuring images are not converted to text input parts.
  • crates/api/tests/common/mod.rs
    • Updated setup_qwen_image_model to use black-forest-labs/FLUX.2-klein-4B instead of Qwen/Qwen-Image-2512 for image generation tests, including display name and description updates.
  • crates/api/tests/e2e_audio_image.rs
    • Replaced references to Qwen/Qwen-Image-2512 with black-forest-labs/FLUX.2-klein-4B across various image generation and edit test cases.
  • crates/api/tests/e2e_responses_images.rs
    • Added a new end-to-end test file (e2e_responses_images.rs) to cover image generation, editing, analysis, conversation threading, usage tracking, and backward compatibility for the Responses API.
  • crates/services/Cargo.toml
    • Added base64 = "0.22" dependency.
  • crates/services/src/completions/mod.rs
    • Modified ChatMessage creation to prioritize multimodal_content if available, falling back to content for text.
    • Implemented get_inference_provider_pool for the CompletionServiceTrait to allow access to the inference provider pool for image operations.
  • crates/services/src/completions/ports.rs
    • Added multimodal_content: Option<serde_json::Value> to CompletionMessage for multimodal input support.
    • Added get_inference_provider_pool to the CompletionServiceTrait.
  • crates/services/src/responses/models.rs
    • Introduced OutputImage enum variant for ResponseContentItem and ResponseOutputContent to represent image outputs.
    • Defined ImageOutputData struct to hold base64 image data, URL, and revised prompt.
    • Added image_count: Option<i32> to the Usage struct for tracking generated images.
    • Added new_image_only constructor to Usage for image-specific usage reporting.
  • crates/services/src/responses/service.rs
    • Added build_multimodal_content function to construct OpenAI-compatible multimodal content arrays from response parts.
    • Refactored stream processing functions to use direct service_helpers imports.
    • Implemented a new logic block in stream_response to detect image generation models and route requests to image_generation or image_edit inference providers.
    • Added is_image_generation_model to identify models capable of image generation/editing.
    • Added extract_prompt_from_request and extract_input_image_from_request to parse image-related request parameters, including base64 image decoding.
    • Updated various CompletionMessage creations to include multimodal_content: None.
    • Modified prepare_messages to handle multimodal_content when processing ResponseInput::Items containing images.
  • crates/services/src/responses/service_helpers.rs
    • Added emit_output_image_created function to emit streaming events when an image output is created.
  • crates/services/src/responses/tools/mcp.rs
    • Updated CompletionMessage creation to include multimodal_content: None.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new functionality for image generation, editing, and analysis through the Responses API. The implementation is well-structured, separating the logic for image generation from image analysis. The addition of a comprehensive e2e test suite is a great contribution to ensure the stability of these new features. My review includes a high-severity suggestion to improve the mechanism for detecting image generation models by using model capabilities from the database instead of hardcoded string matching, which will make the system more robust and scalable. I've also included a medium-severity suggestion to improve the robustness of a test helper.

Comment on lines 2592 to 2598
fn is_image_generation_model(model_name: &str) -> bool {
let model_lower = model_name.to_lowercase();
model_lower.contains("dall-e")
|| model_lower.contains("flux")
|| model_lower.contains("sd")
|| model_lower.contains("stable-diffusion")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Relying on string matching against the model name to determine if it's an image generation model is brittle and not easily extensible. A more robust approach would be to use the model's capabilities stored in the database.

I suggest the following refactoring:

  1. Expose model details via CompletionServiceTrait: Add a method like async fn get_model(&self, model_name: &str) -> Result<Option<Model>> to CompletionServiceTrait and implement it in CompletionServiceImpl to fetch model details from models_repository.

  2. Use model capabilities in process_response_stream: Before the if condition at line 923, resolve the model using the new service method. Then, check if the model's output_modalities contain "image".

This would look something like this:

// In process_response_stream
let model = context.completion_service.get_model(&context.request.model).await?.ok_or_else(|| errors::ResponseError::InvalidParams(format!("Model '{}' not found", &context.request.model)))?;

let is_image_gen = model.output_modalities.contains(&"image".to_string());

if is_image_gen {
    // Image generation logic...
}

This change would make the routing logic more reliable and automatically support new image generation models as they are added to the database without requiring code changes.

Comment on lines 186 to 192
let first_response_id = if let Some(start) = first_response_text.find("\"id\":\"resp_") {
let start_uuid = start + "\"id\":\"resp_".len();
let end = first_response_text[start_uuid..].find('"').unwrap();
&first_response_text[start_uuid..start_uuid + end]
} else {
panic!("Could not find response ID in first response");
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Parsing the response ID from the raw text stream is a bit brittle and might break if the JSON formatting or event order changes. A more robust approach would be to parse the Server-Sent Events (SSE) stream properly. You could split the response text by \n\n, then for each event block, find the data: line, parse the JSON, and extract the ID. This would make the test more resilient to minor changes in the stream format.

@claude
Copy link

claude bot commented Feb 4, 2026

Critical Issues Found ⚠️

I've reviewed this PR focusing on production safety, security, and performance for the new image generation/editing features. Here are the critical issues that must be addressed:


🔴 CRITICAL - Must Fix Before Merge

1. Memory Exhaustion Vulnerability (Security + Performance)

Location: crates/services/src/responses/service.rs:2649

The base64 decoder has no size limits before attempting to decode. A malicious user could send a 500MB base64 string causing memory exhaustion and pod crashes in production.

Fix: Add validation before decode:

const MAX_BASE64_IMAGE_SIZE: usize = 10_485_760; // 10MB base64
if base64_str.len() > MAX_BASE64_IMAGE_SIZE {
    return Err(errors::ResponseError::InvalidParams(
        format\!("Image size exceeds maximum of {}MB", MAX_BASE64_IMAGE_SIZE / 1_048_576)
    ));
}

2. Blocking Operation in Async Context (Performance)

Location: crates/services/src/responses/service.rs:2649-2650

Base64 decoding of large images (e.g., 50MB) is a CPU-intensive synchronous operation running in async context, blocking the runtime thread. In multi-cluster production, this reduces throughput significantly.

Fix: Use spawn_blocking:

let base64_str_owned = base64_str.to_string();
let decoded = tokio::task::spawn_blocking(move || {
    base64::engine::general_purpose::STANDARD.decode(&base64_str_owned)
})
.await
.map_err(|e| errors::ResponseError::InternalError(format\!("Decode task failed: {e}")))?
.map_err(|e| errors::ResponseError::InvalidParams(format\!("Failed to decode base64: {e}")))?;

3. Missing MIME Type Validation (Security)

Location: crates/services/src/responses/service.rs:2645

The code accepts any data URL without validating the MIME type. This could lead to:

  • XSS if image data is rendered in admin dashboards
  • Smuggling non-image data through the system

Fix: Validate MIME type strictly:

if url_str.starts_with("data:image/png;base64,") || url_str.starts_with("data:image/jpeg;base64,") {
    let comma_pos = url_str.find(',').ok_or_else(|| 
        errors::ResponseError::InvalidParams("Invalid data URL format".to_string()))?;
    let base64_str = &url_str[comma_pos + 1..];
    // ... decode with size check
} else {
    return Err(errors::ResponseError::InvalidParams(
        "Image must be PNG or JPEG with base64 encoding".to_string()));
}

4. Memory Amplification Issue (Performance)

Location: crates/inference_providers/src/vllm/mod.rs:503

Image bytes are cloned multiple times:

  1. Base64 string in request (~133MB for 100MB image)
  2. Decoded Vec (100MB)
  3. image_data.to_vec() clone for multipart form (100MB)

Memory multiplier: 100 concurrent image edits = 33GB memory!

Fix: Avoid the clone in multipart form:

// Instead of:
let image_part = reqwest::multipart::Part::bytes(image_data.to_vec())

// Use stream or find reqwest API that accepts Arc<[u8]>

5. Database Scalability Problem (Production Safety)

Location: crates/services/src/responses/service.rs:1019-1022

Entire base64 image data is stored in the response_items PostgreSQL JSONB field. A single 10MB image becomes ~13MB base64 JSON, causing:

  • Database bloat
  • Slow conversation queries
  • Expensive backups
  • PostgreSQL 1GB JSONB limit risk

Fix: Store images in S3/object storage:

// Upload to S3 first
let image_url = file_service.upload_image(workspace_id, &image_bytes).await?;

// Store only URL in database
content: vec\![models::ResponseContentItem::OutputImage {
    data: vec\![], // Empty, not used
    url: Some(image_url),
}],

6. Image Edit Detection is Ambiguous (Logic)

Location: crates/services/src/responses/service.rs:940-953

The code routes to image edit if ANY input item contains an image. This creates problems:

  • User sends text prompt + reference image → incorrectly treated as edit
  • Cannot distinguish between: image editing, image analysis (text output), or image generation with reference

Fix: Use more explicit routing logic:

// Only treat as edit if:
// 1. Model supports editing AND
// 2. Input has image AND  
// 3. Prompt indicates editing intent (or explicit edit parameter)
let is_edit = Self::is_image_generation_model(&context.request.model) 
    && has_input_image 
    && context.request.operation_type == Some("edit"); // Add explicit field

🟡 HIGH PRIORITY - Fix Soon

7. Privacy Compliance Risk (CLAUDE.md violation)

Per CLAUDE.md: "NEVER LOG: File contents - Uploaded file data or processed file content"

Action needed: Audit all log statements to ensure image data is never logged:

  • Check error messages don't include request.input
  • Verify response_text in logs doesn't contain base64 images
  • Add tests to validate no image data in logs

8. Missing Image-Specific Rate Limiting (Security + Cost)

Image generation is 100x more expensive (compute/memory) than text. Without separate rate limits:

  • Attackers can exhaust resources within general rate limits
  • Costs can spike unexpectedly

Fix: Add separate rate limit for image operations in rate limiter service.


9. Incomplete Usage Tracking (Billing)

Usage::image_count tracks only the count, missing critical billing data:

  • Image resolution (1024x1024 vs 2048x2048 = very different cost)
  • Operation type (generation vs edit have different pricing)

Fix: Add fields to Usage model:

pub image_resolution: Option<String>,
pub image_operation: Option<String>, // "generation", "edit"

10. No Validated Image Size After Decode (Security)

After base64 decode, bytes are used directly without validating they're actually a valid image. Could crash backends with malformed data.

Fix: Validate PNG/JPEG magic bytes after decode.


✅ Positive Observations

  • Good use of Arc for large image data to avoid clones during retries
  • Comprehensive test coverage in e2e_responses_images.rs
  • Proper separation of image and text completion flows
  • Multimodal content handling is well-structured

Recommendation

Do not merge until issues #1-6 are resolved. These represent critical security, performance, and scalability risks in production deployment.

Issues #7-10 should be addressed before general availability but could be fixed in follow-up PRs if needed for timeline reasons.

Happy to discuss any of these findings! 🚀

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds comprehensive image generation, editing, and analysis capabilities to the Responses API. The implementation routes image generation model requests (dall-e, flux, stable-diffusion) to the image generation endpoint while maintaining text model support for image analysis.

Changes:

  • Added image generation and editing support through the Responses API with proper streaming events and usage tracking
  • Implemented multimodal content handling for image analysis (text models with image inputs)
  • Introduced new data structures (OutputImage, ImageOutputData, image_count in Usage) to support image outputs

Reviewed changes

Copilot reviewed 18 out of 19 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
crates/services/src/responses/service.rs Core image generation/edit routing logic, multimodal content building, and prompt/image extraction helpers
crates/services/src/responses/service_helpers.rs Added emit_output_image_created() event emitter for streaming image generation events
crates/services/src/responses/models.rs New OutputImage content type, ImageOutputData struct, and image_count field in Usage
crates/services/src/completions/ports.rs Added multimodal_content field to CompletionMessage for image analysis support
crates/services/src/completions/mod.rs Updated message conversion to use multimodal_content when available
crates/api/src/routes/responses.rs Skip OutputImage when converting to input parts
crates/api/src/routes/conversations.rs Skip OutputImage in conversation content conversion
crates/api/src/routes/completions.rs Set multimodal_content to None for chat/text completions
crates/api/tests/e2e_responses_images.rs Comprehensive E2E tests for image generation, editing, analysis, and backward compatibility
crates/api/tests/e2e_audio_image.rs Updated model name from Qwen/Qwen-Image-2512 to black-forest-labs/FLUX.2-klein-4B
crates/api/tests/common/mod.rs Updated image model setup to use FLUX 2 Klein
crates/api/src/lib.rs Added FLUX model to test mock providers
crates/services/Cargo.toml Added base64 dependency for image data URL decoding
crates/services/src/responses/tools/mcp.rs Set multimodal_content to None for MCP approval responses
Cargo.lock Updated bytes dependency to 1.11.1
Comments suppressed due to low confidence (4)

crates/services/src/responses/service.rs:2658

  • The function only handles base64 data URLs (data:...) but doesn't handle HTTP/HTTPS URLs. If a user provides an image URL like "https://example.com/image.png", the function will silently skip it and return "No input image found" error. Consider either fetching the image from HTTP URLs or returning a more specific error message that indicates only data URLs are supported.
                ))
            })?;

        tracing::info!(
            conversation_id = %conversation_id,
            title_length = title.len(),
            truncated = title.len() > 60,
            "Generated conversation title"
        );
        // Emit conversation.title.updated event
        use futures::SinkExt;
        let event = models::ResponseStreamEvent {
            event_type: "conversation.title.updated".to_string(),
            sequence_number: None, // No sequence number for background events

crates/services/src/responses/service.rs:2618

  • When input is of type ResponseInput::Items, the function only checks for Parts variant but doesn't handle the Text variant of ResponseContent. If an item has content: ResponseContent::Text(text), the prompt won't be extracted and the function will incorrectly return "No text prompt found for image request" error. Consider handling both ResponseContent::Text and ResponseContent::Parts variants.
                conversation_id = %conversation_id,
                "LLM response doesn't contain title for conversation, using default"
            );
            "Conversation".to_string()
        };

        // Strip reasoning tags from title
        let mut reasoning_buffer = String::new();
        let mut inside_reasoning = false;
        let (generated_title, _, _) =
            Self::process_reasoning_tags(&raw_title, &mut reasoning_buffer, &mut inside_reasoning);
        let generated_title = generated_title.trim();

crates/services/src/responses/service.rs:2657

  • When a data URL is found but doesn't contain a comma (line 647), the function continues to the next image instead of returning an error. This means a malformed data URL like "data:image/png;base64invalid" will be silently skipped. Consider returning an explicit error when a data URL is detected but is malformed.
                ))
            })?;

        tracing::info!(
            conversation_id = %conversation_id,
            title_length = title.len(),
            truncated = title.len() > 60,
            "Generated conversation title"
        );
        // Emit conversation.title.updated event
        use futures::SinkExt;
        let event = models::ResponseStreamEvent {
            event_type: "conversation.title.updated".to_string(),

crates/services/src/responses/service.rs:2598

  • The model detection logic using "sd" substring is too broad and could match unintended models. For example, a model named "bsd-chat" or "wisdom-7b" would be incorrectly detected as an image generation model. Consider using a more specific pattern like checking for "sd-" prefix or full model family names like "stable-diffusion-" to avoid false positives.
            .map_err(|e| {
                errors::ResponseError::InternalError(format!("Failed to generate title: {e}"))
            })?;

        // Extract title from completion result
        let raw_title = completion_result
            .response

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +364 to +377
"type": "text",
"text": text
}));
}
models::ResponseContentPart::InputImage {
image_url,
detail: _,
} => {
let url_str = match image_url {
models::ResponseImageUrl::String(s) => s.clone(),
models::ResponseImageUrl::Object { url } => url.clone(),
};
content_array.push(serde_json::json!({
"type": "image_url",
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The detail field from InputImage is being discarded when building multimodal content. According to the OpenAI API format, the detail parameter (with values like "auto", "low", or "high") should be included in the image_url object to control image resolution processing. This field should be preserved when constructing the multimodal content array.

Copilot uses AI. Check for mistakes.
Comment on lines +308 to +340
setup_qwen_model(&server).await;
let image_model = setup_qwen_image_model(&server).await;
let org = setup_org_with_credits(&server, 100_000_000_000i64).await;
let api_key = get_api_key_for_org(&server, org.id).await;

// Request multiple images (n=2 or n=3)
let response = server
.post("/v1/responses")
.add_header("Authorization", format!("Bearer {api_key}"))
.add_header("User-Agent", MOCK_USER_AGENT)
.json(&json!({
"model": image_model,
"input": "Generate beautiful images",
"stream": true
}))
.await;

assert_eq!(response.status_code(), 200);
let response_text = response.text();

// Verify we got output_image.created
assert!(
response_text.contains("output_image"),
"Expected output_image for image generation"
);

println!("Multiple images test passed!");
}

/// Test that text models route to text completion
/// Verifies that only image-capable models trigger image generation routing
#[tokio::test]
async fn test_unknown_model_routes_to_text_completion() {
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test comment claims it "Verifies that the n parameter is respected" but the test doesn't pass an n parameter in the request (line 323-327), and the implementation hardcodes n to 1 (line 981 in service.rs). Either the test should be updated to pass an n parameter and verify multiple images are returned, or the test comment and name should be updated to reflect what it actually tests.

Copilot uses AI. Check for mistakes.
Comment on lines +956 to +964

// Determine if this is image editing or image analysis
// Image Edit: input contains only image (or image with minimal text instructions)
// Image Analysis: input contains image + substantive text query (e.g., "what's in this image?")
// Image Generation: no input image (pure text-to-image)
let (has_input_image, has_input_text) =
Self::analyze_input_content(&context.request);

// Routing logic:
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encryption headers (MODEL_PUB_KEY) are only added to image generation requests (lines 929-936) but not to image edit requests (lines 958-964). If encryption is needed for image generation, it should likely also be applied to image editing for consistency and security. Consider adding the extra_params with encryption headers to the ImageEditParams as well.

Copilot uses AI. Check for mistakes.
Comment on lines +998 to +1008
response_format: Some("b64_json".to_string()),
extra: extra_params,
};

context
.completion_service
.get_inference_provider_pool()
.image_generation(params, context.body_hash.clone())
.await
.map_err(|e| {
errors::ResponseError::InvalidParams(format!(
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the image generation or edit response contains an empty data array (response.response.data.is_empty()), the function will create a ResponseOutputItem with empty image data and report image_count as 0, which may not accurately reflect an error condition. Consider adding validation to ensure at least one image was generated, or handling empty responses as an error case.

Copilot uses AI. Check for mistakes.
Comment on lines +1075 to +1076
response_id: ctx.response_id_str.clone(),
previous_response_id: ctx.previous_response_id.clone(),
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image generation flow returns early at line 1075 without waiting for or handling the title_task_handle that was started at line 907. This means the background title generation task will continue running, but if it fails or times out, there's no handling of that case. In the normal text completion flow (line 1200-1215), the code waits for title generation with a timeout. Consider adding similar handling for the image generation path to ensure consistent behavior and proper cleanup.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: image generation, edit and analysis with Responses API

1 participant