feat: image generation, edit and analysis with Responses API#414
feat: image generation, edit and analysis with Responses API#414nickpismenkov wants to merge 10 commits intomainfrom
Conversation
Summary of ChangesHello @nickpismenkov, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly extends the Responses API to support advanced multimodal interactions, specifically focusing on image generation, editing, and analysis. It enables the system to process image-related requests, route them to specialized models, and return image outputs or text-based analyses of image inputs. The changes involve updates across the API and service layers, including new data models for image content and usage, and a robust suite of tests to validate these new functionalities. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant new functionality for image generation, editing, and analysis through the Responses API. The implementation is well-structured, separating the logic for image generation from image analysis. The addition of a comprehensive e2e test suite is a great contribution to ensure the stability of these new features. My review includes a high-severity suggestion to improve the mechanism for detecting image generation models by using model capabilities from the database instead of hardcoded string matching, which will make the system more robust and scalable. I've also included a medium-severity suggestion to improve the robustness of a test helper.
| fn is_image_generation_model(model_name: &str) -> bool { | ||
| let model_lower = model_name.to_lowercase(); | ||
| model_lower.contains("dall-e") | ||
| || model_lower.contains("flux") | ||
| || model_lower.contains("sd") | ||
| || model_lower.contains("stable-diffusion") | ||
| } |
There was a problem hiding this comment.
Relying on string matching against the model name to determine if it's an image generation model is brittle and not easily extensible. A more robust approach would be to use the model's capabilities stored in the database.
I suggest the following refactoring:
-
Expose model details via
CompletionServiceTrait: Add a method likeasync fn get_model(&self, model_name: &str) -> Result<Option<Model>>toCompletionServiceTraitand implement it inCompletionServiceImplto fetch model details frommodels_repository. -
Use model capabilities in
process_response_stream: Before theifcondition at line 923, resolve the model using the new service method. Then, check if the model'soutput_modalitiescontain"image".
This would look something like this:
// In process_response_stream
let model = context.completion_service.get_model(&context.request.model).await?.ok_or_else(|| errors::ResponseError::InvalidParams(format!("Model '{}' not found", &context.request.model)))?;
let is_image_gen = model.output_modalities.contains(&"image".to_string());
if is_image_gen {
// Image generation logic...
}This change would make the routing logic more reliable and automatically support new image generation models as they are added to the database without requiring code changes.
| let first_response_id = if let Some(start) = first_response_text.find("\"id\":\"resp_") { | ||
| let start_uuid = start + "\"id\":\"resp_".len(); | ||
| let end = first_response_text[start_uuid..].find('"').unwrap(); | ||
| &first_response_text[start_uuid..start_uuid + end] | ||
| } else { | ||
| panic!("Could not find response ID in first response"); | ||
| }; |
There was a problem hiding this comment.
Parsing the response ID from the raw text stream is a bit brittle and might break if the JSON formatting or event order changes. A more robust approach would be to parse the Server-Sent Events (SSE) stream properly. You could split the response text by \n\n, then for each event block, find the data: line, parse the JSON, and extract the ID. This would make the test more resilient to minor changes in the stream format.
Critical Issues Found
|
There was a problem hiding this comment.
Pull request overview
This pull request adds comprehensive image generation, editing, and analysis capabilities to the Responses API. The implementation routes image generation model requests (dall-e, flux, stable-diffusion) to the image generation endpoint while maintaining text model support for image analysis.
Changes:
- Added image generation and editing support through the Responses API with proper streaming events and usage tracking
- Implemented multimodal content handling for image analysis (text models with image inputs)
- Introduced new data structures (OutputImage, ImageOutputData, image_count in Usage) to support image outputs
Reviewed changes
Copilot reviewed 18 out of 19 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/services/src/responses/service.rs | Core image generation/edit routing logic, multimodal content building, and prompt/image extraction helpers |
| crates/services/src/responses/service_helpers.rs | Added emit_output_image_created() event emitter for streaming image generation events |
| crates/services/src/responses/models.rs | New OutputImage content type, ImageOutputData struct, and image_count field in Usage |
| crates/services/src/completions/ports.rs | Added multimodal_content field to CompletionMessage for image analysis support |
| crates/services/src/completions/mod.rs | Updated message conversion to use multimodal_content when available |
| crates/api/src/routes/responses.rs | Skip OutputImage when converting to input parts |
| crates/api/src/routes/conversations.rs | Skip OutputImage in conversation content conversion |
| crates/api/src/routes/completions.rs | Set multimodal_content to None for chat/text completions |
| crates/api/tests/e2e_responses_images.rs | Comprehensive E2E tests for image generation, editing, analysis, and backward compatibility |
| crates/api/tests/e2e_audio_image.rs | Updated model name from Qwen/Qwen-Image-2512 to black-forest-labs/FLUX.2-klein-4B |
| crates/api/tests/common/mod.rs | Updated image model setup to use FLUX 2 Klein |
| crates/api/src/lib.rs | Added FLUX model to test mock providers |
| crates/services/Cargo.toml | Added base64 dependency for image data URL decoding |
| crates/services/src/responses/tools/mcp.rs | Set multimodal_content to None for MCP approval responses |
| Cargo.lock | Updated bytes dependency to 1.11.1 |
Comments suppressed due to low confidence (4)
crates/services/src/responses/service.rs:2658
- The function only handles base64 data URLs (data:...) but doesn't handle HTTP/HTTPS URLs. If a user provides an image URL like "https://example.com/image.png", the function will silently skip it and return "No input image found" error. Consider either fetching the image from HTTP URLs or returning a more specific error message that indicates only data URLs are supported.
))
})?;
tracing::info!(
conversation_id = %conversation_id,
title_length = title.len(),
truncated = title.len() > 60,
"Generated conversation title"
);
// Emit conversation.title.updated event
use futures::SinkExt;
let event = models::ResponseStreamEvent {
event_type: "conversation.title.updated".to_string(),
sequence_number: None, // No sequence number for background events
crates/services/src/responses/service.rs:2618
- When input is of type
ResponseInput::Items, the function only checks forPartsvariant but doesn't handle theTextvariant ofResponseContent. If an item hascontent: ResponseContent::Text(text), the prompt won't be extracted and the function will incorrectly return "No text prompt found for image request" error. Consider handling bothResponseContent::TextandResponseContent::Partsvariants.
conversation_id = %conversation_id,
"LLM response doesn't contain title for conversation, using default"
);
"Conversation".to_string()
};
// Strip reasoning tags from title
let mut reasoning_buffer = String::new();
let mut inside_reasoning = false;
let (generated_title, _, _) =
Self::process_reasoning_tags(&raw_title, &mut reasoning_buffer, &mut inside_reasoning);
let generated_title = generated_title.trim();
crates/services/src/responses/service.rs:2657
- When a data URL is found but doesn't contain a comma (line 647), the function continues to the next image instead of returning an error. This means a malformed data URL like "data:image/png;base64invalid" will be silently skipped. Consider returning an explicit error when a data URL is detected but is malformed.
))
})?;
tracing::info!(
conversation_id = %conversation_id,
title_length = title.len(),
truncated = title.len() > 60,
"Generated conversation title"
);
// Emit conversation.title.updated event
use futures::SinkExt;
let event = models::ResponseStreamEvent {
event_type: "conversation.title.updated".to_string(),
crates/services/src/responses/service.rs:2598
- The model detection logic using "sd" substring is too broad and could match unintended models. For example, a model named "bsd-chat" or "wisdom-7b" would be incorrectly detected as an image generation model. Consider using a more specific pattern like checking for "sd-" prefix or full model family names like "stable-diffusion-" to avoid false positives.
.map_err(|e| {
errors::ResponseError::InternalError(format!("Failed to generate title: {e}"))
})?;
// Extract title from completion result
let raw_title = completion_result
.response
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "type": "text", | ||
| "text": text | ||
| })); | ||
| } | ||
| models::ResponseContentPart::InputImage { | ||
| image_url, | ||
| detail: _, | ||
| } => { | ||
| let url_str = match image_url { | ||
| models::ResponseImageUrl::String(s) => s.clone(), | ||
| models::ResponseImageUrl::Object { url } => url.clone(), | ||
| }; | ||
| content_array.push(serde_json::json!({ | ||
| "type": "image_url", |
There was a problem hiding this comment.
The detail field from InputImage is being discarded when building multimodal content. According to the OpenAI API format, the detail parameter (with values like "auto", "low", or "high") should be included in the image_url object to control image resolution processing. This field should be preserved when constructing the multimodal content array.
| setup_qwen_model(&server).await; | ||
| let image_model = setup_qwen_image_model(&server).await; | ||
| let org = setup_org_with_credits(&server, 100_000_000_000i64).await; | ||
| let api_key = get_api_key_for_org(&server, org.id).await; | ||
|
|
||
| // Request multiple images (n=2 or n=3) | ||
| let response = server | ||
| .post("/v1/responses") | ||
| .add_header("Authorization", format!("Bearer {api_key}")) | ||
| .add_header("User-Agent", MOCK_USER_AGENT) | ||
| .json(&json!({ | ||
| "model": image_model, | ||
| "input": "Generate beautiful images", | ||
| "stream": true | ||
| })) | ||
| .await; | ||
|
|
||
| assert_eq!(response.status_code(), 200); | ||
| let response_text = response.text(); | ||
|
|
||
| // Verify we got output_image.created | ||
| assert!( | ||
| response_text.contains("output_image"), | ||
| "Expected output_image for image generation" | ||
| ); | ||
|
|
||
| println!("Multiple images test passed!"); | ||
| } | ||
|
|
||
| /// Test that text models route to text completion | ||
| /// Verifies that only image-capable models trigger image generation routing | ||
| #[tokio::test] | ||
| async fn test_unknown_model_routes_to_text_completion() { |
There was a problem hiding this comment.
The test comment claims it "Verifies that the n parameter is respected" but the test doesn't pass an n parameter in the request (line 323-327), and the implementation hardcodes n to 1 (line 981 in service.rs). Either the test should be updated to pass an n parameter and verify multiple images are returned, or the test comment and name should be updated to reflect what it actually tests.
|
|
||
| // Determine if this is image editing or image analysis | ||
| // Image Edit: input contains only image (or image with minimal text instructions) | ||
| // Image Analysis: input contains image + substantive text query (e.g., "what's in this image?") | ||
| // Image Generation: no input image (pure text-to-image) | ||
| let (has_input_image, has_input_text) = | ||
| Self::analyze_input_content(&context.request); | ||
|
|
||
| // Routing logic: |
There was a problem hiding this comment.
Encryption headers (MODEL_PUB_KEY) are only added to image generation requests (lines 929-936) but not to image edit requests (lines 958-964). If encryption is needed for image generation, it should likely also be applied to image editing for consistency and security. Consider adding the extra_params with encryption headers to the ImageEditParams as well.
| response_format: Some("b64_json".to_string()), | ||
| extra: extra_params, | ||
| }; | ||
|
|
||
| context | ||
| .completion_service | ||
| .get_inference_provider_pool() | ||
| .image_generation(params, context.body_hash.clone()) | ||
| .await | ||
| .map_err(|e| { | ||
| errors::ResponseError::InvalidParams(format!( |
There was a problem hiding this comment.
If the image generation or edit response contains an empty data array (response.response.data.is_empty()), the function will create a ResponseOutputItem with empty image data and report image_count as 0, which may not accurately reflect an error condition. Consider adding validation to ensure at least one image was generated, or handling empty responses as an error case.
| response_id: ctx.response_id_str.clone(), | ||
| previous_response_id: ctx.previous_response_id.clone(), |
There was a problem hiding this comment.
The image generation flow returns early at line 1075 without waiting for or handling the title_task_handle that was started at line 907. This means the background title generation task will continue running, but if it fails or times out, there's no handling of that case. In the normal text completion flow (line 1200-1215), the code waits for title generation with a timeout. Consider adding similar handling for the image generation path to ensure consistent behavior and proper cleanup.
No description provided.