Skip to content

Conversation

@Ladas
Copy link

@Ladas Ladas commented Dec 3, 2025

Summary

WIP: experiment with otel auto instrumentation for weather agent

Related issue(s)

Relates to kagenti/kagenti#436

Ladas and others added 10 commits November 27, 2025 15:43
Implements comprehensive OTEL observability for weather service agent with
Phoenix integration, baggage propagation, and GenAI semantic conventions.

## Changes

### New Module: observability.py
- ObservabilityConfig: Reads OTEL config from environment variables
- setup_observability(): Configures OTEL tracer with proper resource attributes
- Baggage propagation functions for context tracking (user_id, request_id, etc.)
- Extract baggage from HTTP headers
- Phoenix project routing via resource attributes

### Updated Dependencies (pyproject.toml)
- opentelemetry-api>=1.20.0
- opentelemetry-sdk>=1.20.0
- opentelemetry-exporter-otlp>=1.20.0
- opentelemetry-exporter-otlp-proto-grpc>=1.20.0
- opentelemetry-instrumentation>=0.41b0
- openinference-semantic-conventions>=0.1.0

### Updated __init__.py
- Replace basic tracer setup with comprehensive setup_observability()
- Ensures OpenInference instrumentation loads before agent code

### Updated agent.py
- Remove duplicate LangChainInstrumentor() call (now in observability.py)
- Extract baggage context from request headers
- Set baggage for each request (user_id, request_id, task_id, etc.)
- Log trace info for debugging

## Features

✅ Auto-instrumentation with OpenInference (LangChain LLM, Tool, Chain spans)
✅ OTEL baggage propagation across all services
✅ Phoenix project routing (team1-agents, etc.)
✅ GenAI semantic conventions compliance
✅ K8s metadata via resource attributes
✅ Comprehensive logging for observability debugging

## Resource Attributes

All traces include:
- service.name: weather-service
- service.namespace: {namespace}
- k8s.namespace.name: {namespace}
- phoenix.project.name: {namespace}-agents
- deployment.environment: kind-local

## Baggage Context

Baggage propagates across all spans:
- user_id: User identifier
- request_id: Unique request ID
- task_id: A2A task ID
- context_id: A2A context ID

## Testing

To test locally:
1. Deploy with OTEL environment variables set
2. Send request with user-id and request-id headers
3. Check Phoenix UI for traces with baggage attributes

## Related

- Part of Phoenix Integration (TODO_PHOENIX_INTEGRATION.md)
- Phase 1: Agent Instrumentation
- Prepares for E2E tests in Phase 2

Signed-off-by: Claude Code AI Assistant <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
- Replace baggage.get_current() with context.get_current() (API fix)
- Add configurable OTLP exporter supporting both gRPC and HTTP protocols
- Default to HTTP/protobuf for wider compatibility
- Update default endpoint to use correct kagenti-system namespace and port 4318

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
The OTEL Collector in kagenti-system is configured to listen on port
8335 (via --set override), not the standard 4318. Updated the default
endpoint in observability.py to match.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
- Add create_agent_span() for root AGENT span with OI attributes
- Wrap graph execution with using_attributes for session/user tracking
- Add a2a.task_id, a2a.context_id, user.id to spans for filtering
- Set input.value and output.value on agent spans

This ensures all LangChain auto-instrumented spans are properly
nested under an AGENT span and have session/user context.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
The previous commit removed these imports but they're still used by
set_baggage_context function.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Use OpenInference-compatible tracer name "openinference.instrumentation.agent"
so that root AGENT spans pass through the OTEL Collector's filter/phoenix
processor which only allows spans from "openinference.instrumentation.*".

Previously used "weather_service.observability" which was filtered out.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add as_root=True parameter to create_agent_span()
- Use empty Context() to break parent inheritance from A2A SDK telemetry
- Ensures weather_agent_task appears as root span in Phoenix UI

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Instead of breaking parent context in create_agent_span(), we'll allow
a2a.utils.telemetry spans through the OTEL Collector filter. This
preserves the complete trace hierarchy.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Phase 1 of trace propagation implementation:

1. observability.py:
   - Configure W3C Trace Context and Baggage propagators
   - Add extract_trace_context() for incoming HTTP headers
   - Add inject_trace_context() for outgoing HTTP calls
   - Add trace_context_from_headers() context manager

2. agent.py:
   - Wrap entire execute method with trace_context_from_headers()
   - All spans now become children of incoming traceparent
   - Enables proper parent-child relationships across A2A calls

This enables:
- Single connected trace from HTTP request through LLM calls
- Multi-agent call flows with proper trace hierarchy
- Phoenix visibility into complete request flow

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@Ladas Ladas marked this pull request as draft December 3, 2025 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant