The Black Box Problem

Your coding agent just spent 3 minutes and $2.40 to add a print statement. Or worse, it confidently submitted a patch that broke production. You check the logs: "Task completed successfully."

This is the reality of running AI agents in production. They're powerful, autonomous, and completely opaque. Traditional logging tells you that something happened. Distributed tracing tells you why.

Consider what happens when a coding agent tackles a bug fix:

It reads the issue description
It explores the codebase (5-10 file reads)
It forms a hypothesis
It writes a fix
It tests the fix
It iterates (maybe 3-4 times)
It submits

Each step involves LLM calls, tool executions, and decisions. Without tracing, you see the input and output. With tracing, you see the reasoning chain, the dead ends, the token burn rate, and exactly where things went wrong.

Why Agents Are Different

Tracing a traditional web service is straightforward: request comes in, database gets queried, response goes out. Agents are fundamentally different:

Non-deterministic execution paths. The same task might take 3 iterations or 15, depending on how the LLM interprets the problem.

Recursive tool use. An agent might call a tool, realize it needs more context, call another tool, then return to the first tool with new information.

Accumulating context. Each LLM call builds on previous calls. A bug in iteration 2 might not manifest until iteration 7 when the context window overflows.

Cost proportional to complexity. Unlike fixed-cost API calls, agent costs scale with task difficulty in unpredictable ways.

This is why OpenTelemetry's GenAI semantic conventions matter. They provide a standard vocabulary for capturing AI-specific telemetry that traditional tracing frameworks weren't designed for.

What We're Capturing

A complete agent trace includes:

Signal	What It Captures
Agent invocation	Task description, agent name, conversation ID
LLM calls	Model, temperature, max_tokens, system prompt
Input messages	User prompts, conversation history
Output messages	Assistant responses, tool call requests
Tool executions	Tool name, arguments, results, duration
Token usage	Input tokens, output tokens, total cost
Errors	Exception types, error messages, stack traces

OpenTelemetry GenAI Semantic Conventions

OpenTelemetry defines semantic conventions for GenAI that standardize how AI telemetry is captured. Key attributes include:

Agent Attributes

Attribute	Description	Example
`gen_ai.operation.name`	Operation type	`invoke_agent`, `chat`, `execute_tool`
`gen_ai.agent.name`	Agent identifier	`CodingAgent`
`gen_ai.agent.id`	Unique agent instance ID	`agent_abc123`
`gen_ai.agent.description`	What the agent does	`Writes and debugs code`
`gen_ai.conversation.id`	Session/conversation ID	`conv_xyz789`

LLM Call Attributes

Attribute	Description	Example
`gen_ai.provider.name`	LLM provider	`openai`, `anthropic`
`gen_ai.request.model`	Requested model	`gpt-4o`
`gen_ai.response.model`	Actual model used	`gpt-4o-2024-08-06`
`gen_ai.request.temperature`	Temperature setting	`0.0`
`gen_ai.request.max_tokens`	Max output tokens	`4096`
`gen_ai.response.finish_reasons`	Why generation stopped	`["stop"]`, `["tool_calls"]`

Token Usage Attributes

Attribute	Description	Example
`gen_ai.usage.input_tokens`	Prompt tokens	`1250`
`gen_ai.usage.output_tokens`	Completion tokens	`380`

Tool Execution Attributes

Attribute	Description	Example
`gen_ai.tool.name`	Tool identifier	`read_file`, `execute_command`
`gen_ai.tool.type`	Tool category	`function`, `extension`
`gen_ai.tool.call.id`	Unique call ID	`call_abc123`
`gen_ai.tool.call.arguments`	Input arguments (JSON)	`{"path": "/src/main.py"}`
`gen_ai.tool.call.result`	Output result (JSON)	`{"content": "..."}`
`gen_ai.tool.description`	What the tool does	`Read file contents`

Content Attributes (Opt-In)

Attribute	Description
`gen_ai.system_instructions`	System prompt
`gen_ai.input.messages`	Full input message array
`gen_ai.output.messages`	Full output message array

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Coding Agent                           │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │   Planner   │──│  Executor   │──│   Tool Runtime      │  │
│  │   (LLM)     │  │   (Loop)    │  │ (read/write/exec)   │  │
│  └──────┬──────┘  └──────┬──────┘  └──────────┬──────────┘  │
│         │                │                    │              │
│         └────────────────┼────────────────────┘              │
│                          │                                   │
│                   OpenTelemetry SDK                          │
│                          │                                   │
└──────────────────────────┼───────────────────────────────────┘
                           │
                           ▼
                 ┌─────────────────────┐
                 │  OTel Collector     │
                 │  (OTLP Receiver)    │
                 └──────────┬──────────┘
                            │
                            ▼
                 ┌─────────────────────┐
                 │     Parseable       │
                 │   /v1/traces        │
                 └─────────────────────┘

The Instrumented SWE-agent

We've instrumented SWE-agent, a real-world coding agent from Princeton NLP that achieves state-of-the-art results on software engineering benchmarks. SWE-agent can autonomously fix GitHub issues, navigate complex codebases, and submit pull requests.

Why SWE-agent? Because it represents the complexity of production agents:

Multi-step reasoning: It doesn't just generate code; it explores, hypothesizes, tests, and iterates
Rich tool ecosystem: File operations, shell commands, git operations, code search
Real-world benchmarks: Tested against SWE-bench, a dataset of real GitHub issues
LiteLLM integration: Works with any LLM provider (OpenAI, Anthropic, local models)

The instrumentation adds OpenTelemetry tracing to three key files:

File	What It Traces
`sweagent/telemetry.py`	OTel setup, span utilities, conversation tracking
`sweagent/agent/models.py`	LLM calls with token usage, model parameters, response metadata
`sweagent/agent/agents.py`	Agent runs, tool executions, iteration tracking

See the full TRACING.md documentation for setup instructions.

Quick Start

git clone https://github.com/Debanitrkl/SWE-agent
cd SWE-agent
pip install -e .
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-proto-http
 
export SWE_AGENT_ENABLE_TRACING=true
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
export OPENAI_API_KEY="your-key"
 
# Run against a GitHub issue
python -m sweagent run \
  --config config/default.yaml \
  --agent.model.name "gpt-4o" \
  --env.repo.github_url=https://github.com/your-org/your-repo \
  --problem_statement.github_url=https://github.com/your-org/your-repo/issues/123

What Happens Under the Hood

When you run SWE-agent with tracing enabled, here's the telemetry flow:

Agent starts → Root span created with gen_ai.operation.name = "invoke_agent"
Problem loaded → agent.problem_id attribute set with the GitHub issue URL
First LLM call → Child span with model, temperature, and token counts
Tool execution → Child span for each cat, edit, python command
Iteration loop → More LLM calls, more tool spans, all nested under the root
Completion → Root span closed with total tokens, success status, and duration

The result is a complete trace that shows exactly how the agent reasoned through the problem.

Trace View of SWE-agent

Key Instrumentation Patterns

The agent uses three span types following GenAI semantic conventions:

1. Agent Invocation Span

The root span wraps the entire agent run:

with tracer.start_as_current_span("invoke_agent CodingAgent") as span:
    span.set_attribute("gen_ai.operation.name", "invoke_agent")
    span.set_attribute("gen_ai.agent.name", "CodingAgent")
    span.set_attribute("gen_ai.conversation.id", conversation_id)
    # ... agent loop runs here
    span.set_attribute("gen_ai.usage.input_tokens", total_tokens)

2. LLM Call Spans

Each model call gets its own span:

with tracer.start_as_current_span(f"chat {model}") as span:
    span.set_attribute("gen_ai.operation.name", "chat")
    span.set_attribute("gen_ai.request.model", model)
    span.set_attribute("gen_ai.request.temperature", 0.0)
    
    response = client.chat.completions.create(...)
    
    span.set_attribute("gen_ai.usage.input_tokens", response.usage.prompt_tokens)
    span.set_attribute("gen_ai.usage.output_tokens", response.usage.completion_tokens)

3. Tool Execution Spans

Each tool call is traced with arguments and results:

with tracer.start_as_current_span(f"execute_tool {tool_name}") as span:
    span.set_attribute("gen_ai.operation.name", "execute_tool")
    span.set_attribute("gen_ai.tool.name", tool_name)
    span.set_attribute("gen_ai.tool.call.id", tool_call_id)
    span.set_attribute("gen_ai.tool.call.arguments", json.dumps(args))
    
    result = execute(tool_name, args)
    
    span.set_attribute("gen_ai.tool.call.result", json.dumps(result))

OpenTelemetry Collector Configuration

Configure the collector to forward traces to Parseable:

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
 
exporters:
  otlphttp:
    endpoint: http://localhost:8000
    headers:
      Authorization: Basic YWRtaW46YWRtaW4=
      X-P-Stream: coding-agent-traces
      X-P-Log-Source: otel-traces
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlphttp]

Trace Hierarchy: Reading the Agent's Mind

Trace View of SWE-agent

A typical agent trace looks like this:

invoke_agent SWE-agent (45.2s) [problem: django/django#12345]
├── chat gpt-4o (3.1s) [iteration 1]
│   └── [tokens: 2100 in, 450 out] "Let me explore the codebase..."
├── execute_tool find_file (0.8s)
│   └── [args: {"file_name": "models.py"}]
├── execute_tool cat (0.1s)
│   └── [args: {"path": "django/db/models/base.py", "lines": "1-50"}]
├── chat gpt-4o (2.8s) [iteration 2]
│   └── [tokens: 3200 in, 380 out] "I see the issue is in the save() method..."
├── execute_tool cat (0.1s)
│   └── [args: {"path": "django/db/models/base.py", "lines": "700-800"}]
├── chat gpt-4o (4.2s) [iteration 3]
│   └── [tokens: 4100 in, 520 out] "I'll create a fix..."
├── execute_tool edit (0.2s)
│   └── [args: {"path": "django/db/models/base.py", "start": 745, "end": 752}]
├── execute_tool python (2.1s)
│   └── [args: {"command": "python -m pytest tests/model_tests.py -x"}]
├── chat gpt-4o (2.1s) [iteration 4]
│   └── [tokens: 4800 in, 180 out] "Tests pass. Submitting..."
└── execute_tool submit (0.1s)
    └── [result: "Patch submitted successfully"]

Spans of SWE-agent

This trace tells a story. You can see:

The exploration phase (iterations 1-2): The agent is reading files, building context
The hypothesis (iteration 3): Token output spikes as it generates the fix
The validation (iteration 4): It runs tests before submitting
The cost breakdown: 14,200 input tokens, 1,530 output tokens ≈ $0.05

Now imagine this trace for an agent that took 15 iterations and $2.40. You'd immediately see where it got stuck, which files it kept re-reading, and whether it was spinning on a syntax error.

Query Traces in Parseable

Once traces flow into Parseable, you have SQL to query them. This is where observability becomes actionable.

Trace Schema

After ingestion, your traces will have fields like:

Field	Description
`trace_id`	Unique trace identifier
`span_id`	Unique span identifier
`parent_span_id`	Parent span (for hierarchy)
`span_name`	Operation name
`span_kind`	`INTERNAL`, `CLIENT`
`span_duration_ms`	Duration in milliseconds
`gen_ai.operation.name`	`invoke_agent`, `chat`, `execute_tool`
`gen_ai.agent.name`	Agent identifier
`gen_ai.request.model`	Model requested
`gen_ai.usage.input_tokens`	Input token count
`gen_ai.usage.output_tokens`	Output token count
`gen_ai.tool.name`	Tool that was executed
`gen_ai.tool.call.arguments`	Tool input (JSON)
`gen_ai.tool.call.result`	Tool output (JSON)

Example Queries

Agent runs with token usage:

SELECT
  "span_trace_id" AS trace_id,
  "gen_ai.conversation.id" AS conversation_id,
  "agent.problem_id" AS task,
  "gen_ai.usage.input_tokens" AS input_tokens,
  "gen_ai.usage.output_tokens" AS output_tokens,
  "agent.iterations" AS iterations,
  "agent.has_submission" AS completed,
  ("span_duration_ns" / 1000000000.0) AS duration_seconds,
  "p_timestamp"
FROM "swe-agent-traces"
WHERE "gen_ai.operation.name" = 'invoke_agent'
ORDER BY "p_timestamp" DESC
LIMIT 20;

SQL Query Result for Agent Runs

LLM calls per agent run:

SELECT
  "span_trace_id" AS trace_id,
  COUNT(*) AS llm_calls,
  SUM(CAST("gen_ai.usage.input_tokens" AS DOUBLE)) AS total_input_tokens,
  SUM(CAST("gen_ai.usage.output_tokens" AS DOUBLE)) AS total_output_tokens,
  AVG("span_duration_ns" / 1e6) AS avg_latency_ms
FROM "swe-agent-traces"
WHERE "gen_ai.operation.name" = 'chat'
GROUP BY "span_trace_id"
ORDER BY total_output_tokens DESC;

SQL Query Result for LLM calls

Tool usage breakdown:

SELECT
  "swe-agent-traces"."gen_ai.tool.name" AS tool,
  COUNT(*) AS call_count,
  AVG("swe-agent-traces"."span_duration_ns") / 1e6 AS avg_duration_ms,
  SUM(CASE WHEN "swe-agent-traces"."error.type" IS NOT NULL THEN 1 ELSE 0 END) AS errors
FROM "swe-agent-traces"
WHERE "swe-agent-traces"."gen_ai.operation.name" = 'execute_tool'
GROUP BY tool
ORDER BY call_count DESC;

SQL Query Result for Tool Usage

Failed tool executions:

SELECT
  "span_trace_id" AS trace_id,
  "gen_ai.tool.name" AS tool,
  "gen_ai.tool.call.arguments" AS arguments,
  "gen_ai.tool.call.result" AS result,
  "error.type" AS error_type,
  p_timestamp
FROM "swe-agent-traces"
WHERE "gen_ai.operation.name" = 'execute_tool'
  AND "error.type" IS NOT NULL
ORDER BY p_timestamp DESC;

SQL Query Result for Failed Tool Executions

Cost estimation (GPT-4o pricing):

SELECT
  DATE_TRUNC('day', "swe-agent-traces".p_timestamp) AS day,
  COUNT(DISTINCT "swe-agent-traces".span_trace_id) AS agent_runs,
  SUM(CAST("swe-agent-traces"."gen_ai.usage.input_tokens" AS DOUBLE)) AS input_tokens,
  SUM(CAST("swe-agent-traces"."gen_ai.usage.output_tokens" AS DOUBLE)) AS output_tokens,
  ROUND(
    SUM(CAST("swe-agent-traces"."gen_ai.usage.input_tokens" AS DOUBLE)) * 0.0000025 +
    SUM(CAST("swe-agent-traces"."gen_ai.usage.output_tokens" AS DOUBLE)) * 0.00001,
    4
  ) AS estimated_cost_usd
FROM "swe-agent-traces"
WHERE "swe-agent-traces"."gen_ai.operation.name" = 'chat'
GROUP BY day
ORDER BY day DESC;

SQL Query Result for Cost Estimation

Step 9: Set Up Alerts

High Token Usage Alert

Navigate to Alerts → Create Alert
Configure:
- Dataset: swe-agent-traces
- Filter: gen_ai.operation.name = 'invoke_agent'
- Monitor Field: gen_ai.usage.output_tokens
- Aggregation: SUM
- Alert Type: Threshold
- Condition: Greater than 50000 in 1 hour

Tool Failure Anomaly

Create alert:
- Dataset: swe-agent-traces
- Filter: gen_ai.operation.name = 'execute_tool' AND error.type IS NOT NULL
- Monitor Field: All rows (*)
- Aggregation: COUNT
- Alert Type: Anomaly Detection
- Sensitivity: High

Slow Agent Runs

Create alert:
- Dataset: swe-agent-traces
- Filter: gen_ai.operation.name = 'invoke_agent'
- Monitor Field: span_duration_ms
- Aggregation: AVG
- Alert Type: Threshold
- Condition: Greater than 120000 (2 minutes)

Privacy Considerations

The instrumentation captures prompts, responses, and tool arguments by default. This is invaluable for debugging but raises concerns for production:

What gets captured:

Full prompt text (may contain user data, API keys, PII)
LLM responses (may contain generated secrets or sensitive logic)
Tool arguments (file paths, command outputs, code snippets)

Mitigation strategies:

Disable content capture entirely:

# Don't set these attributes in production
# span.set_attribute("gen_ai.input.messages", ...)
# span.set_attribute("gen_ai.output.messages", ...)
# span.set_attribute("gen_ai.tool.call.arguments", ...)
# span.set_attribute("gen_ai.tool.call.result", ...)

Use environment variable:

export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=false

Truncate large content:

def truncate(content: str, max_len: int = 1000) -> str:
    if len(content) > max_len:
        return content[:max_len] + "...[truncated]"
    return content

Redact sensitive patterns:

import re
 
def redact_secrets(content: str) -> str:
    patterns = [
        (r'sk-[a-zA-Z0-9]{48}', '[OPENAI_KEY]'),
        (r'ghp_[a-zA-Z0-9]{36}', '[GITHUB_TOKEN]'),
        (r'password["\s:=]+["\']?[\w@#$%^&*]+', 'password=[REDACTED]'),
    ]
    for pattern, replacement in patterns:
        content = re.sub(pattern, replacement, content, flags=re.IGNORECASE)
    return content

The right balance depends on your use case. For development and debugging, capture everything. For production, capture metadata (tokens, durations, tool names) but redact content.

The Bigger Picture: Agent Observability

Tracing is just the beginning. Once you have structured telemetry flowing into Parseable, you can build:

Cost attribution dashboards. Which teams are burning the most tokens? Which problem types are most expensive to solve?

Performance baselines. What's the p50/p95 latency for your agent? How does it vary by model, task type, or time of day?

Failure analysis. When agents fail, what's the common pattern? Context window overflow? Tool errors? Rate limits?

A/B testing infrastructure. Compare GPT-4o vs Claude 3.5 on the same tasks. Which model produces better patches with fewer iterations?

Regression detection. Did that prompt change increase token usage? Did the new tool implementation slow things down?

The traces you're collecting today become the training data for understanding agent behavior at scale.

Conclusion

Coding agents are no longer experimental. They're fixing real bugs, writing production code, and costing real money. The question isn't whether to use them, it's how to operate them responsibly.

Traditional observability tools weren't built for this. They don't understand that a 3-second "API call" is actually an LLM reasoning through a complex problem. They don't know that token counts matter more than request counts. They can't show you the chain of thought that led to a broken patch.

OpenTelemetry's GenAI semantic conventions change this. They give us a standard vocabulary for AI telemetry. And Parseable gives us the SQL interface to actually use it, with the storage economics to keep months of agent history without breaking the budget.

The instrumented SWE-agent fork is your starting point. Clone it, run it against a real issue, and watch the traces flow in. You'll never look at agent logs the same way again.

Every prompt. Every tool call. Every token. All queryable with SQL. That's the future of agent observability.

Tracing Coding Agents: Complete OpenTelemetry Instrumentation Guide